For greater than a decade now, the Hive desk format has been a ubiquitous presence within the large information ecosystem, managing petabytes of information with outstanding effectivity and scale. However as the information volumes, information selection, and information utilization grows, customers face many challenges when utilizing Hive tables due to its antiquated directory-based desk format. A few of the frequent points embody constrained schema evolution, static partitioning of information, and lengthy planning time due to S3 listing listings.
Apache Iceberg is a contemporary desk format that not solely addresses these issues but in addition provides extra options like time journey, partition evolution, desk versioning, schema evolution, sturdy consistency ensures, object retailer file structure (the flexibility to distribute information current in a single logical partition throughout many prefixes to keep away from object retailer throttling), hidden partitioning (customers don’t need to be intimately conscious of partitioning), and extra. Subsequently, Apache Iceberg desk format is poised to interchange the standard Hive desk format within the coming years.
Nevertheless, as there are already 25 million terabytes of information saved within the Hive desk format, migrating present tables within the Hive desk format into the Iceberg desk format is important for efficiency and value. Relying on the scale and utilization patterns of the information, a number of completely different methods might be pursued to attain a profitable migration. On this weblog, I’ll describe a couple of methods one may undertake for varied use circumstances. Whereas these directions are carried out for Cloudera Knowledge Platform (CDP), Cloudera Knowledge Engineering, and Cloudera Knowledge Warehouse, one can extrapolate them simply to different providers and different use circumstances as properly.
There are few situations that one may encounter. A number of of those use circumstances may suit your workload and also you may be capable to combine and match the potential options offered to fit your wants. They’re meant to be a basic information. In all of the use circumstances we try emigrate a desk named “occasions.”
You may have the flexibility to cease your shoppers from writing to the respective Hive desk throughout the period of your migration. That is ideally suited as a result of this may imply that you just don’t have to change any of your shopper code. Typically that is the one selection out there in case you have a whole bunch of shoppers that may probably write to a desk. It might be a lot simpler to easily cease all these jobs fairly than permitting them to proceed throughout the migration course of.
In-place desk migration
Answer 1A: utilizing Spark’s migrate process
Iceberg’s Spark extensions present an in-built process referred to as “migrate” emigrate an present desk from Hive desk format to Iceberg desk format. In addition they present a “snapshot” process that creates an Iceberg desk with a special title with the identical underlying information. You could possibly first create a snapshot desk, run sanity checks on the snapshot desk, and make sure that every thing is so as.
As soon as you’re happy you may drop the snapshot desk and proceed with the migration utilizing the migrate process. Take into account that the migrate process creates a backup desk named “events__BACKUP__.” As of this writing, the “__BACKUP__” suffix is hardcoded. There’s an effort underway to let the person go a customized backup suffix sooner or later.
Take into account that each the migrate and snapshot procedures don’t modify the underlying information: they carry out in-place migration. They merely learn the underlying information (not even full learn, they simply learn the parquet headers) and create corresponding Iceberg metadata information. Because the underlying information information usually are not modified, chances are you’ll not be capable to take full benefit of the advantages provided by Iceberg instantly. You could possibly optimize your desk now or at a later stage utilizing the “rewrite_data_files” process. This can be mentioned in a later weblog. Now let’s talk about the professionals and cons of this strategy.
- Can do migration in levels: first do the migration after which perform the optimization later utilizing rewrite_data_files process (weblog to comply with).
- Comparatively quick because the underlying information information are saved in place. You don’t have to fret about creating a brief desk and swapping it later. The process will try this for you atomically as soon as the migration is completed.
- Since a Hive backup is accessible one can revert the change solely by dropping the newly created Iceberg desk and by renaming the Hive backup desk (__backup__) desk to its authentic title.
- If the underlying information isn’t optimized, or has lots of small information, these disadvantages might be carried ahead to the Iceberg desk as properly. Question engines (Impala, Hive, Spark) may mitigate a few of these issues by utilizing Iceberg’s metadata information. The underlying information file places won’t change. So if the prefixes of the file path are frequent throughout a number of information we could proceed to endure from S3 throttling (see Object Retailer File Layout to see the way to configure it correctly.) In CDP we solely assist migrating exterior tables. Hive managed tables can’t be migrated. Additionally, the underlying file format for the desk must be one among avro, orc, or parquet.
Be aware: There’s additionally a SparkAction within the JAVA API.
Answer 1B: utilizing Hive’s “ALTER TABLE” command
Cloudera carried out a simple technique to do the migration in Hive. All it’s a must to do is to change the desk properties to set the storage handler to “HiveIcebergStorageHandler.”
The professionals and cons of this strategy are primarily the identical as Answer 1B. The migration is completed in place and the underlying information information usually are not modified. Hive creates Iceberg’s metadata information for a similar actual desk.
Shadow desk migration
Answer 1C: utilizing the CTAS assertion
This answer is most generic and it may probably be used with any processing engine (Spark/Hive/Impala) that helps SQL-like syntax.
You may run fundamental sanity checks on the information to see if the newly created desk is sound.
As soon as you’re happy along with your sanity checking you possibly can rename your “occasions” desk to a “backup_events” desk after which rename your “iceberg_events” to “occasions.” Take into account that in some circumstances the rename operation may set off a listing rename of the underlying information listing. If that’s the case and your underlying information retailer is an object retailer like S3, that may set off a full copy of your information and might be very costly. If whereas creating the Iceberg desk the situation clause is specified, then the renaming operation of the Iceberg desk won’t trigger the underlying information information to maneuver. The title will change solely within the Hive metastore. The identical applies for Hive tables as properly. In case your authentic Hive desk was not created with the situation clause specified, then the rename to backup will set off a listing rename. In that case, In case your filesystem is object retailer based mostly, then it could be greatest to drop it altogether. Given the nuances round desk rename it’s crucial to check with dummy tables in your system and test that you’re seeing your required conduct earlier than you carry out these operations on crucial tables.
You may drop your “backup_events” if you want.
Your shoppers can now resume their learn/write operations on the “occasions” they usually don’t even have to know that the underlying desk format has modified. Now let’s talk about the professionals and cons of this strategy.
- The newly created information is properly optimized for Iceberg and the information can be distributed properly.
- Any present small information can be coalesced mechanically.
- Frequent process throughout all of the engines.
- The newly created information information may reap the benefits of Iceberg’s Object Retailer File Structure, in order that the file paths have completely different prefixes, thus lowering object retailer throttling. Please see the linked documentation to see the way to reap the benefits of this function.
- This strategy isn’t essentially restricted to migrating a Hive desk. One may use the identical strategy emigrate tables out there in any processing engine like Delta, Hudi, and so on.
- You may change the information format say from “orc” to “parquet.’’
- This can set off a full learn and write of the information and it could be an costly operation.
- Your whole information set can be duplicated. You’ll want to have enough cupboard space out there. This shouldn’t be an issue in a public cloud backed by an object retailer.
You don’t have the posh of lengthy downtime to do your migration. You wish to let your shoppers or jobs proceed writing the information to the desk. This requires some planning and testing, however is feasible with some caveats. Right here is a method you are able to do it with Spark. You may probably extrapolate the concepts introduced to different engines.
- Create an Iceberg desk with the specified properties. Take into account that it’s a must to maintain the partitioning scheme the identical for this to work accurately.
- Modify your shoppers or jobs to put in writing to each tables in order that they write to the “iceberg_events” desk and “occasions” desk. However for now, they solely learn from the “occasions” desk. Seize the timestamp from which your shoppers began writing to each of the tables.
- You programmatically checklist all of the information within the Hive desk that had been inserted earlier than the timestamp you captured in step 2.
- Add all of the information captured in step 3 to the Iceberg desk utilizing the “add_files” process. The “add_files” process will merely add the file to your Iceberg desk. You additionally may be capable to reap the benefits of your desk’s partitioning scheme to skip step 3 solely and add information to your newly created Iceberg desk utilizing the “add_files” process.
- When you don’t have entry to Spark you may merely learn every of the information listed in step 3 and insert them into the “iceberg_events.”
- When you efficiently add all the information information, you may cease your shoppers from studying/writing to the outdated “occasions” and use the brand new “iceberg_events.”
Some caveats and notes
- In step 2, you may management which tables your shoppers/jobs must write to utilizing some flag that may be fetched from exterior sources like setting variables, some database (like Redis) pointer, and properties information, and so on. That manner you solely have to change your shopper/job code as soon as and don’t need to maintain modifying it for every step.
- In step 2, you’re capturing a timestamp that can be used to calculate information wanted for step 3; this might be affected by clock drift in your nodes. So that you may wish to sync all of your nodes earlier than you begin the migration course of.
- In case your desk is partitioned by date and time (as most actual world information is partitioned), as in all new information coming will go to a brand new partition on a regular basis, you then may program your shoppers to start out writing to each the tables from a particular date and time. That manner you simply have to fret about including the information from the outdated desk (“occasions”) to the brand new desk (“Iceberg_events”) from that date and time, and you’ll reap the benefits of your partitioning scheme and skip step 3 solely. That is the strategy that must be used each time doable.
Any massive migration is hard and must be thought by means of rigorously. Fortunately, as mentioned above there are a number of methods at our disposal to do it successfully relying in your use case. If in case you have the flexibility to cease all of your jobs whereas the migration is occurring it’s comparatively simple, however if you wish to migrate with minimal to no downtime then that requires some planning and cautious considering by means of your information structure. You should utilize a mixture of the above approaches to greatest fit your wants.
To study extra:
- For extra on desk migration, please check with respective on-line documentations in Cloudera Knowledge Warehouse (CDW) and Cloudera Knowledge Engineering (CDE).
- Watch our webinar Supercharge Your Analytics with Open Knowledge Lakehouse Powered by Apache Iceberg. It features a reside demo recording of Iceberg capabilities.
- Strive Cloudera Knowledge Warehouse (CDW), Cloudera Knowledge Engineering (CDE), and Cloudera Machine Studying (CML) by signing up for a 60 day trial, or check drive CDP. You can even schedule a demo by clicking right here or if you have an interest in chatting about Apache Iceberg in CDP, contact your account staff.