cite

  • bulk loading is the process of preparing and loading HFiles (HBase’s own file format) directly into the RegionServers, thus bypassing the write path and obviating those issues entirely. This process is similar to ETL and looks like this:

1. Extract the data from a source

  • HBase doesn’t manage this part of the process. In other words, you cannot tell HBase to prepare HFiles by directly reading them from MySQL — rather, you have to do it by your own means.
  • For example, you could run mysqldump on a table and upload the resulting files to HDFS or just grab your Apache HTTP log files.
  • In any case, your data needs to be in HDFS before the next step.

2. Transform the data into HFiles.

  • This step requires a MapReduce job and for most input types you will have to write the Mapper yourself.
  • The job will need to emit the row key as the Key, and either a KeyValue, a Put, or a Delete as the Value.
  • The Reducer is handled by HBase; you configure it using HFileOutputFormat.configureIncrementalLoad() and it does the following:

    • Inspects the table to configure a total order partitioner
    • Uploads the partitions file to the cluster and adds it to the DistributedCache
    • Sets the number of reduce tasks to match the current number of regions
    • Sets the output key/value class to match HFileOutputFormat’s requirements
    • Sets the reducer up to perform the appropriate sorting (either KeyValueSortReducer or PutSortReducer)
  • At this stage, one HFile will be created per region in the output folder.
  • Keep in mind that the input data is almost completely re-written, so you will need at least twice the amount of disk space available than the size of the original data set.
    • For example, for a 100GB mysqldump you should have at least 200GB of available disk space in HDFS. You can delete the dump file at the end of the process.

3. Load the files into HBase by telling the RegionServers where to find them.

  • It requires using LoadIncrementalHFiles (more commonly known as the completebulkload tool), and by passing it a URL that locates the files in HDFS
  • it will load each file into the relevant region via the RegionServer that serves it.
  • In the event that a region was split after the files were created, the tool will automatically split the HFile according to the new boundaries.
  • This process isn’t very efficient, so if your table is currently being written to by other processes, it’s best to get the files loaded as soon as the transform step is done.

Here’s an illustration of this process. The data flow goes from the original source to HDFS, where the RegionServers will simply move the files to their regions’ directories.

results matching ""

    No results matching ""