Data Modeling from Site
How to design schemas in HBase.
- HBase has nothing similar to a rich query capability like SQL from relational databases. Instead, it forgoes this capability and others like relationships, joins, etc. to instead focus on providing scalability with good performance and fault-tolerance.
- So when working with HBase you need to design the row keys and table structure in terms of rows and column families to match the data access patterns of your application.
- This is completely opposite what you do with relational databases where you start out with a normalized database schema, separate tables, and then you use SQL to perform joins to combine data in the ways you need.
- With HBase you design your tables specific to how they will be accessed by applications, so you need to think much more up-front about how data is accessed.
- You are much closer to the bare metal with HBase than with relational databases which abstract implementation details and storage mechanisms.
- However, for applications needing to store massive amounts of data and have inherent scalability, performance characteristics and tolerance to server failures, the potential benefits can far outweigh the costs.
Row Key is Critical!
- When scanning data in HBase, the row key is critical since it is the primary means to restrict the rows scanned
- There is nothing like a rich query like SQL as in relational databases. Typically you create a scan using start and stop row keys and optionally add filters to further restrict the rows and columns data returned.
- In order to have some flexibility when scanning, the row key should be designed to contain the information you need to find specific subsets of data.
Wide Table Design
HBase does not have foreign key relationships like in relational databases, but because it supports rows having up to millions of columns, one way to design tables in HBase is to encapsulate related information in the same row - a "wide" table design.
- You are storing all information related to a row together in as many columns as there are data items.
- When HBase retrieves columns it returns them in sorted order, just like row keys.
- This kind of design can work well if the number of columns is relatively modest, as blog comments and a person's contact information would be.
Tall Table Design
If instead you are modeling something like an email inbox, financial transactions, or massive amounts of automatically collected sensor data, you might choose instead to spread a user's emails, transactions, or sensor readings across multiple rows (a "tall" design) and design the row keys to allow efficient scanning and pagination.
- For an inbox the row key might look like
<user_id>-<reversed_email_timestamp>
which would permit easily scanning and paginating a user's inbox, while for financial transactions the row key might be
<user_id>-<reversed_transaction_timestamp>
.This kind of design can be called "tall" since you are spreading information about the same thing (e.g. readings from the same sensor, transactions in an account) across multiple rows
- Is something to consider if there will be an ever-expanding amount of information, as would be the case in a scenario involving data collection from a huge network of sensors.
Other
Designing row keys and table structures in HBase is a key part of working with HBase, and will continue to be given the fundamental architecture of HBase. There are other things you can do to add alternative schemes for data access within HBase.
- You could implement full-text searching via Apache Lucene either within rows or external to HBase (search Google for HBASE-3529).
- You can also create (and maintain) secondary indexes to permit alternate row key schemes for tables
- For example in our people table the composite row key consists of the name and a unique identifier. But if we desire to access people by their birth date, telephone area code, email address, or any other number of ways, we could add secondary indexes to enable that form of interaction.
- Note, however, that adding secondary indexes is not something to be taken lightly; every time you write to the "main" table (e.g. people) you will need to also update all the secondary indexes!
Summary
- Unlike relational models in which you work hard to normalize data and then use SQL as a flexible way to join the data in various ways, with HBase you need to think much more up-front about the data access patterns, because retrieval by row key and table scans are the only two ways to access data.
- In other words, there is no joining across multiple HBase tables and projecting out the columns you need.
- When you retrieve data, you want to only ask HBase for the exact data you need.