HBase on Ubuntu: Data Storage Architecture and Mechanisms
HBase, a distributed NoSQL database built on Hadoop HDFS, stores data in a column-oriented, scalable, and fault-tolerant manner. On Ubuntu (or any Linux-based system), HBase leverages HDFS as its underlying storage layer, ensuring data durability and high availability through replication. Below is a structured breakdown of its core storage components and workflows:
HBase’s data storage is organized around four key components, each serving a distinct role in the data lifecycle:
HBase relies on HDFS (Hadoop Distributed File System) to store all persistent data. Tables, regions, and files are distributed across multiple nodes in the HDFS cluster, providing fault tolerance (via replication) and parallel processing capabilities. For example, in a pseudo-distributed setup, the hbase.rootdir property in hbase-site.xml is configured to point to an HDFS path (e.g., hdfs://localhost:9000/hbase), ensuring all HBase data is stored in HDFS.
HFile is HBase’s binary file format for storing table data. It is optimized for sequential scans and random reads, with features like:
Before data is written to HDFS, it is stored in MemStore (an in-memory buffer per column family). MemStore serves two purposes:
hbase.hregion.memstore.flush.size), its contents are flushed to disk as a new HFile.HLog (implemented as a Hadoop SequenceFile) records every write operation (puts, deletes) before it is written to MemStore. This ensures data durability—if a RegionServer crashes, the HLog can be replayed to recover lost data. Each HLog entry includes:
HLogKey (identifying the table, region, and sequence number).KeyValue object (the actual data being written).HBase organizes data into a table-based model with the following hierarchy:
A table is a collection of rows, split into Regions (horizontal partitions) for scalability. Each Region is managed by a RegionServer and stored on a single node in the HDFS cluster.
A row is identified by a unique RowKey (a byte array), which determines the physical storage location of the row. Rows are stored in lexicographical order (sorted by RowKey), enabling efficient range scans.
Columns are grouped into Column Families (defined at table creation time). Each Column Family is a separate storage unit, with its own compression, caching, and replication settings. For example, a table with Column Families cf1 (user profile) and cf2 (order history) will store data for each family in distinct HFiles.
Within a Column Family, columns are identified by a Column Qualifier (e.g., cf1:name, cf1:email). This allows dynamic addition of columns without schema changes.
The smallest unit of data, a Cell is a combination of RowKey, Column Family, Column Qualifier, and Timestamp (version). Each cell can store multiple versions of data (sorted in reverse chronological order, with the latest version first). Versions are retained based on policies (e.g., keep last 3 versions) to manage storage usage.
When data is written to HBase, it follows a three-step process to ensure durability and performance:
Reading data from HBase involves multiple layers to optimize performance:
To maintain performance, HBase performs two critical background processes:
Compaction merges multiple small HFiles into a single larger HFile. This reduces the number of files on HDFS (improving read performance) and removes deleted or expired data (based on TTL or version policies). There are two types of compaction:
When a Region grows too large (exceeding hbase.hregion.max.filesize), it is split into two smaller Regions. Each new Region contains roughly half of the original data and is assigned to a different RegionServer (to distribute load). Splitting is triggered automatically but can also be initiated manually.
This architecture enables HBase to handle petabytes of data with low-latency reads/writes, making it suitable for use cases like real-time analytics, IoT data storage, and large-scale key-value lookups.