What is HBase
HBase is open source, multidimensional, distributed, scalable, non-relational and a NoSQL (NoSQL means not only SQL) database. HBase runs on top of the Hadoop HDFS (Hadoop Distributed File System). It is designed to have the fault-tolerant way of storing large sparse data.
HBase reads the random data and achieves the low throughput and latency by providing faster READ/WRITE access to the random data. It is the best choice for the applications which require the fast and random access to the large set of the data. In provides in-memory comparison and bloom filtering. It is a column-oriented database, stores the data in rows and columns which is intersected by the cells.
Why to use Apache HBase
HBase provides the feasibility to read the nearby real-time data. As the size of the data is increasing day by day and we need to perform fast on the large set of the data. Apache HBase gives this feasibility. It allows performing random read/write of the structural and non-structural data. As Hadoop reads the data in a sequential manner whereas HBase performs the read/write operation in parallel and use the bloom filter which helps to check whether a value is present in a set or not and helps in performing faster read/write operation.
HBase provides faster read/write access to large data set, thus it could be the choice for the applications which require the faster read-write operation on the large data set. It supports order partitioning in which the rows of a Column Family are stored in RowKey order. It uses region based scan for faster read output. HBase also provides the strict constancy, as it supports single write feature. It uses versioning to keep the track of the changes. Versioning makes difference between HBase tables and RDBMS.
When to Use HBase
The most important area of concern is when we can use the HBase. If we are going to use the large dataset, need the fast read-write operation, full CRUD support, below scenarios, can be considered to use HBase as a solution:-
- If the size of the data is too large in petabyte or Exabyte, then column-oriented approach can be used as the data of one column will be together and data can be accessed faster.
- When data is non-structural and row-oriented approach is capable of handling less number of rows and columns efficiently.
- When need to process a large amount of the semi-structural and non-structural data, the column-oriented approach can be suitable for that scenario.
- When the applications are dealing with the online Analytical processing like data mining, data warehouse, applications require analytics etc.
- Row-oriented approach with transactional abilities (ACID properties).
- Whenever multiple versions of the data and need to store all of them.
NoSQL Database
NoSQL stands for not only SQL. It can represent the data in the other than the tabular formats. It uses a different format to represent the data thus there are different types of the database depending upon the format. It is schema-free which eliminates the need for designing the tables and pushing data into it. It provides feasibility to use structural, non-structural and semi-structural data. There are different types of NoSQL databases available in the market and can be adopted as per the design and requirement.
Key Value Pair
This is the simplest NoSQL database. It is a schema-less database which contains keys and values. Every single item in the database is stored as an attribute name or key together with its value. A key value pair is good at processing a constant stream of rea/write data with low latency.
Document Database
It works on the same key-value pair approach but data is the semi-structured type. It supports JSON, XML’s. These structures are considered a document. It consists of the different key-value pair or key-array pair or nested documents.
Graph Store
To work on the web-based applications, graph store can be a better approach. They help in storing the information about the networks such as social connections.
Wide Column Store
To optimize the large data sets, columns oriented approach can be used. They provide the optimized result for the large set of the data and store the data in column format instead of row. Columns are logically grouped into the column families which can be either created during the time of schema creation or runtime. These databases provide the faster read and write access.
Features of HBase
- HBase supports atomic read-write operation on row level i.e. during one read-write process all other processes are prevented from performing the read-write operations and provide consistent read-write operations.
- HBase provides the features of adding any number of columns on runtime.
- Provides automatic failure support by using WAL (Wall Ahead Logs) features with the help of HDFS.
- For query optimization and high performance of the query, HBase supports bloom filtering and block cache.
- Offers transparent, automatic splitting and redistribution of the contents as it supports file system distributed data.
- HBase provides High Availability and support LAN and WAN failure and recovery.
- Supports programming access through Java API.
