Hive SerDe

SerDe is abbreviation stands for Serializer/Deserializer. Hive uses it for IO. The interface uses both serialization and deserialization and also interpreting the results of serialization as individual fields for processing.

Deserializer:- Deserializer in hive SedDe converts the binary or string data into Java Object that Hive can process.

Serializer:- Hive serializer will convert the Java Object to the readable format that can be stored in HDFS or Hive Table.

HDFS Files -> Input File Format –> <key, value> –> Deserializer –> Row object

Row object –> Serializer –> <key, value> –> Output File Format –> HDFS files

Why Hive SerDe is required?

Hive can perform operations on Java Object only. Hive cannot read and write itself, it uses the class files to perform the read and write operation on Hive. Users can also write their SerDe as per the requirement.

What SerDe’s Hive supports?

Apache Hive basically works on the lazy SerDe. It has few built-in SerDe which can be leveraged as per one’s requirement.

      JsonSerDe stores as plain text file in JSON format. Json SerDe is available in Hcatalog project before it used to available in hive-contrib project. Json SerDe read the JSON files and load it into the Hive tables. It is available from Hive version 0.12(hive-contrlib) and later (hcatalog-core).

     AVRO Serde allows users to perform read and write operation using Hive tables on Avro File format. It is supported from Hive version 0.9.1 and later.

     ORC stands for Optimized Row Columnar, which provides highly efficient way to store the Hive data. ORC files improve performance when Hive is reading, writing and processing data and was developed to overcome the limitation of other Hive file formats. It is supported from Hive version 0.11 and later.

     Parquet is best suitable for columnar type data. It was developed to support efficient and encoding schemas. It is supported from Hive version 0.13 and later.

    Regex is stored as the Plain text file and stores the data to the regular expression.

    CSV stored as the Plain text file in CSV format. It has one limitation, CSV SerDe considers all the columns as of type string. Even if you gave data type as non-string(e.g. int), it will show the data type as the string when you run desc table or show create table command. To create the desired column type, the view needs to create or CAST operation can be used. It is supported from Hive version 0.14 and later.

Details Courtesy: cwiki.apache.org and Apache Hive Developer Guide

Leave a comment