Science – Future of Lithuania / Mokslas – Lietuvos Ateitis, Vol 9, No 3 (2017)

A Comparison of HDFS Compact Data Formats: Avro Versus Parquet

Daiga Plase (University of Latvia, Latvia)
Laila Niedrite (University of Latvia, Latvia)
Romans Taranovs (Riga Technical University, Latvia)

Abstract


In this paper, file formats like Avro and Parquet are compared with text formats to evaluate the performance of the data queries. Different data query patterns have been evaluated. Cloudera’s open-source Apache Hadoop distribution CDH 5.4 has been chosen for the experiments presented in this article. The results show that compact data formats (Avro and Parquet) take up less storage space when compared with plain text data formats because of binary data format and compression advantage. Furthermore, data queries from the column based data format Parquet are faster when compared with text data formats and Avro.

Article in: English

Article published: 2017-07-04

Keyword(s): Big Data; Hadoop; HDFS; Hive; Avro; Parquet.

DOI: 10.3846/mla.2017.1033

Full Text: PDF pdf

Science – Future of Lithuania / Mokslas – Lietuvos Ateitis ISSN 2029-2341, eISSN 2029-2252
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 License.