P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop

  • Idris Hanafi Southern Connecticut State University
  • Amal Abdel-Raouf Computer Science DepartmenSouthern Connecticut State University, USA and the Electronic Rt, esearch Institute (ERI), Egypt
Keywords: Hadoop, MapReduce, HDFS, Compression, Parallelism,

Abstract

The increasing amount and size of data being handled by data analytic applications running on Hadoop has created a need for faster data processing. One of the effective methods for handling big data sizes is compression. Data compression not only makes network I/O processing faster, but also provides better utilization of resources. However, this approach defeats one of Hadoop’s main purposes, which is the parallelism of map and reduce tasks. The number of map tasks created is determined by the size of the file, so by compressing a large file, the number of mappers is reduced which in turn decreases parallelism. Consequently, standard Hadoop takes longer times to process. In this paper, we propose the design and implementation of a Parallel Compressed File Decompressor (P-Codec) that improves the performance of Hadoop when processing compressed data. P-Codec includes two modules; the first module decompresses data upon retrieval by a data node during the phase of uploading the data to the Hadoop Distributed File System (HDFS). This process reduces the runtime of a job by removing the burden of decompression during the MapReduce phase. The second P-Codec module is a decompressed map task divider that increases parallelism by dynamically changing the map task split sizes based on the size of the final decompressed block. Our experimental results using five different MapReduce benchmarks show an average improvement of approximately 80% compared to standard Hadoop.

Author Biographies

Idris Hanafi, Southern Connecticut State University

a graduate researcher in the field of Big Data Systems at Southern Connecticut State University (SCSU). He is also a first year Master Student at SCSU studying Computer Science.

 

Amal Abdel-Raouf, Computer Science DepartmenSouthern Connecticut State University, USA and the Electronic Rt, esearch Institute (ERI), Egypt
Professor at the Computer Science Department, Southern Connecticut State University, USA
and Researcher at the ERI, Egypt
Published
2016-05-24
How to Cite
Hanafi, I., & Abdel-Raouf, A. (2016). P-Codec: Parallel Compressed File Decompression Algorithm for Hadoop. INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY, 15(8), 6991-6998. https://doi.org/10.24297/ijct.v15i8.1500
Section
Articles