PERFORMING ANALYTICS ON HEALTHCARE DATA USING HADOOP HIVE by Upasna Sharma
Big Data analysis poses huge challenges for organizations to extract meaningful information considering huge variety and volume of data made openly available related to finance, business, healthcare, etc. Not only the people but even machines are generating data too e.g. satellites, washing machines, wearable gadgets, buildings equipped with recorders, smoke detectors and cameras, wind turbines, and what not. Sensors are installed on so many devices which contribute to Big Data generation to a large extent. Now the problem is that there is already gigantic amount of data, but the main area of concern is how to analyse the data to draw new inferences from it and gain some knowledge. The main aim of the proposed work is to analyse large data sets effectively and compare the performance of Hive over Map-Reduce, Impala and Hive over Spark based on data load time and average query time. Also, query optimization has been done using compression algorithms like ORC and PARQUET. ORC is the latest compression algorithm in the domain of Hadoop. Impala is better than other platforms in terms of query execution on a simple text file as it provides the least query execution latency. Hive over Spark is a great platform to store and analyse large data sets by combining the techniques of partitioning and compression. There is approximately 50% improvement in query execution time of Hive MR by using compression format ORC and Parquet. Impala does not work on ORC format. So, it is used only on PARQUET format and shows 60-70% improvement in query execution performance.