Subscribe to Email Updates

Pig and Hive: Hadoop stepping stones for data scientists - Part 1

Posted by Bala Deshpande on Tue, Jan 12, 2016 @ 10:12 AM

In today's data overloaded world, there are 5 types of data which require the use of big data tools such as Hadoop:

  1. Clickstream data from websites 
  2. Text data from social media and websites
  3. Geo-coded data from satellite/GPS applications
  4. Log data from IT systems
  5. Sensor or machine generated data from manufacturing

All of the above meet the basic 3 V's of big data: volume, velocity and variety. The reason for the growing popularity of Hadoop is that it allows companies to not only store (or retain) vast volumes of data, but also allows them to process this data to use as a strategic business asset - which is what data really is!hadoop-101.jpg

Biggest difference between Hadoop and traditional databases

In a relational database, a structure (or a data model or schema) must be setup before the data is written to the database, which forces the data to fit into a particular data model.  However with Hadoop, data is quickly input into a file system in its raw format without any schema. When data is retrieved from file system, any schema can be applied then to fit the specific use case and needs of the application.

This does not imply that Hadoop will replace traditional databases completely! Hadoop simply enables businesses to store data which was typically too expensive and too voluminous to store earlier. This data belongs to one of the 5 types listed above.

We have discussed elsewhere in this blog as to how the last two data types are soon going to overwhelm the rest, especially with the coming of the internet of things era. Here are a few use cases which all relate to data type 5 above.

Use case 1: One of the most critical functions of manufacturing is product quality control. The manufacturing plants have an elaborate quality control mechanism and receive billions of readouts from factory sensors designed to detect failures. They have turned to Hadoop to cost effectively store, process and analyze
this data.
Use case 2: Aircraft flight sensor data continuously record and store critical operational parameters which would then allow engineers and on-ground personnel to later digest, dissect and display key metrics to increase fuel efficiency, reduce maintenance issues and predict/prevent critical failures.
Use case 3: Automotive black box data continuously record and store critical engine and vehicle parameters which would then allow similar applications as in the aircraft use case. However, automotive applications may also involve collecting driving history data to understand driving patterns and optimize route and fuel consumption.

But no matter what data type you have to deal with or what your use case is, there are a few common process steps that are always required if the objective is to generate insights from this mass of digital exhaust. The data must be

  1. loaded into Hadoop filesystem,
  2. some schema must be applied,
  3. transformed,
  4. analyzed
  5. and then visualized.

The Hadoop ecosystem has developed a factory-farm of tools uniquely suited for these processes based on the data types, sources etc. This puts more load on the data scientists to master these myriad tools in order to drive insights. 

Pig and Hive are among the two most commonly used tools. Specifically, they allow data scientists to accomplish steps b, c and to some extent d, mentioned above. These tools have many overlapping functions and a beginner is sometimes confused about which tool to use when. A good rule of thumb is to use Pig for b and c, and Hive for c and d. In the next article of this series we will dig deeper into the similarities and differences between Pig and Hive.

Click for more details about specific Pig and Hive training ...

Download the SimaFore Training Syllabus


Topics: big data