In today's data overloaded world, there are 5 types of data which require the use of big data tools such as Hadoop:
- Clickstream data from websites
- Text data from social media and websites
- Geo-coded data from satellite/GPS applications
- Log data from IT systems
- Sensor or machine generated data from manufacturing
All of the above meet the basic 3 V's of big data: volume, velocity and variety. The reason for the growing popularity of Hadoop is that it allows companies to not only store (or retain) vast volumes of data, but also allows them to process this data to use as a strategic business asset - which is what data really is!
Biggest difference between Hadoop and traditional databases
In a relational database, a structure (or a data model or schema) must be setup before the data is written to the database, which forces the data to fit into a particular data model. However with Hadoop, data is quickly input into a file system in its raw format without any schema. When data is retrieved from file system, any schema can be applied then to fit the specific use case and needs of the application.
This does not imply that Hadoop will replace traditional databases completely! Hadoop simply enables businesses to store data which was typically too expensive and too voluminous to store earlier. This data belongs to one of the 5 types listed above.
We have discussed elsewhere in this blog as to how the last two data types are soon going to overwhelm the rest, especially with the coming of the internet of things era. Here are a few use cases which all relate to data type 5 above.
But no matter what data type you have to deal with or what your use case is, there are a few common process steps that are always required if the objective is to generate insights from this mass of digital exhaust. The data must be
- loaded into Hadoop filesystem,
- some schema must be applied,
- and then visualized.
The Hadoop ecosystem has developed a factory-farm of tools uniquely suited for these processes based on the data types, sources etc. This puts more load on the data scientists to master these myriad tools in order to drive insights.
Pig and Hive are among the two most commonly used tools. Specifically, they allow data scientists to accomplish steps b, c and to some extent d, mentioned above. These tools have many overlapping functions and a beginner is sometimes confused about which tool to use when. A good rule of thumb is to use Pig for b and c, and Hive for c and d. In the next article of this series we will dig deeper into the similarities and differences between Pig and Hive.
Click for more details about specific Pig and Hive training ...