battles in data science

In the late nineteenth century, electricity generation and distribution technology faced a major dilemma. Whether to use direct current (DC) or alternating current (AC) for the transmission of electric power from generating stations to end users. Today, these kinds of battles in data science are going to become passe.

Technology was evolving quite rapidly (in a relative sense) and there were supporters in both camps. Prominent in the DC camp was of course, Thomas Edison. Similarly there was another technology war between the internal combustion engine and battery powered electric cars around the same time in history. In the 1970s-80s, there came the war of videotape formats: VHS vs betamax. Finally, as recently as 2008, we all witnessed the Blu-Ray vs. HD-DVD format wars. 

Today, the field of data science is also evolving at a breakneck speed and new technologies are coming out at a far more rapid pace than anytime in the 20th century. What is particularly striking is the window of time each technology gets to prove itself out. This superfast evolution make the battles not so much A-versus-B, but more of a who can effectively work with who to displace incumbents! In fact since the technologies in question are still evolving, many of these battles may seem like non-contests because of the expected immediate application. Technology A may come out and state that it is actually not competing with B, but in fact complementing it (!) because of the intended use. In fact this is quite true in case of some of the newly evolving technologies such as Hadoop and Spark, for instance.

Hadoop (which officially is only 3 years and 3 months old as of March 2015!) allows businesses to capture and process structured and unstructured data using a distributed computing paradigm called Map-Reduce and allows all of this to happen on commodity hardware. Hadoop itself is of course open source, but several commercial vendors make it enterprise ready. Hadoop offers a variety of tools to help with analytics: Pig for ETL, Hive for querying and Mahout for data science. All of them work on the map-reduce framework which requires data to be “mapped” (formatting into key-value pairs), shuffled across the cluster, sorted and “reduced” (aggregating the values from all the keys). Whereas previously, large quantities of data could not even be processed, this framework allows us to split the data among tens or hundreds of nodes and process on all of them by using this distributed concept. 

However there are overhead costs with map-reduce, particularly if the data processing is iterative. Because map-reduce cleans up the data which was loaded into memory after each map or reduce step is performed, iteratively analyzing data becomes time intensive if not impossible. (Note that in data science, we are typically not talking about one-pass computations such as ETL operations, but iterative calculations using the same dataset elements repeatedly).Spark overcomes these hurdles by creating what is called a resilient distributed dataset (RDD). A significant difference in the way Spark processes data is by persisting these RDDs in memory so that iterative computations become easy or even possible. Initially, one might think that Spark is likely to displace Hadoop, however that is not entirely correct. Spark still uses the distributed computing framework and can therefore work well with Hadoop. Early users claim that using Spark within Hadoop can speed up computing by anywhere between 2x to 100x! The combination of these two can actually threaten large database systems. Hadoop, because of its schema-free data loading allows businesses to store virtually any kind of data and with schema-on-read facility can impose structure after the data is already stored (the opposite of schema-on-write required by traditional RDBMS). This combined with hyperfast processing offered by Spark can potentially replace the advantages database systems can provide such as fast OLTP. 

All of this evolution presents important challenges for both businesses and data science practitioners. Keeping pace with the technology. However one thing remains constant: before diving head first into any new thing or flavor of the month, it is very critical to rank the most significant business problems first and then strategize about which technology is the most appropriate for the top problems. If you start with the technology first, then you run into the situation of a solution seeking a problem. Don’t do that because that is a guaranteed money pit. But no matter what, keep an eye out for these kinds of battles in data science.

Originally posted on Mon, Mar 16, 2015 @ 07:30 AM

Photo by Nick Fewings on Unsplash

No responses yet

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.