In the first article of this series on fault detection we discussed the background of the business problem that can be addressed by data science. In the second article we shared a conceptual data platform that would be required to perform the integration and analytics. In this third of the series we will talk briefly about the challenges of a data science platform that are typical for this type of analysis and some of the lessons learned.
3 main challenges for a data science platform and lessons learned
1. Unbalanced data
The dataset used in our benchmarking study typifies a very common problem in data science: unbalanced data. The overall fault rate was less than 5%. To optimize machine learning, we need to use a balanced training set to train the model and an unbalanced, unseen test set to evaluate the model performance. The details of this were explained in an earlier blog. The main lesson from this challenge is that there are two criteria of interest when evaluating model performance: the overall test accuracy and the test recall. Many models report high test accuracy but score poorly in recall. The best algorithm will maximize both of these values as seen in the image. The best performing algorithms cluster on the top right.
2. Cross Validation
It is a non-trivial task to create a non-standard train/test split for an unbalanced dataset that emerge from most preventive maintenance applications. None of the data science platforms we considered in our benchmark study had standard code for this so it needed to be developed. One needs to split the minority and majority classes and work with each one separately to create the correct balance for the training and test sets. Also special care had to be taken when manipulating the indices to make sure that same examples were not used for both training and test sets. Finally, after creating the training and test sets, it was necessary to shuffle the datasets before using them. Very few programs provide this facility out of the box.
3. Comparing algorithms in a data science platform: apples-to-oranges
It is not always easy to perform an apples to apples comparison of an algorithm within a data science platform. Each algorithm has its own set of tuning parameters and they don’t necessarily overlap. Care has to be exercised in ensuring that the parameter set are uniform across software platforms to make one-to-one comparisons. Even when parameters can be set, the optimum for one software tool will not necessarily be the optimum for another. So if any of the platforms return a performance value completely out of the range of the rest, this may require additional exploration. Finally in-memory performance may not match distributed performance: this is another factor which may skew software performance on a fault detection dataset.
Originally posted on Thu, Aug 27, 2015 @ 08:00 AM