Using chi squared calculator to improve classification accuracy
In a previous article we talked about how to use logistic regression to perform a simple classification and interpret the coefficients of the model. In this article we will share some methods on how to improve the performance of the model and how to address common data issues.
Recall that we are trying to predict the survivability for the passengers of the Titanic. The data for this model comes from the free to participate (and also prize-free) Kaggle competition. By using only two predictors: gender and travel class, our model was able to achieve about 78% accuracy on the training data. Here is the full data set view (go to Kaggle for the full data!):
Based on this data set, our local Meetup group developed additional derived attributes and the table below shows (in yellow) the derived attributes. The idea is to see if we can manually "split" the data before submitting it to a solver.
However, before submitting all these attributes blindly to a data mining algorithm, it is a good idea to understand if any of these new attributes are adding predictive value. In order to do this we can run a quick feature selection process. Since most of the added attributes were categorical, we used a chi squared test of independence to verify if these new attributes indeed are useful.
Applying this sort of feature selection is easy, with a tool like KeyConnect which has a built in chi squared calculator. (This video explains how to use keyconnect for precisely this type of problem). As a comparitive benchmark we also included the attributes sex and pclass (because we know they are a significant contributor to accuracy). The image below shows the dependencies between the target variable (survived) and the remaining categorical variables.
If we set a 5% influence level on the target (green line) as the cut-off, we can reduce the dimension of our attribute set from 10 to 6. pclass and sex are still the dominant variables, however we see that among the newly derived variables, only the following seem to have a strong effect on the target: together, child and spouse. embarked, parch and sibsp are the attributes from the original dataset which round out all our categorical predictors which have more than 5% influence on the target variable.
The table below shows the performance of a Weka Logistic Regression operator used in our predictive model.
One question we have to address is the significant increase in the computation time when we add the ticket and group attributes to the model. From this analysis it seems that the slight increase in predictive accuracy is not really justified by a 60x increase in computational expense.
Can we achieve better accuracy? It is clearly possible to achieve a few more points (based on the leaderboard). The data set currently has several attributes with missing values. One would expect to see improvement in the accuracy based on how these variables are handled. In an upcoming article we will discuss a systematic process for dealing with missing values.
Try out the basic version of KeyConnect which is always free.