“Truth is a Pathless Land”

...but finding an effective solution to your business problem does not have to be. Business analytics landscape does actually appear so, with a myriad techniques and vendor tools in the market.

Simafore provides tools and expertise to:

  • Integrate data
  • Select and deploy appropriate analytics
  • Institutionalize processes

About this Blog

The Analytics Compass Blog is aimed at two types of readers:

  • individuals who want to build analytics expertise and 

  • small businesses who want to understand how analytics can help them improve their business performance. 

If you fall into one of these categories, join hundreds of others and subscribe now!

Subscribe via E-mail

Your email:

Search SimaFore

FREE SMB Survey Report

describe the image
  • Our report (pdf) is a survey of analytics usage and needs of more than 100 SMBs
  • Find out which analytics applications have
    • Highest value
    • Most demand
    • Best ROI

Affordable Analytics for SMB

 

Browse by Tag

Blog - The Analytics Compass

Current Articles | RSS Feed RSS Feed

Using chi squared calculator to improve classification accuracy

  
  
  

In a previous article we talked about how to use logistic regression to perform a simple classification and interpret the coefficients of the model. In this article we will share some methods on how to improve the performance of the model and how to address common data issues.

Recall that we are trying to predict the survivability for the passengers of the Titanic. The data for this model comes from the free to participate (and also prize-free) Kaggle competition. By using only two predictors: gender and travel class, our model was able to achieve about 78% accuracy on the training data. Here is the full data set view (go to Kaggle for the full data!):

titanic data for logistic regression model resized 600

Based on this data set, our local Meetup group developed additional derived attributes and the table below shows (in yellow) the derived attributes. The idea is to see if we can manually "split" the data before submitting it to a solver. 

titanic derived data for logistic regression model resized 600
However, before submitting all these attributes blindly to a data mining algorithm, it is a good idea to understand if any of these new attributes are adding predictive value. In order to do this we can run a quick feature selection process. Since most of the added attributes were categorical, we used a chi squared test of independence to verify if these new attributes indeed are useful.

Applying this sort of feature selection is easy, with a tool like KeyConnect which has a built in chi squared calculator. (This video explains how to use keyconnect for precisely this type of problem). As a comparitive benchmark we also included the attributes sex and pclass (because we know they are a significant contributor to accuracy). The image below shows the dependencies between the target variable (survived) and the remaining categorical variables. 

feature selection using chi squared calculator keyconnect kaggle data resized 600

If we set a 5% influence level on the target (green line) as the cut-off, we can reduce the dimension of our attribute set from 10 to 6. pclass and sex are still the dominant variables, however we see that among the newly derived variables, only the following seem to have a strong effect on the target: together, child and spouse. embarked, parch and sibsp are the attributes from the original dataset which round out all our categorical predictors which have more than 5% influence on the target variable.

The table below shows the performance of a Weka Logistic Regression operator used in our predictive model.

classification accuracy improvement using chi squared feature selection 1 resized 600

One question we have to address is the significant increase in the computation time when we add the ticket and group attributes to the model. From this analysis it seems that the slight increase in predictive accuracy is not really justified by a 60x increase in computational expense.

Can we achieve better accuracy? It is clearly possible to achieve a few more points (based on the leaderboard). The data set currently has several attributes with missing values. One would expect to see improvement in the accuracy based on how these variables are handled. In an upcoming article we will discuss a systematic process for dealing with missing values.

Try out the basic version of KeyConnect which is always free.


 

Comments

Currently, there are no comments. Be the first to post one!
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics