Subscribe to Email Updates

Learning Data Science: feature selection for clustering

Posted by Bala Deshpande on Wed, Apr 13, 2016 @ 02:07 PM

We frequently get questions about whether we have chosen all the right parameters to build a machine learning model. There are two scenarios: either we have sufficient attributes (or variables) and we need to select the best ones OR we have only a handful of attributes and we need to know if these are impactful. Both are classic examples of feature engineering challenges

Most of the time, feature selection questions pop up as a prelude to model building. However, recently one of the trainees in our data science course had this question - based on his experience in working with some real data - "can we tell which attributes were most important in determining why a particular example (or a data point) ended up in a particular cluster?" 

There were two things unique about this question - the first was it is feature selection in reverse and the second was that feature selection typically does not get enough attention in unsupervised techniques (such as clustering).

In this article, we will show how quickly the question can be answered, especially if you are comfortable with a tool like RapidMiner. All you need to do is pull in any standard example set, such as the Iris dataset as we do here, build a k-means clustering model and simply add an attribute weighting operator after the model is built. There are of course a few details: how do we verify if the attribute ranking actually worked? The details are described below along with the xml for implementing this simple but instructive use case.

Step 1: No feature selection

Pull the Iris example set, Normalize the data using Z-transformation and Rename the variables. Put together the process as shown below noting that the Select Attributes in the middle is disabled for step 1. After we build a k-means Clustering model (with k=3) we change the roles of a couple of attributes. The flower name is made an id variable (meaning it will not count as a feature) and cluster is made the label variable. This allows us to rank the attributes using an operator like Weight by Information Gain etc.

learn-data-science-clustering-feature-selection-1.pngWhen we run this above process, we will obtain 3 clusters roughly corresponding to the 3 flower types. However as seen from the bar chart below, there is considerable error - particularly in versicolor and virginica groups. For a perfect classification, we should see the 3 colors (red, green and blue) as 3 separate bars. See figure below.learn-data-science-clustering-feature-selection-2.png

When we check the results of attribute ranking, we see that 2 of the 4 attributes: petal W and petal L have the highest rank while sepal W and sepal L have much lower (relative) ranking. Of course flower name which is the id has 0 weight (as it should be).

learn-data-science-clustering-feature-selection-3.png

 

Step 2: Include feature selection

Based on the feature ranking table from above, we now deselect the Sepal * variables. When we run this process we see that the clusters now are much better with the separation between the 3 flower types. Now each cluster has (mostly) the same flower type in it. This demonstrates that feature selection can benefit unsupervised learning as well. Additionally it answers the question raised earlier: we now know that Petal * attributes have a higher significance in determining which cluster an example from this data set would belong to. See figure below which shows that there is significantly less contamination of the wrong class in each cluster compared to before. 

learn-data-science-clustering-feature-selection-4.png

For more on feature selection, read some of the earlier articles on this blog. The final xml of the above process is also included below.

Schedule a free data analytics consultation

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<process version="7.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.0.001" expanded="true" height="68" name="Retrieve Iris" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="normalize" compatibility="7.0.001" expanded="true" height="103" name="Normalize" width="90" x="45" y="136"/>
<operator activated="true" class="rename" compatibility="7.0.001" expanded="true" height="82" name="Rename" width="90" x="45" y="289">
<parameter key="old_name" value="label"/>
<parameter key="new_name" value="flower type"/>
<list key="rename_additional_attributes">
<parameter key="a1" value="sepal W"/>
<parameter key="a2" value="sepal L"/>
<parameter key="a3" value="petal W"/>
<parameter key="a4" value="petal L"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="136">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="sepal L|sepal W"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="k_means" compatibility="7.0.001" expanded="true" height="82" name="Clustering" width="90" x="447" y="34">
<parameter key="k" value="3"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.0.001" expanded="true" height="82" name="Set Role" width="90" x="447" y="136">
<parameter key="attribute_name" value="cluster"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles">
<parameter key="flower type" value="id"/>
</list>
</operator>
<operator activated="true" class="weight_by_information_gain_ratio" compatibility="7.0.001" expanded="true" height="82" name="Weight by Information Gain Ratio" width="90" x="447" y="289">
<parameter key="normalize_weights" value="true"/>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="Rename" to_port="example set input"/>
<connect from_op="Rename" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Weight by Information Gain Ratio" to_port="example set"/>
<connect from_op="Weight by Information Gain Ratio" from_port="weights" to_port="result 2"/>
<connect from_op="Weight by Information Gain Ratio" from_port="example set" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>

Topics: k-means clustering, feature engineering