Simafore provides tools and expertise to:
The Analytics Compass Blog is aimed at two types of readers:
individuals who want to build analytics expertise and
small businesses who want to understand how analytics can help them improve their business performance.
If you fall into one of these categories, join hundreds of others and subscribe now!
Find out now by taking the free grader!
Current Articles | RSS Feed
This is the second and concluding part of the article which shows how one of the disadvantages of principal component analysis (PCA) for feature selection or dimension reduction can be addressed using mutual information based tools.
Read More
In a previous article series we discussed the application of principal component analysis (PCA) using RapidMiner to reduce the dimension of a dataset. One of the things which was pointed out was that in many instances raw data is not the best form for running a PCA and how a normalization (based on z-scores or ranges) needs to be applied to the data before running PCA. The RapidMiner process below illustrates this set up.
In earlier articles we discussed how to build a simple, additive cost model to aggregate data from multiple production or distribution centers to obtain a "corporate" level model for cost forecasting. We also presented a means to build a "multi-faceted aggregate cost model", by select critical factors from each individual cost center using the t-test.
In many previous articles we have described the use of chi-square test for detecting relationships between variables. It was pointed out that if the variables are numerical we can use a simple correlation analysis or mutual information based analysis. However the chi-square test is usually the only option if we have nominal or categorical variables. The chi-square test can also be used with numerical variables by converting them into nominal or categorical types.
In this article we will showcase a simple example to illustrate the use of KeyConnect, a mutual information based feature selection tool, to improve the predictive accuracy of models. A couple of quick caveats regarding KeyConnect:
In a previous article we discussed situations when we will need to combine data from several different distribution centers to develop aggregate cost modeling. There we developed a very basic or "naive" aggregation of data to generate a final cost forecasting model. In this article we will present an approach for intelligently combining data from different centers to develop an aggregate cost model.
There is an old proverb that says "the herbs that grow in your own backyard are usually the best medicine". This saying is pretty apt today when we consider data and its value for manufacturing. Dave Evans, Chief Futurist at Cisco Systems highlighted ten trends which will shape future manufacturing. This was presented at the Manufacturing Leadership summit and is nicely summarized here. One of these trends was called The Zettaflood is Coming: Big data is allowing us to predict more things and change the way we plan.
Customer Lifetime Value emphasizes that value is derived from the customer relationship. This emphasis orients the firm towards the customer, enabling a customer-centric business culture. Customer-centricity is contrasted with product or service-centricity. Direct marketing theory is based on customer-centricity.
A common but basic question on implementing decision tree analyses using RapidMiner is the following: "how do i deploy the decision tree that i have trained using my data?" What the questionner really wants to do is that after they have built a decision tree using RapidMiner, they want to classify some new data using this tree.
We indicated that there are two main types of feature selection algorithms: wrapper type and filter type. A wrapper algorithm works within another machine learning program such as multiple linear regression. Good examples are Backward Elimination and Forward selection. Each iteration using a regression model either removes or introduces a variable which improves model performance. The iterations stop when a preset performance criterion (such as adjusted r-square or RMS error) is reached or exceeded. The inherent advantage of wrapper type methods are that multi-collinearity issues are automatically handled. However, you get no prior knowledge (or will be interested in afterwards) about the actual relationship between the variables.