“Truth is a Pathless Land”

...but finding an effective solution to your business problem does not have to be. Business analytics landscape does actually appear so, with a myriad techniques and vendor tools in the market.

Simafore provides tools and expertise to:

  • Integrate data
  • Select and deploy appropriate analytics
  • Institutionalize processes

About this Blog

The Analytics Compass Blog is aimed at two types of readers:

  • individuals who want to build analytics expertise and 

  • small businesses who want to understand how analytics can help them improve their business performance. 

If you fall into one of these categories, join hundreds of others and subscribe now!

Subscribe via E-mail

Your email:

No time to read? Listen!

FREE SMB Survey Report

describe the image
  • Our report (pdf) is a survey of analytics usage and needs of more than 100 SMBs
  • Find out which analytics applications have
    • Highest value
    • Most demand
    • Best ROI

Is your business analytics ready?

Find out now by taking the free grader!

small business analytics maturity grader

Browse by Tag

Blog - The Analytics Compass

Current Articles | RSS Feed RSS Feed

Feature selection with mutual information, Part 2: PCA disadvantages

  
  
  
mutual information feature selection step1 resized 600

This is the second and concluding part of the article which shows how one of the disadvantages of principal component analysis (PCA) for feature selection or dimension reduction can be addressed using mutual information based tools.

Feature selection with mutual information, Part 1: PCA disadvantages

  
  
  
rapidminer pca setup resized 600

In a previous article series we discussed the application of principal component analysis (PCA) using RapidMiner to reduce the dimension of a dataset. One of the things which was pointed out was that in many instances raw data is not the best form for running a PCA and how a normalization (based on z-scores or ranges) needs to be applied to the data before running PCA. The RapidMiner process below illustrates this set up.

Cost forecasting by aggregating time series data for cost modeling

  
  
  
time series analysis cost forecasting aggregate cost models resized 600

In earlier articles we discussed how to build a simple, additive cost model to aggregate data from multiple production or distribution centers to obtain a "corporate" level model for cost forecasting. We also presented a means to build a "multi-faceted aggregate cost model", by select critical factors from each individual cost center using the t-test.

Using chi-square test for feature selection and key driver analysis

  
  
  
converting nominal data to chi square contingency table resized 600

In many previous articles we have described the use of chi-square test for detecting relationships between variables. It was pointed out that if the variables are numerical we can use a simple correlation analysis or mutual information based analysis. However the chi-square test is usually the only option if we have nominal or categorical variables. The chi-square test can also be used with numerical variables by converting them into nominal or categorical types. 

Higher model accuracy with mutual information based feature selection

  
  
  
keyconnect feature selection predictive models step1 resized 600

In this article we will showcase a simple example to illustrate the use of KeyConnect, a mutual information based feature selection tool, to improve the predictive accuracy of models. A couple of quick caveats regarding KeyConnect:

Using the t-test to build aggregate cost models for cost forecasting

  
  
  
cost modeling using predictive analytics whitepaper cover

In a previous article we discussed situations when we will need to combine data from several different distribution centers to develop aggregate cost modeling. There we developed a very basic or "naive" aggregation of data to generate a final cost forecasting model. In this article we will present an approach for intelligently combining data from different centers to develop an aggregate cost model.

7 reasons why big data for manufacturing analytics is yesterdays news

  
  
  
manufacturing analytics big data old news resized 600

There is an old proverb that says "the herbs that grow in your own backyard are usually the best medicine". This saying is pretty apt today when we consider data and its value for manufacturing. Dave Evans, Chief Futurist at Cisco Systems highlighted ten trends which will shape future manufacturing. This was presented at the Manufacturing Leadership summit and is nicely summarized here. One of these trends was called The Zettaflood is Coming: Big data is allowing us to predict more things and change the way we plan.

The importance of customer lifetime value formula for business

  
  
  
CLV formula whitepaper cover resized 600

Customer Lifetime Value emphasizes that value is derived from the customer relationship. This emphasis orients the firm towards the customer, enabling a customer-centric business culture. Customer-centricity is contrasted with product or service-centricity. Direct marketing theory is based on customer-centricity.

Deploying decision trees to classify new samples using RapidMiner

  
  
  
a decision tree built with rapidminer resized 600

A common but basic question on implementing decision tree analyses using RapidMiner is the following: "how do i deploy the decision tree that i have trained using my data?" What the questionner really wants to do is that after they have built a decision tree using RapidMiner, they want to classify some new data using this tree. 

Mutual information based filter vs. wrapper type feature selection

  
  
  
feature selection using mutual information boston housing resized 600

We indicated that there are two main types of feature selection algorithms: wrapper type and filter type. A wrapper algorithm works within another machine learning program such as multiple linear regression. Good examples are Backward Elimination and Forward selection. Each iteration using a regression model either removes or introduces a variable which improves model performance. The iterations stop when a preset performance criterion (such as adjusted r-square or RMS error) is reached or exceeded. The inherent advantage of wrapper type methods are that multi-collinearity issues are automatically handled. However, you get no prior knowledge (or will be interested in afterwards) about the actual relationship between the variables.

All Posts
The Website Grade for www.SimaFore.com!