Predictive Analytics is the use of advanced statistical and data mining techiques to address many business problems. There are a couple of commonly used and accepted data mining processes or methodologies to follow. The most well-known is the CRoss Industry Standard Platform for Data Mining (CRISP-DM) developed by a consortium of industry partners and consultants in the mid 90's.
The CRISP-DM methodology is simply a broad outline of steps to follow for data mining. It starts with Business Understanding, followed by Data Understanding, then Data Preparation, Modeling, Evaluation and Deployment of the model. Vendors such as SAS have developed their own processes to help in using their tools. SAS calls this process "SEMMA" for Sample, Explore, Modify, Model and Assess. SAS is particular in stating that this is not a methodology, but rather a work flow commonly used and recommended for using with their data mining products.
When you get into the detail of either process, they will more or less encompass the following six steps which are framed as questions.
Question 1: Are we developing an analytics solution for a "one-shot deal", i.e., simply answering a specific business question, or for an ongoing process? This questions establishes the purpose of building a model and how you answer this impacts other questions that follow.
Question 2: Where is the data? This seems like an obvious question, but in many cases, data may not be available from a single source (e.g. multiple databases, internal or external).
Question 3: What kind of preparation is needed for the data? Typically this may involve normalizing the data, developing a strategy for missing values, and identifying and removing outliers. Visually exploring the dataset (using histograms, scatter plots, box-and-whisker plots, bubble plots etc.) is many times the first step. Even looking at a spreadsheet can sometimes help (if the data set is small)!
Question 4: Do we need to reduce the data dimension? When predicting a target or response variable, it is important to select predictors that are correlated to the response. Choosing uncorrelated variables will increase the variance (reduce precision), while NOT choosing correlated variables will increase the error (reduce accuracy). Running a correlation or a bivariate analysis will help select the right variables.
Question 5: Is this a classification or prediction problem? This depends entirely on the response variable. A binary (Yes or No) or categorical (multiple classes) response variable will require classification techniques. A numerical target variable will require prediction techniques.
Question 6: Which technique to use? Clearly the choice will be (somewhat) narrowed after answering Q.5. However with the computing power available today, this question may be best answered after trying out several techniques in parallel or in a sequence!
Never the less, it is beneficial to use the wisdom of past experiences and select techniques that have worked well before for certain classes of problems. visTASC is a completely FREE repository of such associations, and can help you with this question. Try visTASC by signing up below.