A typical problem facing many manufacturers is developing accurate price quotes for their customers. For example, many automotive subsystems which go into a car are complex assemblies of hundreds of parts. Every time a new vehicle is designed, many of these subsystems will need to be changed in some way – from minor adjustments to sometimes a new design altogether. However, the underlying parts or “ingredients” very rarely vary, with only their configurations getting changed. For example, an electrical subsystem will still need copper, plastic and some metal, but the amounts of these materials and the way in which they are put together can vary significantly. In this article we explore how the data science technique of k-means clustering can be applied for price and cost modeling.
Suppliers receive Request for Proposals (RFP) before the launch of a new design cycle from a car manufacturers. Responding quickly and accurately to these RFPs can make or break business. This requires a deep level of understanding of their subsystems, from materials, to engineering to manufacturing. Developing such a level of understanding is possible with accumulated data.
Data mining can be effectively employed to quickly identify relationships between the various underlying parts and the total cost to manufacture. It does not matter whether the labor costs are relatively invariant and material costs are highly variable or vice versa or both are highly variable. As long as data exists that relate the hundreds (or thousands) of historical designs and their respective costs, data mining can help.
One of the first things we can do is to segment the products into families or groups. Manufaturers typically have their internal classifications based on the type of vehicle or specific application within vehicles. However, when we start to mine the data, it is better to set aside these strategic classifications and let the data determine natural classifications or groups. We can employ any of the several clustering techniques to do this.
If we can breakdown every subsystem into its constituent parts (sometimes called attributes, terminology identical in data mining), we can identify which are the key performance indicators that impact costs. However, we can learn much about the data using clustering as a preprocessing step. The idea is to group subsystems into various clusters and associate each cluster with a representative cost. Cluster 1 might typically cost $3, Cluster 2 might on average cost $4 and so on.
Then every time a new RFP rolls in, we can quickly identify the cluster to which the new design might fall in and be able to get a good idea of the representative cost to manufacture the new design.
Once we run a basic clustering, we must also look at which attributes are the key drivers of cluster separation. These same attributes might also help in subsequent modeling to determine more accurate pricing using linear or nonlinear regression models. If you use k-means clustering, then such identification is possible by plotting the centroids of each attributes for the various clusters in a parallel chart. Packages such as RapidMiner provide built in tools for this kind of analysis.
For example, we see above that attributes 3, 4, 5 and 13 have such divergent centroids that they are quite likely the drivers for the clustering. However, these attributes may also contribute to the cost variances between the different clusters. While we cannot conclude that a priori, this analysis has given us a starting point to build regression models using these few attributes as the independent variables (cost being the dependent or predicted variable)
But before we get to that point, however we also need to know if these clusters are the best possible groupings we can arrive at. Cluster performance measures such as intracluster distances and intercluster distances will help to answer that question. We need the intracluster distances to be small and intercluster (between clusters) distances to be high. A commonly used evaluation measure which combines both of these ideas is the Davies-Bouldin index. Davies-Bouldin index is a measure of uniqueness of the clusters and takes into consideration of both cohesiveness of the cluster (distance between the data points and center of cluster) and separation between the clusters. The lower the value of Davies-Bouldin index, the better the clustering.
Originally posted on Tue, Nov 12, 2013 @ 09:06 AM