Data Mining: Hidden Relationships
June 2012
Ivan Obolensky
On Tuesday morning, Tom Larking, the store manager of a local supermarket chain received an email from corporate headquarters. It said that on the Wednesday and Friday night’s aisle restocking, a large Bud Lite six-pack display was to be placed at the ends of each of the diaper and baby aisles. Further, the pricing of all beer and diapers was to be increased to the top of the pricing scale for Thursday and Saturday only.
How come the manager received such a message?
The message was a result of a data mining search done at the home office that discovered that at this particular store, men bought diaper cases in large quantities on Thursdays and Saturdays and also picked up a six pack or two of Bud Lite. By fixing the prices higher and making beer more accessible, corporate was able to tap into buyer purchasing patterns to move select items at higher prices.
Strategically, the store can now use the profitability gained on these sales to offset the discounting of other items that are not moving well and still make profitability targets.
Imagine this analysis being done on hundreds of items with specific instructions as to shelf layout and pricing. By data mining and statistical analysis of sales, inventory, shelf space, restocking and movements of specific items, large retail chains can strategically manage their inventories to gain better performance.1
There are two important data mining points to this story:
Firstly, the knowledge that there are large numbers of men that come into this particular store and buy diapers and beer on Thursdays and Saturdays is surprising. Chances are that unless a data mining search was done, the correlation would never have been discovered.
Secondly, it was the combination of knowledge and the ability to utilize it that allowed the store to take advantage of a real-world link between beer and diapers.
This is data mining at work, and it is happening everywhere from the gas station convenience store to high-end retailers, from the IRS to the Department of Homeland Security, and from Las Vegas casinos to credit card companies.
Data mining is the discovery of previously unknown and potentially useful relationships from information. The term covers several disciplines. It starts with the procurement and storage of information in databases. The data is prepared for analysis and then subjected to various algorithms and statistical methods, as well as artificial intelligence, to process the data and discover
non-obvious relationships. Once a non-obvious pattern is found, it must be verified as real, and then a plan of action thought out to take advantage of the information.
The most expensive part of data mining is the gathering, storing, and transformation of the data into a useable format so that it can be processed. Data sets these days often involve several terabytes worth of data. A terabyte is 1,000 billion bytes of information. It is a costly process and finding meaningful and useful relationships can require a great deal of work.
The first step is simply to describe the data. An example might be car license plate numbers. Each number has to be entered one at a time correctly. Another category might be make of car and year of model. This means translating the manufacturer into one number and the model type into another and entering them next to the correct license plate number.2
Once entered into a database, one can use the processing power of a computer to summarize this information such as simply adding up all the cars with valid license plates that are made by a specific manufacturer. One could expand this by doing a year-by-year census and finding out the number of cars on the road for each maker and then graphing the results. One could break the data down into new cars added that year. By analyzing the data in graph form, one can get an idea of which car company is putting the largest number of new cars on the road in a particular year.
Although this data is interesting, it is not particularly useful except on a factual basis. In and of itself, this information does not build a predictive model of the future that a company or an individual can act on to their advantage.
Suppose there is a strong correlation between those who purchased a certain type of automotive product and car owners that own twelve-year-old white cars. This might be significant. But even with this potentially valuable information, two questions remain. Is there an actual link between customers who own twelve-year-old white cars and product sales? And can the information be used in such a way to create a competitive advantage?
Data mining has to do with finding patterns, but how valuable those patterns are and what can be done with them depends on the user having expert understanding of their business and being able to verify that the pattern actually exists.
This real-world linkage is emphasized because correlation and coincidence does not imply causation. Just because something happens at the same time as something else does not establish a causal link.
In the stock market, some investors believe that one should look at market behavior during the first week of the year and the first month of the year to predict how the whole year will turn out. Is there a causal connection or is this just coincidence? Would you risk your money on this correlation?
What about giving 40 million vaccines to 40 million people in one week all over the age of 50?
Suppose the likelihood of any one of the 40 million suddenly dropping dead for any reason in a
24-hour period is one in 100,000. This means that 400 of the 40 million are expected to die on any particular day. Given this information what is the likelihood of an individual receiving the vaccine and dropping dead within two hours? If they did, is it a matter of coincidence or causality? Is the vaccine to blame?
This is an important coincidence and has real-world implications in terms of the law and insurance claims. In case one thinks the above is completely hypothetical, during the 1976 swine flu vaccine fiasco, the US government ended up paying out 1.3 billion dollars in claims. The vaccine was blamed for 25 deaths. Attorneys for the US government argued that many of these were simply coincidental rather than causal. Given such a large number of recipients, a certain number would be simply coincidental. This argument held up until a statistically significant number of recipients developed crippling Guilliam-Barre Syndrome after being injected with the swine flu vaccine. The government’s case collapsed, and a US government subsidized fund was set up to handle the claims. The coincidence ended up being causal.3
In the example of the automotive product and the owners of twelve-year-old white cars, would it be worthwhile sending a mass mailing to all owners of eleven-year-old white cars in an effort to lock in their business in the twelfth year?
Is there are a way to verify that this model will work before the business owner commits to the expense of the campaign? A test campaign might be justified and the results inspected to see if a large one might work.
Data mining success depends on the quality of data provided and the ability to intelligently inspect and use the results. It is not a recipe for instant success in business, government, or intelligence gathering. It requires painstaking accumulation and accurate input of information and a real working knowledge of one’s business to be able to take the often surprising relationships discovered and turn them into constructive advantage. But when correctly used, data mining can create a significant advantage for the user.
1 Palace, B. (1996, spring). Data Mining. Retrieved June 19, 2012, from http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/index.htm.
2 Two Crows Corporation. (2005). Introduction to Data Mining and Knowledge Discovery, Third Edition. Retrieved June 19, 2012, from Two Crows Consulting: http://www.twocrows.com/intro-dm.pdf.
3 Freedman, D.A., & Stark, P.B. (1999, August 15). The Swine Flu Vaccine and Guillain-Barre Syndrome: A Case Study in Relative Risk and Specific Causation. Retrieved June 19, 2012, from Department of Statistics, University of California, Berkeley: http://www.stat.berkeley.edu/~census/546.pdf.
If you would like to sign up for our monthly articles, please click here.
Interested in reprinting our articles? Please see our reprint requirements.
© 2012 Ivan Obolensky. All rights reserved. No part of this publication can be reproduced without the written permission from the author.