Association rules are part of every data miner's arsenal. Haven't heard
about it? I am pretty sure you've had. Association rules are a
substantial part of every e-shop, of every supermarket and every tool
that aims to analyze data.
Does the picture look familiar? If
you've ever bought something at amazon, you might have noticed that
they are kinda obsessed with showing you items related to your order.
Where do they get this information? It is not stored statically in the
database, instead it is computed from the overall orders using the
association rules mining algorithms.
Do
you think that items in your favorite supermarket are organized
randomly? No, they are organized in a way that maximizes a chance that
the items are bought. Again, this is information, that can be easily
discovered using association rules mining algorithms.
From
the previous paragraphs you might suspect, that association rules
express relations (associations) between items. More formally,
association rule is an implication of form A -> B, where the left
side, A, is called premise and it represents a condition which must be
true, for the right side, B (conclusion) to hold. A rule A->B can be
interpreted as
If A happens, than B happens.
This
is a very generic interpretation, because the true interpretation
depends on the domain.I am now a supermarket employee and I got
following rule from the mining software:
Bread -> Milk
The rule can be translated as:
Customers, who bought bread, also bought milk
Now I magically transform into a website traffic analyst and see a rule
/news/obama.html -> /sport/tour-de-france.html
and I instantly know that
those, who read news about Barrack Obama, also read news about le Tour and not only that, I know that those who are interested in Barrack Obama are interested in Tour de France.
Woosh, flash of light and I am now a doctor, looking at the rule
vasculitis -> paraneoplastic syndrome
and I see that there is a serious chance that my vasculitis patients will suffer paraneoplastic syndrome.
The important thing is, that association rules helped me to discover hidden knowledge (that's why they call it data mining),
but the more important thing is, that I can act based on the knowledge.
I can move the milk closer to bread to sell more of it together and
generate more income. I can recommend stuff to my e-shop visitors, I
can treat my vasculitis patients and run some tests to detect
paraneoplastic syndrome early and maybe save lives.
So what do you need to get started? You need data of course, but not just any data, you need data in a form of transactions.
These transactions have nothing to do with the database transactions.
Instead, the transaction is a logical group of somehow related items.
You might have groups of market basket items, groups of links clicked
on one web page visit, group of one patient's diseases.. Such groups
are then called transactions.
When
I said, that rule interpretation depends on domain, it was only half of
the truth. The other half is, that the interpretation also depends on
your transactions. The interpretation simply depends on what you are mining, and what you are mining is based on how you define your transaction.
I'll
now do a simple, manual association rule mining, using the classical
market basket analysis example. We define our transaction as a content
of a basket.
| Transaction Id | Items |
| 1 |
bread, milk, butter, cocoa, cheese |
| 2 |
bread, butter, milk, cheese |
| 3 |
bread, butter, olives |
| 4 |
milk, sugar, butter, cheese |
We
have four baskets, four customers and their data. Looking at the items,
we see, that transactions 1,2,3 contain bread and butter. We have just
found our very first rule.
bread -> butter
There are other rules in our data, for example rule
milk -> cheese
found
in transactions 1,2,4. Although association rule mining may seem like a
very trivial task at the first look, imagine finding the rules in
dataset of billions of transactions.
The rules presented so far have all one big downside. There is no way to tell which rule is better,
it is impossible to compare them. To get past this limitation, we can
add several classifiers to the rule, which will represent the strength
of the rule. They are commonly known as interestingness measures, because the strength of the rule is equal to its interestingness.
The two classical measures, which were introduced by R. Agrawal, an association rule pioneer, are called support and confidence.
Support is a measure, which represents how often did the rule apply. It
is a percentage of all transaction, where the items in the rule were
found.
Confidence is a percentage of all transactions, which contain items on the left and on the right side of the rule.
| Id | Transactions | bread + cheese | bread -> cheese | cheese -> bread |
| 1 |
bread, cheese, honey, apples |
 |
 |
 |
| 2 |
milk, bread, cheese, pasta |
 |
 |
 |
| 3 |
milk, bread, apples |
|
 |
|
| 4 |
bread, milk |
|
 |
|
| 5 |
milk, pasta, cheese |
|
|
 |
| 6 |
milk, bread, cheese |
 |
 |
 |
Look
at the table above. Bread and cheese can be found in transactions
1,2,6, we have total six transactions, so the support of the rule bread
-> cheese and cheese -> bread is 3/6 or 50%. Now take the rule
bread -> cheese. Our customers bought bread in transactions
1,2,3,4,6, but bought cheese only in transactions 1,2 and 6. So five
customers bought bread, but only three of them bought also a cheese, so
the support of the rule is 3/5.
It should be
pretty clear from the examples, that both interestingness measures are
important, because they both quantify the rule and express its
strength. But not only that, the interestingness measure are the key
concept, that actually enables the mining.
Association
rule mining is formally defined as a process of finding the rules,
where the support and confidence of the rule are greater than the user
provided values of minimum support and minimum confidence, further
referred to as minconf and minsup. The two values actually prune the search space and make mining possible.
Take
the last example. There is a rule milk -> apples [1/6, 1/1], which
can be found in only one transaction. Is this rule interesting? It
isn't and yet it is there. This is not a problem if we have six items
in six transactions, but it would be great problem, had we thousands of
items in billions of transactions. If you specify minsup=0.5,
minconf=0.8 you will effectively filter out all uninteresting items. If
you specify the values too low, you will end up with tons of rules,
because the items will be associated with each other in all possible
ways. On the other side, if you specify the values too high, you might
not find a single rule. There is no universal advice as to what values
should you set, the best way is to experiment.
What
we'll be talking about next time? In the next posts I will show some
practical examples using RapidMiner mining tool, explain the algorithm
behind, tell about the problems this model has and explain why support
and confidence are bad measures.
Credit:http://tkramar.blogspot.com/2008/09/introduction-to-association-rules-minig.html