Apriori and FP-Growth Algorithms
Hello everyone… I am planning to write an article series on Data Mining and its Concepts and today I will explain you in detail to solve a common problem type related to Apriori and FP-Growth algorithms.
First, let's get a problem.
Question:
A grocery store chain keeps a record of weekly transactions where each transaction represents the items bought during one cash register transaction. The executives of the chain receive a summarized report of the transactions indicating what types of items have sold at what quantity. In addition, they periodically request information about what items are commonly purchased together. In addition, they periodically request information about what items are commonly purchased together. In this case, there are five transactions and five items, as shown in Table 1. Suppose that the minimum support and confidence are S=30% and 50%, respectively.
- Define the notions of support and confidence.
- Calculate the Support for all the subsets given in Table 2.
- Calculate the support and confidence of the association rules in Table 3.
- Find all frequent items for the above database using the Apriori Algorithm
- Find all frequent items using the FP-growth algorithm.
Transaction ID
|
Items
|
T1
|
Cake, Cream, Sugar
|
T2
|
Cake, Sugar
|
T3
|
Cake, Sugar, Milk
|
T4
|
Coke, Cake
|
T5
|
Coke, Milk
|
Table 1
Set
|
Support
|
Coke
| |
Cake
| |
Sugar
| |
Cream
| |
Milk
| |
Coke, Cake
| |
Coke, Milk
| |
Cake, Sugar
| |
Cake, Cream
| |
Cake, Milk
| |
Sugar, Milk
| |
Sugar, Cream
| |
Cake, Sugar, Cream
| |
Cake, Sugar, Milk
|
Table 2
X=> Y
|
Support
|
Confidence
|
Cake=>Sugar
| ||
Sugar=>Cake
| ||
Coke=>Cake
| ||
Sugar=>Cream
| ||
Cream=>Sugar
| ||
Cream=>Milk
|
Table 3
Answer:
1.
- Support is an indication of how frequently the itemset appears in the database.
- Confidence is an indication of how often the rule has been found to be true.
For rule A=>C
Confidence = support({A∪C})
support({A})
2.
First, we need to get the occurrence of each item/itemset in Table 1. Here we have 5 unique items; Bread, Jelly, Butter, Milk, Beer. For the sets given in Table 2, occurrences in Table 1 are as following,
First, we need to get the occurrence of each item/itemset in Table 1. Here we have 5 unique items; Bread, Jelly, Butter, Milk, Beer. For the sets given in Table 2, occurrences in Table 1 are as following,
Set
|
Number of occurrences
| |
Coke
|
11
|
2
|
Cake
|
1111
|
4
|
Sugar
|
111
|
3
|
Cream
|
1
|
1
|
Milk
|
11
|
2
|
Coke, Cake
|
1
|
1
|
Coke, Milk
|
1
|
1
|
Cake, Sugar
|
111
|
3
|
Cake, Cream
|
1
|
1
|
Cake, Milk
|
1
|
1
|
Sugar, Milk
|
1
|
1
|
Sugar, Cream
|
1
|
1
|
Cake, Sugar, Cream
|
1
|
1
|
Cake, Sugar, Milk
|
1
|
1
|
Table A
By the definition of support, it is about an indication of how frequently an itemset occurs in the dataset. Therefore, the above results can be considered as the support of each itemset.
Or we can present support as a percentage. Total number of unique items are 5. (Bread, Jelly, Butter, Milk, Beer)
Ex:
Support of Coke = Number of occurrence of Coke x 100%
Number of items
= 2 x 100%
5
= 0.4 x 100%
= 40 %
We can calculate the support for all the itemsets given in Table 2 like this.
Set
|
Support
| |
Coke
|
2 x 100%
5
|
= 40%
|
Cake
|
4 x 100%
5
|
= 80%
|
Sugar
|
3 x 100%
5
|
= 60%
|
Cream
|
1 x 100%
5
|
= 20%
|
Milk
|
2 x 100%
5
|
= 40%
|
Coke, Cake
|
1 x 100%
5
|
= 20%
|
Coke, Milk
|
1 x 100%
5
|
= 20%
|
Cake, Sugar
|
3 x 100%
5
|
= 60%
|
Cake, Cream
|
1 x 100%
5
|
= 20%
|
Cake, Milk
|
1 x 100%
5
|
= 20%
|
Sugar, Milk
|
1 x 100%
5
|
= 20%
|
Sugar, Cream
|
1 x 100%
5
|
= 20%
|
Cake, Sugar, Cream
|
1 x 100%
5
|
= 20%
|
Cake, Sugar, Milk
|
1 x 100%
5
|
= 20%
|
Table 2
3. When it comes to association rules support can be calculated as below.
Example:
For the rule Cake=>Sugar
Support({Cake∪Sugar}) is the number of occurrences of itemset cake and sugar. Therefore, in our Table A we have found it as 3. So the support is 3 x 100% = 60%.
5
Now let’s find the support of association rules given in Table 3.
X=> Y
|
Number of occurrences of itemset
|
Support
| |
Cake=>Sugar
|
111
|
3
|
3 x 100% = 60%
5
|
Sugar=>Cake
|
111
|
3
|
3 x 100% = 60%
5
|
Coke=>Cake
|
1
|
1
|
1 x 100% = 20%
5
|
Sugar=>Cream
|
1
|
1
|
1 x 100% = 20%
5
|
Cream=>Sugar
|
1
|
1
|
1 x 100% = 20%
5
|
Cream=>Milk
|
0
|
0
|
1 x 100% = 20%
5
|
Table B
By definition, Confidence is an indication of how often the rule has been found to be true.
For rule A=>C
Confidence = support({A∪C})
support({A})
Therefore we can calculate the confidence by considering the results we got in Table A and B.
Ex:
Confidence (Cake=>Sugar) = Support({Cake∪Sugar})
Support({Cake})
= Number of occurrences of Cake,Sugar
Number of occurrences of Cake
= 3
4
4
= 0.75
Or as a percentage 3 x 100% = 75%
4
Therefore the confidence of association rules given in the Table 3 can be calculated as following,
X=> Y
|
Confidence
|
Cake=>Sugar
|
3 x 100% = 75%
4
|
Sugar=>Cake
|
3 x 100% = 100%
3
|
Coke=>Cake
|
1 x 100% = 50%
2
|
Sugar=>Cream
|
1 x 100% = 33.33%
3
|
Cream=>Sugar
|
1 x 100% = 100%
1
|
Cream=>Milk
|
0
|
Table C
Now we can summarize above calculations into Table 3.
X=> Y
|
Support
|
Confidence
|
Cake=>Sugar
|
60%
|
75%
|
Sugar=>Cake
|
60%
|
100%
|
Coke=>Cake
|
20%
|
50%
|
Sugar=>Cream
|
20%
|
33.33%
|
Cream=>Sugar
|
20%
|
100%
|
Cream=>Milk
|
0
|
0
|
Table 3
4. According to Apriori Algorithm, to be a frequent item, an item should satisfy the minimum support of the data set it belongs to.
Given that the minimum support is 30%. So here we have to compare the support of each individual item with the minimum support.
Ex:
Support(Coke) = 40%
Here Support(Coke) > Minimum Support
Therefore Coke is a frequent item.
Therefore, according to Table 2,
Item
|
Support
|
Satisfy Minimum Support
| |
Coke
|
40%
|
> 30%
|
Yes
|
Cake
|
80%
|
> 30%
|
Yes
|
Sugar
|
60%
|
> 30%
|
Yes
|
Cream
|
20%
|
< 30%
|
No
|
Milk
|
40%
|
> 30%
|
Yes
|
Table D
Therefore, frequent singletons are {Coke, Cake, Sugar, Milk}.
Now, let’s pair the frequent singletons to find their support.
Coke-Cake, Coke-Sugar, Coke-Milk, Cake-Sugar, Cake-Milk, Sugar-Milk
Item Pair
|
Occurrence
|
Support
|
Satisfy Minimum Support
| ||
Coke, Cake
|
1
|
1
|
1 x 100% = 20%
5
|
< 30%
|
No
|
Coke, Sugar
|
-
|
0
|
0
|
< 30%
|
No
|
Coke, Milk
|
1
|
1
|
1 x 100% = 20%
5
|
< 30%
|
No
|
Cake, Sugar
|
111
|
3
|
3 x 100% = 60%
5
|
> 30%
|
Yes
|
Cake, Milk
|
1
|
1
|
1 x 100% = 20%
5
|
< 30%
|
No
|
Sugar, Milk
|
1
|
1
|
1 x 100% = 20%
5
|
< 30%
|
No
|
Table E
Therefore the only frequent item pair is {Cake, Sugar}.
5. To find the items using FP-growth algorithm first we need to prioritize the unique items by their occurrence in Table 1.
Note: Only the items which support the minimum support should be prioritized.
Therefore the items which satisfy minimum support can be prioritized as following by Referring Table 1 and Table D.
Item
|
Number of occurrences
|
Priority
|
Coke
|
2
|
3
|
Cake
|
4
|
1
|
Sugar
|
3
|
2
|
Milk
|
2
|
4
|
Table F
Now let’s arrange the items in transaction dataset given in Table 1 according to the order of the priority.
Note: Here we will have to left the items which do not satisfy minimum support.
- Cake
- Sugar
- Coke
- Milk
Transaction ID
|
Items
|
Priority Order
|
T1
|
Cake, Cream, Sugar
|
Cake, Sugar
|
T2
|
Cake, Sugar
|
Cake, Sugar
|
T3
|
Cake, Sugar, Milk
|
Cake, Sugar, Milk
|
T4
|
Coke, Cake
|
Cake, Coke
|
T5
|
Coke, Milk
|
Coke, Milk
|
Table G
Now we have to draw the FP-Growth tree; I will explain a simple and easy way to draw the tree.
- Always the starting node is an empty node.
- Look at Table G; you can see Cake continuously comes down as the first time in each row. Draw a branch for it.
- Sugar comes three times continuously down each row after Cake. Draw a branch for Sugar under Cake.
- In the first two rows we can find Cake and Sugar only; when it comes to the third row we can find Milk after Sugar. Draw a branch for Milk under Sugar.
- Now first 3 rows are completed. In the fourth row, we can find Coke after Cake. Draw a branch for Coke under Cake.
- Now 5th row is left and there we can find Coke which doesn’t come under Cake. Draw a branch from the root for Coke.
- A Milk item is left in the 5th row after Coke. Draw a branch for Milk under the relevant Coke node.
Comments
Post a Comment