Apriori and FP-Growth Algorithms

Hello everyone… I am planning to write an article series on Data Mining and its Concepts and today I will explain you in detail to solve a common problem type related to Apriori and FP-Growth algorithms.

First, let's get a problem.

Question:
A grocery store chain keeps a record of weekly transactions where each transaction represents the items bought during one cash register transaction. The executives of the chain receive a summarized report of the transactions indicating what types of items have sold at what quantity. In addition, they periodically request information about what items are commonly purchased together. In addition, they periodically request information about what items are commonly purchased together. In this case, there are five transactions and five items, as shown in Table 1. Suppose that the minimum support and confidence are S=30% and 50%, respectively.

  1. Define the notions of support and confidence.
  2. Calculate the Support for all the subsets given in Table 2.
  3. Calculate the support and confidence of the association rules in Table 3.
  4. Find all frequent items for the above database using the Apriori Algorithm
  5. Find all frequent items using the FP-growth algorithm.

Transaction ID
Items
T1
Cake, Cream, Sugar
T2
Cake, Sugar
T3
Cake, Sugar, Milk
T4
Coke, Cake
T5
Coke, Milk
Table 1

Set
Support
Coke

Cake

Sugar

Cream

Milk

Coke, Cake

Coke, Milk

Cake, Sugar

Cake, Cream

Cake, Milk

Sugar, Milk

Sugar, Cream

Cake, Sugar, Cream

Cake, Sugar, Milk

Table 2

X=> Y
Support
Confidence
Cake=>Sugar


Sugar=>Cake


Coke=>Cake


Sugar=>Cream


Cream=>Sugar


Cream=>Milk


Table 3

Answer:
1.
  • Support is an indication of how frequently the itemset appears in the database.
  • Confidence is an indication of how often the rule has been found to be true.
For rule A=>C
Confidence = support({A∪C})
           support({A})

2.  
First, we need to get the occurrence of each item/itemset in Table 1. Here we have 5 unique items; Bread, Jelly, Butter, Milk, Beer. For the sets given in Table 2, occurrences in Table 1 are as following,


Set
Number of occurrences
Coke
11
2
Cake
1111
4
Sugar
111
3
Cream
1
1
Milk
11
2
Coke, Cake
1
1
Coke, Milk
1
1
Cake, Sugar
111
3
Cake, Cream
1
1
Cake, Milk
1
1
Sugar, Milk
1
1
Sugar, Cream
1
1
Cake, Sugar, Cream
1
1
Cake, Sugar, Milk
1
1
Table A


By the definition of support, it is about an indication of how frequently an itemset occurs in the dataset. Therefore, the above results can be considered as the support of each itemset.



Or we can present support as a percentage. Total number of unique items are 5. (Bread, Jelly, Butter, Milk, Beer)



Ex:
Support of Coke = Number of occurrence of Coke  x 100%
Number of items
                          = 2  x 100%
5
         = 0.4 x 100%
         = 40 %

We can calculate the support for all the itemsets given in Table 2 like this.

Set
Support
Coke
2  x 100%
5
= 40%
Cake
4  x 100%
5
= 80%
Sugar
3  x 100%
5
= 60%
Cream
1  x 100%
5
= 20%
Milk
2  x 100%
5
= 40%
Coke, Cake
1  x 100%
5
= 20%
Coke, Milk
1  x 100%
5
= 20%
Cake, Sugar
3  x 100%
5
= 60%
Cake, Cream
1  x 100%
5
= 20%
Cake, Milk
1  x 100%
5
= 20%
Sugar, Milk
1  x 100%
5
= 20%
Sugar, Cream
1  x 100%
5
= 20%
Cake, Sugar, Cream
1  x 100%
5
= 20%
Cake, Sugar, Milk
1  x 100%
5
= 20%
Table 2
3. When it comes to association rules support can be calculated as below.
  
Example:
For the rule Cake=>Sugar
Support({Cake∪Sugar}) is the number of occurrences of itemset cake and sugar. Therefore, in our Table A we have found it as 3. So the support is 3  x 100% = 60%.
    5

Now let’s find the support of association rules given in Table 3.


X=> Y
Number of occurrences of itemset
Support
Cake=>Sugar
111
3
3  x 100% = 60%
5
Sugar=>Cake
111
3
3  x 100% = 60%
5
Coke=>Cake
1
1
1  x 100% = 20%
5
Sugar=>Cream
1
1
1  x 100% = 20%
5
Cream=>Sugar
1
1
1  x 100% = 20%
5
Cream=>Milk
0
0
1  x 100% = 20%
5
Table B

By definition, Confidence is an indication of how often the rule has been found to be true.
For rule A=>C
Confidence = support({A∪C})
           support({A})

Therefore we can calculate the confidence by considering the results we got in Table A and B.

Ex:
Confidence (Cake=>Sugar) = Support({Cake∪Sugar})
Support({Cake})
          = Number of occurrences of Cake,Sugar
Number of occurrences of Cake
          = 3
  4
          = 0.75

Or  as a percentage 3 x 100% = 75%
         4

Therefore the confidence of association rules given in the Table 3 can be calculated as following,

X=> Y
Confidence
Cake=>Sugar
3  x 100% = 75%
4
Sugar=>Cake
3  x 100% = 100%
3
Coke=>Cake
1  x 100% = 50%
2
Sugar=>Cream
1  x 100% = 33.33%
3
Cream=>Sugar
1  x 100% = 100%
1
Cream=>Milk
0
Table C

Now we can summarize above calculations into Table 3.


X=> Y
Support
Confidence
Cake=>Sugar
60%
75%
Sugar=>Cake
60%
100%
Coke=>Cake
20%
50%
Sugar=>Cream
20%
33.33%
Cream=>Sugar
20%
100%
Cream=>Milk
0
0
Table 3

4. According to Apriori Algorithm, to be a frequent item, an item should satisfy the minimum support of the data set it belongs to.
Given that the minimum support is 30%. So here we have to compare the support of each individual item with the minimum support.

Ex:

Support(Coke) = 40%
Here Support(Coke) > Minimum Support
Therefore Coke is a frequent item.

Therefore, according to Table 2,

Item
Support
Satisfy Minimum Support
Coke
40%
> 30%
Yes
Cake
80%
> 30%
Yes
Sugar
60%
> 30%
Yes
Cream
20%
< 30%
No
Milk
40%
> 30%
Yes
Table D

Therefore, frequent singletons are {Coke, Cake, Sugar, Milk}.

Now, let’s pair the frequent singletons to find their support.

Coke-Cake, Coke-Sugar, Coke-Milk, Cake-Sugar, Cake-Milk, Sugar-Milk

Item Pair
Occurrence
Support
Satisfy Minimum Support
Coke, Cake
1
1
1  x 100% = 20%
5
< 30%
No
Coke, Sugar
-
0
0
< 30%
No
Coke, Milk
1
1
1  x 100% = 20%
5
< 30%
No
Cake, Sugar
111
3
3  x 100% = 60%
5
> 30%
Yes
Cake, Milk
1
1
1  x 100% = 20%
5
< 30%
No
Sugar, Milk
1
1
1  x 100% = 20%
5
< 30%
No
Table E

Therefore the only frequent item pair is {Cake, Sugar}.

5. To find the items using FP-growth algorithm first we need to prioritize the unique items by their occurrence in Table 1.
Note: Only the items which support the minimum support should be prioritized.

Therefore the items which satisfy minimum support can be prioritized as following by Referring Table 1 and Table D.

Item
Number of occurrences
Priority
Coke
2
3
Cake
4
1
Sugar
3
2
Milk
2
4
Table F

Now let’s arrange the items in transaction dataset given in Table 1 according to the order of the priority.
Note: Here we will have to left the items which do not satisfy minimum support.
  1. Cake
  2. Sugar
  3. Coke
  4. Milk

Transaction ID
Items
Priority Order
T1
Cake, Cream, Sugar
Cake, Sugar
T2
Cake, Sugar
Cake, Sugar
T3
Cake, Sugar, Milk
Cake, Sugar, Milk
T4
Coke, Cake
Cake, Coke
T5
Coke, Milk
Coke, Milk
Table G

Now we have to draw the FP-Growth tree; I will explain a simple and easy way to draw the tree.

  • Always the starting node is an empty node.
  • Look at Table G; you can see Cake continuously comes down as the first time in each row. Draw a branch for it.


  • Sugar comes three times continuously down each row after Cake. Draw a branch for Sugar under Cake.

  • In the first two rows we can find Cake and Sugar only; when it comes to the third row we can find Milk after Sugar. Draw a branch for Milk under Sugar.

  • Now first 3 rows are completed. In the fourth row, we can find Coke after Cake. Draw a branch for Coke under Cake.


  • Now 5th row is left and there we can find Coke which doesn’t come under Cake. Draw a branch from the root for Coke.


  • A Milk item is left in the 5th row after Coke. Draw a branch for Milk under the relevant Coke node.



Above you can find the FP-Growth tree.






Comments

Popular posts from this blog

UML - Use Case Diagrams

How I faced to OCJP exam...

International Women's Day Celebration 2016 by Women Techmakers Sri Lanka