Apriori and FP-Growth Algorithms

Hello everyone… I am planning to write an article series on Data Mining and its Concepts and today I will explain you in detail to solve a common problem type related to Apriori and FP-Growth algorithms.

First, let's get a problem.

Question:

A grocery store chain keeps a record of weekly transactions where each transaction represents the items bought during one cash register transaction. The executives of the chain receive a summarized report of the transactions indicating what types of items have sold at what quantity. In addition, they periodically request information about what items are commonly purchased together. In addition, they periodically request information about what items are commonly purchased together. In this case, there are five transactions and five items, as shown in Table 1. Suppose that the minimum support and confidence are S=30% and 50%, respectively.

Define the notions of support and confidence.
Calculate the Support for all the subsets given in Table 2.
Calculate the support and confidence of the association rules in Table 3.
Find all frequent items for the above database using the Apriori Algorithm
Find all frequent items using the FP-growth algorithm.

Transaction ID	Items
T1	Cake, Cream, Sugar
T2	Cake, Sugar
T3	Cake, Sugar, Milk
T4	Coke, Cake
T5	Coke, Milk

Table 1

Set	Support
Coke
Cake
Sugar
Cream
Milk
Coke, Cake
Coke, Milk
Cake, Sugar
Cake, Cream
Cake, Milk
Sugar, Milk
Sugar, Cream
Cake, Sugar, Cream
Cake, Sugar, Milk

Table 2

X=> Y	Support	Confidence
Cake=>Sugar
Sugar=>Cake
Coke=>Cake
Sugar=>Cream
Cream=>Sugar
Cream=>Milk

Table 3

Answer:

Support is an indication of how frequently the itemset appears in the database.
Confidence is an indication of how often the rule has been found to be true.

For rule A=>C

Confidence = support({A∪C})

support({A})

2.
First, we need to get the occurrence of each item/itemset in Table 1. Here we have 5 unique items; Bread, Jelly, Butter, Milk, Beer. For the sets given in Table 2, occurrences in Table 1 are as following,

Set	Number of occurrences
Coke	11	2
Cake	1111	4
Sugar	111	3
Cream	1	1
Milk	11	2
Coke, Cake	1	1
Coke, Milk	1	1
Cake, Sugar	111	3
Cake, Cream	1	1
Cake, Milk	1	1
Sugar, Milk	1	1
Sugar, Cream	1	1
Cake, Sugar, Cream	1	1
Cake, Sugar, Milk	1	1

Table A

By the definition of support, it is about an indication of how frequently an itemset occurs in the dataset. Therefore, the above results can be considered as the support of each itemset.

Or we can present support as a percentage. Total number of unique items are 5. (Bread, Jelly, Butter, Milk, Beer)

Ex:

Support of Coke = Number of occurrence of Coke x 100%

Number of items

= 2 x 100%

= 0.4 x 100%

= 40 %

We can calculate the support for all the itemsets given in Table 2 like this.

Set	Support
Coke	2 x 100% 5	= 40%
Cake	4 x 100% 5	= 80%
Sugar	3 x 100% 5	= 60%
Cream	1 x 100% 5	= 20%
Milk	2 x 100% 5	= 40%
Coke, Cake	1 x 100% 5	= 20%
Coke, Milk	1 x 100% 5	= 20%
Cake, Sugar	3 x 100% 5	= 60%
Cake, Cream	1 x 100% 5	= 20%
Cake, Milk	1 x 100% 5	= 20%
Sugar, Milk	1 x 100% 5	= 20%
Sugar, Cream	1 x 100% 5	= 20%
Cake, Sugar, Cream	1 x 100% 5	= 20%
Cake, Sugar, Milk	1 x 100% 5	= 20%

Table 2

3. When it comes to association rules support can be calculated as below.

Example:

For the rule Cake=>Sugar

Support({Cake∪Sugar}) is the number of occurrences of itemset cake and sugar. Therefore, in our Table A we have found it as 3. So the support is 3 x 100% = 60%.

Now let’s find the support of association rules given in Table 3.

X=> Y	Number of occurrences of itemset		Support
Cake=>Sugar	111	3	3 x 100% = 60% 5
Sugar=>Cake	111	3	3 x 100% = 60% 5
Coke=>Cake	1	1	1 x 100% = 20% 5
Sugar=>Cream	1	1	1 x 100% = 20% 5
Cream=>Sugar	1	1	1 x 100% = 20% 5
Cream=>Milk	0	0	1 x 100% = 20% 5

Table B

By definition, Confidence is an indication of how often the rule has been found to be true.

For rule A=>C

Confidence = support({A∪C})

support({A})

Therefore we can calculate the confidence by considering the results we got in Table A and B.

Ex:

Confidence (Cake=>Sugar) = Support({Cake∪Sugar})

Support({Cake})

= Number of occurrences of Cake,Sugar

Number of occurrences of Cake

= 3
4

= 0.75

Or as a percentage 3 x 100% = 75%

Therefore the confidence of association rules given in the Table 3 can be calculated as following,

X=> Y	Confidence
Cake=>Sugar	3 x 100% = 75% 4
Sugar=>Cake	3 x 100% = 100% 3
Coke=>Cake	1 x 100% = 50% 2
Sugar=>Cream	1 x 100% = 33.33% 3
Cream=>Sugar	1 x 100% = 100% 1
Cream=>Milk	0

Table C

Now we can summarize above calculations into Table 3.

X=> Y	Support	Confidence
Cake=>Sugar	60%	75%
Sugar=>Cake	60%	100%
Coke=>Cake	20%	50%
Sugar=>Cream	20%	33.33%
Cream=>Sugar	20%	100%
Cream=>Milk	0	0

Table 3

4. According to Apriori Algorithm, to be a frequent item, an item should satisfy the minimum support of the data set it belongs to.

Given that the minimum support is 30%. So here we have to compare the support of each individual item with the minimum support.

Ex:

Support(Coke) = 40%

Here Support(Coke) > Minimum Support

Therefore Coke is a frequent item.

Therefore, according to Table 2,

Item	Support	Satisfy Minimum Support
Coke	40%	> 30%	Yes
Cake	80%	> 30%	Yes
Sugar	60%	> 30%	Yes
Cream	20%	< 30%	No
Milk	40%	> 30%	Yes

Table D

Therefore, frequent singletons are {Coke, Cake, Sugar, Milk}.

Now, let’s pair the frequent singletons to find their support.

Coke-Cake, Coke-Sugar, Coke-Milk, Cake-Sugar, Cake-Milk, Sugar-Milk

Item Pair	Occurrence		Support	Satisfy Minimum Support
Coke, Cake	1	1	1 x 100% = 20% 5	< 30%	No
Coke, Sugar	-	0	0	< 30%	No
Coke, Milk	1	1	1 x 100% = 20% 5	< 30%	No
Cake, Sugar	111	3	3 x 100% = 60% 5	> 30%	Yes
Cake, Milk	1	1	1 x 100% = 20% 5	< 30%	No
Sugar, Milk	1	1	1 x 100% = 20% 5	< 30%	No

Table E

Therefore the only frequent item pair is {Cake, Sugar}.

5. To find the items using FP-growth algorithm first we need to prioritize the unique items by their occurrence in Table 1.

Note: Only the items which support the minimum support should be prioritized.

Therefore the items which satisfy minimum support can be prioritized as following by Referring Table 1 and Table D.

Item	Number of occurrences	Priority
Coke	2	3
Cake	4	1
Sugar	3	2
Milk	2	4

Table F

Now let’s arrange the items in transaction dataset given in Table 1 according to the order of the priority.

Note: Here we will have to left the items which do not satisfy minimum support.

Cake
Sugar
Coke
Milk

Transaction ID	Items	Priority Order
T1	Cake, Cream, Sugar	Cake, Sugar
T2	Cake, Sugar	Cake, Sugar
T3	Cake, Sugar, Milk	Cake, Sugar, Milk
T4	Coke, Cake	Cake, Coke
T5	Coke, Milk	Coke, Milk

Table G

Now we have to draw the FP-Growth tree; I will explain a simple and easy way to draw the tree.

Always the starting node is an empty node.
Look at Table G; you can see Cake continuously comes down as the first time in each row. Draw a branch for it.

Sugar comes three times continuously down each row after Cake. Draw a branch for Sugar under Cake.

In the first two rows we can find Cake and Sugar only; when it comes to the third row we can find Milk after Sugar. Draw a branch for Milk under Sugar.

Now first 3 rows are completed. In the fourth row, we can find Coke after Cake. Draw a branch for Coke under Cake.

Now 5th row is left and there we can find Coke which doesn’t come under Cake. Draw a branch from the root for Coke.

A Milk item is left in the 5th row after Coke. Draw a branch for Milk under the relevant Coke node.

Above you can find the FP-Growth tree.

Search This Blog

Dinukaa's Blog

Apriori and FP-Growth Algorithms

Comments

Post a Comment

Popular posts from this blog

UML - Use Case Diagrams

How I faced to OCJP exam...

International Women's Day Celebration 2016 by Women Techmakers Sri Lanka