Analysis

Market Basket Analysis

Before understanding MBA, let’s see what are frequent patterns. Frequent patterns are itemsets, subsequences, or substructures which appear frequently in a data set....

Written by Heena Rijhwani · 3 min read >

Before understanding MBA, let’s see what are frequent patterns.

Frequent patterns are itemsets, subsequences, or substructures which appear frequently in a data set. For instance, a set of items, such as milk and bread, that are frequently bought together. A subsequence, such as first buying some milk, then eggs, and then coffee, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern. A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices, which may be combined with itemsets or subsequences. If it occurs frequently, it is called a (frequent) structured pattern. Finding these patterns is of utmost importance in mining associations, correlations, and many other interesting relationships. It is also used in data classification, clustering, and other data mining tasks.

Association Rule Mining

Frequent itemset mining leads to the discovery of associations and correlations among items in large transactional or relational data sets. The discovery of this information can provide instrumental and actionable insights for the decision making processes and business strategies. An example of frequent itemset mining is market basket analysis. It analyzes customer behavior by finding associations between the different items that customers place in their shopping baskets. This can help businesses develop marketing strategies by gaining an insight into which items are frequently purchased together by customers. It can prove to be very useful for the sales and marketing teams , thereby resulting in an increase in productivity, revenue and customer acquisition.

For instance, if customers are buying bread, how likely are they to also buy butter on the same trip to the supermarket. This information can be used to increase sales by placing these goods together.

Let’s look at another example.

An unforeseen yet interesting discovery was that of the correlation between sales of beer and diapers. The information that customers who purchase beer also tend to buy diapers at the same time is represented in the following association rule:

Beer => diapers

[support = 2%,confidence = 60%]

Here, a support of 2% means that 2% of all the transactions prove that beer and diapers are purchased together and a confidence of 60% means that 60% of the customers who purchased a beer also bought a diaper.

Association rules must satisfy a minimum support threshold and a minimum confidence threshold. In other words, we must find all the rules X & Y => Z with minimum confidence and support where support, s, is the probability that a transaction contains {X &Y & Z} and confidence, c, is the conditional probability that a transaction having {X &Y} also contains Z.

Syntax:

Body => Head [support, confidence]

Beer =>Diaper [support= 2% ,confidence=60%]

Types of association rules

  • Boolean association rule- If a rule involves associations between the presence or absence of items, it is a Boolean association rule. Example: buys(X, “computer”) ⇒ buys(X, “HP printer”)
  • Quantitative Asociation Rule- If a rule describes association between quantitative items or attributes then it is a quantitative association rule. Example: age (X,”30….39”) ^ income(X,”40,000…50,000) ^ buys (X, high resolution TV)
  • Single dimensional rule- buys(X, ”Computer”) ⇒buys (X, “Software”) are single-dimensional association rules because they each refer to only one dimension, buys.
  • Multidimensional association rule- If a rule references two or more dimensions, such as the dimensions age, income, and buys, then it is a multidimensional association rule.

Apriori Algorithm

Apriori is a commonly used algorithm used for finding frequent item sets. It is based on Apriori Property which states that — All non-empty subsets of a frequent itemset must also be frequent. If a set can not pass a test all of its super test will fail the same test as well.’ It uses prior knowledge of frequent itemset properties. It uses an iterative approach known as level wise search where k-itemsets are used to explore (k + 1)-itemsets.

How does it work?

First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The resulting set is denoted by L1 . Next, L1 is used to find L2 , the set of frequent 2-itemsets, which is used to find L3 , and so on, until no more frequent k-itemsets can be found. Two major steps of the Apriori algorithm are the join and prune steps. The join step is used to construct new candidate sets. The prune step helps in filtering out candidate item sets whose subsets (prior level) are not frequent.

Image for post

Let’s look at an example. Given TDB:


First we will scan the database to get a count for each of the items to get C1.


Then select those items with min support to get L1. Here min_sup=2.


Find 2-item subsets and scan them for min support to get C2 and L2 respectively:


Now there is only one 3 -item subset possible.( with support=2). This gives L3.

Disadvantage of Apriori

A drawback of using Apriori is that it repeatedly scans the database and checks large set of candidates by pattern matching. The large amount of candidate set generations and multiple database scans reduce the efficiency of the algorithm. A solution to this is the use of FP-Growth or Frequent Pattern Growth.

To better understand the concept of FP-Growth, have a look at its application in MBA and the corresponding outputs and conclusion.

This is the Groceries dataset we will be using.

Image for post

First we find the top 20 most sold items.

Image for post

Then we look at the visualization of top 20 items and their contribution to total sales.

Image for post
Image for post

Pruning the dataset for Frequently Bought Items:

Image for post

We get transactions with a minimum length of 2 and contribution of at least 40% to the total sales.

Let’s look at Association Rule Mining with FP Growth.

  • Based on Support

support(A)=Number of transaction in which A appears / Total number of transactions

Image for post
  • Based on Confidence and Lift

The lift of a rule is defined as:

lift(X⟶Y)=supp(X∪ Y) / ( supp(X)∗ supp(Y) )

It is the ratio of the observed support divided by individual support of the items.

If the value of lift is greater than 1, it means that the itemset Y is likely to be bought with itemset X, while a value less than 1 implies that itemset Y is unlikely to be bought if the itemset X is bought.

Image for post

Sorting the Association Rules:

Image for post

Leave a Reply

Your email address will not be published. Required fields are marked *