Association Rule Learning (Apriori, Eclat) Explained
Key Concepts
- Association Rule Learning
- Apriori Algorithm
- Eclat Algorithm
- Support
- Confidence
- Lift
- Frequent Itemsets
1. Association Rule Learning
Association Rule Learning is a data mining technique used to discover interesting relationships between variables in large datasets. It is commonly used in market basket analysis to identify patterns in customer purchasing behavior.
2. Apriori Algorithm
The Apriori Algorithm is a classic algorithm for learning association rules. It works by identifying frequent itemsets in the dataset and then generating association rules from these itemsets. The algorithm uses a "bottom-up" approach, where frequent itemsets are extended one item at a time and groups of candidates are tested against the dataset.
Example:
from mlxtend.frequent_patterns import apriori from mlxtend.frequent_patterns import association_rules import pandas as pd # Sample data data = {'Milk': [1, 0, 1, 1, 0], 'Bread': [1, 1, 0, 1, 1], 'Butter': [1, 1, 1, 0, 1], 'Beer': [0, 1, 0, 1, 0]} df = pd.DataFrame(data) # Applying Apriori frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True) rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7) print(rules)
3. Eclat Algorithm
The Eclat Algorithm is another approach to finding frequent itemsets. Unlike the Apriori Algorithm, Eclat uses a "depth-first" search and set intersection instead of candidate generation. It is particularly efficient for dense datasets.
Example:
from pyECLAT import ECLAT import pandas as pd # Sample data data = {'Milk': [1, 0, 1, 1, 0], 'Bread': [1, 1, 0, 1, 1], 'Butter': [1, 1, 1, 0, 1], 'Beer': [0, 1, 0, 1, 0]} df = pd.DataFrame(data) # Applying Eclat eclat = ECLAT(data=df, verbose=True) frequent_itemsets, rules = eclat.fit(min_support=0.5, min_combination=2, max_combination=2) print(frequent_itemsets)
4. Support
Support is a measure of how frequently the itemset appears in the dataset. It is defined as the ratio of the number of transactions containing the itemset to the total number of transactions.
\[ \text{Support}(A) = \frac{\text{Number of transactions containing } A}{\text{Total number of transactions}} \]
5. Confidence
Confidence is a measure of the reliability of the rule. It is defined as the ratio of the number of transactions containing both A and B to the number of transactions containing A.
\[ \text{Confidence}(A \rightarrow B) = \frac{\text{Support}(A \cup B)}{\text{Support}(A)} \]
6. Lift
Lift is a measure of how much more often the items A and B occur together than expected if they were statistically independent. A lift value greater than 1 indicates a strong association.
\[ \text{Lift}(A \rightarrow B) = \frac{\text{Support}(A \cup B)}{\text{Support}(A) \times \text{Support}(B)} \]
7. Frequent Itemsets
Frequent Itemsets are sets of items that appear together in the dataset with a frequency greater than or equal to a specified minimum support threshold. These itemsets are used to generate association rules.
Analogies
Think of Association Rule Learning as a detective trying to find patterns in a grocery store. The Apriori Algorithm is like a methodical detective who checks each item one by one, while the Eclat Algorithm is like a detective who looks for intersections between items. Support is like the popularity of an item, confidence is like the reliability of a pattern, and lift is like the surprise factor of finding two items together more often than expected.