diff --git a/docs/sources/CHANGELOG.md b/docs/sources/CHANGELOG.md index c0fa2b804..d9d5b5167 100755 --- a/docs/sources/CHANGELOG.md +++ b/docs/sources/CHANGELOG.md @@ -21,6 +21,7 @@ The CHANGELOG for the current development version is available at - The `mlxtend.evaluate.bootstrap_point632_score` now supports `fit_params`. ([#861](https://github.com/rasbt/mlxtend/pull/861)) - The `mlxtend/plotting/decision_regions.py` function now has a `contourf_kwargs` for matplotlib to change the look of the decision boundaries if desired. ([#881](https://github.com/rasbt/mlxtend/pull/881) via [[pbloem](https://github.com/pbloem)]) +- The `mlxtend.frequent_patterns.metrics` provides **Kulczynski metric** and **Imbalance Ratio** metrics as `kulczynski_measure` and `imbalance_ratio` ([#840](https://github.com/rasbt/mlxtend/issues/840)) ##### Changes diff --git a/docs/sources/user_guide/frequent_patterns/metrics.ipynb b/docs/sources/user_guide/frequent_patterns/metrics.ipynb new file mode 100644 index 000000000..838774e6a --- /dev/null +++ b/docs/sources/user_guide/frequent_patterns/metrics.ipynb @@ -0,0 +1,389 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Evaluating quality of Association Rules" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overview" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A strong association rule may or may not be interesting for a specific application. Some measures have been developed to help evaluate association rules. `mlxtend` implements two such measures, Kulczynski Measure and Imbalance Ratio." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Kulczynski Measure:\n", + "\n", + "The Kulczynski measure $K_{A,B}$ can be interpreted as the average between the confidence that $A ⇒ B$ and the confidence that $B ⇒ A$\n", + "\n", + "The Kulczynski measure $K_{A,B} ∈ [0, 1]$ of the itemsets $A ⊆ I$ and\n", + "$B ⊆ I$ such that $A ∩ B = \\varnothing$ is given by\n", + "\n", + "$$K_{A,B} = \\frac{V_{A⇒B} + V_{B⇒A}}{2}$$\n", + "\n", + "$$K_{A,B} = \\frac{1}{2} \\Bigg[\\frac{sup(A \\cup B)}{sup(A)} + \\frac{sup(A \\cup B)}{sup(B)} \\Bigg]$$\n", + "\n", + "- If $K_{A,B} = 0$, then $A ⊆ T$ implies that $B \\nsubseteq T$ for any transaction $T$\n", + "- If $K_{A,B} = 1$, then $A ⊆ T$ implies that $B ⊆ T$ for any transaction $T$\n", + "- Note that the Kulczynski measure is symmetric: $K_{A,B} = K_{B,A}$" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Imbalance Ratio:\n", + "The imbalance ratio $I_{A,B}$ can be interpreted as the ratio between the absolute difference between the support count of $A$ and the support count of $B$ and the number of transactions that contain $A$, $B$, or both $A$ and $B$\n", + "- The imbalance ratio $I_{A,B} ∈ [0, 1]$ of the itemsets $A ⊆ I$ and $B ⊆ I$ is given by\n", + "\n", + "$$I_{A,B} =\\frac{|N_A − N_B|}{N_A + N_B − N_{A∪B}}$$\n", + "- If $I_{A,B} = 0$, then $A$ and $B$ have the same support\n", + "- If $I_{A,B} = 1$, then either $A$ or $B$ has zero support\n", + "- Note that the imbalance ratio is symmetric: $I_{A,B} = I_{B,A}$" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "\n", + "[1] Chapter 6 of J. Han, M. Kamber, J. Pei, “Data Mining: Concepts and Techniques”, 3rd edition, Elsevier/Morgan Kaufmann, 2012" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example 1 -- Evaluate Kulczynski Measure of an Association rule:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
0(Eggs)(Kidney Beans)0.81.00.81.001.000.00inf
1(Kidney Beans)(Eggs)1.00.80.80.801.000.001.0
2(Eggs)(Onion)0.80.60.60.751.250.121.6
3(Onion)(Eggs)0.60.80.61.001.250.12inf
4(Milk)(Kidney Beans)0.61.00.61.001.000.00inf
5(Onion)(Kidney Beans)0.61.00.61.001.000.00inf
6(Yogurt)(Kidney Beans)0.61.00.61.001.000.00inf
7(Eggs, Onion)(Kidney Beans)0.61.00.61.001.000.00inf
8(Eggs, Kidney Beans)(Onion)0.80.60.60.751.250.121.6
9(Onion, Kidney Beans)(Eggs)0.60.80.61.001.250.12inf
10(Eggs)(Onion, Kidney Beans)0.80.60.60.751.250.121.6
11(Onion)(Eggs, Kidney Beans)0.60.80.61.001.250.12inf
\n", + "
" + ], + "text/plain": [ + " antecedents consequents antecedent support \\\n", + "0 (Eggs) (Kidney Beans) 0.8 \n", + "1 (Kidney Beans) (Eggs) 1.0 \n", + "2 (Eggs) (Onion) 0.8 \n", + "3 (Onion) (Eggs) 0.6 \n", + "4 (Milk) (Kidney Beans) 0.6 \n", + "5 (Onion) (Kidney Beans) 0.6 \n", + "6 (Yogurt) (Kidney Beans) 0.6 \n", + "7 (Eggs, Onion) (Kidney Beans) 0.6 \n", + "8 (Eggs, Kidney Beans) (Onion) 0.8 \n", + "9 (Onion, Kidney Beans) (Eggs) 0.6 \n", + "10 (Eggs) (Onion, Kidney Beans) 0.8 \n", + "11 (Onion) (Eggs, Kidney Beans) 0.6 \n", + "\n", + " consequent support support confidence lift leverage conviction \n", + "0 1.0 0.8 1.00 1.00 0.00 inf \n", + "1 0.8 0.8 0.80 1.00 0.00 1.0 \n", + "2 0.6 0.6 0.75 1.25 0.12 1.6 \n", + "3 0.8 0.6 1.00 1.25 0.12 inf \n", + "4 1.0 0.6 1.00 1.00 0.00 inf \n", + "5 1.0 0.6 1.00 1.00 0.00 inf \n", + "6 1.0 0.6 1.00 1.00 0.00 inf \n", + "7 1.0 0.6 1.00 1.00 0.00 inf \n", + "8 0.6 0.6 0.75 1.25 0.12 1.6 \n", + "9 0.8 0.6 1.00 1.25 0.12 inf \n", + "10 0.6 0.6 0.75 1.25 0.12 1.6 \n", + "11 0.8 0.6 1.00 1.25 0.12 inf " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "from mlxtend.preprocessing import TransactionEncoder\n", + "from mlxtend.frequent_patterns import apriori, association_rules\n", + "from mlxtend.frequent_patterns import metrics\n", + "\n", + "dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],\n", + " ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],\n", + " ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],\n", + " ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],\n", + " ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]\n", + "\n", + "te = TransactionEncoder()\n", + "te_ary = te.fit_transform(dataset)\n", + "df = pd.DataFrame(te_ary, columns=te.columns_)\n", + "freq_items = apriori(df, min_support=0.6, use_colnames=True)\n", + "rules = association_rules(freq_items, metric=\"confidence\", min_threshold=0.7)\n", + "rules" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.875" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "a = frozenset(['Onion'])\n", + "b = frozenset(['Kidney Beans', 'Eggs'])\n", + "metrics.kulczynski_measure(rules, a, b)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example 2 -- Evaluate Imabalance Ratio of an Association rule:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.2500000000000001" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "a = frozenset(['Onion'])\n", + "b = frozenset(['Kidney Beans', 'Eggs'])\n", + "metrics.imbalance_ratio(freq_items, a, b)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.2" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/mlxtend/frequent_patterns/metrics.py b/mlxtend/frequent_patterns/metrics.py new file mode 100644 index 000000000..74427a940 --- /dev/null +++ b/mlxtend/frequent_patterns/metrics.py @@ -0,0 +1,109 @@ +# mlxtend Machine Learning Library Extensions +# +# Functions to measure quality of association rules +# +# Author: Mohammed Niyas +# +# License: BSD 3 clause + + +def kulczynski_measure(df, antecedent, consequent): + """Calculates the Kulczynski measure for a given rule. + + Parameters + ----------- + df : pandas DataFrame + pandas DataFrame of association rules + with columns ['antecedents', 'consequents', 'confidence'] + + antecedent : set or frozenset + Antecedent of the rule + consequent : set or frozenset + Consequent of the rule + + Returns + ---------- + The Kulczynski measure + K(A,C) = (confidence(A->C) + confidence(C->A)) / 2, range: [0, 1]\n. + """ + if not df.shape[0]: + raise ValueError('The input DataFrame `df` containing ' + 'the frequent itemsets is empty.') + + # check for mandatory columns + required_columns = ["antecedents", "consequents", "confidence"] + if not all(col in df.columns for col in required_columns): + raise ValueError( + "Dataframe needs to contain the\ + columns 'antecedents', 'consequents' and 'confidence'" + ) + + # get confidence of antecedent to consequent rule + a_to_c = df[ + (df["antecedents"] == antecedent) & (df["consequents"] == consequent)] + try: + a_to_c_confidence = a_to_c["confidence"].iloc[0] + except IndexError: + a_to_c_confidence = 0 + + # get confidence of consequent to antecedent rule + c_to_a = df[ + (df["antecedents"] == consequent) & (df["consequents"] == antecedent)] + try: + c_to_a_confidence = c_to_a["confidence"].iloc[0] + except IndexError: + c_to_a_confidence = 0 + return (a_to_c_confidence + c_to_a_confidence) / 2 + + +def imbalance_ratio(df, a, b): + """ + Calculates the imbalance ratio for a given pair of itemsets + + Parameters + ----------- + df : pandas DataFrame + pandas DataFrame of frequent itemsets + with columns ['support', 'itemsets'] + a : set or frozenset + First itemset + b : set or frozenset + Second itemset + + Returns + ---------- + The imbalance ratio + I(A,B) = |support(A) - support(B)| /\ + (support(A) + support(B) - support(A+B)), range: [0, 1]\n. + """ + if not df.shape[0]: + raise ValueError('The input DataFrame `df` containing ' + 'the frequent itemsets is empty.') + + # check for mandatory columns + if not all(col in df.columns for col in ["support", "itemsets"]): + raise ValueError("Dataframe needs to contain the\ + columns 'support' and 'itemsets'") + + # get support of a + try: + sA = df[df["itemsets"] == a].support.iloc[0] + except IndexError: + sA = 0 + + # get support of b + try: + sB = df[df["itemsets"] == b].support.iloc[0] + except IndexError: + sB = 0 + + # get support of a union b + try: + sAB = df[df["itemsets"] == a.union(b)].support.iloc[0] + except IndexError: + sAB = 0 + + try: + return abs(sA - sB) / (sA + sB - sAB) + except ZeroDivisionError: + return 0 diff --git a/mlxtend/frequent_patterns/tests/test_metrics.py b/mlxtend/frequent_patterns/tests/test_metrics.py new file mode 100644 index 000000000..a439e16df --- /dev/null +++ b/mlxtend/frequent_patterns/tests/test_metrics.py @@ -0,0 +1,89 @@ +import pandas as pd +from mlxtend.preprocessing import TransactionEncoder +from mlxtend.frequent_patterns import apriori, association_rules +from mlxtend.frequent_patterns import metrics +from numpy.testing import assert_raises as numpy_assert_raises + + +dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'], + ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'], + ['Milk', 'Apple', 'Kidney Beans', 'Eggs'], + ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'], + ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']] + +te = TransactionEncoder() +te_ary = te.fit_transform(dataset) +df = pd.DataFrame(te_ary, columns=te.columns_) +df_freq_items_with_colnames = apriori(df, min_support=0.6, use_colnames=True) +df_strong_rules = association_rules( + df_freq_items_with_colnames, metric="confidence", min_threshold=0.7) + + +def test_kulczynski_measure_default(): + a = frozenset(['Onion']) + b = frozenset(['Kidney Beans', 'Eggs']) + assert metrics.kulczynski_measure(df_strong_rules, a, b) == 0.875 + + +def test_kulczynski_measure_set(): + a = set(['Onion']) + b = set(['Kidney Beans', 'Eggs']) + assert metrics.kulczynski_measure(df_strong_rules, a, b) == 0.875 + + +def test_kulczynski_measure_no_antecedent(): + a = frozenset(['Laptop']) + b = frozenset(['Kidney Beans', 'Eggs']) + assert metrics.kulczynski_measure(df_strong_rules, a, b) == 0.0 + + +def test_kulczynski_measure_no_consequent(): + a = frozenset(['Onion']) + b = frozenset(['Laptop']) + assert metrics.kulczynski_measure(df_strong_rules, a, b) == 0.0 + + +def test_kulczynski_measure_no_rule(): + a = frozenset(['Onion']) + b = frozenset(['Kidney Beans', 'Eggs']) + numpy_assert_raises( + ValueError, metrics.kulczynski_measure, pd.DataFrame(), a, b) + + +def test_imbalance_ratio_default(): + a = frozenset(['Onion']) + b = frozenset(['Kidney Beans', 'Eggs']) + assert metrics.imbalance_ratio( + df_freq_items_with_colnames, a, b) == 0.2500000000000001 + + +def test_imbalance_ratio_set(): + a = set(['Onion']) + b = set(['Kidney Beans', 'Eggs']) + assert metrics.imbalance_ratio( + df_freq_items_with_colnames, a, b) == 0.2500000000000001 + + +def test_imbalance_ratio_no_itemset_a(): + a = frozenset([]) + b = frozenset(['Laptop']) + assert metrics.imbalance_ratio(df_freq_items_with_colnames, a, b) == 0.0 + + +def test_imbalance_ratio_no_itemset_b(): + a = frozenset(['Laptop']) + b = frozenset([]) + assert metrics.imbalance_ratio(df_freq_items_with_colnames, a, b) == 0.0 + + +def test_imbalance_ratio_no_itemset_a_b(): + a = frozenset([]) + b = frozenset([]) + assert metrics.imbalance_ratio(df_freq_items_with_colnames, a, b) == 0.0 + + +def test_imbalance_ratio_no_rule(): + a = frozenset(['Onion']) + b = frozenset(['Kidney Beans', 'Eggs']) + numpy_assert_raises( + ValueError, metrics.imbalance_ratio, pd.DataFrame(), a, b)