Skip to content

Commit 922f44f

Browse files
authored
add multiplexer function (#263)
1 parent 38d5964 commit 922f44f

File tree

10 files changed

+456
-1
lines changed

10 files changed

+456
-1
lines changed

docs/mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ pages:
4747
- user_guide/data/boston_housing_data.md
4848
- user_guide/data/iris_data.md
4949
- user_guide/data/loadlocal_mnist.md
50+
- user_guide/data/make_multiplexer_dataset.md
5051
- user_guide/data/mnist_data.md
5152
- user_guide/data/three_blobs_data.md
5253
- user_guide/data/wine_data.md

docs/sources/CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ The CHANGELOG for the current development version is available at
1919
- Added `'leverage'` and `'conviction` as evaluation metrics to the `frequent_patterns.association_rules` function. [#246](https://github.com/rasbt/mlxtend/pull/246) & [#247](https://github.com/rasbt/mlxtend/pull/247)
2020
- Added a `loadings_` attribute to `PrincipalComponentAnalysis` to compute the factor loadings of the features on the principal components. [#251](https://github.com/rasbt/mlxtend/pull/251)
2121
- Allow grid search over classifiers/regressors in ensemble and stacking estimators [#259](https://github.com/rasbt/mlxtend/pull/259)
22+
- New `make_multiplexer_dataset` function that creates a dataset generated by a n-bit Boolean multiplexer for evaluating supervised learning algorithms [#263](https://github.com/rasbt/mlxtend/pull/263)
2223

2324
##### Changes
2425

docs/sources/USER_GUIDE_INDEX.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
- [boston_housing_data](user_guide/data/boston_housing_data.md)
1919
- [iris_data](user_guide/data/iris_data.md)
2020
- [loadlocal_mnist](user_guide/data/loadlocal_mnist.md)
21+
- [make_multiplexer_dataset](user_guide/data/make_multiplexer_dataset.md)
2122
- [mnist_data](user_guide/data/mnist_data.md)
2223
- [three_blobs_data](user_guide/data/three_blobs_data.md)
2324
- [wine_data](user_guide/data/wine_data.md)
Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Make Multiplexer Dataset"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"Function that creates a dataset generated by a n-bit Boolean multiplexer for evaluating supervised learning algorithms."
15+
]
16+
},
17+
{
18+
"cell_type": "markdown",
19+
"metadata": {},
20+
"source": [
21+
"> `from mlxtend.data import make_multiplexer_dataset` "
22+
]
23+
},
24+
{
25+
"cell_type": "markdown",
26+
"metadata": {},
27+
"source": [
28+
"## Overview"
29+
]
30+
},
31+
{
32+
"cell_type": "markdown",
33+
"metadata": {},
34+
"source": [
35+
"The `make_multiplexer_dataset` function creates a dataset generated by an n-bit Boolean multiplexer. Such dataset represents a dataset generated by a simple rule, based on the behavior of a electric multiplexer, yet presents a relatively challenging classification problem for supervised learning algorithm with interactions between features (epistasis) as it may be encountered in many real-world scenarios [1].\n",
36+
"\n",
37+
"The following illustration depicts a 6-bit multiplexer that consists of 2 address bits and 4 register bits. The address bits converted to decimal representation point to a position in the register bit. For example, if the address bits are \"00\" (0 in decimal), the address bits point to the register bit at position 0. The value of the register position pointed to determines the class label. For example, if the register bit at position is 0, the class label is 0. Vice versa, if the register bit at position 0 is 1, the class label is 1. \n",
38+
"\n",
39+
"![](make_multiplexer_dataset_data_files/6bit_multiplexer.png)\n",
40+
"\n",
41+
"\n",
42+
"In the example above, the address bits \"10\" (2 in decimal) point to the 3rd register position (as we start counting from index 0), which has a bit value of 1. Hence, the class label is 1.\n",
43+
"\n",
44+
"Below are a few more examples:\n",
45+
"\n",
46+
"1. Address bits: [0, 1], register bits: [1, 0, 1, 1], class label: 0\n",
47+
"2. Address bits: [0, 1], register bits: [1, 1, 1, 0], class label: 1\n",
48+
"3. Address bits: [1, 0], register bits: [1, 0, 0, 1], class label: 0\n",
49+
"4. Address bits: [1, 1], register bits: [1, 1, 1, 0], class label: 0\n",
50+
"5. Address bits: [0, 1], register bits: [0, 1, 1, 0], class label: 1\n",
51+
"6. Address bits: [0, 1], register bits: [1, 0, 0, 1], class label: 0\n",
52+
"7. Address bits: [0, 1], register bits: [0, 1, 1, 1], class label: 1\n",
53+
"8. Address bits: [0, 1], register bits: [0, 0, 0, 0], class label: 0\n",
54+
"9. Address bits: [1, 0], register bits: [1, 0, 1, 1], class label: 1\n",
55+
"10. Address bits: [0, 1], register bits: [1, 1, 1, 1], class label: 1\n",
56+
"\n",
57+
"Note that in the implementation of the multiplexer function, if the number of address bits is set to 2, this results in a 6 bit multiplexer as two bit can have 2^2=4 different register positions (2 bit + 4 bit = 6 bit). However, if we choose 3 address bits instead, 2^3=8 positions would be covered, resulting in a 11 bit (3 bit + 8 bit = 11 bit) multiplexer, and so forth.\n",
58+
"\n"
59+
]
60+
},
61+
{
62+
"cell_type": "markdown",
63+
"metadata": {},
64+
"source": [
65+
"### References\n",
66+
"\n",
67+
"- [1] Urbanowicz, R. J., & Browne, W. N. (2017). *Introduction to Learning Classifier Systems*. Springer."
68+
]
69+
},
70+
{
71+
"cell_type": "markdown",
72+
"metadata": {},
73+
"source": [
74+
"## Example 1 -- 6-bit multiplexer"
75+
]
76+
},
77+
{
78+
"cell_type": "markdown",
79+
"metadata": {},
80+
"source": [
81+
"This simple example illustrates how to create dataset from a 6-bit multiplexer"
82+
]
83+
},
84+
{
85+
"cell_type": "code",
86+
"execution_count": 1,
87+
"metadata": {},
88+
"outputs": [
89+
{
90+
"name": "stdout",
91+
"output_type": "stream",
92+
"text": [
93+
"Features:\n",
94+
" [[0 1 0 1 0 1]\n",
95+
" [1 0 0 0 1 1]\n",
96+
" [0 1 1 1 0 0]\n",
97+
" [0 1 1 1 0 0]\n",
98+
" [0 0 1 1 0 0]\n",
99+
" [0 1 0 0 0 0]\n",
100+
" [0 1 1 0 1 1]\n",
101+
" [1 0 1 0 0 0]\n",
102+
" [1 0 0 1 0 1]\n",
103+
" [1 0 1 0 0 1]]\n",
104+
"\n",
105+
"Class labels:\n",
106+
" [1 1 1 1 1 0 0 0 0 0]\n"
107+
]
108+
}
109+
],
110+
"source": [
111+
"import numpy as np\n",
112+
"from mlxtend.data import make_multiplexer_dataset\n",
113+
"\n",
114+
"\n",
115+
"X, y = make_multiplexer_dataset(address_bits=2, \n",
116+
" sample_size=10,\n",
117+
" positive_class_ratio=0.5, \n",
118+
" shuffle=False,\n",
119+
" random_seed=123)\n",
120+
"\n",
121+
"print('Features:\\n', X)\n",
122+
"print('\\nClass labels:\\n', y)"
123+
]
124+
},
125+
{
126+
"cell_type": "markdown",
127+
"metadata": {},
128+
"source": [
129+
"## API"
130+
]
131+
},
132+
{
133+
"cell_type": "code",
134+
"execution_count": 2,
135+
"metadata": {},
136+
"outputs": [
137+
{
138+
"name": "stdout",
139+
"output_type": "stream",
140+
"text": [
141+
"## make_multiplexer_dataset\n",
142+
"\n",
143+
"*make_multiplexer_dataset(address_bits=2, sample_size=100, positive_class_ratio=0.5, shuffle=False, random_seed=None)*\n",
144+
"\n",
145+
"Function to create a binary n-bit multiplexer dataset.\n",
146+
"\n",
147+
"New in mlxtend v0.9\n",
148+
"\n",
149+
"**Parameters**\n",
150+
"\n",
151+
"- `address_bits` : int (default: 2)\n",
152+
"\n",
153+
" A positive integer that determines the number of address\n",
154+
" bits in the multiplexer, which in turn determine the\n",
155+
" n-bit capacity of the multiplexer and therefore the\n",
156+
" number of features. The number of features is determined by\n",
157+
" the number of address bits. For example, 2 address bits\n",
158+
" will result in a 6 bit multiplexer and consequently\n",
159+
" 6 features (2 + 2^2 = 6). If `address_bits=3`, then\n",
160+
" this results in an 11-bit multiplexer as (2 + 2^3 = 11)\n",
161+
" with 11 features.\n",
162+
"\n",
163+
"\n",
164+
"- `sample_size` : int (default: 100)\n",
165+
"\n",
166+
" The total number of samples generated.\n",
167+
"\n",
168+
"\n",
169+
"- `positive_class_ratio` : float (default: 0.5)\n",
170+
"\n",
171+
" The fraction (a float between 0 and 1)\n",
172+
" of samples in the `sample_size`d dataset\n",
173+
" that have class label 1.\n",
174+
" If `positive_class_ratio=0.5` (default), then\n",
175+
" the ratio of class 0 and class 1 samples is perfectly balanced.\n",
176+
"\n",
177+
"\n",
178+
"- `shuffle` : Bool (default: False)\n",
179+
"\n",
180+
" Whether or not to shuffle the features and labels.\n",
181+
" If `False` (default), the samples are returned in sorted\n",
182+
" order starting with `sample_size`/2 samples with class label 0\n",
183+
" and followed by `sample_size`/2 samples with class label 1.\n",
184+
"\n",
185+
"\n",
186+
"- `random_seed` : int (default: None)\n",
187+
"\n",
188+
" Random seed used for generating the multiplexer samples and shuffling.\n",
189+
"\n",
190+
"**Returns**\n",
191+
"\n",
192+
"- `X, y` : [n_samples, n_features], [n_class_labels]\n",
193+
"\n",
194+
" X is the feature matrix with the number of samples equal\n",
195+
" to `sample_size`. The number of features is determined by\n",
196+
" the number of address bits. For instance, 2 address bits\n",
197+
" will result in a 6 bit multiplexer and consequently\n",
198+
" 6 features (2 + 2^2 = 6).\n",
199+
" All features are binary (values in {0, 1}).\n",
200+
" y is a 1-dimensional array of class labels in {0, 1}.\n",
201+
"\n",
202+
"\n"
203+
]
204+
}
205+
],
206+
"source": [
207+
"with open('../../api_modules/mlxtend.data/make_multiplexer_dataset.md', 'r') as f:\n",
208+
" s = f.read() \n",
209+
"print(s)"
210+
]
211+
}
212+
],
213+
"metadata": {
214+
"anaconda-cloud": {},
215+
"kernelspec": {
216+
"display_name": "Python 3",
217+
"language": "python",
218+
"name": "python3"
219+
},
220+
"language_info": {
221+
"codemirror_mode": {
222+
"name": "ipython",
223+
"version": 3
224+
},
225+
"file_extension": ".py",
226+
"mimetype": "text/x-python",
227+
"name": "python",
228+
"nbconvert_exporter": "python",
229+
"pygments_lexer": "ipython3",
230+
"version": "3.6.1"
231+
}
232+
},
233+
"nbformat": 4,
234+
"nbformat_minor": 1
235+
}
Loading

mlxtend/data/__init__.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,9 @@
1111
from .local_mnist import loadlocal_mnist
1212
from .boston_housing import boston_housing_data
1313
from .three_blobs import three_blobs_data
14+
from .multiplexer import make_multiplexer_dataset
1415

1516
__all__ = ["iris_data", "wine_data", "autompg_data",
1617
"loadlocal_mnist", "mnist_data",
17-
"boston_housing_data", "three_blobs_data"]
18+
"boston_housing_data", "three_blobs_data",
19+
"make_multiplexer_dataset"]

mlxtend/data/multiplexer.py

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Sebastian Raschka 2014-2017
2+
# mlxtend Machine Learning Library Extensions
3+
#
4+
# A function for creating a multiplexer dataset for classification.
5+
# Author: Sebastian Raschka <sebastianraschka.com>
6+
#
7+
# License: BSD 3 clause
8+
9+
import numpy as np
10+
11+
12+
def make_multiplexer_dataset(address_bits=2, sample_size=100,
13+
positive_class_ratio=0.5, shuffle=False,
14+
random_seed=None):
15+
"""
16+
Function to create a binary n-bit multiplexer dataset.
17+
18+
New in mlxtend v0.9
19+
20+
Parameters
21+
---------------
22+
address_bits : int (default: 2)
23+
A positive integer that determines the number of address
24+
bits in the multiplexer, which in turn determine the
25+
n-bit capacity of the multiplexer and therefore the
26+
number of features. The number of features is determined by
27+
the number of address bits. For example, 2 address bits
28+
will result in a 6 bit multiplexer and consequently
29+
6 features (2 + 2^2 = 6). If `address_bits=3`, then
30+
this results in an 11-bit multiplexer as (2 + 2^3 = 11)
31+
with 11 features.
32+
33+
sample_size : int (default: 100)
34+
The total number of samples generated.
35+
36+
positive_class_ratio : float (default: 0.5)
37+
The fraction (a float between 0 and 1)
38+
of samples in the `sample_size`d dataset
39+
that have class label 1.
40+
If `positive_class_ratio=0.5` (default), then
41+
the ratio of class 0 and class 1 samples is perfectly balanced.
42+
43+
shuffle : Bool (default: False)
44+
Whether or not to shuffle the features and labels.
45+
If `False` (default), the samples are returned in sorted
46+
order starting with `sample_size`/2 samples with class label 0
47+
and followed by `sample_size`/2 samples with class label 1.
48+
49+
random_seed : int (default: None)
50+
Random seed used for generating the multiplexer samples and shuffling.
51+
52+
Returns
53+
--------
54+
X, y : [n_samples, n_features], [n_class_labels]
55+
X is the feature matrix with the number of samples equal
56+
to `sample_size`. The number of features is determined by
57+
the number of address bits. For instance, 2 address bits
58+
will result in a 6 bit multiplexer and consequently
59+
6 features (2 + 2^2 = 6).
60+
All features are binary (values in {0, 1}).
61+
y is a 1-dimensional array of class labels in {0, 1}.
62+
63+
"""
64+
65+
if not isinstance(address_bits, int):
66+
raise AttributeError('address_bits'
67+
' must be an integer. Got %s.' %
68+
type(address_bits))
69+
if address_bits < 1:
70+
raise AttributeError('Number of address_bits'
71+
' must be greater than 0. Got %s.' % address_bits)
72+
register_bits = 2**address_bits
73+
total_bits = address_bits + register_bits
74+
X_pos, y_pos = [], []
75+
X_neg, y_neg = [], []
76+
77+
# use numpy's instead of python's round because of consistent
78+
# banker's rounding behavior across versions
79+
n_positives = np.round(sample_size*positive_class_ratio).astype(np.int)
80+
n_negatives = sample_size - n_positives
81+
82+
rng = np.random.RandomState(random_seed)
83+
84+
def gen_randsample():
85+
all_bits = [rng.randint(0, 2) for i in range(total_bits)]
86+
address_str = ''.join(str(c) for c in all_bits[:address_bits])
87+
register_pos = int(address_str, base=2)
88+
class_label = all_bits[address_bits:][register_pos]
89+
return all_bits, class_label
90+
91+
while len(y_pos) < n_positives or len(y_neg) < n_negatives:
92+
93+
all_bits, class_label = gen_randsample()
94+
95+
if class_label and len(y_pos) < n_positives:
96+
X_pos.append(all_bits)
97+
y_pos.append(class_label)
98+
99+
elif not class_label and len(y_neg) < n_negatives:
100+
X_neg.append(all_bits)
101+
y_neg.append(class_label)
102+
103+
X, y = X_pos + X_neg, y_pos + y_neg
104+
X, y = np.array(X, dtype=np.int), np.array(y, dtype=np.int)
105+
106+
if shuffle:
107+
p = rng.permutation(y.shape[0])
108+
X, y = X[p], y[p]
109+
110+
return X, y

mlxtend/data/tests/test_data.py renamed to mlxtend/data/tests/test_datasets.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1+
# Sebastian Raschka 2014-2017
2+
# mlxtend Machine Learning Library Extensions
3+
#
4+
# Author: Sebastian Raschka <sebastianraschka.com>
5+
#
6+
# License: BSD 3 clause
7+
8+
19
from mlxtend.data import iris_data
210
from mlxtend.data import wine_data
311
from mlxtend.data import autompg_data

0 commit comments

Comments
 (0)