Using Label Transforms

In this guide, we will demonstrate how to use the transforms that are available on LabelTimes. Each transform will return a copy of the label times. This is useful for trying out multiple transforms in different settings without having to recalculate the labels. As a result, we could see which labels give a better performance in less time.

Generate Labels

Let’s start by generating labels on a mock dataset of transactions. Each label is defined as the total spent by a customer given one hour of transactions.

[1]:
import composeml as cp

def total_spent(df):
    return df['amount'].sum()

label_maker = cp.LabelMaker(
    labeling_function=total_spent,
    target_entity='customer_id',
    time_index='transaction_time',
    window_size='1h',
)

labels = label_maker.search(
    cp.demos.load_transactions(),
    num_examples_per_instance=10,
    label_type='continuous',
    minimum_data='2h',
    gap='2min',
    verbose=True,
)
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████| customer_id: 50/50

To get an idea on how the labels looks, we preview the data frame.

[2]:
labels.head()
[2]:
customer_id cutoff_time total_spent
id
0 1 2014-01-01 02:45:30 217.94
1 1 2014-01-01 02:47:30 217.94
2 1 2014-01-01 02:49:30 217.94
3 1 2014-01-01 02:51:30 217.94
4 1 2014-01-01 02:53:30 217.94

Threshold on Labels

LabelTimes.threshold() will create binary labels by testing if label values are above a threshold. In this example, a threshold is applied to determine which customers spent over 100.

[3]:
labels.threshold(100).head()
[3]:
customer_id cutoff_time total_spent
id
0 1 2014-01-01 02:45:30 True
1 1 2014-01-01 02:47:30 True
2 1 2014-01-01 02:49:30 True
3 1 2014-01-01 02:51:30 True
4 1 2014-01-01 02:53:30 True

Lead Labels Times

LabelTimes.apply_lead() will shift the label time earlier. This is useful for training a model to predict in advance. In this example, a one hour lead is applied to the label times.

[4]:
labels.apply_lead('1h').head()
[4]:
customer_id cutoff_time total_spent
id
0 1 2014-01-01 01:45:30 217.94
1 1 2014-01-01 01:47:30 217.94
2 1 2014-01-01 01:49:30 217.94
3 1 2014-01-01 01:51:30 217.94
4 1 2014-01-01 01:53:30 217.94

Bin Labels

LabelTimes.bin() will bin the labels into discrete intervals. There are two types of bins. Bins could either be based on values or quantiles. Additionally, the widths of the bins could either be defined by the user or divided equally. The following examples will go through each type.

Value Based

To use bins based on values, quantiles should be set to False which is the default value.

Equal Width

To group values into bins of equal width, set bins as a scalar value. In this example, the total spent is grouped into bins of equal width.

[5]:
labels.bin(4, quantiles=False).head()
[5]:
customer_id cutoff_time total_spent
id
0 1 2014-01-01 02:45:30 (198.455, 271.072]
1 1 2014-01-01 02:47:30 (198.455, 271.072]
2 1 2014-01-01 02:49:30 (198.455, 271.072]
3 1 2014-01-01 02:51:30 (198.455, 271.072]
4 1 2014-01-01 02:53:30 (198.455, 271.072]

Custom Widths

To group values into bins of custom widths, set bins as an array of values to define edges. In this example, the total spent is grouped into bins of custom widths.

[6]:
inf = float('inf')
edges = [-inf, 34, 50, 67, inf]
labels.bin(edges, quantiles=False,).head()
[6]:
customer_id cutoff_time total_spent
id
0 1 2014-01-01 02:45:30 (67.0, inf]
1 1 2014-01-01 02:47:30 (67.0, inf]
2 1 2014-01-01 02:49:30 (67.0, inf]
3 1 2014-01-01 02:51:30 (67.0, inf]
4 1 2014-01-01 02:53:30 (67.0, inf]

Quantile Based

To use bins based on quantiles, quantiles should be set to True.

Equal Width

To group values into quantile bins of equal width, set bins to the number of quantiles as a scalar value (e.g. 4 for quartiles, 10 for deciles, etc.). In this example, the total spent is grouped into bins based on the quartiles.

[7]:
labels.bin(4, quantiles=True).head()
[7]:
customer_id cutoff_time total_spent
id
0 1 2014-01-01 02:45:30 (196.25, 217.94]
1 1 2014-01-01 02:47:30 (196.25, 217.94]
2 1 2014-01-01 02:49:30 (196.25, 217.94]
3 1 2014-01-01 02:51:30 (196.25, 217.94]
4 1 2014-01-01 02:53:30 (196.25, 217.94]

To verify quartile values, we could check the descriptive statistics.

[8]:
stats = labels.total_spent.describe()
stats = stats.round(3).to_string()
print(stats)
count     50.000
mean     215.182
std       90.518
min       53.220
25%      196.250
50%      217.940
75%      290.390
max      343.690

Custom Widths

To group values into quantile bins of custom widths, set bins as an array of quantiles. In this example, the total spent is grouped into quantile bins of custom widths.

[9]:
quantiles = [0, .34, .5, .67, 1]
labels.bin(quantiles, quantiles=True).head()
[9]:
customer_id cutoff_time total_spent
id
0 1 2014-01-01 02:45:30 (196.25, 217.94]
1 1 2014-01-01 02:47:30 (196.25, 217.94]
2 1 2014-01-01 02:49:30 (196.25, 217.94]
3 1 2014-01-01 02:51:30 (196.25, 217.94]
4 1 2014-01-01 02:53:30 (196.25, 217.94]

Label Bins

To assign bins with custom labels, set labels to the array of values. The number of labels need to match the number of bins. In this example, the total spent is grouped into bins with custom labels.

[10]:
values = ['low', 'medium', 'high']
labels.bin(3, labels=values).head()
[10]:
customer_id cutoff_time total_spent
id
0 1 2014-01-01 02:45:30 medium
1 1 2014-01-01 02:47:30 medium
2 1 2014-01-01 02:49:30 medium
3 1 2014-01-01 02:51:30 medium
4 1 2014-01-01 02:53:30 medium

Describe Labels

LabelTimes.describe() will print out the distribution with the settings and transforms that were used to make the labels. This is useful as a reference for understanding how the labels were generated from raw data. Also, the label distribution is helpful for determining if we have imbalanced labels. In this examlpe, a description of the labels is printed after transforming the labels into discrete values.

[11]:
labels.threshold(100).describe()
Label Distribution
------------------
True      42
False      8
Total:    50


Settings
--------
num_examples_per_instance        10
minimum_data                     2h
window_size                  <Hour>
gap                            2min


Transforms
----------
1. threshold
  - value:    100

Sample Labels

LabelTimes.sample() will sample the labels based on a number or fraction. Samples can be reproduced by fixing random_state to an integer.

To sample 10 labels, n is set to 10.

[12]:
labels.sample(n=10, random_state=0)
[12]:
customer_id cutoff_time total_spent
id
28 3 2014-01-01 04:01:05 196.25
11 2 2014-01-01 02:02:00 290.39
10 2 2014-01-01 02:00:00 290.39
41 5 2014-01-01 03:48:25 53.22
2 1 2014-01-01 02:49:30 217.94
27 3 2014-01-01 03:59:05 196.25
38 4 2014-01-01 02:55:00 225.18
31 4 2014-01-01 02:41:00 343.69
22 3 2014-01-01 03:49:05 196.25
4 1 2014-01-01 02:53:30 217.94

Similarly, to sample 10% of labels, frac is set to 10%.

[13]:
labels.sample(frac=.1, random_state=0)
[13]:
customer_id cutoff_time total_spent
id
28 3 2014-01-01 04:01:05 196.25
11 2 2014-01-01 02:02:00 290.39
10 2 2014-01-01 02:00:00 290.39
41 5 2014-01-01 03:48:25 53.22
2 1 2014-01-01 02:49:30 217.94

Categorical Labels

When working with categorical labels, the number or fraction of labels for each category can be sampled by using a dictionary. Let’s bin the labels into 4 bins to make categorical.

[14]:
categorical = labels.bin(4, labels=['A', 'B', 'C', 'D'])

To sample 2 labels per category, map each category to the number 2.

[15]:
n = {'A': 2, 'B': 2, 'C': 2, 'D': 2}
categorical.sample(n=n, random_state=0)
[15]:
customer_id cutoff_time total_spent
id
46 5 2014-01-01 03:58:25 A
42 5 2014-01-01 03:50:25 A
26 3 2014-01-01 03:57:05 B
48 5 2014-01-01 04:02:25 B
6 1 2014-01-01 02:57:30 C
38 4 2014-01-01 02:55:00 C
11 2 2014-01-01 02:02:00 D
16 2 2014-01-01 02:12:00 D

Similarly, to sample 10% of labels per category, map each category to 10%.

[16]:
frac = {'A': .1, 'B': .1, 'C': .1, 'D': .1}
categorical.sample(frac=frac, random_state=0)
[16]:
customer_id cutoff_time total_spent
id
46 5 2014-01-01 03:58:25 A
26 3 2014-01-01 03:57:05 B
6 1 2014-01-01 02:57:30 C
11 2 2014-01-01 02:02:00 D
16 2 2014-01-01 02:12:00 D