Getting Started

In this example, we will generate labels on a mock dataset of transactions. For each customer, we want to label whether the total purchase amount over the next hour of transactions will exceed 100. Additionally, we want to predict one hour in advance.

Load Data

With the package installed, we load in the data. To get an idea on how the transactions looks, we preview the data frame.

[1]:
import composeml as cp

df = cp.datasets.transactions()

df[df.columns[:5]].head()
[1]:
transaction_id session_id product_id amount customer_id
transaction_time
2014-01-01 03:13:51 190 14 5 120.52 1
2014-01-01 11:04:42 350 19 3 65.43 3
2014-01-02 11:44:35 254 11 5 128.51 4
2014-01-02 17:12:39 337 16 2 105.15 2
2014-01-02 17:46:20 177 29 5 65.11 1

Create Labeling Function

First, we define the function that will return the total purchase amount given a hour of transactions.

[2]:
def my_labeling_function(df_slice):
    label = df_slice["amount"].sum()
    return label

Construct Label Maker

With the labeling function, we create the LabelMaker for our prediction problem. We need an hour of transactions for each label, so we set window_size to one hour.

[3]:
label_maker = cp.LabelMaker(
    target_entity="customer_id",
    time_index="transaction_time",
    labeling_function=my_labeling_function,
    window_size="1h",
)

Generate Labels

Next, we automatically search and extract the labels by using LabelMaker.search().

[4]:
labels = label_maker.search(
    df,
    minimum_data="1h",
    num_examples_per_instance=25,
    gap=1,
    verbose=True,
)

labels.head()
Elapsed: 00:00 | Remaining: 00:00 | Progress: 100%|██████████| customer_id: 125/125
[4]:
customer_id cutoff_time my_labeling_function
label_id
0 1 2014-01-01 04:13:51 65.11
1 1 2014-01-03 15:41:34 101.08
2 1 2014-01-05 11:46:10 16.78
3 1 2014-01-06 09:54:58 108.16
4 1 2014-01-08 08:54:02 48.33

Transform Labels

With the generated LabelTimes, we will apply specific transforms for our prediction problem.

Apply Threshold on Labels

We apply LabelTimes.threshold() to make the labels binary for totaled amounts exceeding 100.

[5]:
labels = labels.threshold(100)

labels.head()
[5]:
customer_id cutoff_time my_labeling_function
label_id
0 1 2014-01-01 04:13:51 False
1 1 2014-01-03 15:41:34 True
2 1 2014-01-05 11:46:10 False
3 1 2014-01-06 09:54:58 True
4 1 2014-01-08 08:54:02 False

Lead Label Times

Lastly, we use LabelTimes.apply_lead() to shift the label times 1 hour earlier for predicting in advance.

[6]:
labels = labels.apply_lead('1h')

labels.head()
[6]:
customer_id cutoff_time my_labeling_function
label_id
0 1 2014-01-01 03:13:51 False
1 1 2014-01-03 14:41:34 True
2 1 2014-01-05 10:46:10 False
3 1 2014-01-06 08:54:58 True
4 1 2014-01-08 07:54:02 False

Describe Labels

We could use LabelTimes.describe() to get the steps and settings used to make the labels.

[7]:
labels.describe()
Label Distribution
------------------
False      75
True       50
Total:    125


Settings
--------
gap                           1
minimum_data                 1h
num_examples_per_instance    25
window_size                  1h


Transforms
----------
1. threshold
  - value:    100

2. apply_lead
  - value:    1h

Plot Labels

Also, there are plots available for insight to the labels.

Label Distribution

labels.plot.distribution()
_images/toc_getting_started_0.png

Label Count vs. Time

labels.plot.count_by_time(figsize=(7, 5))
_images/toc_getting_started_1.1.png