Predict Next Purchase

In this example, we will generate labels on online grocery orders provided by Instacart using Compose. The labels can be used to train a machine learning model to predict whether a customer will buy a specific product within the next month.

[1]:
import composeml as cp
import data

Load Data

You can download the data directly from Instacart here. After downloading the data, you can set folder as an absolute path to the directory of the CSV files. Alternatively, you can place the CSV files inside the data folder in the root directory of this notebook. With the files in place, we preview the data to get an idea on how the grocery orders looks.

[2]:
df = data.load_orders(folder='data', nrows=1000000)

df.head()
[2]:
order_id product_id add_to_cart_order reordered product_name aisle_id department_id department user_id order_time
0 120 33120 13 0 Organic Egg Whites 86 16 dairy eggs 23750 2015-01-11 08:00:00
1 120 31323 7 0 Light Wisconsin String Cheese 21 16 dairy eggs 23750 2015-01-11 08:00:00
2 120 1503 8 0 Low Fat Cottage Cheese 108 16 dairy eggs 23750 2015-01-11 08:00:00
3 120 28156 11 0 Total 0% Nonfat Plain Greek Yogurt 120 16 dairy eggs 23750 2015-01-11 08:00:00
4 120 41273 4 0 Broccoli Florets 123 4 produce 23750 2015-01-11 08:00:00

Generate Labels

Now with the grocery orders loaded, we are ready to generate labels for our prediction problem.

Create Labeling Function

To get started, we define the labeling function that will return whether a customer purchased the product in a given month.

[3]:
def bought_product(df, product_name):
    purchased = df.product_name.str.contains(product_name).any()
    return purchased

Construct Label Maker

With the labeling function, we create the label maker for our prediction problem. To process one month of orders for each customer, we set the target_entity to the customer ID and the window_size to one month. When window size is set to 1MS, the window size will end on the first day of the next month. Alias definitions are listed here.

[4]:
lm = cp.LabelMaker(
    target_entity='user_id',
    time_index='order_time',
    labeling_function=bought_product,
    window_size='1MS',
)

Search Labels

Next, the label maker will search through the data continously to label whether a customer bought bananas in a given month. This happens when we use LabelMaker.search and set the product_name to bananas. If you are running this code yourself, feel free to expirement with other products (e.g. limes, avocados, etc.) and different time frames!

[5]:
lt = lm.search(
    df.sort_values('order_time'),
    minimum_data='2015-01-01',
    num_examples_per_instance=-1,
    product_name='Banana',
    verbose=True,
)

lt.head()
Elapsed: 01:37 | Remaining: 00:00 | Progress: 100%|██████████| user_id: 19477/19477
[5]:
user_id cutoff_time bought_product
id
0 4 2015-01-01 False
1 7 2015-01-01 False
2 10 2015-01-01 False
3 10 2015-02-01 False
4 13 2015-01-01 False

Describe Labels

With the generate label times, we can use LabelTimes.describe to print out the distribution with the settings and transforms that were used to make these labels. This is useful as a reference for understanding how the labels were generated from raw data. Also, the label distribution is helpful for determining if we have imbalanced labels.

[6]:
lt.describe()
Label Distribution
------------------
False     13752
True       7044
Total:    20796


Settings
--------
num_examples_per_instance              -1
minimum_data                   2015-01-01
window_size                  <MonthBegin>
gap                                  None


Transforms
----------
No transforms applied

Plot Labels

Additionally, there are plots available for insight to the labels.

Distribution

This plot shows the label distribution.

[7]:
lt.plot.distribution();
../../_images/examples_predict-next-purchase_example_13_0.png

Count by Time

This plot shows the label distribution across cutoff times.

[8]:
lt.plot.count_by_time();
../../_images/examples_predict-next-purchase_example_15_0.png