Starbucks Promotion Analysis Vol.2/2¶

Direction of this project¶

In this project, I will create a predictive model to predict whether a customer would complete a coupon or not when they first view it.

This model is made by users who don't have any null value in their attributes, and also users who viewed BOGO or discount coupons at least 3 times.

(Even though a customer does not view or display the coupon in the shop, the coupon can be automatically completed once he spends the amount needed.)

This model uses mainly 4 kinds of features for prediction.

Coupon's info
Customer's attribute
Customer's behavior to past coupons
Situation when the customer viewed this coupon

Table of contents¶

Notebook 1 (190315StarbucksPromotionAnalysis.ipynb)

1. Load in files
2. Clean the data
3. Univariate Exploration
4. Multivariate Exploration
5. Make a dataframe whose each row represents each coupon sent in this observation period
6. Basic Feature Engineering

Notebook 2 (190403StarbucksPromotionAnalysis.ipynb)

1. Load in files
2. Feature engineering 1 (make columns representing the distribution of customer's past viewing time for coupons)
3. Feature engineering 2 (make a column representing the customer's past total completion rate)
4. Feature engineering 3 (make columns representing the customer's past total completion rate in 4 situation)
5. Create a model
6. Interpretation of the model

1. Load in files¶

import pandas as pd
import numpy as np
import math
import json
import matplotlib.pyplot as plt
import seaborn as sns
import time
% matplotlib inline

# details of 10 coupons
portfolio = pd.read_csv('data/portfolio_clean.csv', )
# each customer's attribute
profile = pd.read_csv('data/profile_clean.csv')
# all the action of all customers
transcript = pd.read_csv('data/transcript_clean.csv')
# dataframe whose each row reperesents each coupon sent
interaction = pd.read_csv("data/interaction.csv")
# extracted dataframe of "interaction". offers which are firstly viewed
offer_viewed = pd.read_csv("data/firstly_viewed_offers.csv")

pd.options.display.max_rows = 100
pd.set_option("display.max_columns", 500)

portfolio.head()

profile.head()

transcript.head()

print(interaction.shape)
interaction.head()

(48326, 27)

print(offer_viewed.shape)
offer_viewed.head()

(32007, 27)

# remove offers whose receivers has null values in his attributes

offer_viewed_no_null = offer_viewed[offer_viewed.isnull().iloc[:,:10].sum(axis = 1) == 0]

print(offer_viewed_no_null.shape)
offer_viewed_no_null.groupby("person")["offer_bogo"].count().value_counts().sort_index()

(26971, 27)

1    4328
2    5032
3    2844
4     873
5     111
Name: offer_bogo, dtype: int64

I will use customers [who got more than or equal to 3 discount/BOGO coupons] and [no null values in the customer info].¶

This means that I can use 3828 rows for creating a predictive model. Again, I will make a model which predicts whether [a custmer who had offers more than 2 offers] would see the next offer or not.

# customers who got more than or equal to 3 discount/BOGO coupons
customers_for_model = offer_viewed_no_null.groupby("person")["offer_bogo"].count().index[offer_viewed_no_null.groupby("person")["offer_bogo"].count() >= 3]

# extract rows who has null values in the customers' info
offer_viewed_no_null_3or_more = offer_viewed_no_null[offer_viewed_no_null.person.isin(customers_for_model)]
offer_viewed_no_null_3or_more.shape

(12579, 27)

Extract rows which represents the last offers sent to each of customers¶

# The last time when the last offer was sent to each of customers
offer_person_time = offer_viewed_no_null_3or_more.groupby("person").max()["t_received"]
print(offer_person_time.shape)
offer_person_time.head()

(3828,)

person
0011e0d4e6b944f998e987f904e8c1e5    504
0020ccbbb6d84e358d3414a3ff76cffd    504
004c5799adbf42868b9cff0396190900    504
00857b24b13f4fe0ad17b605f00357f5    504
008d7088107b468893889da0ede0df5c    504
Name: t_received, dtype: int64

# Boolean mask which indicates whether the row is the last offer for the person
last_mask = (offer_viewed_no_null_3or_more["person"] + offer_viewed_no_null_3or_more["t_received"].astype(str)).isin(
        offer_person_time.index + offer_person_time.astype(str))

offer_viewed_no_null_3or_more_last = offer_viewed_no_null_3or_more[last_mask]

print(offer_viewed_no_null_3or_more_last.shape)
offer_viewed_no_null_3or_more_last.head()

(3828, 27)

2. Feature Engineering1 : integrate the distribution of viewing time of the coupon¶

The distribution of time which each customer spent till he viewed offers can be the great variable which represents customer's behavior pattern. Note that the distribution should exclude view_times when he completed the coupon before that. (It can be imagined that the customer viewed a coupon because he noticed it because of the discount.)

For this section, the type of coupons (BOGO or discount or info) doesn't matter.

# extract necessary columns
t_viewed = interaction.copy()[["person", "t_received", "t_viewed"]]

# replace t = nan(never viewed) with 1000(arbitrary long hour : interpreted as it takes 1000 hours till he views) 
t_viewed.t_viewed = t_viewed.t_viewed.replace(np.nan,1000)

# mask of boolean which represents whether each offer was viewed until t = 120 hours
mask_over_120hour = t_viewed.t_viewed.astype(int) // 120 > 0

# now the offers which are not viewed until t = 120 hours are replaced by 1000 tentatively. 
t_viewed.loc[mask_over_120hour, "t_viewed"] = 777

t_viewed.head(10)

t_viewed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48326 entries, 0 to 48325
Data columns (total 3 columns):
person        48326 non-null object
t_received    48326 non-null int64
t_viewed      48326 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 1.1+ MB

# Create a empty dataframe. Each row represents each customer.

index = pd.Series(t_viewed.t_viewed.unique()).sort_values()
df = pd.DataFrame(columns=index)
df["person"] = offer_viewed_no_null_3or_more_last.person
df = df.replace(np.nan, 0)
df.head()

# fill 1 in "df" according to when the customer of each row viewed coupons in the past. 
t = time.time()

person_index = []

# for each of the row which represents the last offer to each person who has got >= 3 offers...
for i, (each_person, each_time) in enumerate(zip(offer_viewed_no_null_3or_more_last.person, offer_viewed_no_null_3or_more_last.t_received)):
    
    # offers sent to that person BEFORE that last offer
    offers_before =  t_viewed[(t_viewed.person == each_person) & (t_viewed.t_received < each_time)]
    
    for t in offers_before.t_viewed:
        temp_row = df.iloc[i].copy()
        temp_row.loc[t] += 1
        df.iloc[i] = temp_row

time.time() - t

1555873083.711358

# "df" represents the time distribution each customer spent before he viewed given coupons in the past.
print(df.shape)
df.head()

(3828, 22)

# Convert df into Cumulative Distribution of number of viewed coupons
df2 = df.replace(np.nan, 0)
df2.iloc[:,:-1] = df2.iloc[:,:-1].cumsum(axis=1)
df2 = df2.rename(columns = {777:"denom_viewed"})
df2.head()

# Convert df into Cumulative Distribution of probability that coupon is viewed at each time
for column_name in df2.columns[:-2]:
        df2[column_name] = df2[column_name].astype(float) / df2.denom_viewed
        
print(df2.shape)
df2.sample(5)

(3828, 22)

# merge this into "offer_viewed_no_null_3or_more_last"
merged_df = offer_viewed_no_null_3or_more_last.merge(df2)
print(merged_df.shape)
merged_df.head()

(3828, 48)

3. Feature Engineering 2 : Past total completion rate (0~1) of offers sent to him¶

This is the most simple and helpful feature which represents the customer's behavior pattern. This represents the completion rate of the coupons sent to him [no matter which coupon it was] and [no matter how many hours were left till the deadline of the coupon when he opened] etc...

In section 4, completion rate depending on the situation when he viewed that coupon would also be integrated as new features.

# dictionary how many offers(BOGO/discount) were sent to the customer
dict_person_to_past_offer_num = {}
# dictionary the probability that offers(BOGO/discount) sent to the customer was completed
dict_person_to_past_completion_rate = {}

for each_person, each_time in zip(merged_df.person, merged_df.t_received):
    prev_trans_df = offer_viewed_no_null_3or_more[(offer_viewed_no_null_3or_more.person == each_person) & (offer_viewed_no_null_3or_more.t_received < each_time)]
    dict_person_to_past_offer_num[each_person] = prev_trans_df.shape[0]
    dict_person_to_past_completion_rate[each_person] = prev_trans_df.t_completed.count() / prev_trans_df.shape[0]

# add the columns using dictionaries made up above.
merged_df["past_offers_num"] = merged_df["person"].apply(lambda x : dict_person_to_past_offer_num[x])
merged_df["past_completion_rate"] = merged_df["person"].apply(lambda x : dict_person_to_past_completion_rate[x])

print(merged_df.shape)
merged_df.head()

(3828, 50)

# Confirm that newly added 2 columns stores possible values
print(merged_df.past_completion_rate.unique())
print(merged_df.past_offers_num.unique())

[0.5        1.         0.         0.33333333 0.66666667 0.75
 0.25      ]
[2 3 4]

4. Feature Engineering 3 : Past completion rate (0~1) of offers sent to him in each of 4 situations when he viewed the coupon¶

The probability whether a coupon would be accomplished or not relies on the situation when he viewed it to some extent. In other words, even for the same coupon,

Situation 1 : the duration left for the coupon is 12 hours and he has to spend \$10 more to achieve it. Situation 2 : the duration left for the coupon is 5 days and he has to spend \\$1 more to achieve it.

In this case, it is easily imagined that situation 2 would motivates customer more.
That being said, I divided the past situations when he viewed coupons into 4 as written below.

When the customer viewed the coupon,

The valid time left was so short (less than or equal to 128 hours)

The valid time left was so long (more than 128 hours)
✖️

He had to spent a little more money till coupon completion. (less than \$10)

He had to spent a lot more money till coupon completion. (more than or equal to \$10)

I calculated the probability(0~1) that a person completed an offer in each of situation in the past and added these 4 new variables.

It is ideal that the reward amount would be also used and the past situations would be divided into 8. However, the dataset I got here is limited that I did not do it.

# list of customers who received 3 or more offers and not have null attributes
customers_3_or_more_offers = offer_viewed_no_null_3or_more.person.unique()
print(len(customers_3_or_more_offers))
print(customers_3_or_more_offers[:5])

3828
['389bc3fa690240e798340f5a15918d5c' '31dda685af34476cad5bc968bdb01c53'
 'f806632c011441378d4646567f357a21' 'd058f73bf8674a26a95227db098147b1'
 '1e9420836d554513ab90eba98552d0a9']

# extract the transactions which are the last offer given to each person.
person_to_last_offer_time = offer_viewed_no_null_3or_more.groupby("person").t_received.max()
person_to_last_offer_time.head()

person
0011e0d4e6b944f998e987f904e8c1e5    504
0020ccbbb6d84e358d3414a3ff76cffd    504
004c5799adbf42868b9cff0396190900    504
00857b24b13f4fe0ad17b605f00357f5    504
008d7088107b468893889da0ede0df5c    504
Name: t_received, dtype: int64

# Create dictioary from above. (key = "person id" > value = "time when the last offer was sent to him")
dict_person_to_last_offer_t = {}
for each_person, each_t in zip(person_to_last_offer_time.index, person_to_last_offer_time):
    dict_person_to_last_offer_t[each_person] = each_t

# dataframe whose each row repesents each offer.
# The offers in this df is limited to
#       offers sent to the person who got more than 2 offers
#       offers sent to the person BEFORE THE LAST ONE


offer_viewed_no_null_3or_not_last = offer_viewed_no_null_3or_more\
        [offer_viewed_no_null_3or_more.t_received <\
            offer_viewed_no_null_3or_more.person.apply(lambda x : dict_person_to_last_offer_t[x])]

offer_viewed_no_null_3or_more.shape, offer_viewed_no_null_3or_not_last.shape

((12579, 27), (8751, 27))

Check the situation when the offer were viewed in the past.¶

Mainly in 2 points of view.

How much money did a user had to use to reach the difficulty
How much hours left till the offer becomes invalid

# distribution of resting time when a customer viewed an offer
(offer_viewed_no_null_3or_not_last.duration * 24 - offer_viewed_no_null_3or_not_last.t_viewed).plot("hist", bins = 50);

# distribution of resting amount to be used to achieve the offer when a customer viewed an offer
(offer_viewed_no_null_3or_not_last.difficulty - offer_viewed_no_null_3or_not_last.amt_till_viewed).plot("hist", bins = 50);

I found that the resting amount to be used to achieve the offer takes negative value!!!¶

I check the rows which do.

# the rows of an offer whose resting amount to be used to achieve the offer takes negative value.
offer_viewed_no_null_3or_not_last[(offer_viewed_no_null_3or_not_last.difficulty - offer_viewed_no_null_3or_not_last.amt_till_viewed)<0]

# Take a closer look at 1 of them.
transcript[transcript.person == "c58726d95ba8447ba01036e9d50da94c"]

What I've found is that BOGO difficulty is not for accumulating amount. BOGO has to be achieved at one time.¶

For example, if a customer has spent $1.5 when he viewed a coupon,

For discount, he has to spend \$3.5 more cumulatively.
For BOGO, he has to spend \$5 more at a time.

For simplicity, I just take care of the amount.

# How much more money he had to spend when he viewed an offer??
offer_viewed_no_null_3or_not_last["goal_amt_when_viewed"] = offer_viewed_no_null_3or_not_last.difficulty\
                                                                - offer_viewed_no_null_3or_not_last.amt_till_viewed

# but for example of $5 BOGO, he had to spend $5 more even if he had spent &4.9 when he viewed the offer
mask_bogo = (offer_viewed_no_null_3or_not_last.offer_bogo == 1)
offer_viewed_no_null_3or_not_last.loc[mask_bogo,"goal_amt_when_viewed"] = offer_viewed_no_null_3or_not_last.loc[mask_bogo, "difficulty"]

/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:543: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

# How many hours left when he viewed the offer
offer_viewed_no_null_3or_not_last["left_hours_when_viewed"] = offer_viewed_no_null_3or_not_last.duration * 24\
                                                                - offer_viewed_no_null_3or_not_last.t_viewed

/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

plt.figure(figsize=(15,10))
sns.regplot(data=offer_viewed_no_null_3or_not_last, x="left_hours_when_viewed", y="goal_amt_when_viewed",\
            scatter_kws={"alpha":0.08}, fit_reg=False, x_jitter=3.5, y_jitter=0.6)

plt.title("Situation when each coupon was viewed")
plt.xlabel("Time left till the coupon get invalid (hour)");
plt.ylabel("Amount of money the customer had to use to achieve the coupon ($)");

I divide these past situations into 4.¶

(short hours left, a little more purchase needed) > 1738 offers in the past
(short hours left, lot more purchase needed) > 1812 offers in the past
(long hours left, a little more purchase needed) > 2022 offers in the past
(long hours left, lot more purchase needed) > 3179 offers in the past

I set the threshold [at 128 hours for hours left] and [at $10 for amount of money to be spent more].
These are values to make the number of offers which belong to each of 4 clusters not so biased.

merged_df2 = merged_df.copy()

# 8751 offers should be thought as past offers
offer_viewed_no_null_3or_not_last.shape

(8751, 29)

# offers for which the customer had to spend [less than $10] in [less than or equal to 128 hours]
#                  to achieve when he viewed this
short_little = offer_viewed_no_null_3or_not_last[(offer_viewed_no_null_3or_not_last.left_hours_when_viewed <= 128)
                                                 & (offer_viewed_no_null_3or_not_last.goal_amt_when_viewed < 10)]

# offers for which the customer had to spend [more than $10] in [less than or equal to 128 hours]
#                  to achieve when he viewed this
short_lot = offer_viewed_no_null_3or_not_last[(offer_viewed_no_null_3or_not_last.left_hours_when_viewed <= 128)
                                                 & (offer_viewed_no_null_3or_not_last.goal_amt_when_viewed >= 10)]

# offers for which the customer had to spend [less than $10] in [more than 128 hours]
#                  to achieve when he viewed this
long_little = offer_viewed_no_null_3or_not_last[(offer_viewed_no_null_3or_not_last.left_hours_when_viewed > 128)
                                                 & (offer_viewed_no_null_3or_not_last.goal_amt_when_viewed < 10)]

# offers for which the customer had to spend [more than $10] in [more than 128 hours]
#                  to achieve when he viewed this
long_lot = offer_viewed_no_null_3or_not_last[(offer_viewed_no_null_3or_not_last.left_hours_when_viewed > 128)
                                                 & (offer_viewed_no_null_3or_not_last.goal_amt_when_viewed >= 10)]

# number of offers which belong to each of 4 situations
short_little.shape[0], short_lot.shape[0], long_little.shape[0], long_lot.shape[0]

(1738, 1812, 2022, 3179)

# probability that offer was achieved in each situation (for all customers)
print("short hours left, a little more purchase needed")
print(short_little.t_completed.count()/short_little.shape[0])

print("\nshort hours left, a lot more purchase needed")
print(short_lot.t_completed.count()/short_lot.shape[0])

print("\nlong left, a little more purchase needed")
print(long_little.t_completed.count()/long_little.shape[0])

print("\nlong hours left, a lot more purchase needed")
print(long_lot.t_completed.count()/long_lot.shape[0])

short hours left, a little more purchase needed
0.5143843498273878

short hours left, a lot more purchase needed
0.3923841059602649

long left, a little more purchase needed
0.6988130563798219

long hours left, a lot more purchase needed
0.6162315193457062

This makes sense intuitively.¶

Below, I calculate each customer's completion rate of past offers in each situation.¶

# 1. short time left, a little more purchase needed

# add columns which represents the past completion rate 
#             in the situation of (short hours left, little more purchase needed)

df_temp1 = (short_little.groupby("person")["t_completed"].count() / short_little.groupby("person")["age"].count())
df_temp2 = short_little.groupby("person")["age"].count()

df_short_little_comp_rate = pd.DataFrame(index=df_temp1.index)
df_short_little_comp_rate["short_little_comp_rate"] = df_temp1

df_short_little_count = pd.DataFrame(index=df_temp2.index)
df_short_little_count["short_little_count"] = df_temp2

merged_df2 = merged_df2.merge(df_short_little_comp_rate, how="left", on="person")
merged_df2 = merged_df2.merge(df_short_little_count, how="left", on="person")


merged_df2.head()

# 2. short time left, a lot more purchase needed

# add columns which represents the past completion rate 
#             in the situation of (short hours left, a lot more purchase needed)


df_temp1 = (short_lot.groupby("person")["t_completed"].count() / short_lot.groupby("person")["age"].count())
df_temp2 = short_lot.groupby("person")["age"].count()

df_short_lot_comp_rate = pd.DataFrame(index=df_temp1.index)
df_short_lot_comp_rate["short_lot_comp_rate"] = df_temp1

df_short_lot_count = pd.DataFrame(index=df_temp2.index)
df_short_lot_count["short_lot_count"] = df_temp2

merged_df2 = merged_df2.merge(df_short_lot_comp_rate, how="left", on="person")
merged_df2 = merged_df2.merge(df_short_lot_count, how="left", on="person")

merged_df2.head()

# 3. long time left, a little more purchase needed

# add columns which represents the past completion rate 
#             in the situation of (long hours left, a little more purchase needed)

df_temp1 = (long_little.groupby("person")["t_completed"].count() / long_little.groupby("person")["age"].count())
df_temp2 = long_little.groupby("person")["age"].count()

df_long_little_comp_rate = pd.DataFrame(index=df_temp1.index)
df_long_little_comp_rate["long_little_comp_rate"] = df_temp1

df_long_little_count = pd.DataFrame(index=df_temp2.index)
df_long_little_count["long_little_count"] = df_temp2

merged_df2 = merged_df2.merge(df_long_little_comp_rate, how="left", on="person")
merged_df2 = merged_df2.merge(df_long_little_count, how="left", on="person")

merged_df2.head()

# 4. long time left, a lot more purchase needed

# add columns which represents the past completion rate 
#             in the situation of (long hours left, a lot more purchase needed)

df_temp1 = long_lot.groupby("person")["t_completed"].count() / long_lot.groupby("person")["age"].count()
df_temp2 = long_lot.groupby("person")["age"].count()

df_long_lot_comp_rate = pd.DataFrame(index=df_temp1.index)
df_long_lot_comp_rate["long_lot_comp_rate"] = df_temp1

df_long_lot_count = pd.DataFrame(index=df_temp2.index)
df_long_lot_count["long_lot_count"] = df_temp2

merged_df2 = merged_df2.merge(df_long_lot_comp_rate, how="left", on="person")
merged_df2 = merged_df2.merge(df_long_lot_count, how="left", on="person")

merged_df2.head()

# replace nan of 0 in 4 columns of "*_*_count"

merged_df2.short_little_count = merged_df2.short_little_count.replace(np.nan, 0)
merged_df2.short_lot_count = merged_df2.short_lot_count.replace(np.nan, 0)
merged_df2.long_little_count = merged_df2.long_little_count.replace(np.nan, 0)
merged_df2.long_lot_count = merged_df2.long_lot_count.replace(np.nan, 0)

merged_df2.head()

assert merged_df2.short_little_count.sum() == short_little.shape[0]
assert merged_df2.short_lot_count.sum() == short_lot.shape[0]
assert merged_df2.long_little_count.sum() == long_little.shape[0]
assert merged_df2.long_lot_count.sum() == long_lot.shape[0]

# How many situations has each of cusomers gone through till the last offer?
(~merged_df2[["short_little_count","short_lot_count","long_little_count","long_lot_count"]].isnull()).sum(axis=1).value_counts().sort_index()

4    3828
dtype: int64

This tells us that almost all customers have 1 or more null values in the past completion rate in each of 4 situations. Imputation needed!¶

Possible ways of imputation are¶

replace null with the mean completion rates of that column (that is, the mean rate in each of situation)

short_little = 0.5143843498273878
short_lot = 0.3923841059602649
long_little = 0.6988130563798219
long_lot = 0.6162315193457062
replace null with the mean completion rates of that person's "past_completion rate" column

the first person's null rate should be all 1
more complicated way

like assuming Nans by combining 1 and 2

Of course imputation 3 can be thought of as the best way, I won't dig deeper into that in this analysis. This is because the data I have here is too little. Getting more data is needed to solve the algorithm.

So I will try 1 and 2 when creating a model and decide which to use by comparing the results.¶

Also, the situation when a customer gets the last coupon should be integrated in the dataframe.¶

# How namy hours are left when a customer viewed the last offer?
merged_df2["t_left_when_viewed"] = merged_df2.duration * 24 - merged_df2.t_viewed
# How namy hours are left when a customer viewed the last offer?
merged_df2["amt_needed_when_viewed"] = merged_df2.difficulty - merged_df2.amt_till_viewed

# For BOGO, it is eaual to the difficulty
mask_bogo = (merged_df2.offer_bogo == 1)
merged_df2.loc[mask_bogo, "amt_needed_when_viewed"] = merged_df2.loc[mask_bogo, "difficulty"]

merged_df2.head()

# Save this dataframe in a file
merged_df2.to_csv("data/merged_df2.csv", index=False)

5. Create a predictive model¶

Model overview¶

I used Random Forest Classifier to predict 0(he would complete the coupon) or 1(would not complete). I divided the dataframe into training dataset(3445 rows) and testing dataset (383 rows). I did grid search with cv = 2 to tune hyperparameters.

The reason I chose Random Forest Classifier is that it is easily interpreted after creating a model. It would be done in section 6.

Imputation¶

As was written in section 4, there are null values in 4 columns

short_little_comp_rate
short_lot_comp_rate
long_little_comp_rate *long_lot_comp_rate"

I discussed that the way of imputation for these columns should be decided by experimenting both ways. So I tried these 2 ways :

replace null with the mean completion rates of that column (that is, the mean rate in each of situation)

replace null with the mean completion rates of that column (that is, the mean rate of each situation)
short_little.shape = 0.5143843498273878
short_lot.shape = 0.3923841059602649
long_little.shape = 0.6988130563798219
long_lot.shape = 0.616231519345706
replace null with the mean completion rates of that person's "past_completion rate" column

To tell the result first, (1) and (2) generated almost the same result. These Models achieved f1-score = 0.80 in testing dataset after tuning.

merged_df3 = pd.read_csv("data/merged_df2.csv")
merged_df3.head()

# Create list of y [0(completed) or 1(not-completed)]

## null in the column "t_completed" means that the offer in the row was not completed.
y = (~merged_df3.t_completed.isnull()).astype(int)

indep_var = ['person', 'age', 'income', 'gender_F', 'gender_M',
       'gender_O', 'became_year', 'became_month_sin', 'became_month_cos',
       'became_day_sin', 'became_day_cos', 'became_dow_sin',
       'became_dow_cos', 'offer_bogo', 'offer_disc', 'difficulty',
       'duration', 'reward', 'email', 'mobile', 'social', 'web',
       't_received', 't_viewed', 'amt_till_viewed',
       '0.0', '6.0', '12.0', '18.0', '24.0', '30.0',
       '36.0', '42.0', '48.0', '54.0', '60.0', '66.0', '72.0', '78.0',
       '84.0', '90.0', '96.0', '102.0', '108.0', '114.0', 'denom_viewed',
       'past_offers_num', 'past_completion_rate',
       'short_little_comp_rate', 'short_little_count',
       'short_lot_comp_rate', 'short_lot_count', 'long_little_comp_rate',
       'long_little_count', 'long_lot_comp_rate', 'long_lot_count',
        "t_left_when_viewed","amt_needed_when_viewed"]

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import pickle

(1) Imputation1 : replace null with the mean completion rates of that column (that is, the mean completion rate of each situation)¶

merged_df3.short_little_comp_rate.replace(np.nan, short_little.t_completed.count()/short_little.shape[0], inplace=True)
merged_df3.short_lot_comp_rate.replace(np.nan, short_lot.t_completed.count()/short_lot.shape[0], inplace=True)
merged_df3.long_little_comp_rate.replace(np.nan, long_little.t_completed.count()/long_little.shape[0], inplace=True)
merged_df3.long_lot_comp_rate.replace(np.nan, long_lot.t_completed.count()/long_lot.shape[0], inplace=True)

merged_df3.head()

# split into training data and testing dataset
X_train, X_test, y_train, y_test = train_test_split(merged_df3[indep_var], y, test_size = 0.10, random_state = 0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3445, 58), (383, 58), (3445,), (383,))

RandomForestClassifier Model¶

# Using default hyperparameters....
rfc = RandomForestClassifier()
rfc.fit(X=X_train[indep_var2], y=y_train)

y_train_pred = rfc.predict(X_train[indep_var2])
y_test_pred = rfc.predict(X_test[indep_var2])

print(metrics.classification_report(y_train, y_train_pred))
print(metrics.classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1405
           1       1.00      0.99      0.99      2040

   micro avg       0.99      0.99      0.99      3445
   macro avg       0.99      0.99      0.99      3445
weighted avg       0.99      0.99      0.99      3445

              precision    recall  f1-score   support

           0       0.70      0.79      0.74       158
           1       0.84      0.76      0.80       225

   micro avg       0.78      0.78      0.78       383
   macro avg       0.77      0.78      0.77       383
weighted avg       0.78      0.78      0.78       383

/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

# Using grid search...

# Number of trees in random forest
n_estimators = [400, 500, 600]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [20,30,40,50,70]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [5, 10, 15, 20,100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10, 50]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer

f1_scorer = make_scorer(f1_score, pos_label=1)
gd1 = GridSearchCV(RandomForestClassifier(), param_grid = random_grid, cv=2, scoring=f1_scorer, verbose=True,  n_jobs=-1)
gd1.fit(X_train[indep_var2], y_train)


# the results in training dataset / testing dataset
print(metrics.classification_report(y_train, gd1.best_estimator_.predict(X_train[indep_var2])))
print(metrics.classification_report(y_test, gd1.best_estimator_.predict(X_test[indep_var2])))

Fitting 2 folds for each of 1800 candidates, totalling 3600 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   41.2s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  5.4min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  9.6min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed: 14.9min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed: 21.2min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 64.5min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed: 75.5min
[Parallel(n_jobs=-1)]: Done 3600 out of 3600 | elapsed: 81.0min finished

              precision    recall  f1-score   support

           0       0.88      0.81      0.84      1405
           1       0.87      0.92      0.90      2040

   micro avg       0.87      0.87      0.87      3445
   macro avg       0.87      0.86      0.87      3445
weighted avg       0.87      0.87      0.87      3445

              precision    recall  f1-score   support

           0       0.78      0.72      0.75       158
           1       0.81      0.86      0.84       225

   micro avg       0.80      0.80      0.80       383
   macro avg       0.80      0.79      0.79       383
weighted avg       0.80      0.80      0.80       383

gd1.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=40, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=20,
            min_weight_fraction_leaf=0.0, n_estimators=600, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

(2) Imputatiaon 2 : replace null with the mean completion rates of that person's "past_completion rate" column¶

merged_df4 = pd.read_csv("data/merged_df2.csv")
merged_df4.head()

# fill in null values with the total past completiron rate of the customer.

null_mask = merged_df4.short_little_comp_rate.isnull()
merged_df4.loc[null_mask, "short_little_comp_rate"] = merged_df4.loc[null_mask, "past_completion_rate"]

null_mask = merged_df4.short_lot_comp_rate.isnull()
merged_df4.loc[null_mask, "short_lot_comp_rate"] = merged_df4.loc[null_mask, "past_completion_rate"]

null_mask = merged_df4.long_little_comp_rate.isnull()
merged_df4.loc[null_mask, "long_little_comp_rate"] = merged_df4.loc[null_mask, "past_completion_rate"]

null_mask = merged_df4.long_lot_comp_rate.isnull()
merged_df4.loc[null_mask, "long_lot_comp_rate"] = merged_df4.loc[null_mask, "past_completion_rate"]

merged_df4.head(5)

# split into training data and testing dataset
X_train, X_test, y_train, y_test = train_test_split(merged_df4[indep_var], y, test_size = 0.1, random_state = 0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3445, 58), (383, 58), (3445,), (383,))

RandomForestClassifier Model¶

# Using default hyperparameters....
rfc = RandomForestClassifier()
rfc.fit(X=X_train[indep_var2], y=y_train)

y_train_pred = rfc.predict(X_train[indep_var2])
y_test_pred = rfc.predict(X_test[indep_var2])

print(metrics.classification_report(y_train, y_train_pred))
print(metrics.classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.98      0.99      0.99      1405
           1       0.99      0.99      0.99      2040

   micro avg       0.99      0.99      0.99      3445
   macro avg       0.99      0.99      0.99      3445
weighted avg       0.99      0.99      0.99      3445

              precision    recall  f1-score   support

           0       0.72      0.70      0.71       158
           1       0.79      0.81      0.80       225

   micro avg       0.77      0.77      0.77       383
   macro avg       0.76      0.75      0.76       383
weighted avg       0.76      0.77      0.76       383

/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

# Using grid search...

# Number of trees in random forest
n_estimators = [400, 500, 600]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [20,30,40,50,70]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [5, 10, 15, 20,100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10, 50]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}


f1_scorer = make_scorer(f1_score, pos_label=1)
gd2 = GridSearchCV(RandomForestClassifier(), param_grid = random_grid, cv=2, scoring=f1_scorer, verbose=True,  n_jobs=-1)
gd2.fit(X_train[indep_var2], y_train)


# the results in training dataset / testing dataset
print(metrics.classification_report(y_train, gd2.best_estimator_.predict(X_train[indep_var2])))
print(metrics.classification_report(y_test, gd2.best_estimator_.predict(X_test[indep_var2])))

Fitting 2 folds for each of 1800 candidates, totalling 3600 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   43.8s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  7.4min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 11.6min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed: 17.6min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed: 25.1min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 35.7min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed: 46.0min
[Parallel(n_jobs=-1)]: Done 3600 out of 3600 | elapsed: 51.3min finished

              precision    recall  f1-score   support

           0       0.86      0.80      0.83      1405
           1       0.87      0.91      0.89      2040

   micro avg       0.87      0.87      0.87      3445
   macro avg       0.86      0.85      0.86      3445
weighted avg       0.86      0.87      0.86      3445

              precision    recall  f1-score   support

           0       0.79      0.72      0.75       158
           1       0.81      0.87      0.84       225

   micro avg       0.80      0.80      0.80       383
   macro avg       0.80      0.79      0.79       383
weighted avg       0.80      0.80      0.80       383

gd2.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=20, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=20,
            min_weight_fraction_leaf=0.0, n_estimators=600, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Impoutation (1) was a little better than (2).¶

I save this model in a file "model.pkl".

# Save the model "gd1" in a pkl file
with open("model.pkl", 'wb') as file:  
    pickle.dump(gd1, file)

6. Interpretation of this model¶

The reason I used Random Forest Classifier is that I can know how each of features is important for the classification. I visualize the importance of features in descending order.

(REFERENCE : http://aiweeklynews.com/archives/50653819.html)

#特徴量の重要度
feature = gd1.best_estimator_.feature_importances_

#特徴量の重要度を上から順に出力する
f = pd.DataFrame({'number': range(0, len(feature)),'feature': feature[:], 'f_name': indep_var2})
f2 = f.sort_values('feature',ascending=False)

#特徴量の名前
label = indep_var2[0:]

#特徴量の重要度順（降順）
indices = np.argsort(feature)[::-1]

for i in range(len(feature)):
    print(str(i + 1) + "   " + str(label[indices[i]]) + "   " + str(feature[indices[i]]))

plt.figure(figsize=(20,10))
plt.title('Feature Importance')
plt.bar(range(len(feature)),feature[indices], color='blue', align='center')
plt.xticks(range(len(feature)), f2.f_name, rotation=90)
plt.xlim([-1, len(feature)])
plt.tight_layout()
plt.show()

1   past_completion_rate   0.10823659095118901
2   long_lot_comp_rate   0.08595954918332138
3   short_lot_comp_rate   0.08367514878558703
4   long_little_comp_rate   0.07800267360062899
5   short_little_comp_rate   0.07629602818899717
6   income   0.05515854732093211
7   became_year   0.04978646626207604
8   t_left_when_viewed   0.04372529933213142
9   age   0.032500881997055905
10   reward   0.028411643826730427
11   t_viewed   0.02818576091705314
12   became_day_sin   0.025394052309530565
13   became_day_cos   0.020636502997940502
14   offer_disc   0.020362254248339293
15   offer_bogo   0.018011862096028248
16   amt_till_viewed   0.017038079021295557
17   became_month_cos   0.013689292361550042
18   amt_needed_when_viewed   0.01348575010866454
19   became_month_sin   0.012654230593368994
20   became_dow_sin   0.012047240703728845
21   6.0   0.010895234368922743
22   gender_M   0.010756556206782751
23   0.0   0.010064690721758803
24   12.0   0.009964438159063586
25   duration   0.009583476913822537
26   gender_F   0.009470994379461854
27   18.0   0.008530871634158388
28   became_dow_cos   0.007984499406540026
29   difficulty   0.007629850091177974
30   24.0   0.00746064962156683
31   30.0   0.007112858627687475
32   48.0   0.00659312797135162
33   42.0   0.006427851523804757
34   36.0   0.006328506018091945
35   t_received   0.005963774451188491
36   54.0   0.005723121750294654
37   66.0   0.0048314301588163735
38   web   0.004571750007966045
39   60.0   0.004570713392911735
40   72.0   0.004220667969400514
41   social   0.004209621293537857
42   78.0   0.0035151355313087986
43   108.0   0.0033584451826530618
44   114.0   0.0033182158708945676
45   90.0   0.003009509250063534
46   84.0   0.0030033185220166214
47   96.0   0.0029720190349539774
48   102.0   0.002946642490886528
49   gender_O   0.0011585985025822421
50   mobile   0.000565576140184444
51   email   0.0

I will take a look at top 10 important features of this model.¶

1   past_completion_rate   0.18659521953364147
2   long_lot_comp_rate   0.08191259118268183
3   became_year   0.06747546281501476
4   income   0.06312119581440186
5   t_left_when_viewed   0.04825733767458709
6   long_little_comp_rate   0.037443030677931574
7   short_lot_comp_rate   0.0370296331200642
8   age   0.0339851178911399
9   t_viewed   0.029519463245205893
10  short_little_comp_rate   0.029358644621018382

1   past_completion_rate   0.10823659095118901
2   long_lot_comp_rate   0.08595954918332138
3   short_lot_comp_rate   0.08367514878558703
4   long_little_comp_rate   0.07800267360062899
5   short_little_comp_rate   0.07629602818899717
6   income   0.05515854732093211
7   became_year   0.04978646626207604
8   t_left_when_viewed   0.04372529933213142
9   age   0.032500881997055905
10   reward   0.028411643826730427

[1]Features representing past completion rate of the customer¶

past_completion_rate(1)  
long_lot_comp_rate(2)  
short_lot_comp_rate(3) 
long_little_comp_rate(4)   
short_little_comp_rate(5)

are about past completion rates and it makes sense that these are the most important features of all. Feature engineering in this notebook was worth doing!

[2]Features representing attributes of the customer¶

became_year(6)  
income(7)  
age(9)

are about attributes of the customer. It is concluded that the persons' attributes are more relevant to whether he would achieve the coupon rather than the coupon's type.

[3]Features representing the coupon¶

t_left_when_viewed(8)  
reward(10)

As was expected, remaining time and and the reward of a coupon have affect on customers' behavior. It is interesting that how much more amount of purchase the customer have to make to achieve the coupon is not important as these 2 factors.

	age	became_member_on	id	income	gender_F	gender_M	gender_O
0	NaN	2017-02-12	68be06ca386d4c31939f3a4f0e3dd783	NaN	NaN	NaN	NaN
1	55.0	2017-07-15	0610b486422d4921ae7d2bf64640c50b	112000.0	1.0	0.0	0.0
2	NaN	2018-07-12	38fe809add3b4fcf9315a9694bb96ff5	NaN	NaN	NaN	NaN
3	75.0	2017-05-09	78afa995795e4d85b5d9ceeca43f5fef	100000.0	1.0	0.0	0.0
4	NaN	2017-08-04	a03223e636434f42ac4c3df47e8bac43	NaN	NaN	NaN	NaN

	event	person	transaction	id
0	offer received	78afa995795e4d85b5d9ceeca43f5fef	NaN	bogo_7days$5
1	offer received	a03223e636434f42ac4c3df47e8bac43	NaN	disc_10days$20-$5
2	offer received	e2127556f4f64592b11af22de27a7932	NaN	disc_7days$10-$2
3	offer received	8ec6ce2a7e7949b1bf142def7d0e0586	NaN	disc_10days$10-$2
4	offer received	68617ca6246f4fbc85e91a2a49552598	NaN	bogo_5days$10

	person	age	income	gender_F	gender_M	gender_O	became_year	became_month_sin	became_month_cos	became_day_sin	became_day_cos	became_dow_sin	became_dow_cos	offer_bogo	offer_disc	difficulty	duration	reward	email	mobile	social	web	t_viewed	t_completed	amt_till_completed
0	78afa995795e4d85b5d9ceeca43f5fef	75.0	100000.0	1.0	0.0	0.0	2017	0.500000	-8.660254e-01	0.968077	-0.250653	0.781831	0.623490	1	0	5	7	5	1	1	0	1	6.0	132.0	19.89
1	a03223e636434f42ac4c3df47e8bac43	NaN	NaN	NaN	NaN	NaN	2017	-0.866025	-5.000000e-01	0.724793	0.688967	-0.433884	-0.900969	0	1	20	10	5	1	0	0	1	6.0	NaN	NaN
2	e2127556f4f64592b11af22de27a7932	68.0	70000.0	0.0	1.0	0.0	2018	0.866025	-5.000000e-01	-0.848644	0.528964	0.433884	-0.900969	0	1	10	7	2	1	1	0	1	18.0	NaN	NaN
3	8ec6ce2a7e7949b1bf142def7d0e0586	NaN	NaN	NaN	NaN	NaN	2017	-1.000000	-1.836970e-16	-0.937752	0.347305	0.000000	1.000000	0	1	10	10	2	1	1	1	1	12.0	NaN	NaN
4	68617ca6246f4fbc85e91a2a49552598	NaN	NaN	NaN	NaN	NaN	2017	-0.866025	5.000000e-01	0.394356	0.918958	0.000000	1.000000	1	0	10	5	10	1	1	1	1	84.0	NaN	NaN

	person	age	income	gender_F	gender_M	gender_O	became_year	became_month_sin	became_month_cos	became_day_sin	became_day_cos	became_dow_sin	became_dow_cos	offer_bogo	offer_disc	difficulty	duration	reward	email	mobile	social	web	t_viewed	t_completed	amt_till_completed
0	78afa995795e4d85b5d9ceeca43f5fef	75.0	100000.0	1.0	0.0	0.0	2017	0.500000	-8.660254e-01	0.968077	-0.250653	0.781831	0.623490	1	0	5	7	5	1	1	0	1	6.0	132.0	19.89
1	a03223e636434f42ac4c3df47e8bac43	NaN	NaN	NaN	NaN	NaN	2017	-0.866025	-5.000000e-01	0.724793	0.688967	-0.433884	-0.900969	0	1	20	10	5	1	0	0	1	6.0	NaN	NaN
2	e2127556f4f64592b11af22de27a7932	68.0	70000.0	0.0	1.0	0.0	2018	0.866025	-5.000000e-01	-0.848644	0.528964	0.433884	-0.900969	0	1	10	7	2	1	1	0	1	18.0	NaN	NaN
3	8ec6ce2a7e7949b1bf142def7d0e0586	NaN	NaN	NaN	NaN	NaN	2017	-1.000000	-1.836970e-16	-0.937752	0.347305	0.000000	1.000000	0	1	10	10	2	1	1	1	1	12.0	NaN	NaN
4	68617ca6246f4fbc85e91a2a49552598	NaN	NaN	NaN	NaN	NaN	2017	-0.866025	5.000000e-01	0.394356	0.918958	0.000000	1.000000	1	0	10	5	10	1	1	1	1	84.0	NaN	NaN

	0.0	6.0	12.0	18.0	24.0	30.0	36.0	42.0	48.0	54.0	60.0	66.0	72.0	78.0	84.0	90.0	96.0	102.0	108.0	114.0	denom_viewed	person
23808	0.000000	0.000000	0.000000	0.000000	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	2	8c4043ca07cf48c6a71fe5155768237a
19541	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	2	24a3df24be334727a9b6a704a1b536e6
21051	0.500000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	2	23be2bae33d54da884c7ff0230fc241f
30216	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	0.500000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	2	68213b08d99a4ae1b0dcb72aebd9aa35
28199	0.333333	0.333333	0.333333	0.333333	0.333333	0.333333	0.333333	0.666667	0.666667	0.666667	0.666667	0.666667	0.666667	0.666667	0.666667	0.666667	0.666667	0.666667	0.666667	0.666667	3	e46283405b094b62806e05f24c5f0304

	channels	difficulty	duration	id	offer_type	reward
0	['email', 'mobile', 'social']	10	7	bogo_7days$10	bogo	10
1	['web', 'email', 'mobile', 'social']	10	5	bogo_5days$10	bogo	10
2	['web', 'email', 'mobile']	0	4	info_4days	informational	0
3	['web', 'email', 'mobile']	5	7	bogo_7days$5	bogo	5
4	['web', 'email']	20	10	disc_10days$20-$5	discount	5

	person	age	income	gender_F	gender_M	became_year	became_month_sin	became_month_cos	became_day_sin	became_day_cos	became_dow_sin	became_dow_cos	offer_bogo	offer_disc	difficulty	duration	reward	email	mobile	social	web	t_received	t_viewed	amt_till_viewed	t_completed	amt_till_completed
14130	31dda685af34476cad5bc968bdb01c53	62.0	71000.0	1.0	0.0	2016	8.660254e-01	5.000000e-01	0.790776	-0.612106	0.433884	-0.900969	0	1	7	7	3	1	1	1	1	336	0.0	0.00	48.0	12.57
14136	1e9420836d554513ab90eba98552d0a9	22.0	70000.0	0.0	1.0	2017	-1.000000e+00	-1.836970e-16	-0.937752	0.347305	0.000000	1.000000	1	0	10	5	10	1	1	1	1	336	18.0	0.00	78.0	30.20
14152	99e9a1916821461ebd5d5048dc2dff7d	58.0	85000.0	1.0	0.0	2018	5.000000e-01	8.660254e-01	-0.937752	0.347305	0.433884	-0.900969	0	1	20	10	5	1	0	0	1	336	36.0	15.99	72.0	44.31
14159	9aba5058cb4f483da896335a1479a944	65.0	119000.0	1.0	0.0	2017	-1.000000e+00	-1.836970e-16	0.968077	-0.250653	-0.974928	-0.222521	1	0	5	5	5	1	1	1	1	336	18.0	0.00	42.0	29.00
14182	127cd9d95e6247e5b92bff64d6205a94	46.0	60000.0	0.0	1.0	2014	-2.449294e-16	1.000000e+00	0.299363	-0.954139	-0.781831	0.623490	0	1	10	10	2	1	1	1	1	336	30.0	3.41	48.0	15.92

	person	age	income	gender_F	gender_M	became_year	became_month_sin	became_month_cos	became_day_sin	became_day_cos	became_dow_sin	became_dow_cos	offer_bogo	difficulty	duration	reward	email	mobile	social	web	t_received	t_viewed	amt_till_viewed	t_completed	amt_till_completed
1289	c58726d95ba8447ba01036e9d50da94c	64.0	52000.0	0.0	1.0	2016	1.000000e+00	6.123234e-17	-3.943559e-01	0.918958	0.781831	0.623490	1	5	7	5	1	1	0	1	0	138.0	7.04	NaN	NaN
1436	ec24a28bab9a4548baa8a3f34800f8e1	77.0	70000.0	1.0	0.0	2018	1.224647e-16	-1.000000e+00	-8.978045e-01	-0.440394	0.433884	-0.900969	1	10	5	10	1	1	1	1	0	66.0	10.34	NaN	NaN
3623	26873f30a4b740e1b1831fc5e76d2fd3	52.0	32000.0	1.0	0.0	2014	5.000000e-01	-8.660254e-01	-7.907757e-01	-0.612106	0.781831	0.623490	1	5	7	5	1	1	0	1	0	138.0	11.75	NaN	NaN
5105	56a24501a23f4d34b70d88ab867b65e2	43.0	36000.0	1.0	0.0	2016	-5.000000e-01	8.660254e-01	-9.680771e-01	-0.250653	0.781831	0.623490	1	10	5	10	1	1	1	1	0	102.0	13.67	NaN	NaN
8565	e63bbd64cf5e43baa4dd9ef77c99f050	58.0	39000.0	0.0	1.0	2014	8.660254e-01	-5.000000e-01	-4.853020e-01	-0.874347	-0.433884	-0.900969	1	5	5	5	1	1	1	1	168	72.0	8.37	NaN	NaN
9391	fb89e4ae9fbc4a3d8f701556290ac87c	29.0	55000.0	0.0	1.0	2017	1.000000e+00	6.123234e-17	-6.513725e-01	-0.758758	-0.781831	0.623490	1	5	7	5	1	1	0	1	168	48.0	5.08	90.0	11.02
9415	49295418cdfb40aba2dba760b44a0a08	44.0	34000.0	1.0	0.0	2017	1.000000e+00	6.123234e-17	8.978045e-01	-0.440394	-0.433884	-0.900969	1	10	5	10	1	1	1	1	168	66.0	13.06	NaN	NaN
10252	40ce078d5b2a43d19138a788754520be	61.0	31000.0	0.0	1.0	2016	5.000000e-01	-8.660254e-01	3.943559e-01	0.918958	0.000000	1.000000	1	10	7	10	1	1	1	0	168	18.0	10.12	30.0	25.92
11193	da6918b53fd54371a5ce83b2092fd01c	45.0	55000.0	0.0	1.0	2017	-8.660254e-01	-5.000000e-01	-2.449294e-16	1.000000	0.433884	-0.900969	1	5	5	5	1	1	1	1	168	24.0	6.74	NaN	NaN
11461	e0780fee1ddf45f98fcc65beb58966a1	19.0	51000.0	0.0	1.0	2016	5.000000e-01	-8.660254e-01	9.987165e-01	-0.050649	-0.781831	0.623490	1	5	5	5	1	1	1	1	168	12.0	6.98	78.0	20.69
12425	019be0c6f8fd4000b1967c7308b4bdf4	23.0	33000.0	1.0	0.0	2015	-5.000000e-01	8.660254e-01	9.987165e-01	-0.050649	-0.781831	0.623490	1	10	5	10	1	1	1	1	168	108.0	28.02	NaN	NaN
13444	f422c808ac4d47d1a65f78bab1c6ebff	53.0	31000.0	0.0	1.0	2015	5.000000e-01	-8.660254e-01	9.884683e-01	0.151428	0.433884	-0.900969	1	5	7	5	1	1	0	1	168	54.0	5.70	NaN	NaN
15605	b973a98befba4c57be97d63c5f07cb12	43.0	33000.0	1.0	0.0	2017	-8.660254e-01	5.000000e-01	-8.486443e-01	0.528964	0.433884	-0.900969	1	10	5	10	1	1	1	1	336	42.0	19.87	NaN	NaN
16857	9719186ce75a4e1bab16678eaa81397a	61.0	70000.0	0.0	1.0	2013	-8.660254e-01	-5.000000e-01	-2.449294e-16	1.000000	-0.974928	-0.222521	1	10	5	10	1	1	1	1	336	48.0	17.38	NaN	NaN
17796	4377157cf3664707b667bc74073c6ca6	26.0	53000.0	0.0	1.0	2014	-8.660254e-01	5.000000e-01	-6.513725e-01	-0.758758	-0.781831	0.623490	1	5	7	5	1	1	0	1	336	36.0	6.14	NaN	NaN
18147	651f312a47b54b7da7aedb5ff9dfa6bb	30.0	62000.0	0.0	1.0	2017	-1.000000e+00	-1.836970e-16	7.907757e-01	-0.612106	0.000000	1.000000	1	5	5	5	1	1	1	1	336	30.0	9.02	60.0	17.42
19613	4cfaa26c8d4f4fabaf9017861d152634	36.0	69000.0	0.0	1.0	2015	1.000000e+00	6.123234e-17	7.907757e-01	-0.612106	0.974928	-0.222521	1	5	7	5	1	1	0	1	336	102.0	8.45	NaN	NaN
21049	a6937b83abe649b38e3c019e559345df	23.0	34000.0	0.0	1.0	2017	-1.000000e+00	-1.836970e-16	-9.884683e-01	0.151428	-0.781831	0.623490	1	10	5	10	1	1	1	1	408	42.0	17.22	NaN	NaN
23242	106ebd92bff94148b4948f52079e87dc	37.0	32000.0	0.0	1.0	2017	-8.660254e-01	5.000000e-01	9.884683e-01	0.151428	-0.974928	-0.222521	1	5	7	5	1	1	0	1	408	156.0	6.20	NaN	NaN
26463	eb695e77fdb84f359f7de63eabe6bc5f	18.0	31000.0	1.0	0.0	2017	-1.000000e+00	-1.836970e-16	7.907757e-01	-0.612106	0.000000	1.000000	1	10	5	10	1	1	1	1	408	42.0	26.65	48.0	42.95

	event	person	time	transaction	id
2215	offer received	c58726d95ba8447ba01036e9d50da94c	0	NaN	bogo_7days$5
28148	transaction	c58726d95ba8447ba01036e9d50da94c	42	2.50	NaN
37150	transaction	c58726d95ba8447ba01036e9d50da94c	78	0.84	NaN
48706	offer viewed	c58726d95ba8447ba01036e9d50da94c	138	NaN	bogo_7days$5
48707	transaction	c58726d95ba8447ba01036e9d50da94c	138	3.70	NaN
50595	transaction	c58726d95ba8447ba01036e9d50da94c	150	3.96	NaN
52411	transaction	c58726d95ba8447ba01036e9d50da94c	162	1.34	NaN
55412	offer received	c58726d95ba8447ba01036e9d50da94c	168	NaN	disc_10days$10-$2
85635	offer viewed	c58726d95ba8447ba01036e9d50da94c	216	NaN	disc_10days$10-$2
87438	transaction	c58726d95ba8447ba01036e9d50da94c	222	2.53	NaN
92364	transaction	c58726d95ba8447ba01036e9d50da94c	240	7.42	NaN
102881	transaction	c58726d95ba8447ba01036e9d50da94c	288	1.05	NaN
102882	offer completed	c58726d95ba8447ba01036e9d50da94c	288	NaN	disc_10days$10-$2
104008	transaction	c58726d95ba8447ba01036e9d50da94c	294	0.96	NaN
105042	transaction	c58726d95ba8447ba01036e9d50da94c	300	2.16	NaN
108131	transaction	c58726d95ba8447ba01036e9d50da94c	318	124.03	NaN
110038	transaction	c58726d95ba8447ba01036e9d50da94c	330	2.56	NaN
113032	offer received	c58726d95ba8447ba01036e9d50da94c	336	NaN	disc_7days$7-$3
124189	offer viewed	c58726d95ba8447ba01036e9d50da94c	336	NaN	disc_7days$7-$3
137787	transaction	c58726d95ba8447ba01036e9d50da94c	366	3.46	NaN
145725	transaction	c58726d95ba8447ba01036e9d50da94c	390	0.93	NaN
152825	offer received	c58726d95ba8447ba01036e9d50da94c	408	NaN	info_3days
164122	transaction	c58726d95ba8447ba01036e9d50da94c	408	1.70	NaN
187474	offer viewed	c58726d95ba8447ba01036e9d50da94c	456	NaN	info_3days
191582	transaction	c58726d95ba8447ba01036e9d50da94c	468	1.12	NaN
191583	offer completed	c58726d95ba8447ba01036e9d50da94c	468	NaN	disc_7days$7-$3
193509	transaction	c58726d95ba8447ba01036e9d50da94c	474	0.25	NaN
203781	offer received	c58726d95ba8447ba01036e9d50da94c	504	NaN	disc_7days$7-$3
215004	offer viewed	c58726d95ba8447ba01036e9d50da94c	504	NaN	disc_7days$7-$3
219037	transaction	c58726d95ba8447ba01036e9d50da94c	510	0.82	NaN
228319	transaction	c58726d95ba8447ba01036e9d50da94c	528	0.81	NaN
233282	transaction	c58726d95ba8447ba01036e9d50da94c	540	5.97	NaN
233283	offer completed	c58726d95ba8447ba01036e9d50da94c	540	NaN	disc_7days$7-$3
235558	transaction	c58726d95ba8447ba01036e9d50da94c	546	0.39	NaN
239718	transaction	c58726d95ba8447ba01036e9d50da94c	558	2.34	NaN
258671	transaction	c58726d95ba8447ba01036e9d50da94c	576	8.91	NaN