In this project, I will create a predictive model to predict whether a customer would complete a coupon or not when they first view it.
This model is made by users who don't have any null value in their attributes, and also users who viewed BOGO or discount coupons at least 3 times.
(Even though a customer does not view or display the coupon in the shop, the coupon can be automatically completed once he spends the amount needed.)
This model uses mainly 4 kinds of features for prediction.
1. Load in files
2. Clean the data
3. Univariate Exploration
4. Multivariate Exploration
5. Make a dataframe whose each row represents each coupon sent in this observation period
6. Basic Feature Engineering
1. Load in files
2. Feature engineering 1 (make columns representing the distribution of customer's past viewing time for coupons)
3. Feature engineering 2 (make a column representing the customer's past total completion rate)
4. Feature engineering 3 (make columns representing the customer's past total completion rate in 4 situation)
5. Create a model
6. Interpretation of the model
import pandas as pd
import numpy as np
import math
import json
import matplotlib.pyplot as plt
import seaborn as sns
import time
% matplotlib inline
# details of 10 coupons
portfolio = pd.read_csv('data/portfolio_clean.csv', )
# each customer's attribute
profile = pd.read_csv('data/profile_clean.csv')
# all the action of all customers
transcript = pd.read_csv('data/transcript_clean.csv')
# dataframe whose each row reperesents each coupon sent
interaction = pd.read_csv("data/interaction.csv")
# extracted dataframe of "interaction". offers which are firstly viewed
offer_viewed = pd.read_csv("data/firstly_viewed_offers.csv")
pd.options.display.max_rows = 100
pd.set_option("display.max_columns", 500)
portfolio.head()
profile.head()
transcript.head()
print(interaction.shape)
interaction.head()
print(offer_viewed.shape)
offer_viewed.head()
# remove offers whose receivers has null values in his attributes
offer_viewed_no_null = offer_viewed[offer_viewed.isnull().iloc[:,:10].sum(axis = 1) == 0]
print(offer_viewed_no_null.shape)
offer_viewed_no_null.groupby("person")["offer_bogo"].count().value_counts().sort_index()
This means that I can use 3828 rows for creating a predictive model. Again, I will make a model which predicts whether [a custmer who had offers more than 2 offers] would see the next offer or not.
# customers who got more than or equal to 3 discount/BOGO coupons
customers_for_model = offer_viewed_no_null.groupby("person")["offer_bogo"].count().index[offer_viewed_no_null.groupby("person")["offer_bogo"].count() >= 3]
# extract rows who has null values in the customers' info
offer_viewed_no_null_3or_more = offer_viewed_no_null[offer_viewed_no_null.person.isin(customers_for_model)]
offer_viewed_no_null_3or_more.shape
# The last time when the last offer was sent to each of customers
offer_person_time = offer_viewed_no_null_3or_more.groupby("person").max()["t_received"]
print(offer_person_time.shape)
offer_person_time.head()
# Boolean mask which indicates whether the row is the last offer for the person
last_mask = (offer_viewed_no_null_3or_more["person"] + offer_viewed_no_null_3or_more["t_received"].astype(str)).isin(
offer_person_time.index + offer_person_time.astype(str))
offer_viewed_no_null_3or_more_last = offer_viewed_no_null_3or_more[last_mask]
print(offer_viewed_no_null_3or_more_last.shape)
offer_viewed_no_null_3or_more_last.head()
The distribution of time which each customer spent till he viewed offers can be the great variable which represents customer's behavior pattern. Note that the distribution should exclude view_times when he completed the coupon before that. (It can be imagined that the customer viewed a coupon because he noticed it because of the discount.)
For this section, the type of coupons (BOGO or discount or info) doesn't matter.
# extract necessary columns
t_viewed = interaction.copy()[["person", "t_received", "t_viewed"]]
# replace t = nan(never viewed) with 1000(arbitrary long hour : interpreted as it takes 1000 hours till he views)
t_viewed.t_viewed = t_viewed.t_viewed.replace(np.nan,1000)
# mask of boolean which represents whether each offer was viewed until t = 120 hours
mask_over_120hour = t_viewed.t_viewed.astype(int) // 120 > 0
# now the offers which are not viewed until t = 120 hours are replaced by 1000 tentatively.
t_viewed.loc[mask_over_120hour, "t_viewed"] = 777
t_viewed.head(10)
t_viewed.info()
# Create a empty dataframe. Each row represents each customer.
index = pd.Series(t_viewed.t_viewed.unique()).sort_values()
df = pd.DataFrame(columns=index)
df["person"] = offer_viewed_no_null_3or_more_last.person
df = df.replace(np.nan, 0)
df.head()
# fill 1 in "df" according to when the customer of each row viewed coupons in the past.
t = time.time()
person_index = []
# for each of the row which represents the last offer to each person who has got >= 3 offers...
for i, (each_person, each_time) in enumerate(zip(offer_viewed_no_null_3or_more_last.person, offer_viewed_no_null_3or_more_last.t_received)):
# offers sent to that person BEFORE that last offer
offers_before = t_viewed[(t_viewed.person == each_person) & (t_viewed.t_received < each_time)]
for t in offers_before.t_viewed:
temp_row = df.iloc[i].copy()
temp_row.loc[t] += 1
df.iloc[i] = temp_row
time.time() - t
# "df" represents the time distribution each customer spent before he viewed given coupons in the past.
print(df.shape)
df.head()
# Convert df into Cumulative Distribution of number of viewed coupons
df2 = df.replace(np.nan, 0)
df2.iloc[:,:-1] = df2.iloc[:,:-1].cumsum(axis=1)
df2 = df2.rename(columns = {777:"denom_viewed"})
df2.head()
# Convert df into Cumulative Distribution of probability that coupon is viewed at each time
for column_name in df2.columns[:-2]:
df2[column_name] = df2[column_name].astype(float) / df2.denom_viewed
print(df2.shape)
df2.sample(5)
# merge this into "offer_viewed_no_null_3or_more_last"
merged_df = offer_viewed_no_null_3or_more_last.merge(df2)
print(merged_df.shape)
merged_df.head()
This is the most simple and helpful feature which represents the customer's behavior pattern. This represents the completion rate of the coupons sent to him [no matter which coupon it was] and [no matter how many hours were left till the deadline of the coupon when he opened] etc...
In section 4, completion rate depending on the situation when he viewed that coupon would also be integrated as new features.
# dictionary how many offers(BOGO/discount) were sent to the customer
dict_person_to_past_offer_num = {}
# dictionary the probability that offers(BOGO/discount) sent to the customer was completed
dict_person_to_past_completion_rate = {}
for each_person, each_time in zip(merged_df.person, merged_df.t_received):
prev_trans_df = offer_viewed_no_null_3or_more[(offer_viewed_no_null_3or_more.person == each_person) & (offer_viewed_no_null_3or_more.t_received < each_time)]
dict_person_to_past_offer_num[each_person] = prev_trans_df.shape[0]
dict_person_to_past_completion_rate[each_person] = prev_trans_df.t_completed.count() / prev_trans_df.shape[0]
# add the columns using dictionaries made up above.
merged_df["past_offers_num"] = merged_df["person"].apply(lambda x : dict_person_to_past_offer_num[x])
merged_df["past_completion_rate"] = merged_df["person"].apply(lambda x : dict_person_to_past_completion_rate[x])
print(merged_df.shape)
merged_df.head()
# Confirm that newly added 2 columns stores possible values
print(merged_df.past_completion_rate.unique())
print(merged_df.past_offers_num.unique())
The probability whether a coupon would be accomplished or not relies on the situation when he viewed it to some extent. In other words, even for the same coupon,
Situation 1 : the duration left for the coupon is 12 hours and he has to spend \$10 more to achieve it. Situation 2 : the duration left for the coupon is 5 days and he has to spend \\$1 more to achieve it.
In this case, it is easily imagined that situation 2 would motivates customer more.
That being said, I divided the past situations when he viewed coupons into 4 as written below.
When the customer viewed the coupon,
- The valid time left was so short (less than or equal to 128 hours)
- The valid time left was so long (more than 128 hours)
✖️- He had to spent a little more money till coupon completion. (less than \$10)
- He had to spent a lot more money till coupon completion. (more than or equal to \$10)
I calculated the probability(0~1) that a person completed an offer in each of situation in the past and added these 4 new variables.
It is ideal that the reward amount would be also used and the past situations would be divided into 8. However, the dataset I got here is limited that I did not do it.
# list of customers who received 3 or more offers and not have null attributes
customers_3_or_more_offers = offer_viewed_no_null_3or_more.person.unique()
print(len(customers_3_or_more_offers))
print(customers_3_or_more_offers[:5])
# extract the transactions which are the last offer given to each person.
person_to_last_offer_time = offer_viewed_no_null_3or_more.groupby("person").t_received.max()
person_to_last_offer_time.head()
# Create dictioary from above. (key = "person id" > value = "time when the last offer was sent to him")
dict_person_to_last_offer_t = {}
for each_person, each_t in zip(person_to_last_offer_time.index, person_to_last_offer_time):
dict_person_to_last_offer_t[each_person] = each_t
# dataframe whose each row repesents each offer.
# The offers in this df is limited to
# offers sent to the person who got more than 2 offers
# offers sent to the person BEFORE THE LAST ONE
offer_viewed_no_null_3or_not_last = offer_viewed_no_null_3or_more\
[offer_viewed_no_null_3or_more.t_received <\
offer_viewed_no_null_3or_more.person.apply(lambda x : dict_person_to_last_offer_t[x])]
offer_viewed_no_null_3or_more.shape, offer_viewed_no_null_3or_not_last.shape
Mainly in 2 points of view.
# distribution of resting time when a customer viewed an offer
(offer_viewed_no_null_3or_not_last.duration * 24 - offer_viewed_no_null_3or_not_last.t_viewed).plot("hist", bins = 50);
# distribution of resting amount to be used to achieve the offer when a customer viewed an offer
(offer_viewed_no_null_3or_not_last.difficulty - offer_viewed_no_null_3or_not_last.amt_till_viewed).plot("hist", bins = 50);
I check the rows which do.
# the rows of an offer whose resting amount to be used to achieve the offer takes negative value.
offer_viewed_no_null_3or_not_last[(offer_viewed_no_null_3or_not_last.difficulty - offer_viewed_no_null_3or_not_last.amt_till_viewed)<0]
# Take a closer look at 1 of them.
transcript[transcript.person == "c58726d95ba8447ba01036e9d50da94c"]
For example, if a customer has spent $1.5 when he viewed a coupon,
For simplicity, I just take care of the amount.
# How much more money he had to spend when he viewed an offer??
offer_viewed_no_null_3or_not_last["goal_amt_when_viewed"] = offer_viewed_no_null_3or_not_last.difficulty\
- offer_viewed_no_null_3or_not_last.amt_till_viewed
# but for example of $5 BOGO, he had to spend $5 more even if he had spent &4.9 when he viewed the offer
mask_bogo = (offer_viewed_no_null_3or_not_last.offer_bogo == 1)
offer_viewed_no_null_3or_not_last.loc[mask_bogo,"goal_amt_when_viewed"] = offer_viewed_no_null_3or_not_last.loc[mask_bogo, "difficulty"]
# How many hours left when he viewed the offer
offer_viewed_no_null_3or_not_last["left_hours_when_viewed"] = offer_viewed_no_null_3or_not_last.duration * 24\
- offer_viewed_no_null_3or_not_last.t_viewed
plt.figure(figsize=(15,10))
sns.regplot(data=offer_viewed_no_null_3or_not_last, x="left_hours_when_viewed", y="goal_amt_when_viewed",\
scatter_kws={"alpha":0.08}, fit_reg=False, x_jitter=3.5, y_jitter=0.6)
plt.title("Situation when each coupon was viewed")
plt.xlabel("Time left till the coupon get invalid (hour)");
plt.ylabel("Amount of money the customer had to use to achieve the coupon ($)");
I set the threshold [at 128 hours for hours left] and [at $10 for amount of money to be spent more].
These are values to make the number of offers which belong to each of 4 clusters not so biased.
merged_df2 = merged_df.copy()
# 8751 offers should be thought as past offers
offer_viewed_no_null_3or_not_last.shape
# offers for which the customer had to spend [less than $10] in [less than or equal to 128 hours]
# to achieve when he viewed this
short_little = offer_viewed_no_null_3or_not_last[(offer_viewed_no_null_3or_not_last.left_hours_when_viewed <= 128)
& (offer_viewed_no_null_3or_not_last.goal_amt_when_viewed < 10)]
# offers for which the customer had to spend [more than $10] in [less than or equal to 128 hours]
# to achieve when he viewed this
short_lot = offer_viewed_no_null_3or_not_last[(offer_viewed_no_null_3or_not_last.left_hours_when_viewed <= 128)
& (offer_viewed_no_null_3or_not_last.goal_amt_when_viewed >= 10)]
# offers for which the customer had to spend [less than $10] in [more than 128 hours]
# to achieve when he viewed this
long_little = offer_viewed_no_null_3or_not_last[(offer_viewed_no_null_3or_not_last.left_hours_when_viewed > 128)
& (offer_viewed_no_null_3or_not_last.goal_amt_when_viewed < 10)]
# offers for which the customer had to spend [more than $10] in [more than 128 hours]
# to achieve when he viewed this
long_lot = offer_viewed_no_null_3or_not_last[(offer_viewed_no_null_3or_not_last.left_hours_when_viewed > 128)
& (offer_viewed_no_null_3or_not_last.goal_amt_when_viewed >= 10)]
# number of offers which belong to each of 4 situations
short_little.shape[0], short_lot.shape[0], long_little.shape[0], long_lot.shape[0]
# probability that offer was achieved in each situation (for all customers)
print("short hours left, a little more purchase needed")
print(short_little.t_completed.count()/short_little.shape[0])
print("\nshort hours left, a lot more purchase needed")
print(short_lot.t_completed.count()/short_lot.shape[0])
print("\nlong left, a little more purchase needed")
print(long_little.t_completed.count()/long_little.shape[0])
print("\nlong hours left, a lot more purchase needed")
print(long_lot.t_completed.count()/long_lot.shape[0])
# 1. short time left, a little more purchase needed
# add columns which represents the past completion rate
# in the situation of (short hours left, little more purchase needed)
df_temp1 = (short_little.groupby("person")["t_completed"].count() / short_little.groupby("person")["age"].count())
df_temp2 = short_little.groupby("person")["age"].count()
df_short_little_comp_rate = pd.DataFrame(index=df_temp1.index)
df_short_little_comp_rate["short_little_comp_rate"] = df_temp1
df_short_little_count = pd.DataFrame(index=df_temp2.index)
df_short_little_count["short_little_count"] = df_temp2
merged_df2 = merged_df2.merge(df_short_little_comp_rate, how="left", on="person")
merged_df2 = merged_df2.merge(df_short_little_count, how="left", on="person")
merged_df2.head()
# 2. short time left, a lot more purchase needed
# add columns which represents the past completion rate
# in the situation of (short hours left, a lot more purchase needed)
df_temp1 = (short_lot.groupby("person")["t_completed"].count() / short_lot.groupby("person")["age"].count())
df_temp2 = short_lot.groupby("person")["age"].count()
df_short_lot_comp_rate = pd.DataFrame(index=df_temp1.index)
df_short_lot_comp_rate["short_lot_comp_rate"] = df_temp1
df_short_lot_count = pd.DataFrame(index=df_temp2.index)
df_short_lot_count["short_lot_count"] = df_temp2
merged_df2 = merged_df2.merge(df_short_lot_comp_rate, how="left", on="person")
merged_df2 = merged_df2.merge(df_short_lot_count, how="left", on="person")
merged_df2.head()
# 3. long time left, a little more purchase needed
# add columns which represents the past completion rate
# in the situation of (long hours left, a little more purchase needed)
df_temp1 = (long_little.groupby("person")["t_completed"].count() / long_little.groupby("person")["age"].count())
df_temp2 = long_little.groupby("person")["age"].count()
df_long_little_comp_rate = pd.DataFrame(index=df_temp1.index)
df_long_little_comp_rate["long_little_comp_rate"] = df_temp1
df_long_little_count = pd.DataFrame(index=df_temp2.index)
df_long_little_count["long_little_count"] = df_temp2
merged_df2 = merged_df2.merge(df_long_little_comp_rate, how="left", on="person")
merged_df2 = merged_df2.merge(df_long_little_count, how="left", on="person")
merged_df2.head()
# 4. long time left, a lot more purchase needed
# add columns which represents the past completion rate
# in the situation of (long hours left, a lot more purchase needed)
df_temp1 = long_lot.groupby("person")["t_completed"].count() / long_lot.groupby("person")["age"].count()
df_temp2 = long_lot.groupby("person")["age"].count()
df_long_lot_comp_rate = pd.DataFrame(index=df_temp1.index)
df_long_lot_comp_rate["long_lot_comp_rate"] = df_temp1
df_long_lot_count = pd.DataFrame(index=df_temp2.index)
df_long_lot_count["long_lot_count"] = df_temp2
merged_df2 = merged_df2.merge(df_long_lot_comp_rate, how="left", on="person")
merged_df2 = merged_df2.merge(df_long_lot_count, how="left", on="person")
merged_df2.head()
# replace nan of 0 in 4 columns of "*_*_count"
merged_df2.short_little_count = merged_df2.short_little_count.replace(np.nan, 0)
merged_df2.short_lot_count = merged_df2.short_lot_count.replace(np.nan, 0)
merged_df2.long_little_count = merged_df2.long_little_count.replace(np.nan, 0)
merged_df2.long_lot_count = merged_df2.long_lot_count.replace(np.nan, 0)
merged_df2.head()
assert merged_df2.short_little_count.sum() == short_little.shape[0]
assert merged_df2.short_lot_count.sum() == short_lot.shape[0]
assert merged_df2.long_little_count.sum() == long_little.shape[0]
assert merged_df2.long_lot_count.sum() == long_lot.shape[0]
# How many situations has each of cusomers gone through till the last offer?
(~merged_df2[["short_little_count","short_lot_count","long_little_count","long_lot_count"]].isnull()).sum(axis=1).value_counts().sort_index()
replace null with the mean completion rates of that column (that is, the mean rate in each of situation)
short_little = 0.5143843498273878
short_lot = 0.3923841059602649
long_little = 0.6988130563798219
long_lot = 0.6162315193457062
replace null with the mean completion rates of that person's "past_completion rate" column
the first person's null rate should be all 1
more complicated way
like assuming Nans by combining 1 and 2
Of course imputation 3 can be thought of as the best way, I won't dig deeper into that in this analysis. This is because the data I have here is too little. Getting more data is needed to solve the algorithm.
# How namy hours are left when a customer viewed the last offer?
merged_df2["t_left_when_viewed"] = merged_df2.duration * 24 - merged_df2.t_viewed
# How namy hours are left when a customer viewed the last offer?
merged_df2["amt_needed_when_viewed"] = merged_df2.difficulty - merged_df2.amt_till_viewed
# For BOGO, it is eaual to the difficulty
mask_bogo = (merged_df2.offer_bogo == 1)
merged_df2.loc[mask_bogo, "amt_needed_when_viewed"] = merged_df2.loc[mask_bogo, "difficulty"]
merged_df2.head()
# Save this dataframe in a file
merged_df2.to_csv("data/merged_df2.csv", index=False)
I used Random Forest Classifier to predict 0(he would complete the coupon) or 1(would not complete). I divided the dataframe into training dataset(3445 rows) and testing dataset (383 rows). I did grid search with cv = 2 to tune hyperparameters.
The reason I chose Random Forest Classifier is that it is easily interpreted after creating a model. It would be done in section 6.
As was written in section 4, there are null values in 4 columns
I discussed that the way of imputation for these columns should be decided by experimenting both ways. So I tried these 2 ways :
replace null with the mean completion rates of that column (that is, the mean rate in each of situation)
replace null with the mean completion rates of that column (that is, the mean rate of each situation)
short_little.shape = 0.5143843498273878
short_lot.shape = 0.3923841059602649
long_little.shape = 0.6988130563798219
long_lot.shape = 0.616231519345706
replace null with the mean completion rates of that person's "past_completion rate" column
To tell the result first, (1) and (2) generated almost the same result. These Models achieved f1-score = 0.80 in testing dataset after tuning.
merged_df3 = pd.read_csv("data/merged_df2.csv")
merged_df3.head()
# Create list of y [0(completed) or 1(not-completed)]
## null in the column "t_completed" means that the offer in the row was not completed.
y = (~merged_df3.t_completed.isnull()).astype(int)
indep_var = ['person', 'age', 'income', 'gender_F', 'gender_M',
'gender_O', 'became_year', 'became_month_sin', 'became_month_cos',
'became_day_sin', 'became_day_cos', 'became_dow_sin',
'became_dow_cos', 'offer_bogo', 'offer_disc', 'difficulty',
'duration', 'reward', 'email', 'mobile', 'social', 'web',
't_received', 't_viewed', 'amt_till_viewed',
'0.0', '6.0', '12.0', '18.0', '24.0', '30.0',
'36.0', '42.0', '48.0', '54.0', '60.0', '66.0', '72.0', '78.0',
'84.0', '90.0', '96.0', '102.0', '108.0', '114.0', 'denom_viewed',
'past_offers_num', 'past_completion_rate',
'short_little_comp_rate', 'short_little_count',
'short_lot_comp_rate', 'short_lot_count', 'long_little_comp_rate',
'long_little_count', 'long_lot_comp_rate', 'long_lot_count',
"t_left_when_viewed","amt_needed_when_viewed"]
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import pickle
merged_df3.short_little_comp_rate.replace(np.nan, short_little.t_completed.count()/short_little.shape[0], inplace=True)
merged_df3.short_lot_comp_rate.replace(np.nan, short_lot.t_completed.count()/short_lot.shape[0], inplace=True)
merged_df3.long_little_comp_rate.replace(np.nan, long_little.t_completed.count()/long_little.shape[0], inplace=True)
merged_df3.long_lot_comp_rate.replace(np.nan, long_lot.t_completed.count()/long_lot.shape[0], inplace=True)
merged_df3.head()
# split into training data and testing dataset
X_train, X_test, y_train, y_test = train_test_split(merged_df3[indep_var], y, test_size = 0.10, random_state = 0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
# Using default hyperparameters....
rfc = RandomForestClassifier()
rfc.fit(X=X_train[indep_var2], y=y_train)
y_train_pred = rfc.predict(X_train[indep_var2])
y_test_pred = rfc.predict(X_test[indep_var2])
print(metrics.classification_report(y_train, y_train_pred))
print(metrics.classification_report(y_test, y_test_pred))
# Using grid search...
# Number of trees in random forest
n_estimators = [400, 500, 600]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [20,30,40,50,70]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [5, 10, 15, 20,100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10, 50]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer
f1_scorer = make_scorer(f1_score, pos_label=1)
gd1 = GridSearchCV(RandomForestClassifier(), param_grid = random_grid, cv=2, scoring=f1_scorer, verbose=True, n_jobs=-1)
gd1.fit(X_train[indep_var2], y_train)
# the results in training dataset / testing dataset
print(metrics.classification_report(y_train, gd1.best_estimator_.predict(X_train[indep_var2])))
print(metrics.classification_report(y_test, gd1.best_estimator_.predict(X_test[indep_var2])))
gd1.best_estimator_
merged_df4 = pd.read_csv("data/merged_df2.csv")
merged_df4.head()
# fill in null values with the total past completiron rate of the customer.
null_mask = merged_df4.short_little_comp_rate.isnull()
merged_df4.loc[null_mask, "short_little_comp_rate"] = merged_df4.loc[null_mask, "past_completion_rate"]
null_mask = merged_df4.short_lot_comp_rate.isnull()
merged_df4.loc[null_mask, "short_lot_comp_rate"] = merged_df4.loc[null_mask, "past_completion_rate"]
null_mask = merged_df4.long_little_comp_rate.isnull()
merged_df4.loc[null_mask, "long_little_comp_rate"] = merged_df4.loc[null_mask, "past_completion_rate"]
null_mask = merged_df4.long_lot_comp_rate.isnull()
merged_df4.loc[null_mask, "long_lot_comp_rate"] = merged_df4.loc[null_mask, "past_completion_rate"]
merged_df4.head(5)
# split into training data and testing dataset
X_train, X_test, y_train, y_test = train_test_split(merged_df4[indep_var], y, test_size = 0.1, random_state = 0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
# Using default hyperparameters....
rfc = RandomForestClassifier()
rfc.fit(X=X_train[indep_var2], y=y_train)
y_train_pred = rfc.predict(X_train[indep_var2])
y_test_pred = rfc.predict(X_test[indep_var2])
print(metrics.classification_report(y_train, y_train_pred))
print(metrics.classification_report(y_test, y_test_pred))
# Using grid search...
# Number of trees in random forest
n_estimators = [400, 500, 600]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [20,30,40,50,70]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [5, 10, 15, 20,100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10, 50]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
f1_scorer = make_scorer(f1_score, pos_label=1)
gd2 = GridSearchCV(RandomForestClassifier(), param_grid = random_grid, cv=2, scoring=f1_scorer, verbose=True, n_jobs=-1)
gd2.fit(X_train[indep_var2], y_train)
# the results in training dataset / testing dataset
print(metrics.classification_report(y_train, gd2.best_estimator_.predict(X_train[indep_var2])))
print(metrics.classification_report(y_test, gd2.best_estimator_.predict(X_test[indep_var2])))
gd2.best_estimator_
I save this model in a file "model.pkl".
# Save the model "gd1" in a pkl file
with open("model.pkl", 'wb') as file:
pickle.dump(gd1, file)
The reason I used Random Forest Classifier is that I can know how each of features is important for the classification. I visualize the importance of features in descending order.
(REFERENCE : http://aiweeklynews.com/archives/50653819.html)
#特徴量の重要度
feature = gd1.best_estimator_.feature_importances_
#特徴量の重要度を上から順に出力する
f = pd.DataFrame({'number': range(0, len(feature)),'feature': feature[:], 'f_name': indep_var2})
f2 = f.sort_values('feature',ascending=False)
#特徴量の名前
label = indep_var2[0:]
#特徴量の重要度順(降順)
indices = np.argsort(feature)[::-1]
for i in range(len(feature)):
print(str(i + 1) + " " + str(label[indices[i]]) + " " + str(feature[indices[i]]))
plt.figure(figsize=(20,10))
plt.title('Feature Importance')
plt.bar(range(len(feature)),feature[indices], color='blue', align='center')
plt.xticks(range(len(feature)), f2.f_name, rotation=90)
plt.xlim([-1, len(feature)])
plt.tight_layout()
plt.show()
1 past_completion_rate 0.18659521953364147
2 long_lot_comp_rate 0.08191259118268183
3 became_year 0.06747546281501476
4 income 0.06312119581440186
5 t_left_when_viewed 0.04825733767458709
6 long_little_comp_rate 0.037443030677931574
7 short_lot_comp_rate 0.0370296331200642
8 age 0.0339851178911399
9 t_viewed 0.029519463245205893
10 short_little_comp_rate 0.029358644621018382
1 past_completion_rate 0.10823659095118901
2 long_lot_comp_rate 0.08595954918332138
3 short_lot_comp_rate 0.08367514878558703
4 long_little_comp_rate 0.07800267360062899
5 short_little_comp_rate 0.07629602818899717
6 income 0.05515854732093211
7 became_year 0.04978646626207604
8 t_left_when_viewed 0.04372529933213142
9 age 0.032500881997055905
10 reward 0.028411643826730427
past_completion_rate(1)
long_lot_comp_rate(2)
short_lot_comp_rate(3)
long_little_comp_rate(4)
short_little_comp_rate(5)
are about past completion rates and it makes sense that these are the most important features of all. Feature engineering in this notebook was worth doing!
became_year(6)
income(7)
age(9)
are about attributes of the customer. It is concluded that the persons' attributes are more relevant to whether he would achieve the coupon rather than the coupon's type.
t_left_when_viewed(8)
reward(10)
As was expected, remaining time and and the reward of a coupon have affect on customers' behavior. It is interesting that how much more amount of purchase the customer have to make to achieve the coupon is not important as these 2 factors.