The Starbucks Offers Challenge and My Solution

Introduction

This is a capstone project from Udacity's Machine Learning Engineer Nano-degree Programme, you can read about it here . The datasets in this project mimic consumer behaviour on the Starbucks rewards mobile app. Periodically, Starbucks disseminates messages containing three types of offers to consumers on its mobile app. These offers include: bogo (buy-one-get-one) offers, Discount offers, Informational offers

Bogo offers require that a customer spends a particular amount or purchases the required amount of items to qualify. Discount offers give customers the chance to purchase certain items at lesser value. The last category which is informational, is not necessarily an offer but gives information about certain products. Customers and consumers alike receive these offers through a variety of channels. These channels include: web, email, mobile and social media.

Generally, companies make offers or deploy ads to their customers for a number of reasons, this action is referred to as sales promotion. Some of the reasons for sales promotion include: increase sales, gain market share from competition, gain new distribution opportunities and so on. Sale promos demand a significant amount of resources and are only successful when the objectives are achieved

Problem Statement

The typical flow of an appropriate offer begins with Offer Received. The next stage is Offer Viewed. At this stage, the customer has viewed the offer. The third stage is Transaction, where the customer makes a transaction in accordance with the offer viewed. The last stage is Offer Completed in which the customer has completed the demands of the offer and made appropriate transactions.

Screen Shot 2021-01-22 at 16.13.32.png

However, some consumers do not complete the offer. In certain cases, offers are only received and not viewed while others are viewed and no transaction is conducted. There are also cases where transactions not impacted by offers are conducted. The preceding scenarios which depict incomplete offers, point to the inability to match customers with offers they are prone to completing. As such, this project sets out to determine if a particular customer will respond to an offer or not.

Solution Statement

For the purpose of solving the above stated problem, a classifier will be trained to determine if a customer will respond to a particular offer or not. The project also identified the features that customers consider before responding to an offer.

Datasets

Datasets/ Inputs This project makes use of three distinct datasets, namely: portfolio, profile and transcript.

Portfolio: This dataset contains details on offers made by Starbucks to its consumers. Size: 10 rows and 6 columns. Features in this dataset include: difficulty (int), reward (int), id (string), offer_type (string), duration (int), channels (list of strings).

Screen Shot 2021-01-23 at 15.44.06.png

2. Profile: This dataset contains details of customers or consumers of Starbucks’ products. Size: 17000 rows and 5 columns. Features in this dataset include: id (str), age (int), became_member_on (int) gender (str), income (float).

Screen Shot 2021-01-23 at 15.44.56.png

Transcript: This dataset contains details on the offers made to consumers of Starbucks’ products. Size: 306534 rows and 4 columns. The Features in this dataset include: event (str), person (str), time (int), value - (dict of strings).

Screen Shot 2021-01-23 at 15.47.49.png

The datasets were obtained by Starbucks and Udacity, as part of the Machine Learning Engineer Capstone Project. The datasets depict the purchase decisions of consumers.

Methodology

Metrics

This project used the following metrics:

Confusion Matrix: The confusion matrix detailed the True Positives (TP), False Positives (FP), False Negatives (FN), True Negatives (TN). The purpose of this is to depict the number of correct and wrong predictions made by the model. You can read more on this here
Accuracy Score: After this, the model’s accuracy score will be generated. Accuracy measures the ratio of correct predictions over the total number of instances evaluated. Formula for accuracy (TP+TN)/(TP+TN+FP+FN)

Libraries and Packages used in the project

The libraries and/ or packages used for this project include:

-pandas -numpy -matplotlib -seaborn -sklean

Data Cleaning and Processing

Portfolio Dataset: On the portfolio dataset, the was done to clean and prepare the dataset:

Renaming the id column to offer_df
One hot encoding the offer_type column
Separating the list values in the channels column and one hot encoding them
Changing the the duration column from days to hours.
dropping offer_type, duration and channel columns

This is the dataset after cleaning:

Screen Shot 2021-01-23 at 15.44.45.png

Profile Dataset: the following can be done to clean and prepare the profile dataset:

Check for and drop null values to a variable
Transforming the 'became_member_on' column to date format
Seperating the dates in the 'became_member_on' column into distinct columns: -became_member_year (year) -became_member_month (month) -became_member_day (day)
Calculating the period the customers have been members
Creating age_grade column from 'age' column in the profile dataset
One hot encoding the 'became_member_year', 'became_member_month', 'income_range', 'age_grade' columns
Renaming the id column to customer_id

This is the cleaned dataset: Screen Shot 2021-01-23 at 15.45.13.png

Transcript Dataset: The following was done to clean and process the transcript dataset:

Check for null values
Change the name of the column from 'Person' to 'customer_id'
Removing IDs associated with null values in the profile dataset
Separating the dataset into offers_df and transaction_df
Extracting dict values from the value columns of the newly created datasets to new columns

offers_df: containing data on offers received, viewed and completed

Screen Shot 2021-01-23 at 15.58.29.png

transaction_df: containing data on transactions made

Screen Shot 2021-01-23 at 15.58.37.png

Data Visualisations

The following were visualised from the profile dataset:

Income_range Frequency distribution
Gender distribution by year

Screen Shot 2021-01-20 at 13.51.02.png

Screen Shot 2021-01-20 at 13.51.52.png

Creating a Combined Dataset

Following the cleaning and processing of the three datasets, the next step was combined the datasets into one named combined_df. Combined_df had all the data points represented in the previously cleaned datasets

def create_combined_data(portfolio, profile, offers_df, transaction_df):

    """
    Create a combined dataframe from the transaction, demographic and offer data:
    ARGS:
        portfolio - (dataframe),offer metadata
        profile - (dataframe),customer demographic data
        offers_df - (dataframe), offers data for customers
        transaction_df - (dataframe), transaction data for customers
    """

    combined_data = [] # Initialize empty list for combined data
    customer_id_list = offers_df['customer_id'].unique().tolist() # List of unique customers in offers_df


    # Iterate over each customer
    for i, cust_id in enumerate(customer_id_list):

        # select customer profile from profile data
        cust_profile = clean_profile[clean_profile['customer_id'] == cust_id] 

        # select offers associated with the customer from offers_df
        cust_offers_data = offers_df[offers_df['customer_id'] == cust_id]

        # select transactions associated with the customer from transactions_df
        cust_transaction_df = transaction_df[transaction_df['customer_id'] == cust_id]

        # select received, completed, viewed offer data from customer offers
        offer_received_data  = cust_offers_data[cust_offers_data['offer received'] == 1]
        offer_viewed_data = cust_offers_data[cust_offers_data['offer viewed'] == 1]
        offer_completed_data = cust_offers_data[cust_offers_data['offer completed'] == 1]

        # Iterate over each offer received by a customer
        rows = [] # Initialize empty list for a customer records

        for off_id in offer_received_data['offer_id'].values.tolist():

            # select duration of a particular offer_id
            duration = clean_portfolio.loc[clean_portfolio['offer_id'] == off_id, 'duration_by_hours'].values[0]

            # select the time when offer was received
            off_recd_time = offer_received_data.loc[offer_received_data['offer_id'] == off_id, 'time'].values[0]

            # Calculate the time when the offer ends
            off_end_time = off_recd_time + duration

            #Initialize a boolean array that determines if the customer viewed an offer between offer period
            offers_viewed = np.logical_and(offer_viewed_data['time'] >= off_recd_time,offer_viewed_data['time'] <= off_end_time)

            # Check if the offer type is 'bogo' or 'discount'
            if (clean_portfolio[clean_portfolio['offer_id'] == off_id]['bogo'].values[0] == 1 or\
                    clean_portfolio[clean_portfolio['offer_id'] == off_id]['discount'].values[0] == 1):

                #Initialize a boolean array that determines if the customer completed an offer between offer period
                offers_comp = np.logical_and(offer_completed_data ['time'] >= off_recd_time,\
                                                 offer_completed_data ['time'] <= off_end_time)

                #Initialize a boolean array that selects customer transctions between offer period
                cust_tran_within_period = cust_transaction_df[np.logical_and(cust_transaction_df['time'] >= off_recd_time,\
                                                                                 cust_transaction_df['time'] <= off_end_time)]

                # Determine if the customer responded to an offer(bogo or discount) or not
                cust_response = np.logical_and(offers_viewed.sum() > 0, offers_comp.sum() > 0) and\
                                                    (cust_tran_within_period['amount'].sum() >=\
                                                     clean_portfolio[clean_portfolio['offer_id'] == off_id]['difficulty'].values[0])

            # Check if the offer type is 'informational'
            elif clean_portfolio[clean_portfolio['offer_id'] == off_id]['informational'].values[0] == 1:

                #Initialize a boolean array that determines if the customer made any transctions between offer period
                cust_info_tran = np.logical_and(cust_transaction_df['time'] >= off_recd_time,\
                                                    cust_transaction_df['time'] <= off_end_time)                   

                # Determine if the customer responded to an offer(informational) or not
                cust_response = offers_viewed.sum() > 0 and cust_info_tran.sum() > 0                  

                #Initialize a boolean array that selects customer transctions between offer period
                cust_tran_within_period = cust_transaction_df[np.logical_and(cust_transaction_df['time'] >= off_recd_time,\
                                                                                 cust_transaction_df['time'] <= off_end_time)]

            # Initialize a dictionary for a customer with required information for a particular offer
            cust_rec = {'cust_response': int(cust_response),'time': off_recd_time,'total_amount': cust_tran_within_period['amount'].sum()}
            cust_rec.update(clean_profile[clean_profile['customer_id'] == cust_id].squeeze().to_dict())
            cust_rec.update(clean_portfolio[clean_portfolio['offer_id'] == off_id].squeeze().to_dict())

            # Add the dictionary to list for combined_data
            rows.append(cust_rec)

        # Add the dictionaries from rows list to combined_data list
        combined_data.extend(rows)


    # Convert combined_data list to dataframe
    combined_data_df = pd.DataFrame(combined_data)

    # Reorder columns of combined_data_df
    combined_data_df_col_order = ['customer_id', 'offer_id', 'time']

    port_ls = clean_portfolio.columns.tolist()
    port_ls.remove('offer_id')
    pro_ls = clean_profile.columns.tolist()
    pro_ls.remove('customer_id')
    combined_data_df_col_order.extend(port_ls)
    combined_data_df_col_order.extend(pro_ls)
    combined_data_df_col_order.extend(['total_amount', 'cust_response'])

    combined_data_df = combined_data_df.reindex(combined_data_df_col_order, axis=1)
    combined_data_df.to_csv('combined_data2.csv', index=False)
    return combined_data_df

combined_df = create_combined_data(portfolio, profile, offers_df, transaction_df)
combined_df

In an effort to generate more insights, I created a dataset that showed the amount of offers that were successful.

Screen Shot 2021-01-23 at 07.01.13.png

Here is a visualisation of each offer_type and its rate of success (in percentages)

Screen Shot 2021-01-23 at 07.03.08.png

From the preceding, we can see that discount offers are the most successful. Generally, discount and bogo offers out performed informational offers.

Predictive Modelling

Benchmarking

This project adopted a benchmarking approach, by training and testing three distinct classifiers and measuring their performance in terms of their accuracy scores. These classifiers are: Logistic Regression, Random Forest Classifier, Gradient Boosting Classifier. The classifiers were initialised like so:

lr = LogisticRegression(random_state=42) 
rfc = RandomForestClassifier(random_state=42) 
gbc = GradientBoostingClassifier(random_state=42)

Scaling combined_df

In machine learning, it is standard practice that the data is scaled before it can be fed to any model. This is done to optimise performance. The scaling of the dataset was done with the MinMaxScaler() function as initialised above.

Next, I scaled values in the  combined_df dataset. This was done with a scale_features function

#A list of the features we want to scale
features_to_scale = ['difficulty', 'duration_by_hours', 'reward', 'membership_tenure', 'total_amount']

def scale_features(df, feat=features_to_scale):


    """
    This function will scale list features in a given dataframe

    ARGS:
        df (dataframe): dataframe having features to scale
        feat (list): list of features in dataframe to scale
    """

    # Prepare dataframe with features to scale
    df_feat_scale = df[feat]

    # Apply feature scaling to df
    df_feat_scale = pd.DataFrame(scaler.fit_transform(df_feat_scale), columns = df_feat_scale.columns,index=df_feat_scale.index)

    # Drop original features from df and add scaled features 
    df = df.drop(columns=feat, axis=1)
    df_scaled = pd.concat([df, df_feat_scale], axis=1)

    return df_scaled

#Applying the scale_features column to the dataset 
combined_df_scaled = scale_features(combined_df, feat=features_to_scale)

combined_df_scaled

Splitting combined_df_scaled into train and test sets

The scaled data was split into training and testing sets with the train_test_split() as imported above.

# splitting the dataset

# x are the independence variables that act as the input of the model
X = combined_df_scaled.drop(columns=['cust_response'])

# y is the variable to we are trying to predict predict 
y = combined_df_scaled['cust_response']

# split data into train and test sets with train_test_split()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

The next task was to check the distribution of target values in the y_train set. After running my code, this was the distribution:

1    53.8
0    46.2

From the answer below, we can tell that the training set is almost balanced. It is not extremely imbalanced. As such, no need to worry about dealing with class imbalance.

Training the Classifiers

The three classifiers above were trained simultaneously, with their accuracy scores generated. Here is the code:

def classifier_trainer(X_train, y_train):
    """
    This function trains the identified classifiers with the X_train and y_train sets
    ARGS:
        X_train - Independent variables (Features) training set
        y_train - Dependent variable (Target) training set
        """
    classifiers = [lr, rfc, gbc]
    for classifier in classifiers:
        training = classifier.fit(X_train, y_train)
        score = 'Classifier_Score : {}'.format(classifier.score(X_train, y_train))
        line_breaker = '*****************************'
        print(training)
        print(score)
        print(line_breaker)

#calling the created function on the training sets. 
classifier_trainer(X_train, y_train)

Results:

LogisticRegression(random_state=42)
Classifier_Score : 0.8416111707841031
*****************************
RandomForestClassifier(random_state=42)
Classifier_Score : 0.9999355531686359
*****************************
GradientBoostingClassifier(random_state=42)
Classifier_Score : 0.9114285714285715
*****************************

Testing the Classifiers

The preceding represents the training scores of the identified models. But this is not enough metric to choose a model. The test accuracy score of the models was also needed. The following was done:

def classifier_tester(X_test):
    """
    This function tests the identified classifiers with the X_test set
    ARGS:
        X_test - Independent variables (Features) testing set
        """
    classifiers = [lr, rfc, gbc]
    for classifier in classifiers:
        testing = classifier.predict(X_test)
        score = 'Classifier_test_Score : {}'.format(classifier.score(X_test, y_test))
        line_breaker = '*****************************'
        print(classifier.__class__.__name__)
        print(score)
        print(line_breaker)

classifier_tester(X_test)

Results:

LogisticRegression
Classifier_test_Score : 0.8434664929076237
*****************************
RandomForestClassifier
Classifier_test_Score : 0.9252669039145908
*****************************
GradientBoostingClassifier
Classifier_test_Score : 0.9099794496516466
*****************************

From the training and testing scores above, the Random Forest Classifier out performed the other models with a training and testing accuracy of 0.999 and 0.925 respectively. This happened because of the non-linear nature of the predictions to be made. For more info, please see here. Predictions were made with rfc (Random Forest Classifier) and a confusion matrix was generated to ascertain the True and False predictions.

Screen Shot 2021-01-23 at 09.45.55.png

Although the model performed well, false positives and negatives are a bit high. This made tuning necessary. Here is the tuning code used:

clf = RandomForestClassifier(n_estimators=60, criterion='entropy',random_state=42)

clf was used in this case because the Random Forest Classifier was the adopted model.

with the tuned model, new predictions were made and another confusion matrix was generated.

Screen Shot 2021-01-23 at 09.54.13.png

The tuning worked in that it helped reduce the false negative results to 450 from 463 in the initial results. The true negatives also increased from 10349 in the previous confusion matrix, to 10362. These parameters can be changed to get getter results

Results

Following the predictions made, a chart which pictured the most important features in the modelling process. This features are also elements which will inform a customers response to offers made by Starbucks.

Screen Shot 2021-01-23 at 10.04.36.png

The top four features from the chart above are: total amount, membership tenure, social, difficulty.

Total Amount: The total amount of money spent by a customer will determine to a large extent, if they will respond to an offer or no
membership tenure: The membership tenure of customers also determines if they will respond to an offer or not
social: Social is one of the channels by which customers receive offers. from the visualization above, customers who receved offers through social channels are more likely to respond to offers
difficulty: difficulty denotes the minimum amount required to be spent before an offer can be completed. The chart shows that difficulty influences a customers decision to respond to an offer or not

Conclusion

The project was set out to determine if a particular customer will respond to an offer or not. Following Data Exploration and Cleaning, the project involved training three classifiers namely: Logistic Regression, Random Forest Classifier and Gradient Boosting Classifier. their scores were measured to determine the model that performs best. with a score of 99% or 0.99, the Random Forest Classifier was chosen and tuned with certain parameters. The predictions were measured with a confusion matrix, to identify the "correct" and "wrong" predictions. With

True positive of 8111, False positive of 1028, False negative of 450, True negative of 10362 it can be said that the model perfomed well. The project also went further to identify the most influential features in the dataset. These features include: total_amount, membership tenure, social and difficulty.

Here is a link to the Github repository, containing the notebook and files used in this project. For enquiries send me a mail: seyiogunnowo@gmail.com

Thank you for reading, i hope you learnt a thing or two