Prediction of Prospective Wheelchair Ramp Buyers in R

Author

Iman Mousavi

Published

August 5, 2022

As a part of certification examination for professional data scientist in DataCamp, a fictitious case study is given to the candidate that requires not only standard tools and skills expected from a data savvy, but to give business insights in order to overcome an obstacle to making more profit.

Case Study

Company Background

National Accessibility currently installs wheelchair ramps for office buildings and schools. The marketing manager wants the company to start installing ramps for event venues as well. According to a new survey, approximately 40% of event venues are not wheelchair accessible. However, it is not easy to know whether a venue already has a ramp installed.

It is a waste of time to contact venues that already have a ramp installed, and it also looks bad for the company. They would like the help of the data science team in predicting which venues already have a ramp installed.

Customer Question

The marketing manager would like to know: - Can you develop a model to predict whether an event venue already has a wheelchair ramp installed?

Success Criteria

To reduce the amount of time wasted by the company contacting venues that already have a ramp, at least two-thirds of venues predicted to be without a ramp should not have a ramp.

Data

In the CSV file, the following variables have been gathered:
- venue_name: Character, name of the venue.
- Loud music / events: Character, whether the venue hosts loud events (True) or not (False).
- Venue provides alcohol: Numeric, whether the venue provides alcohol (1) or not (0).
- Wi-Fi: Character, whether the venue provides wi-fi (True) or not (False).
- supervenue: Character, whether the venue qualifies as a supervenue (True) or not (False).
- U-Shaped_max: Numeric, the total capacity of the u-shaped portion of the theater.
- max_standing: Numeric, the total standing capacity of the venue.
- Theatre_max: Numeric, the total capacity of the theatre.
- Promoted / ticketed events: Character, whether the venue hosts promoted/ticket events (True) or not (False).
- Wheelchair accessible: Character, whether the venue is wheelchair accessible (True) or not (False).

Exploratory Data Anaysis

Basic Exploration

First of all, it’s recommended to take a look at the data and its structure.

venue_name	Loud music / events	Venue provides alcohol	Wi-Fi	supervenue	U-Shaped_max	max_standing	Theatre_max	Promoted / ticketed events	Wheelchair accessible
techspace aldgate east	FALSE	0	TRUE	FALSE	35.04545	0	112.7159	FALSE	FALSE
green rooms hotel	TRUE	1	TRUE	FALSE	40.00000	120	80.0000	TRUE	FALSE
148 leadenhall street	FALSE	0	TRUE	FALSE	35.04545	0	112.7159	FALSE	FALSE
conway hall	FALSE	0	TRUE	FALSE	35.04545	60	60.0000	FALSE	FALSE
gridiron building	FALSE	0	TRUE	FALSE	35.04545	0	112.7159	FALSE	FALSE
kimpton fitzroy london	TRUE	1	TRUE	FALSE	6.00000	0	112.7159	TRUE	FALSE

Variable names are not standard, so they should be turned into lower case with underscores (snake_case).

 [1] "venue_name"               "loud_music_events"       
 [3] "venue_provides_alcohol"   "wi_fi"                   
 [5] "supervenue"               "u_shaped_max"            
 [7] "max_standing"             "theatre_max"             
 [9] "promoted_ticketed_events" "wheelchair_accessible"

Then, we need to check each variable whether the data types are in a right format or not.

Rows: 3,910
Columns: 10
$ venue_name               <chr> "techspace aldgate east", "green rooms hotel"…
$ loud_music_events        <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE…
$ venue_provides_alcohol   <dbl> 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, …
$ wi_fi                    <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
$ supervenue               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ u_shaped_max             <dbl> 35.04545, 40.00000, 35.04545, 35.04545, 35.04…
$ max_standing             <dbl> 0, 120, 0, 60, 0, 0, 0, 200, 0, 180, 300, 46,…
$ theatre_max              <dbl> 112.7159, 80.0000, 112.7159, 60.0000, 112.715…
$ promoted_ticketed_events <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE…
$ wheelchair_accessible    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

There are 3910 rows or records in this data set with 10 columns. Last column or wheelchair_accessible is the target variable.

Missing Values

There are 0 in this data set and the data types stated in the glimpse object implies no odd values are in the columns. Therefore, we can continue our analysis with no worries.

Cross Field Validation

In this data, theatre_max must be greater than u_shaped_max and max_standing. We check if such constraint is met.

After forcing the constraint, the number of rows in the data set dropped to 2329 implying that there were some cases in which our restraint have been violated.

Duplicated Venue Observations

venue_name	loud_music_events	venue_provides_alcohol	wi_fi	supervenue	u_shaped_max	theatre_max	promoted_ticketed_events	wheelchair_accessible
1 cornhill	FALSE	1	TRUE	FALSE	35.04545	112.7159	FALSE	FALSE
1 cornhill	FALSE	1	TRUE	FALSE	35.04545	112.7159	FALSE	FALSE
1 cornhill	FALSE	1	TRUE	FALSE	35.04545	55.0000	FALSE	FALSE
1 cornhill	FALSE	1	TRUE	FALSE	35.04545	112.7159	FALSE	FALSE
1 king street	FALSE	0	TRUE	FALSE	35.04545	112.7159	FALSE	FALSE
1 king street	FALSE	0	TRUE	FALSE	35.04545	112.7159	FALSE	FALSE
1 king william street	FALSE	1	TRUE	FALSE	35.04545	112.7159	FALSE	FALSE
1 king william street	FALSE	1	TRUE	FALSE	35.04545	112.7159	FALSE	FALSE
1 king william street	FALSE	1	TRUE	FALSE	35.04545	112.7159	FALSE	TRUE
1 wimpole street	FALSE	1	TRUE	FALSE	35.04545	80.0000	TRUE	TRUE

Duplicated observations must be removed. To do so, data is grouped by the venues and then, summarized with median function to calculate the median values of variables to have just one value.

In the next table, first few rows of the number of cases that contain information of one venue is shown.

venue_name	n
1 cornhill	5
1 king street	3
1 king william street	4
1 wimpole street	3
107 cheapside	6
110 rochester row	5

The following table shows the first few rows of the data set after aggregating duplicated cases.

venue_name	venue_provides_alcohol	wi_fi	supervenue	u_shaped_max	max_standing	theatre_max	promoted_ticketed_events	wheelchair_accessible
1 cornhill	1	1	0	35	0	113	0	0
1 king street	0	1	0	35	0	113	0	0
1 king william street	1	1	0	35	0	113	0	0
1 wimpole street	1	1	0	35	0	113	1	1
107 cheapside	0	1	1	35	0	91	1	1
110 rochester row	0	1	0	16	40	40	1	1

Again, we should check if there are some other duplicated cases. Now, 0 duplicated cases exist in the data set, so we go on to the next part.

Summary Statistics

Temporarily, boolean and binary data types should be transformed into labeled categorical type. However, in the model preprocessing part, categorical variables change into binary (dummy) variables.

Before modeling, it’s imperative to grasp a general idea of the data set. In this section, a summary statistics has been provided for numerical and categorical (binary) variables.

Data Frame Summary

Dimensions: 1066 x 9
Duplicates: 446

No	Variable	Stats / Values	Freqs (% of Valid)
1	loud_music_events [factor]	1. No 2. Yes	738 (69.2%) 328 (30.8%)
2	venue_provides_alcohol [factor]	1. No 2. Yes	375 (35.2%) 691 (64.8%)
3	wi_fi [factor]	1. No 2. Yes	97 ( 9.1%) 969 (90.9%)
4	supervenue [factor]	1. No 2. Yes	979 (91.8%) 87 ( 8.2%)
5	u_shaped_max [numeric]	Mean (sd) : 33.7 (7.6) min < med < max: 4 < 35 < 105 IQR (CV) : 0 (0.2)	45 distinct values
6	max_standing [numeric]	Mean (sd) : 50.4 (152.9) min < med < max: 0 < 30 < 4000 IQR (CV) : 60 (3)	81 distinct values
7	theatre_max [numeric]	Mean (sd) : 122.1 (159.4) min < med < max: 8 < 113 < 4000 IQR (CV) : 0 (1.3)	108 distinct values
8	promoted_ticketed_events [factor]	1. No 2. Yes	704 (66.0%) 362 (34.0%)
9	wheelchair_accessible [factor]	1. No 2. Yes	627 (58.8%) 439 (41.2%)

Data Visualization

Maximum Capacity

Highly Skewed! Better to plot with log scale.

The reason our box plot has been compacted into a thick line is the distribution of theatre_max variable. The majority of data is slightly above 100 and other values are detected as outliers.

U-Shaped Maximum Capacity

Still outliers exist, but the density distribution shows less skewness.

Maximum Standing Capacity

Log-scaled plots reveal more insights into the distributions. Note that there are 270 zeros in this variable and we need to add one unit to circumvent errors.

Categorical Variables

Venues with loud-music events are almost half of those with no loud music.

Venues providing alcohol are almost twice the number of venues with no alcohol.

Most venues provide wi-fi service to their customers and it doesn’t really gives that much information. This feature has the potential to be excluded due to its low variation.

Few venues are labelled “Supervenue”. This variable can’t be useful explaining the probability of target labels.

Number of venues with promoted events is almost half of the number of those that have no such events.

Target Variable

Whether a venue has access to wheelchair ramp is our target variable and we attempt to predict the probability of each outcome for each case.

Target variable is approximately balanced.

Relationship Between Features

theatre_max has correlation with max_standing and u_shaped_max, which was expected beforehand.

Feature Distributions By Target Variable

Those venues with no wheelchair ramp have also less variance in their maximum capacity comparing with venues with the accessibility.

Correlation Coefficients

term	u_shaped_max	max_standing	theatre_max
u_shaped_max	NA	0.120	0.133
max_standing	0.120	NA	0.913
theatre_max	0.133	0.913	NA

Collinearity exists between theatre_max and max_standing!

Modeling with Tidymodels

The objective of the following analysis is predicting whether the venue has already have ramp accessibility or not. In other words, there are two possible outcomes:
- In case wheelchair_accessible was TRUE then Marketing Department dismiss such venues.
- In case wheelchair_accessible = FALSE then Marketing Department contacts them.

The problem at hand is a binary classification under supervised machine learning.

KPI to compare models

In the following anaylses, specificity, which is defined as the number of True negative prediction out of all negative predicted, is our main focus. Business Criterion is that out of all venues predicted with no ramp accessibility, at least 67% must be predicted accurately.

Specificity is the KPI to evaluate the analyses.

Criterion : specificity > 67%

Preprocessing

In order to use step_log, we would better off changing 0 values in max_standing to 1.

First few rows of the training data after preprocessing steps are as follows:

max_standing	theatre_max	wheelchair_accessible	loud_music_events_Yes	venue_provides_alcohol_Yes	promoted_ticketed_events_Yes
−1.570	0.134	No	0.000	0.000	0.000
−1.570	0.134	No	0.000	0.000	0.000
0.103	0.134	No	1.000	1.000	1.000
−1.570	0.134	No	0.000	0.000	0.000
−1.570	0.134	No	0.000	0.000	0.000
−1.570	0.134	No	0.000	0.000	0.000

First few rows of the test data after preprocessing steps are as follows:

max_standing	theatre_max	wheelchair_accessible	venue_provides_alcohol_Yes	promoted_ticketed_events_Yes
−1.570	0.134	No	1.000	0.000
−1.570	0.134	No	1.000	0.000
0.490	−2.065	Yes	0.000	1.000
−1.570	0.134	No	0.000	0.000
−1.570	0.134	Yes	1.000	0.000
−1.570	0.134	No	0.000	0.000

Linear Logistic Regression

Let’s start the analysis with the simplest form of model for classification. Linear Logistic Regression assumes a linear relationship between predictors and the probability of getting value of 1 for the outcome.

The coefficients, standard errors, t-statistics, and p-values of the logistic model fitted on the training set have been shown in the next table.

term	estimate	std.error	statistic	p.value
(Intercept)	−0.837	0.135	−6.198	0.000
max_standing	0.024	0.076	0.320	0.749
theatre_max	−0.057	0.072	−0.790	0.430
loud_music_events_Yes	−0.221	0.171	−1.297	0.194
venue_provides_alcohol_Yes	0.507	0.162	3.131	0.002
promoted_ticketed_events_Yes	0.625	0.158	3.943	0.000

Fitted model on the training set is used for prediction on the test set. Evaluation metrics such as accuracy, sensitivity, specificity, and roc auc are given below:

.metric	.estimator	.estimate
accuracy	binary	0.60
sens	binary	0.32
spec	binary	0.80
roc_auc	binary	0.63

Accuracy and ROC AUC is not good enough, despite reasonable specificity. Due to non-linearity relationship between predictors and outcome, let’s try the basic tree-based model.

Decision Tree

Decision tree is capable of capturing nonlinear contributions of input variables to predict the probability of target variable. Although more complex algorithms exist and they might perform outstandingly, we start by a rather simple one.

Again, the evaluation metrics have been gathered in the next table:

.metric	.estimator	.estimate
accuracy	binary	0.66
sens	binary	0.38
spec	binary	0.87
roc_auc	binary	0.67

We obtained results with improvement comparing with the linear logistic regression. Note that tuning the model can help enhance the accuracy and other relevant metrics.

Tuning Decision Tree

min_n	.metric	.estimator	mean	n	std_err	.config
27.000	roc_auc	binary	0.657	3.000	0.027	Preprocessor1_Model003
28.000	roc_auc	binary	0.657	3.000	0.027	Preprocessor1_Model023
27.000	roc_auc	binary	0.657	3.000	0.027	Preprocessor1_Model049

.metric	.estimator	.estimate
accuracy	binary	0.659
sens	binary	0.443
spec	binary	0.810
roc_auc	binary	0.660

Still need to improve other metrics, let’s try random forest model, which is much more powerful than decision tree.

Random Forest

Random forest is a more complicated form of decision tree model consisting of a multitude of trees that utilize a distinct bootstrap sample of the training set. Finally, in a classification problem, the majority voting is the deciding criteria to assign each case to one of the classes.

.metric	.estimator	.estimate
accuracy	binary	0.678
sens	binary	0.420
spec	binary	0.857
roc_auc	binary	0.717

Results are better than decision tree model, however, it needs to be tuned for better performance.

trees	min_n	.metric	.estimator	mean	n	std_err	.config
492.000	22.000	roc_auc	binary	0.683	3.000	0.024	Preprocessor1_Model021
682.000	20.000	roc_auc	binary	0.682	3.000	0.026	Preprocessor1_Model029
91.000	34.000	roc_auc	binary	0.682	3.000	0.021	Preprocessor1_Model013

.metric	.estimator	.estimate
accuracy	binary	0.682
sens	binary	0.409
spec	binary	0.873
roc_auc	binary	0.719

XGBoost

Last but not least, we go on using another complicated machine learning models called XGBoost. It’s an optimized gradient-boosting machine learning model that has advantages like great speed and performance, outperforming single-algorithm models, and state-of-the-art performance in many ML tasks.

.metric	.estimator	.estimate
accuracy	binary	0.692
sens	binary	0.455
spec	binary	0.857
roc_auc	binary	0.703

Now, we continue by tunning XGBoost as well.

tree_depth	learn_rate	sample_size	.metric	.estimator	mean	n	std_err	.config
4.000	0.080	0.570	roc_auc	binary	0.684	3.000	0.008	Preprocessor1_Model128
1.000	0.264	0.237	roc_auc	binary	0.683	3.000	0.012	Preprocessor1_Model176
3.000	0.027	0.520	roc_auc	binary	0.681	3.000	0.014	Preprocessor1_Model015

.metric	.estimator	.estimate
accuracy	binary	0.706
sens	binary	0.432
spec	binary	0.897
roc_auc	binary	0.715

XGBoost outperformed random forest model according to evaluation metrics in general. All of them are acceptable. Consequently, XGBoost is the final model. Because the criterion is met, more complex models are dismissed.

Final Model is XGBoost to detect which venues still haven’t special ramps for their audience.

Criterion is met, Specificity > 0.67 Accuracy and ROC AUC improved as well!

Business Focus

In marketing, like other departments, resources such as money, time, and people are scarce. That is, we need to identify prospect customers so that we manage to forge a close relatioship and offer our product to solve their issue.

CPC or Cost per Contact is defined as the money or any other kind of cost that the company incur to persuade one person to make a purchase. By targeting the customers with high probability of purchasing, we avoid calling every single venue, and with less time and money we accomplish what we are looking for.

Addressing the Problem

We have developed a model which is powerful to select only venues with high chances of ordering. Marketing team calls merely those prospects, so CPC is lowered and the company can contact potential customers before anyone else.

Recommandations

Recently, digital marketing has become one of the prominent tools to contact customers and offer the service they’re looking for. Instead of calling predicted prospects, we can do some preliminary actions:

Contact via email explaining why the prospects would be better off if they order a ramp for wheelchair users.
What features distinguish venues based on their need to use ramps? For instance, it would be instructive to give some statistics about venues with similar features and trigger feeling of necessity to have the ramp.
Sending a video or even a simulation is helpful because customers can have a sense where the ramp will be constructed and whether they’re ok with it or not.

--- title: "Prediction of Prospective Wheelchair Ramp Buyers in R" author: "Iman Mousavi" date: "2022-08-5" format: html: theme: cosmo css: styles.css execute: echo: false cache: false warning: false toc: true toccolor: "#023E8A" toc-title: Sections theme: theme.scss code-link: true code-fold: show code-tools: true highlight-style: github --- As a part of certification examination for professional data scientist in DataCamp, a fictitious case study is given to the candidate that requires not only standard tools and skills expected from a data savvy, but to give business insights in order to overcome an obstacle to making more profit. ![Wheelchair Ramp for Venues](wheelchair%20ramp.jpg) # Case Study ## Company Background National Accessibility currently installs wheelchair ramps for office buildings and schools. The marketing manager wants the company to start installing ramps for event venues as well. According to a new survey, approximately 40% of event venues are not wheelchair accessible. However, it is not easy to know whether a venue already has a ramp installed. It is a waste of time to contact venues that already have a ramp installed, and it also looks bad for the company. They would like the help of the data science team in predicting which venues already have a ramp installed. ## Customer Question The marketing manager would like to know: - Can you develop a model to predict whether an event venue already has a wheelchair ramp installed? ## Success Criteria To reduce the amount of time wasted by the company contacting venues that already have a ramp, at least two-thirds of venues predicted to be without a ramp should not have a ramp. ## Data In the CSV file, the following variables have been gathered:\ - `venue_name`: Character, name of the venue.\ - `Loud music / events`: Character, whether the venue hosts loud events (True) or not (False).\ - `Venue provides alcohol`: Numeric, whether the venue provides alcohol (1) or not (0).\ - `Wi-Fi`: Character, whether the venue provides wi-fi (True) or not (False).\ - `supervenue`: Character, whether the venue qualifies as a supervenue (True) or not (False).\ - `U-Shaped_max`: Numeric, the total capacity of the u-shaped portion of the theater.\ - `max_standing`: Numeric, the total standing capacity of the venue.\ - `Theatre_max`: Numeric, the total capacity of the theatre.\ - `Promoted / ticketed events`: Character, whether the venue hosts promoted/ticket events (True) or not (False).\ - `Wheelchair accessible`: Character, whether the venue is wheelchair accessible (True) or not (False).\ ```{r Import Packages} #| include: false if (!require("pacman")) { install.packages("pacman") library(pacman) } p_load(dplyr, ggplot2, readr, stringr, corrr, tidymodels, ranger, xgboost, gt, summarytools, broom) ``` ```{r Reading Data set} #| include: false event_ven <- read_csv("/Users/iman/Documents/Iman/Portfolio/Iman Mousavi Portfolio/posts/Wheelchair Ramp/event_venues.csv") ``` # Exploratory Data Anaysis ## Basic Exploration First of all, it's recommended to take a look at the data and its structure. ```{r Data Overview} head(event_ven) |> gt() ``` Variable names are not standard, so they should be turned into lower case with underscores (snake_case). ```{r Col Names to snake_case} colnames(event_ven) <- str_replace_all(colnames(event_ven), " ", "_") |> str_replace_all("-","_") |> str_replace("_/_", "_") |> str_to_lower() colnames(event_ven) ``` Then, we need to check each variable whether the data types are in a right format or not. ```{r Glimpse} glimpse(event_ven) ``` **There are `r dim(event_ven)[1]` rows or records in this data set with `r dim(event_ven)[2]` columns. Last column or `wheelchair_accessible` is the target variable.** ## Missing Values ```{r Missing Values} num_miss <- sum(is.na(event_ven)) ``` There are `r num_miss` in this data set and the data types stated in the glimpse object implies no odd values are in the columns. Therefore, we can continue our analysis with no worries. ## Cross Field Validation In this data, `theatre_max` must be greater than `u_shaped_max` and `max_standing`. We check if such constraint is met. ```{r Cross Field Validation} event_ven_cfv <- event_ven |> filter(theatre_max >= max_standing & theatre_max >= u_shaped_max) ``` After forcing the constraint, the number of rows in the data set dropped to `r dim(event_ven_cfv)[1]` implying that there were some cases in which our restraint have been violated. ## Duplicated Venue Observations ```{r Duplicated Cases} event_ven_cfv |> filter(duplicated(venue_name)) |> arrange(venue_name) |> head(n = 10) |> gt() ``` Duplicated observations must be removed. To do so, data is grouped by the venues and then, summarized with median function to calculate the median values of variables to have just one value. In the next table, first few rows of the number of cases that contain information of one venue is shown. ```{r Number of Duplicates} event_ven_cfv |> group_by(venue_name) |> summarize(n = n()) |> head() |> gt() ``` The following table shows the first few rows of the data set after aggregating duplicated cases. ```{r Aggregating Duplicates} event_ven_Yesdups <- event_ven_cfv |> group_by(venue_name) |> summarize_all(median) |> mutate_if(is.numeric, round) event_ven_Yesdups |> head() |> gt() ``` Again, we should check if there are some other duplicated cases. Now, `r sum(duplicated(event_ven_Yesdups))` duplicated cases exist in the data set, so we go on to the next part. ## Summary Statistics Temporarily, boolean and binary data types should be transformed into labeled categorical type. However, in the model preprocessing part, categorical variables change into binary (dummy) variables. ```{r Labelled Categories} categorizer_func <- function(x) { x <- ifelse(x == 1, "Yes", "No") } event_ven_cat <- event_ven_Yesdups |> mutate_at(vars(loud_music_events, venue_provides_alcohol, wi_fi, supervenue, promoted_ticketed_events, wheelchair_accessible), categorizer_func) |> mutate_at(vars(loud_music_events, venue_provides_alcohol, wi_fi, supervenue, promoted_ticketed_events, wheelchair_accessible), as.factor) ``` Before modeling, it's imperative to grasp a general idea of the data set. In this section, a summary statistics has been provided for numerical and categorical (binary) variables. ```{r Summary Statistics} #| results: asis dfSummary(event_ven_cat |> select(-venue_name), plain.ascii = FALSE, style = "grid", graph.col = FALSE, graph.magnif = 0.75, tmp.img.dir = "/tmp", valid.col = FALSE) ``` ### Data Visualization #### Maximum Capacity ```{r theatre_max Density Plot} ggplot(event_ven_cat, aes(theatre_max)) + geom_density(bw = 50) + theme(text = element_text(size = 12)) + labs(x = "Maximum Capacity of Venue", y = "Density", title = "Density Plot") ``` ```{r theatre_max Box Plot} ggplot(event_ven_cat, aes(theatre_max)) + geom_boxplot() + theme(text = element_text(size = 12)) + labs(x = "Maximum Capacity of Venue", title = "Box Plot") ``` Highly Skewed! Better to plot with log scale. ```{r theatre_max Box Plot Log} ggplot(event_ven_cat, aes(theatre_max)) + geom_boxplot() + scale_x_log10() + theme(text = element_text(size = 12)) + labs(x = "Maximum Capacity of Venue", title = "Box Plot") ``` The reason our box plot has been compacted into a thick line is the distribution of `theatre_max` variable. The majority of data is slightly above 100 and other values are detected as outliers. #### U-Shaped Maximum Capacity ```{r u_shaped_max Density Plot} ggplot(event_ven_cat, aes(u_shaped_max)) + geom_density() + theme(text = element_text(size = 12)) + labs(x = "Maximum U-Shaped Capacity of Venue", y = "Density", title = "Density Plot") ``` ```{r u_shaped_max Box Plot} ggplot(event_ven_cat, aes(u_shaped_max)) + geom_boxplot() + theme(text = element_text(size = 12)) + labs(x = "Maximum U-Shaped Capacity of Venue", title = "Box Plot") ``` Still outliers exist, but the density distribution shows less skewness. #### Maximum Standing Capacity ```{r max_standing Density Plot} ggplot(event_ven_cat, aes(max_standing)) + geom_density() + theme(text = element_text(size = 12)) + labs(x = "Maximum Standing Capacity of Venue", y = "Density", title = "Density Plot") ``` ```{r max_standing Box Plot} ggplot(event_ven_cat, aes(max_standing)) + geom_boxplot() + theme(text = element_text(size = 12)) + labs(x = "Maximum Standing Capacity of Venue", title = "Box Plot") ``` Log-scaled plots reveal more insights into the distributions. Note that there are 270 zeros in this variable and we need to add one unit to circumvent errors. ```{r max_standing Density Plot Log} ggplot(event_ven_cat, aes(max_standing + 1)) + geom_density() + scale_x_log10() + theme(text = element_text(size = 12)) + labs(x = "Log Maximum Standing Capacity of Venue", y = "Density", title = "Density Plot") ``` ```{r max_standing Box Plot Log} ggplot(event_ven_cat, aes(max_standing + 1)) + geom_boxplot() + scale_x_log10() + theme(text = element_text(size = 12)) + labs(x = "Log Maximum Standing Capacity of Venue", title = "Box Plot") ``` #### Categorical Variables ```{r loud_music_events Bar Plot} ggplot(event_ven_cat, aes(loud_music_events)) + geom_bar(width = 0.6) + scale_y_continuous(breaks = seq(0, 800, 100)) + theme(panel.grid.major.x = element_blank()) + labs(x = "Venue Hosts Loud Events", y = "Count", title = "Loud Music Bar Plot") ``` Venues with loud-music events are almost half of those with no loud music. ```{r venue_provides_alcohol Bar Plot} ggplot(event_ven_cat, aes(venue_provides_alcohol)) + geom_bar(width = 0.6) + scale_y_continuous(breaks = seq(0, 800, 100)) + theme(panel.grid.major.x = element_blank()) + labs(x = "Venue Serves Alcohol", y = "Count", title = "Alcohol Bar Plot") ``` Venues providing alcohol are almost twice the number of venues with no alcohol. ```{r wi_fi Bar Plot} ggplot(event_ven_cat, aes(wi_fi)) + geom_bar(width = 0.6) + scale_y_continuous(breaks = seq(0, 1000, 100)) + theme(panel.grid.major.x = element_blank()) + labs(x = "Venue has Wi-Fi", y = "Count", title = "Wi-Fi Bar Plot") ``` Most venues provide wi-fi service to their customers and it doesn't really gives that much information. This feature has the potential to be excluded due to its low variation. ```{r supervenue Bar Plot} ggplot(event_ven_cat, aes(supervenue)) + geom_bar(width = 0.6) + scale_y_continuous(breaks = seq(0, 1000, 100)) + theme(panel.grid.major.x = element_blank()) + labs(x = "Supervenue", y = "Count", title = "Supervenue Bar Plot") ``` Few venues are labelled "Supervenue". This variable can't be useful explaining the probability of target labels. ```{r promoted_ticketed_events Bar Plot} ggplot(event_ven_cat, aes(promoted_ticketed_events)) + geom_bar(width = 0.6) + scale_y_continuous(breaks = seq(0, 800, 100)) + theme(panel.grid.major.x = element_blank()) + labs(x = "Promoted / Ticketed Events", y = "Count", title = "Promoted / Ticketed Events Bar Plot") ``` Number of venues with promoted events is almost half of the number of those that have no such events. #### Target Variable Whether a venue has access to wheelchair ramp is our target variable and we attempt to predict the probability of each outcome for each case. ```{r wheelchair_accessible Bar Plot} ggplot(event_ven_cat, aes(wheelchair_accessible)) + geom_bar(width = 0.6) + scale_y_continuous(breaks = seq(0, 700, 100)) + theme(panel.grid.major.x = element_blank()) + labs(x = "Wheelchair Ramp Access", y = "Count", title = "Wheelchair Ramp Bar Plot") ``` **Target variable is approximately balanced.** #### Relationship Between Features ```{r u-shpaped_max vs. theatre_max Log Plot} ggplot(event_ven_cat, aes(theatre_max, u_shaped_max)) + geom_point(alpha = 0.5) + scale_x_log10() + scale_y_log10() + labs(x = "Maximum Capacity of theater", y = "Maximum Capacity of U-Shaped Portion", title = "Max U-Shaped Portion vs. Theater Max in Log Scale") ``` ```{r Max Standing vs. Theatre Max Log Plot} ggplot(event_ven_cat, aes(theatre_max, max_standing)) + geom_point(alpha = 0.4) + scale_x_log10() + scale_y_log10() + labs(x = "Maximum Capacity of theater", y = "Maximum Standing Capacity", title = "Max Standing vs. Theater Max in Log Scale") ``` `theatre_max` has correlation with `max_standing` and `u_shaped_max`, which was expected beforehand. #### Feature Distributions By Target Variable ```{r Theater Max Dist by Wheelchair Accessiblity Log Plot} ggplot(event_ven_cat, aes(theatre_max, fill = wheelchair_accessible)) + geom_density(bw = 0.1 ,alpha = 0.3) + scale_x_log10() + labs(x = "Theater Max", title = "Theater Max Capacity Density Plot in Log Scale") ``` Those venues with no wheelchair ramp have also less variance in their maximum capacity comparing with venues with the accessibility. #### Correlation Coefficients ```{r Corr Coeff} event_ven_cat |> correlate() |> gt() |> fmt_number(decimals = 3) ``` Collinearity exists between theatre_max and max_standing! # Modeling with Tidymodels The objective of the following analysis is predicting whether the venue has already have ramp accessibility or not. In other words, there are two possible outcomes:\ - In case `wheelchair_accessible` was TRUE then Marketing Department dismiss such venues.\ - In case `wheelchair_accessible` = FALSE then Marketing Department contacts them.\ The problem at hand is a binary classification under supervised machine learning. ## KPI to compare models In the following anaylses, specificity, which is defined as the number of True negative prediction out of all negative predicted, is our main focus. Business Criterion is that out of all venues predicted with no ramp accessibility, at least 67% must be predicted accurately. Specificity is the KPI to evaluate the analyses. Criterion : specificity \> 67% ## Preprocessing In order to use step_log, we would better off changing 0 values in max_standing to 1. ```{r max_standing 0 to 1} event_ven_cat$max_standing <- ifelse(event_ven_cat$max_standing == 0, 1, event_ven_cat$max_standing) ``` ```{r Seed and Split} set.seed(123) event_split <- initial_split(event_ven_cat[,-1], prop = 0.8, strata = wheelchair_accessible) event_training <- training(event_split) event_test <- testing(event_split) ``` ```{r Formula and Recipe} formula <- wheelchair_accessible ~ . event_recipe <- recipe(formula, data = event_training) |> step_log(recipes::all_numeric()) |> step_normalize(recipes::all_numeric()) |> step_dummy(recipes::all_nominal_predictors()) |> step_corr(recipes::all_numeric_predictors(),threshold = 0.8) |> step_nzv(recipes::all_predictors(), freq_cut = 80/20) ``` First few rows of the training data after preprocessing steps are as follows: ```{r Prep and Bake of Training} event_prep <- event_recipe |> prep(training = event_training) event_training_baked <- event_prep |> bake(new_data = NULL) head(event_training_baked) |> gt() |> fmt_number(decimals = 3) ``` First few rows of the test data after preprocessing steps are as follows: ```{r Bake of Testing} event_test_baked <- event_prep |> bake(event_test) head(event_test_baked) |> gt() |> fmt_number(decimals = 3) ``` ## Linear Logistic Regression Let's start the analysis with the simplest form of model for classification. Linear Logistic Regression assumes a linear relationship between predictors and the probability of getting value of 1 for the outcome. The coefficients, standard errors, t-statistics, and p-values of the logistic model fitted on the training set have been shown in the next table. ```{r Logistic Regression} glm_func <- glm(formula, event_training_baked, family = "binomial") tidy(glm_func) |> gt() |> fmt_number(decimals = 3) ``` ```{r Custom Metrics} event_metrics <- metric_set(accuracy, sens, spec, roc_auc) ``` ```{r Logistic Model} logistic_reg_mdl <- logistic_reg() |> set_engine("glm") |> set_mode("classification") ``` ```{r Logistic Workflow} logistic_wkfl <- workflow() |> add_model(logistic_reg_mdl) |> add_recipe(event_recipe) ``` Fitted model on the training set is used for prediction on the test set. Evaluation metrics such as accuracy, sensitivity, specificity, and roc auc are given below: ```{r Logistic Fit and Evaluation Metrics} logistic_fit <- logistic_wkfl |> last_fit(split = event_split) logistic_results <- logistic_fit |> collect_predictions() event_metrics(logistic_results, truth = wheelchair_accessible, estimate = .pred_class, .pred_Yes, event_level = "second") |> gt() |> fmt_number(decimals = 2) ``` Accuracy and ROC AUC is not good enough, despite reasonable specificity. Due to non-linearity relationship between predictors and outcome, let's try the basic tree-based model. ## Decision Tree Decision tree is capable of capturing nonlinear contributions of input variables to predict the probability of target variable. Although more complex algorithms exist and they might perform outstandingly, we start by a rather simple one. ```{r Decision Tree} dt_mdl <- decision_tree() |> set_engine("rpart") |> set_mode("classification") ``` ```{r DT Workflow} dt_wkfl <- workflow() |> add_model(dt_mdl) |> add_recipe(event_recipe) ``` Again, the evaluation metrics have been gathered in the next table: ```{r DT Fit and Evaluation Metrics} dt_fit <- dt_wkfl |> last_fit(split = event_split) dt_results <- dt_fit |> collect_predictions() event_metrics(dt_results, truth = wheelchair_accessible, estimate = .pred_class, .pred_Yes, event_level = "second") |> gt() |> fmt_number(decimal = 2) ``` We obtained results with improvement comparing with the linear logistic regression. Note that tuning the model can help enhance the accuracy and other relevant metrics. ### Tuning Decision Tree ```{r CV Folds} set.seed(123) event_folds <- vfold_cv(event_training, v = 3, strata = wheelchair_accessible) ``` ```{r DT Tuning} dt_tune_mdl <- decision_tree( cost_complexity = tune(), min_n = tune() ) |> set_engine("rpart") |> set_mode("classification") dt_tune_wkfl <- dt_wkfl |> update_model(dt_tune_mdl) dt_grid <- grid_random(extract_parameter_set_dials(dt_tune_mdl), size = 150) ``` ```{r DT Tuning Workflow} dt_tuning <- dt_tune_wkfl |> tune_grid(resamples = event_folds, grid = dt_grid, metrics = event_metrics) autoplot(dt_tuning) ``` ```{r DT Tuning Showing Best} show_best(dt_tuning, metric = "roc_auc", n = 3) |> gt() |> fmt_number(decimals = 3) best_dt_mdl <- dt_tuning |> select_best(metric = "roc_auc") ``` ```{r DT Tuned Workflow} dt_tuned_wkfl <- dt_tune_wkfl |> finalize_workflow(best_dt_mdl) ``` ```{r DT Tuned Fit and Evaluation Metrics} dt_tuned_wkfl_fit <- dt_tuned_wkfl |> last_fit(split = event_split) dt_results <- dt_tuned_wkfl_fit |> collect_predictions() |> event_metrics(truth = wheelchair_accessible, estimate = .pred_class, .pred_Yes, event_level = "second") dt_results |> gt() |> fmt_number(decimals = 3) ``` Still need to improve other metrics, let's try random forest model, which is much more powerful than decision tree. ## Random Forest Random forest is a more complicated form of decision tree model consisting of a multitude of trees that utilize a distinct bootstrap sample of the training set. Finally, in a classification problem, the majority voting is the deciding criteria to assign each case to one of the classes. ```{r Random Forest} ranforest_mdl <- rand_forest() |> set_engine("ranger") |> set_mode("classification") ``` ```{r RF Workflow} ranforest_wkfl <- workflow() |> add_model(ranforest_mdl) |> add_recipe(event_recipe) ranforest_wkfl_fit <- ranforest_wkfl |> last_fit(split = event_split) ``` ```{r RF Fit and Evaluation Metrics} ranforest_wkfl_fit |> collect_predictions() |> event_metrics(truth = wheelchair_accessible, estimate = .pred_class, .pred_Yes, event_level = "second") |> gt() |> fmt_number(decimals = 3) ``` Results are better than decision tree model, however, it needs to be tuned for better performance. ```{r RF Tuning} ranforest_tune_mdl <- rand_forest(trees = tune(), min_n = tune()) |> set_engine("ranger") |> set_mode("classification") ``` ```{r RF Tuning Workflow} ranforest_tune_wkfl <- ranforest_wkfl |> update_model(ranforest_tune_mdl) ranforest_grid <- grid_random(extract_parameter_set_dials(ranforest_tune_mdl), size = 150) ``` ```{r RF Tuned Workflow} ranforest_tuning <- ranforest_tune_wkfl |> tune_grid(resamples = event_folds, grid = ranforest_grid, metrics = event_metrics) autoplot(ranforest_tuning) ``` ```{r RF Show Best} show_best(ranforest_tuning, metric = "roc_auc", n = 3) |> gt() |> fmt_number(decimals = 3) best_ranforest_mdl <- ranforest_tuning |> select_best(metric = "roc_auc") ``` ```{r RF Tuned and Evaluation Metrics} ranforest_wkfl <- ranforest_tune_wkfl |> finalize_workflow(best_ranforest_mdl) ranforest_wkfl_fit <- ranforest_wkfl |> last_fit(split = event_split) ranforest_results <- ranforest_wkfl_fit |> collect_predictions() |> event_metrics(truth = wheelchair_accessible, estimate = .pred_class, .pred_Yes, event_level = "second") ranforest_results |> gt() |> fmt_number(decimals = 3) ``` ## XGBoost Last but not least, we go on using another complicated machine learning models called XGBoost. It's an optimized gradient-boosting machine learning model that has advantages like great speed and performance, outperforming single-algorithm models, and state-of-the-art performance in many ML tasks. ```{r XGBoost} boost_mdl <- boost_tree() |> set_engine("xgboost") |> set_mode("classification") ``` ```{r XGBoost Workflow} boost_wkfl <- workflow() |> add_model(boost_mdl) |> add_recipe(event_recipe) ``` ```{r XGBoost Fit and Evaluation Metrics} boost_wkfl_fit <- boost_wkfl |> last_fit(split = event_split) boost_wkfl_fit |> collect_predictions() |> event_metrics(truth = wheelchair_accessible, estimate = .pred_class, .pred_Yes, event_level = "second") |> gt() |> fmt_number(decimals = 3) ``` Now, we continue by tunning XGBoost as well. ```{r XGBoost Tuning} boost_tune_mdl <- boost_tree(learn_rate = tune(), tree_depth = tune(), sample_size = tune()) |> set_engine("xgboost") |> set_mode("classification") ``` ```{r XGBoost Tuning Workflow} boost_tune_wkfl <- boost_wkfl |> update_model(boost_tune_mdl) boost_grid <- grid_random(extract_parameter_set_dials(boost_tune_wkfl), size = 200) ``` ```{r XGBoost Tuned Workflow} boost_tuning <- boost_tune_wkfl |> tune_grid(resamples = event_folds, grid = boost_grid, metrics = event_metrics) autoplot(boost_tuning) ``` ```{r XGBoost Show Best} show_best(boost_tuning, metric = "roc_auc", n = 3) |> gt() |> fmt_number(decimals = 3) best_boost_mdl <- boost_tuning |> select_best(metric = "roc_auc") ``` ```{r XGBoost Finalize Workflow} boost_wkfl <- boost_tune_wkfl |> finalize_workflow(best_boost_mdl) boost_wkfl_fit <- boost_wkfl |> last_fit(split = event_split) ``` ```{r XGBoost Tuned Fit and Evaluation Metrics} boost_results <- boost_wkfl_fit |> collect_predictions() |> event_metrics(truth = wheelchair_accessible, estimate = .pred_class, .pred_Yes, event_level = "second") boost_results |> gt() |> fmt_number(decimals = 3) ``` XGBoost outperformed random forest model according to evaluation metrics in general. All of them are acceptable. Consequently, XGBoost is the final model. Because the criterion is met, more complex models are dismissed. Final Model is XGBoost to detect which venues still haven't special ramps for their audience. Criterion is met, Specificity \> 0.67 Accuracy and ROC AUC improved as well! # Business Focus In marketing, like other departments, resources such as money, time, and people are scarce. That is, we need to identify prospect customers so that we manage to forge a close relatioship and offer our product to solve their issue. CPC or Cost per Contact is defined as the money or any other kind of cost that the company incur to persuade one person to make a purchase. By targeting the customers with high probability of purchasing, we avoid calling every single venue, and with less time and money we accomplish what we are looking for. ## Addressing the Problem We have developed a model which is powerful to select only venues with high chances of ordering. Marketing team calls merely those prospects, so CPC is lowered and the company can contact potential customers before anyone else. ## Recommandations Recently, digital marketing has become one of the prominent tools to contact customers and offer the service they're looking for. Instead of calling predicted prospects, we can do some preliminary actions: 1. Contact via email explaining why the prospects would be better off if they order a ramp for wheelchair users. 2. What features distinguish venues based on their need to use ramps? For instance, it would be instructive to give some statistics about venues with similar features and trigger feeling of necessity to have the ramp. 3. Sending a video or even a simulation is helpful because customers can have a sense where the ramp will be constructed and whether they're ok with it or not.