venue_name | Loud music / events | Venue provides alcohol | Wi-Fi | supervenue | U-Shaped_max | max_standing | Theatre_max | Promoted / ticketed events | Wheelchair accessible |
---|---|---|---|---|---|---|---|---|---|
techspace aldgate east | FALSE | 0 | TRUE | FALSE | 35.04545 | 0 | 112.7159 | FALSE | FALSE |
green rooms hotel | TRUE | 1 | TRUE | FALSE | 40.00000 | 120 | 80.0000 | TRUE | FALSE |
148 leadenhall street | FALSE | 0 | TRUE | FALSE | 35.04545 | 0 | 112.7159 | FALSE | FALSE |
conway hall | FALSE | 0 | TRUE | FALSE | 35.04545 | 60 | 60.0000 | FALSE | FALSE |
gridiron building | FALSE | 0 | TRUE | FALSE | 35.04545 | 0 | 112.7159 | FALSE | FALSE |
kimpton fitzroy london | TRUE | 1 | TRUE | FALSE | 6.00000 | 0 | 112.7159 | TRUE | FALSE |
Prediction of Prospective Wheelchair Ramp Buyers in R
As a part of certification examination for professional data scientist in DataCamp, a fictitious case study is given to the candidate that requires not only standard tools and skills expected from a data savvy, but to give business insights in order to overcome an obstacle to making more profit.
Case Study
Company Background
National Accessibility currently installs wheelchair ramps for office buildings and schools. The marketing manager wants the company to start installing ramps for event venues as well. According to a new survey, approximately 40% of event venues are not wheelchair accessible. However, it is not easy to know whether a venue already has a ramp installed.
It is a waste of time to contact venues that already have a ramp installed, and it also looks bad for the company. They would like the help of the data science team in predicting which venues already have a ramp installed.
Customer Question
The marketing manager would like to know: - Can you develop a model to predict whether an event venue already has a wheelchair ramp installed?
Success Criteria
To reduce the amount of time wasted by the company contacting venues that already have a ramp, at least two-thirds of venues predicted to be without a ramp should not have a ramp.
Data
In the CSV file, the following variables have been gathered:
- venue_name
: Character, name of the venue.
- Loud music / events
: Character, whether the venue hosts loud events (True) or not (False).
- Venue provides alcohol
: Numeric, whether the venue provides alcohol (1) or not (0).
- Wi-Fi
: Character, whether the venue provides wi-fi (True) or not (False).
- supervenue
: Character, whether the venue qualifies as a supervenue (True) or not (False).
- U-Shaped_max
: Numeric, the total capacity of the u-shaped portion of the theater.
- max_standing
: Numeric, the total standing capacity of the venue.
- Theatre_max
: Numeric, the total capacity of the theatre.
- Promoted / ticketed events
: Character, whether the venue hosts promoted/ticket events (True) or not (False).
- Wheelchair accessible
: Character, whether the venue is wheelchair accessible (True) or not (False).
Exploratory Data Anaysis
Basic Exploration
First of all, it’s recommended to take a look at the data and its structure.
Variable names are not standard, so they should be turned into lower case with underscores (snake_case).
[1] "venue_name" "loud_music_events"
[3] "venue_provides_alcohol" "wi_fi"
[5] "supervenue" "u_shaped_max"
[7] "max_standing" "theatre_max"
[9] "promoted_ticketed_events" "wheelchair_accessible"
Then, we need to check each variable whether the data types are in a right format or not.
Rows: 3,910
Columns: 10
$ venue_name <chr> "techspace aldgate east", "green rooms hotel"…
$ loud_music_events <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE…
$ venue_provides_alcohol <dbl> 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, …
$ wi_fi <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
$ supervenue <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ u_shaped_max <dbl> 35.04545, 40.00000, 35.04545, 35.04545, 35.04…
$ max_standing <dbl> 0, 120, 0, 60, 0, 0, 0, 200, 0, 180, 300, 46,…
$ theatre_max <dbl> 112.7159, 80.0000, 112.7159, 60.0000, 112.715…
$ promoted_ticketed_events <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE…
$ wheelchair_accessible <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
There are 3910 rows or records in this data set with 10 columns. Last column or wheelchair_accessible
is the target variable.
Missing Values
There are 0 in this data set and the data types stated in the glimpse object implies no odd values are in the columns. Therefore, we can continue our analysis with no worries.
Cross Field Validation
In this data, theatre_max
must be greater than u_shaped_max
and max_standing
. We check if such constraint is met.
After forcing the constraint, the number of rows in the data set dropped to 2329 implying that there were some cases in which our restraint have been violated.
Duplicated Venue Observations
venue_name | loud_music_events | venue_provides_alcohol | wi_fi | supervenue | u_shaped_max | max_standing | theatre_max | promoted_ticketed_events | wheelchair_accessible |
---|---|---|---|---|---|---|---|---|---|
1 cornhill | FALSE | 1 | TRUE | FALSE | 35.04545 | 0 | 112.7159 | FALSE | FALSE |
1 cornhill | FALSE | 1 | TRUE | FALSE | 35.04545 | 0 | 112.7159 | FALSE | FALSE |
1 cornhill | FALSE | 1 | TRUE | FALSE | 35.04545 | 0 | 55.0000 | FALSE | FALSE |
1 cornhill | FALSE | 1 | TRUE | FALSE | 35.04545 | 0 | 112.7159 | FALSE | FALSE |
1 king street | FALSE | 0 | TRUE | FALSE | 35.04545 | 0 | 112.7159 | FALSE | FALSE |
1 king street | FALSE | 0 | TRUE | FALSE | 35.04545 | 0 | 112.7159 | FALSE | FALSE |
1 king william street | FALSE | 1 | TRUE | FALSE | 35.04545 | 0 | 112.7159 | FALSE | FALSE |
1 king william street | FALSE | 1 | TRUE | FALSE | 35.04545 | 0 | 112.7159 | FALSE | FALSE |
1 king william street | FALSE | 1 | TRUE | FALSE | 35.04545 | 0 | 112.7159 | FALSE | TRUE |
1 wimpole street | FALSE | 1 | TRUE | FALSE | 35.04545 | 0 | 80.0000 | TRUE | TRUE |
Duplicated observations must be removed. To do so, data is grouped by the venues and then, summarized with median function to calculate the median values of variables to have just one value.
In the next table, first few rows of the number of cases that contain information of one venue is shown.
venue_name | n |
---|---|
1 cornhill | 5 |
1 king street | 3 |
1 king william street | 4 |
1 wimpole street | 3 |
107 cheapside | 6 |
110 rochester row | 5 |
The following table shows the first few rows of the data set after aggregating duplicated cases.
venue_name | loud_music_events | venue_provides_alcohol | wi_fi | supervenue | u_shaped_max | max_standing | theatre_max | promoted_ticketed_events | wheelchair_accessible |
---|---|---|---|---|---|---|---|---|---|
1 cornhill | 0 | 1 | 1 | 0 | 35 | 0 | 113 | 0 | 0 |
1 king street | 0 | 0 | 1 | 0 | 35 | 0 | 113 | 0 | 0 |
1 king william street | 0 | 1 | 1 | 0 | 35 | 0 | 113 | 0 | 0 |
1 wimpole street | 0 | 1 | 1 | 0 | 35 | 0 | 113 | 1 | 1 |
107 cheapside | 0 | 0 | 1 | 1 | 35 | 0 | 91 | 1 | 1 |
110 rochester row | 0 | 0 | 1 | 0 | 16 | 40 | 40 | 1 | 1 |
Again, we should check if there are some other duplicated cases. Now, 0 duplicated cases exist in the data set, so we go on to the next part.
Summary Statistics
Temporarily, boolean and binary data types should be transformed into labeled categorical type. However, in the model preprocessing part, categorical variables change into binary (dummy) variables.
Before modeling, it’s imperative to grasp a general idea of the data set. In this section, a summary statistics has been provided for numerical and categorical (binary) variables.
Data Frame Summary
Dimensions: 1066 x 9
Duplicates: 446
No | Variable | Stats / Values | Freqs (% of Valid) | Missing |
---|---|---|---|---|
1 | loud_music_events [factor] |
1. No 2. Yes |
738 (69.2%) 328 (30.8%) |
0 (0.0%) |
2 | venue_provides_alcohol [factor] |
1. No 2. Yes |
375 (35.2%) 691 (64.8%) |
0 (0.0%) |
3 | wi_fi [factor] |
1. No 2. Yes |
97 ( 9.1%) 969 (90.9%) |
0 (0.0%) |
4 | supervenue [factor] |
1. No 2. Yes |
979 (91.8%) 87 ( 8.2%) |
0 (0.0%) |
5 | u_shaped_max [numeric] |
Mean (sd) : 33.7 (7.6) min < med < max: 4 < 35 < 105 IQR (CV) : 0 (0.2) |
45 distinct values | 0 (0.0%) |
6 | max_standing [numeric] |
Mean (sd) : 50.4 (152.9) min < med < max: 0 < 30 < 4000 IQR (CV) : 60 (3) |
81 distinct values | 0 (0.0%) |
7 | theatre_max [numeric] |
Mean (sd) : 122.1 (159.4) min < med < max: 8 < 113 < 4000 IQR (CV) : 0 (1.3) |
108 distinct values | 0 (0.0%) |
8 | promoted_ticketed_events [factor] |
1. No 2. Yes |
704 (66.0%) 362 (34.0%) |
0 (0.0%) |
9 | wheelchair_accessible [factor] |
1. No 2. Yes |
627 (58.8%) 439 (41.2%) |
0 (0.0%) |
Data Visualization
Maximum Capacity
Highly Skewed! Better to plot with log scale.
The reason our box plot has been compacted into a thick line is the distribution of theatre_max
variable. The majority of data is slightly above 100 and other values are detected as outliers.
U-Shaped Maximum Capacity
Still outliers exist, but the density distribution shows less skewness.
Maximum Standing Capacity
Log-scaled plots reveal more insights into the distributions. Note that there are 270 zeros in this variable and we need to add one unit to circumvent errors.
Categorical Variables
Venues with loud-music events are almost half of those with no loud music.
Venues providing alcohol are almost twice the number of venues with no alcohol.
Most venues provide wi-fi service to their customers and it doesn’t really gives that much information. This feature has the potential to be excluded due to its low variation.
Few venues are labelled “Supervenue”. This variable can’t be useful explaining the probability of target labels.
Number of venues with promoted events is almost half of the number of those that have no such events.
Target Variable
Whether a venue has access to wheelchair ramp is our target variable and we attempt to predict the probability of each outcome for each case.
Target variable is approximately balanced.
Relationship Between Features
theatre_max
has correlation with max_standing
and u_shaped_max
, which was expected beforehand.
Feature Distributions By Target Variable
Those venues with no wheelchair ramp have also less variance in their maximum capacity comparing with venues with the accessibility.
Correlation Coefficients
term | u_shaped_max | max_standing | theatre_max |
---|---|---|---|
u_shaped_max | NA | 0.120 | 0.133 |
max_standing | 0.120 | NA | 0.913 |
theatre_max | 0.133 | 0.913 | NA |
Collinearity exists between theatre_max and max_standing!
Modeling with Tidymodels
The objective of the following analysis is predicting whether the venue has already have ramp accessibility or not. In other words, there are two possible outcomes:
- In case wheelchair_accessible
was TRUE then Marketing Department dismiss such venues.
- In case wheelchair_accessible
= FALSE then Marketing Department contacts them.
The problem at hand is a binary classification under supervised machine learning.
KPI to compare models
In the following anaylses, specificity, which is defined as the number of True negative prediction out of all negative predicted, is our main focus. Business Criterion is that out of all venues predicted with no ramp accessibility, at least 67% must be predicted accurately.
Specificity is the KPI to evaluate the analyses.
Criterion : specificity > 67%
Preprocessing
In order to use step_log, we would better off changing 0 values in max_standing to 1.
First few rows of the training data after preprocessing steps are as follows:
max_standing | theatre_max | wheelchair_accessible | loud_music_events_Yes | venue_provides_alcohol_Yes | promoted_ticketed_events_Yes |
---|---|---|---|---|---|
−1.570 | 0.134 | No | 0.000 | 0.000 | 0.000 |
−1.570 | 0.134 | No | 0.000 | 0.000 | 0.000 |
0.103 | 0.134 | No | 1.000 | 1.000 | 1.000 |
−1.570 | 0.134 | No | 0.000 | 0.000 | 0.000 |
−1.570 | 0.134 | No | 0.000 | 0.000 | 0.000 |
−1.570 | 0.134 | No | 0.000 | 0.000 | 0.000 |
First few rows of the test data after preprocessing steps are as follows:
max_standing | theatre_max | wheelchair_accessible | loud_music_events_Yes | venue_provides_alcohol_Yes | promoted_ticketed_events_Yes |
---|---|---|---|---|---|
−1.570 | 0.134 | No | 0.000 | 1.000 | 0.000 |
−1.570 | 0.134 | No | 0.000 | 1.000 | 0.000 |
0.490 | −2.065 | Yes | 0.000 | 0.000 | 1.000 |
−1.570 | 0.134 | No | 0.000 | 0.000 | 0.000 |
−1.570 | 0.134 | Yes | 0.000 | 1.000 | 0.000 |
−1.570 | 0.134 | No | 0.000 | 0.000 | 0.000 |
Linear Logistic Regression
Let’s start the analysis with the simplest form of model for classification. Linear Logistic Regression assumes a linear relationship between predictors and the probability of getting value of 1 for the outcome.
The coefficients, standard errors, t-statistics, and p-values of the logistic model fitted on the training set have been shown in the next table.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | −0.837 | 0.135 | −6.198 | 0.000 |
max_standing | 0.024 | 0.076 | 0.320 | 0.749 |
theatre_max | −0.057 | 0.072 | −0.790 | 0.430 |
loud_music_events_Yes | −0.221 | 0.171 | −1.297 | 0.194 |
venue_provides_alcohol_Yes | 0.507 | 0.162 | 3.131 | 0.002 |
promoted_ticketed_events_Yes | 0.625 | 0.158 | 3.943 | 0.000 |
Fitted model on the training set is used for prediction on the test set. Evaluation metrics such as accuracy, sensitivity, specificity, and roc auc are given below:
.metric | .estimator | .estimate |
---|---|---|
accuracy | binary | 0.60 |
sens | binary | 0.32 |
spec | binary | 0.80 |
roc_auc | binary | 0.63 |
Accuracy and ROC AUC is not good enough, despite reasonable specificity. Due to non-linearity relationship between predictors and outcome, let’s try the basic tree-based model.
Decision Tree
Decision tree is capable of capturing nonlinear contributions of input variables to predict the probability of target variable. Although more complex algorithms exist and they might perform outstandingly, we start by a rather simple one.
Again, the evaluation metrics have been gathered in the next table:
.metric | .estimator | .estimate |
---|---|---|
accuracy | binary | 0.66 |
sens | binary | 0.38 |
spec | binary | 0.87 |
roc_auc | binary | 0.67 |
We obtained results with improvement comparing with the linear logistic regression. Note that tuning the model can help enhance the accuracy and other relevant metrics.
Tuning Decision Tree
cost_complexity | min_n | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|
0.000 | 27.000 | roc_auc | binary | 0.657 | 3.000 | 0.027 | Preprocessor1_Model003 |
0.000 | 28.000 | roc_auc | binary | 0.657 | 3.000 | 0.027 | Preprocessor1_Model023 |
0.000 | 27.000 | roc_auc | binary | 0.657 | 3.000 | 0.027 | Preprocessor1_Model049 |
.metric | .estimator | .estimate |
---|---|---|
accuracy | binary | 0.659 |
sens | binary | 0.443 |
spec | binary | 0.810 |
roc_auc | binary | 0.660 |
Still need to improve other metrics, let’s try random forest model, which is much more powerful than decision tree.
Random Forest
Random forest is a more complicated form of decision tree model consisting of a multitude of trees that utilize a distinct bootstrap sample of the training set. Finally, in a classification problem, the majority voting is the deciding criteria to assign each case to one of the classes.
.metric | .estimator | .estimate |
---|---|---|
accuracy | binary | 0.678 |
sens | binary | 0.420 |
spec | binary | 0.857 |
roc_auc | binary | 0.717 |
Results are better than decision tree model, however, it needs to be tuned for better performance.
trees | min_n | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|
492.000 | 22.000 | roc_auc | binary | 0.683 | 3.000 | 0.024 | Preprocessor1_Model021 |
682.000 | 20.000 | roc_auc | binary | 0.682 | 3.000 | 0.026 | Preprocessor1_Model029 |
91.000 | 34.000 | roc_auc | binary | 0.682 | 3.000 | 0.021 | Preprocessor1_Model013 |
.metric | .estimator | .estimate |
---|---|---|
accuracy | binary | 0.682 |
sens | binary | 0.409 |
spec | binary | 0.873 |
roc_auc | binary | 0.719 |
XGBoost
Last but not least, we go on using another complicated machine learning models called XGBoost. It’s an optimized gradient-boosting machine learning model that has advantages like great speed and performance, outperforming single-algorithm models, and state-of-the-art performance in many ML tasks.
.metric | .estimator | .estimate |
---|---|---|
accuracy | binary | 0.692 |
sens | binary | 0.455 |
spec | binary | 0.857 |
roc_auc | binary | 0.703 |
Now, we continue by tunning XGBoost as well.
tree_depth | learn_rate | sample_size | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|---|
4.000 | 0.080 | 0.570 | roc_auc | binary | 0.684 | 3.000 | 0.008 | Preprocessor1_Model128 |
1.000 | 0.264 | 0.237 | roc_auc | binary | 0.683 | 3.000 | 0.012 | Preprocessor1_Model176 |
3.000 | 0.027 | 0.520 | roc_auc | binary | 0.681 | 3.000 | 0.014 | Preprocessor1_Model015 |
.metric | .estimator | .estimate |
---|---|---|
accuracy | binary | 0.706 |
sens | binary | 0.432 |
spec | binary | 0.897 |
roc_auc | binary | 0.715 |
XGBoost outperformed random forest model according to evaluation metrics in general. All of them are acceptable. Consequently, XGBoost is the final model. Because the criterion is met, more complex models are dismissed.
Final Model is XGBoost to detect which venues still haven’t special ramps for their audience.
Criterion is met, Specificity > 0.67 Accuracy and ROC AUC improved as well!
Business Focus
In marketing, like other departments, resources such as money, time, and people are scarce. That is, we need to identify prospect customers so that we manage to forge a close relatioship and offer our product to solve their issue.
CPC or Cost per Contact is defined as the money or any other kind of cost that the company incur to persuade one person to make a purchase. By targeting the customers with high probability of purchasing, we avoid calling every single venue, and with less time and money we accomplish what we are looking for.
Addressing the Problem
We have developed a model which is powerful to select only venues with high chances of ordering. Marketing team calls merely those prospects, so CPC is lowered and the company can contact potential customers before anyone else.
Recommandations
Recently, digital marketing has become one of the prominent tools to contact customers and offer the service they’re looking for. Instead of calling predicted prospects, we can do some preliminary actions:
Contact via email explaining why the prospects would be better off if they order a ramp for wheelchair users.
What features distinguish venues based on their need to use ramps? For instance, it would be instructive to give some statistics about venues with similar features and trigger feeling of necessity to have the ramp.
Sending a video or even a simulation is helpful because customers can have a sense where the ramp will be constructed and whether they’re ok with it or not.