Prediction of Prospective Wheelchair Ramp Buyers in R

Author

Iman Mousavi

Published

August 5, 2022

As a part of certification examination for professional data scientist in DataCamp, a fictitious case study is given to the candidate that requires not only standard tools and skills expected from a data savvy, but to give business insights in order to overcome an obstacle to making more profit.

Wheelchair Ramp for Venues

Case Study

Company Background

National Accessibility currently installs wheelchair ramps for office buildings and schools. The marketing manager wants the company to start installing ramps for event venues as well. According to a new survey, approximately 40% of event venues are not wheelchair accessible. However, it is not easy to know whether a venue already has a ramp installed.

It is a waste of time to contact venues that already have a ramp installed, and it also looks bad for the company. They would like the help of the data science team in predicting which venues already have a ramp installed.

Customer Question

The marketing manager would like to know: - Can you develop a model to predict whether an event venue already has a wheelchair ramp installed?

Success Criteria

To reduce the amount of time wasted by the company contacting venues that already have a ramp, at least two-thirds of venues predicted to be without a ramp should not have a ramp.

Data

In the CSV file, the following variables have been gathered:
- venue_name: Character, name of the venue.
- Loud music / events: Character, whether the venue hosts loud events (True) or not (False).
- Venue provides alcohol: Numeric, whether the venue provides alcohol (1) or not (0).
- Wi-Fi: Character, whether the venue provides wi-fi (True) or not (False).
- supervenue: Character, whether the venue qualifies as a supervenue (True) or not (False).
- U-Shaped_max: Numeric, the total capacity of the u-shaped portion of the theater.
- max_standing: Numeric, the total standing capacity of the venue.
- Theatre_max: Numeric, the total capacity of the theatre.
- Promoted / ticketed events: Character, whether the venue hosts promoted/ticket events (True) or not (False).
- Wheelchair accessible: Character, whether the venue is wheelchair accessible (True) or not (False).

Exploratory Data Anaysis

Basic Exploration

First of all, it’s recommended to take a look at the data and its structure.

venue_name Loud music / events Venue provides alcohol Wi-Fi supervenue U-Shaped_max max_standing Theatre_max Wheelchair accessible
techspace aldgate east FALSE 0 TRUE FALSE 35.04545 0 112.7159 FALSE FALSE
green rooms hotel TRUE 1 TRUE FALSE 40.00000 120 80.0000 TRUE FALSE
148 leadenhall street FALSE 0 TRUE FALSE 35.04545 0 112.7159 FALSE FALSE
conway hall FALSE 0 TRUE FALSE 35.04545 60 60.0000 FALSE FALSE
gridiron building FALSE 0 TRUE FALSE 35.04545 0 112.7159 FALSE FALSE
kimpton fitzroy london TRUE 1 TRUE FALSE 6.00000 0 112.7159 TRUE FALSE

Variable names are not standard, so they should be turned into lower case with underscores (snake_case).

 [1] "venue_name"               "loud_music_events"       
 [3] "venue_provides_alcohol"   "wi_fi"                   
 [5] "supervenue"               "u_shaped_max"            
 [7] "max_standing"             "theatre_max"             
 [9] "promoted_ticketed_events" "wheelchair_accessible"   

Then, we need to check each variable whether the data types are in a right format or not.

Rows: 3,910
Columns: 10
$ venue_name               <chr> "techspace aldgate east", "green rooms hotel"…
$ loud_music_events        <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE…
$ venue_provides_alcohol   <dbl> 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, …
$ wi_fi                    <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
$ supervenue               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ u_shaped_max             <dbl> 35.04545, 40.00000, 35.04545, 35.04545, 35.04…
$ max_standing             <dbl> 0, 120, 0, 60, 0, 0, 0, 200, 0, 180, 300, 46,…
$ theatre_max              <dbl> 112.7159, 80.0000, 112.7159, 60.0000, 112.715…
$ promoted_ticketed_events <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE…
$ wheelchair_accessible    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

There are 3910 rows or records in this data set with 10 columns. Last column or wheelchair_accessible is the target variable.

Missing Values

There are 0 in this data set and the data types stated in the glimpse object implies no odd values are in the columns. Therefore, we can continue our analysis with no worries.

Cross Field Validation

In this data, theatre_max must be greater than u_shaped_max and max_standing. We check if such constraint is met.

After forcing the constraint, the number of rows in the data set dropped to 2329 implying that there were some cases in which our restraint have been violated.

Duplicated Venue Observations

venue_name loud_music_events venue_provides_alcohol wi_fi supervenue u_shaped_max max_standing theatre_max wheelchair_accessible
1 cornhill FALSE 1 TRUE FALSE 35.04545 0 112.7159 FALSE FALSE
1 cornhill FALSE 1 TRUE FALSE 35.04545 0 112.7159 FALSE FALSE
1 cornhill FALSE 1 TRUE FALSE 35.04545 0 55.0000 FALSE FALSE
1 cornhill FALSE 1 TRUE FALSE 35.04545 0 112.7159 FALSE FALSE
1 king street FALSE 0 TRUE FALSE 35.04545 0 112.7159 FALSE FALSE
1 king street FALSE 0 TRUE FALSE 35.04545 0 112.7159 FALSE FALSE
1 king william street FALSE 1 TRUE FALSE 35.04545 0 112.7159 FALSE FALSE
1 king william street FALSE 1 TRUE FALSE 35.04545 0 112.7159 FALSE FALSE
1 king william street FALSE 1 TRUE FALSE 35.04545 0 112.7159 FALSE TRUE
1 wimpole street FALSE 1 TRUE FALSE 35.04545 0 80.0000 TRUE TRUE

Duplicated observations must be removed. To do so, data is grouped by the venues and then, summarized with median function to calculate the median values of variables to have just one value.

In the next table, first few rows of the number of cases that contain information of one venue is shown.

venue_name n
1 cornhill 5
1 king street 3
1 king william street 4
1 wimpole street 3
107 cheapside 6
110 rochester row 5

The following table shows the first few rows of the data set after aggregating duplicated cases.

venue_name loud_music_events venue_provides_alcohol wi_fi supervenue u_shaped_max max_standing theatre_max wheelchair_accessible
1 cornhill 0 1 1 0 35 0 113 0 0
1 king street 0 0 1 0 35 0 113 0 0
1 king william street 0 1 1 0 35 0 113 0 0
1 wimpole street 0 1 1 0 35 0 113 1 1
107 cheapside 0 0 1 1 35 0 91 1 1
110 rochester row 0 0 1 0 16 40 40 1 1

Again, we should check if there are some other duplicated cases. Now, 0 duplicated cases exist in the data set, so we go on to the next part.

Summary Statistics

Temporarily, boolean and binary data types should be transformed into labeled categorical type. However, in the model preprocessing part, categorical variables change into binary (dummy) variables.

Before modeling, it’s imperative to grasp a general idea of the data set. In this section, a summary statistics has been provided for numerical and categorical (binary) variables.

Data Frame Summary

Dimensions: 1066 x 9
Duplicates: 446

No Variable Stats / Values Freqs (% of Valid) Missing
1 loud_music_events
[factor]
1. No
2. Yes
738 (69.2%)
328 (30.8%)
0
(0.0%)
2 venue_provides_alcohol
[factor]
1. No
2. Yes
375 (35.2%)
691 (64.8%)
0
(0.0%)
3 wi_fi
[factor]
1. No
2. Yes
97 ( 9.1%)
969 (90.9%)
0
(0.0%)
4 supervenue
[factor]
1. No
2. Yes
979 (91.8%)
87 ( 8.2%)
0
(0.0%)
5 u_shaped_max
[numeric]
Mean (sd) : 33.7 (7.6)
min < med < max:
4 < 35 < 105
IQR (CV) : 0 (0.2)
45 distinct values 0
(0.0%)
6 max_standing
[numeric]
Mean (sd) : 50.4 (152.9)
min < med < max:
0 < 30 < 4000
IQR (CV) : 60 (3)
81 distinct values 0
(0.0%)
7 theatre_max
[numeric]
Mean (sd) : 122.1 (159.4)
min < med < max:
8 < 113 < 4000
IQR (CV) : 0 (1.3)
108 distinct values 0
(0.0%)
8 promoted_ticketed_events
[factor]
1. No
2. Yes
704 (66.0%)
362 (34.0%)
0
(0.0%)
9 wheelchair_accessible
[factor]
1. No
2. Yes
627 (58.8%)
439 (41.2%)
0
(0.0%)

Data Visualization

Maximum Capacity

Highly Skewed! Better to plot with log scale.

The reason our box plot has been compacted into a thick line is the distribution of theatre_max variable. The majority of data is slightly above 100 and other values are detected as outliers.

U-Shaped Maximum Capacity

Still outliers exist, but the density distribution shows less skewness.

Maximum Standing Capacity

Log-scaled plots reveal more insights into the distributions. Note that there are 270 zeros in this variable and we need to add one unit to circumvent errors.

Categorical Variables

Venues with loud-music events are almost half of those with no loud music.

Venues providing alcohol are almost twice the number of venues with no alcohol.

Most venues provide wi-fi service to their customers and it doesn’t really gives that much information. This feature has the potential to be excluded due to its low variation.

Few venues are labelled “Supervenue”. This variable can’t be useful explaining the probability of target labels.

Number of venues with promoted events is almost half of the number of those that have no such events.

Target Variable

Whether a venue has access to wheelchair ramp is our target variable and we attempt to predict the probability of each outcome for each case.

Target variable is approximately balanced.

Relationship Between Features

theatre_max has correlation with max_standing and u_shaped_max, which was expected beforehand.

Feature Distributions By Target Variable

Those venues with no wheelchair ramp have also less variance in their maximum capacity comparing with venues with the accessibility.

Correlation Coefficients

term u_shaped_max max_standing theatre_max
u_shaped_max NA 0.120 0.133
max_standing 0.120 NA 0.913
theatre_max 0.133 0.913 NA

Collinearity exists between theatre_max and max_standing!

Modeling with Tidymodels

The objective of the following analysis is predicting whether the venue has already have ramp accessibility or not. In other words, there are two possible outcomes:
- In case wheelchair_accessible was TRUE then Marketing Department dismiss such venues.
- In case wheelchair_accessible = FALSE then Marketing Department contacts them.

The problem at hand is a binary classification under supervised machine learning.

KPI to compare models

In the following anaylses, specificity, which is defined as the number of True negative prediction out of all negative predicted, is our main focus. Business Criterion is that out of all venues predicted with no ramp accessibility, at least 67% must be predicted accurately.

Specificity is the KPI to evaluate the analyses.

Criterion : specificity > 67%

Preprocessing

In order to use step_log, we would better off changing 0 values in max_standing to 1.

First few rows of the training data after preprocessing steps are as follows:

max_standing theatre_max wheelchair_accessible loud_music_events_Yes venue_provides_alcohol_Yes
−1.570 0.134 No 0.000 0.000 0.000
−1.570 0.134 No 0.000 0.000 0.000
0.103 0.134 No 1.000 1.000 1.000
−1.570 0.134 No 0.000 0.000 0.000
−1.570 0.134 No 0.000 0.000 0.000
−1.570 0.134 No 0.000 0.000 0.000

First few rows of the test data after preprocessing steps are as follows:

max_standing theatre_max wheelchair_accessible loud_music_events_Yes venue_provides_alcohol_Yes
−1.570 0.134 No 0.000 1.000 0.000
−1.570 0.134 No 0.000 1.000 0.000
0.490 −2.065 Yes 0.000 0.000 1.000
−1.570 0.134 No 0.000 0.000 0.000
−1.570 0.134 Yes 0.000 1.000 0.000
−1.570 0.134 No 0.000 0.000 0.000

Linear Logistic Regression

Let’s start the analysis with the simplest form of model for classification. Linear Logistic Regression assumes a linear relationship between predictors and the probability of getting value of 1 for the outcome.

The coefficients, standard errors, t-statistics, and p-values of the logistic model fitted on the training set have been shown in the next table.

term estimate std.error statistic p.value
(Intercept) −0.837 0.135 −6.198 0.000
max_standing 0.024 0.076 0.320 0.749
theatre_max −0.057 0.072 −0.790 0.430
loud_music_events_Yes −0.221 0.171 −1.297 0.194
venue_provides_alcohol_Yes 0.507 0.162 3.131 0.002
promoted_ticketed_events_Yes 0.625 0.158 3.943 0.000

Fitted model on the training set is used for prediction on the test set. Evaluation metrics such as accuracy, sensitivity, specificity, and roc auc are given below:

.metric .estimator .estimate
accuracy binary 0.60
sens binary 0.32
spec binary 0.80
roc_auc binary 0.63

Accuracy and ROC AUC is not good enough, despite reasonable specificity. Due to non-linearity relationship between predictors and outcome, let’s try the basic tree-based model.

Decision Tree

Decision tree is capable of capturing nonlinear contributions of input variables to predict the probability of target variable. Although more complex algorithms exist and they might perform outstandingly, we start by a rather simple one.

Again, the evaluation metrics have been gathered in the next table:

.metric .estimator .estimate
accuracy binary 0.66
sens binary 0.38
spec binary 0.87
roc_auc binary 0.67

We obtained results with improvement comparing with the linear logistic regression. Note that tuning the model can help enhance the accuracy and other relevant metrics.

Tuning Decision Tree

cost_complexity min_n .metric .estimator mean n std_err .config
0.000 27.000 roc_auc binary 0.657 3.000 0.027 Preprocessor1_Model003
0.000 28.000 roc_auc binary 0.657 3.000 0.027 Preprocessor1_Model023
0.000 27.000 roc_auc binary 0.657 3.000 0.027 Preprocessor1_Model049
.metric .estimator .estimate
accuracy binary 0.659
sens binary 0.443
spec binary 0.810
roc_auc binary 0.660

Still need to improve other metrics, let’s try random forest model, which is much more powerful than decision tree.

Random Forest

Random forest is a more complicated form of decision tree model consisting of a multitude of trees that utilize a distinct bootstrap sample of the training set. Finally, in a classification problem, the majority voting is the deciding criteria to assign each case to one of the classes.

.metric .estimator .estimate
accuracy binary 0.678
sens binary 0.420
spec binary 0.857
roc_auc binary 0.717

Results are better than decision tree model, however, it needs to be tuned for better performance.

trees min_n .metric .estimator mean n std_err .config
492.000 22.000 roc_auc binary 0.683 3.000 0.024 Preprocessor1_Model021
682.000 20.000 roc_auc binary 0.682 3.000 0.026 Preprocessor1_Model029
91.000 34.000 roc_auc binary 0.682 3.000 0.021 Preprocessor1_Model013
.metric .estimator .estimate
accuracy binary 0.682
sens binary 0.409
spec binary 0.873
roc_auc binary 0.719

XGBoost

Last but not least, we go on using another complicated machine learning models called XGBoost. It’s an optimized gradient-boosting machine learning model that has advantages like great speed and performance, outperforming single-algorithm models, and state-of-the-art performance in many ML tasks.

.metric .estimator .estimate
accuracy binary 0.692
sens binary 0.455
spec binary 0.857
roc_auc binary 0.703

Now, we continue by tunning XGBoost as well.

tree_depth learn_rate sample_size .metric .estimator mean n std_err .config
4.000 0.080 0.570 roc_auc binary 0.684 3.000 0.008 Preprocessor1_Model128
1.000 0.264 0.237 roc_auc binary 0.683 3.000 0.012 Preprocessor1_Model176
3.000 0.027 0.520 roc_auc binary 0.681 3.000 0.014 Preprocessor1_Model015
.metric .estimator .estimate
accuracy binary 0.706
sens binary 0.432
spec binary 0.897
roc_auc binary 0.715

XGBoost outperformed random forest model according to evaluation metrics in general. All of them are acceptable. Consequently, XGBoost is the final model. Because the criterion is met, more complex models are dismissed.

Final Model is XGBoost to detect which venues still haven’t special ramps for their audience.

Criterion is met, Specificity > 0.67 Accuracy and ROC AUC improved as well!

Business Focus

In marketing, like other departments, resources such as money, time, and people are scarce. That is, we need to identify prospect customers so that we manage to forge a close relatioship and offer our product to solve their issue.

CPC or Cost per Contact is defined as the money or any other kind of cost that the company incur to persuade one person to make a purchase. By targeting the customers with high probability of purchasing, we avoid calling every single venue, and with less time and money we accomplish what we are looking for.

Addressing the Problem

We have developed a model which is powerful to select only venues with high chances of ordering. Marketing team calls merely those prospects, so CPC is lowered and the company can contact potential customers before anyone else.

Recommandations

Recently, digital marketing has become one of the prominent tools to contact customers and offer the service they’re looking for. Instead of calling predicted prospects, we can do some preliminary actions:

  1. Contact via email explaining why the prospects would be better off if they order a ramp for wheelchair users.

  2. What features distinguish venues based on their need to use ramps? For instance, it would be instructive to give some statistics about venues with similar features and trigger feeling of necessity to have the ramp.

  3. Sending a video or even a simulation is helpful because customers can have a sense where the ramp will be constructed and whether they’re ok with it or not.