Install all the libraries needed :
library(tidyverse) #Data manipulation
library(caret) #Data preprocessing and evaluation metrics
library(GGally) #Extension of ggplot to draw pairplot
library(psych) #Another library to draw pairplot
library(caTools)
Import the Titanic Dataset :
df = read.csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
df
Titanic dataset is the legendary dataset which contains demographic and traveling information of some Titanic passengers, and the goal is to predict the survival of these passengers. The accident happened in 1912 when the ship RMS Titanic struck an iceberg on its maiden voyage and sank, resulting in the deaths of most of its passengers and crew.
From above sample of the RMS Titanic data, one can see various features present for each passenger on the ship:
First, let’s investigate stat descriptive parameter and data types of each feature within the data
glimpse(df)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,~
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1~
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3~
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl~
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal~
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, ~
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0~
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0~
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37~
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,~
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C~
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"~
summary(df)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
colSums(is.na(df))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
Following a quick investigation, we discover that the only variable with missing values is Age. However, a closer examination of the data reveals that Cabin also has 687 missing values, which are represented by the ” character. As a result, we’ll remove all rows with missing Age and drop the Cabin column. The PassengerId column will be removed as well, as it provides no useful information.
nrow(df[df$Cabin=='',])
## [1] 687
Quick data cleaning :
df = df %>% select(-c(PassengerId,Cabin))
df=df[complete.cases(df),]
df_ori=df #save the original dataframe
df
Before doing the analysis, it’s better for us to know the proportion of survived passengers in the sample data. From the bar plot below, it is evident that roughly 40% passengers or 300 passengers are survived.
df %>% count(Survived=factor(Survived)) %>% mutate(prop=prop.table(n)) %>% ggplot(aes(x=Survived,y=n,label=paste(round(prop*100,2),'%')))+geom_bar(stat='identity',fill='#F8766D')+geom_text(position = position_dodge(width = .9),vjust = -0.5,size = 4)+theme_minimal()+labs(title='Count of Survived Passengers',subtitle='Survived Passengers represented by 1',y='count')+ylim(0,500)
Now, we can analyze the highlighted variables i.e : Sex, Pclass, and Age. In order to make the plotting process easier, let’s transform the data type of Survived, Pclass, and Sex become factor (numerical into categorical).
# Factor the categorical variable Survived, Pclass, and Sex to make the plotting process easier
df$Survived=factor(df$Survived)
df$Pclass=factor(df$Pclass)
df$Sex=factor(df$Sex)
Age
library(patchwork)
p1=ggplot(df,aes(x=Age,fill=Survived))+geom_histogram()+theme_minimal()+labs(title='Histogram of Survived vs Age')
p2=ggplot(df,aes(x=Survived,y=Age))+geom_boxplot()+theme_minimal()+labs(title='Boxplot of Survived vs Age')
p3=p1+p2+plot_annotation(caption = 'Survived passengers represented by 1')
p3
At glance, the age distributions of 1 and 0 Survived Passengers are nearly identical, hence Age may not be a good predictor of Survived. However, a closer examination of the histogram reveals that passengers with age roughly 0-8 years (kids) do have higher survival chance than others.
p1 = df %>% ggplot(aes(x=cut_interval(Age,length = 8)))+geom_bar(stat='count',fill='#F8766D')+theme_minimal()
p2 = df %>% ggplot(aes(x=cut_interval(Age,length = 8)))+geom_bar(stat='count',position = 'fill',aes(fill=Survived))+scale_y_continuous(labels = scales::percent)+theme_minimal()+labs(y='Survived Percentage')
p=p1/p2+plot_annotation(title='Age Binning with length=8',subtitle = 'Childs have relatively bigger probability to survive')
ggsave('age_bin.jpg',p,dpi=1000,width=9,height=6)
print(p)
Sex
p1 = df %>% ggplot(aes(x=Sex))+geom_bar(stat='count',fill='#F8766D')+theme_minimal()
p2 = df %>% ggplot(aes(x=Sex))+geom_bar(stat='count',position = 'fill',aes(fill=Survived))+ scale_y_continuous(labels = scales::percent)+theme_minimal()+labs(y='Survived Percentage')
p1+p2+plot_annotation(title='Count of Survived vs Sex',subtitle='Survived passengers represented by 1')
The stacked bar plot shows that Sex or Gender may comes as good predictor for Survived variable as female passengers have survival probability around 75%, about 3 times bigger than male passengers.
Pclass
p1 = df %>% ggplot(aes(x=Pclass))+geom_bar(stat='count',fill='#F8766D')+theme_minimal()
p2 = df %>% ggplot(aes(x=Pclass))+geom_bar(stat='count',position = 'fill',aes(fill=Survived))+ scale_y_continuous(labels = scales::percent)+theme_minimal()+labs(y='Survived Percentage')
p1+p2+plot_annotation(title='Count of Survived vs Pclass',subtitle='Survived passengers represented by 1')
The same conclusion can be drawn for the Pclass variable since the bar plot shows that the probability of survival decreases as the Pclass increases (remember that Pclass 1 indicates upper class and Pclass 3 indicates lower class ).
Lastly, I will plot the correlation and value distribution of each feature with the pairplot. However, we have to conduct some feature engineerings beforehand. Name and Ticket columns first will be dropped since their data types are character meanwhile Sex and Embarked will be one-hot encoded. Please note that Embarked actually still has 2 rows with empty string values thus they will be imputed with ‘S’, the majority value.
df_ori=df_ori %>% mutate(Embarked=replace(Embarked,Embarked=='','S')) #There are 2 empty string values in Embarked
df_ori$Sex = ifelse(df_ori$Sex=='male',1,0) #1 for male and 0 for female
df_ori=df_ori %>% mutate(Embarked_Q=ifelse(Embarked=='Q',1,0),Embarked_S=ifelse(Embarked=='S',1,0)) #Embarked will be encoded to Embarked_S and Embarked_Q only to avoid multicollinearity
df_ori = df_ori %>% select(-c(Name,Ticket,Embarked))
df_ori
Only numerical features do exist within the dataset
glimpse(df_ori)
## Rows: 714
## Columns: 9
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1,~
## $ Pclass <int> 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3, 2, 2, 3,~
## $ Sex <dbl> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,~
## $ Age <dbl> 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55, 2~
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1, 0, 0, 0,~
## $ Parch <int> 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0, 0,~
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625, 21.0750,~
## $ Embarked_Q <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,~
## $ Embarked_S <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,~
Draw the pairplot and determine the spearman correlation for each feature.
pairs.panels(df_ori,smooth = TRUE,density = TRUE,ellipses = TRUE,method='spearman') #based on my literature study, spearman is more suitable for data that has categorical feature
The pairplot appears to support our previous hypotheses, as it shows that Sex and Pclass are the two variables with the highest absolute correlation values to Survived among all columns. As expected, Age is also shown to have such a low correlation value to the target variable.
To summarize the findings, Pclass and Sex have the potential to be good predictors of the Survived variable. However, given that the Age distributions for both 1 and 0 Survived Passengers are roughly similar, Age may not have the same impact on Survived as Pclass and Age. It does provide an important information though that passengers under the age of eight have a better chance of survival.
The only variable we need to transform to do the predictive learning is the target variable, i.e Survived. Remember that Sex and Embarked are already encoded. Also, Pclass will be left as it is since it’s considered as ordinal variable.
df_ori$Survived=factor(df_ori$Survived)
Split the data into training and test data with 85% ratio
set.seed(2233)
data_split=sample.split(df_ori$Survived,SplitRatio = 0.85)
train_df=subset(df_ori,data_split==TRUE)
test_df=subset(df_ori,data_split==FALSE)
As the result, right now we have train_df with 606 observations and test_df with 108 observations.
train_df
test_df
Select Survived, Age, Pclass, and Sex columns :
train_df_filtered=train_df %>% select(Survived,Pclass,Age,Sex)
test_df_filtered=test_df %>% select(Survived,Pclass,Age,Sex)
Build the logistic regression model. Fit the models into train data:
model_logit=glm(Survived~.,train_df_filtered,family='binomial')
summary(model_logit)
##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_df_filtered)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6771 -0.6917 -0.4105 0.6597 2.4263
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.911494 0.533678 9.203 < 2e-16 ***
## Pclass -1.290937 0.149474 -8.637 < 2e-16 ***
## Age -0.032694 0.008003 -4.085 4.4e-05 ***
## Sex -2.456765 0.223250 -11.005 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 818.52 on 605 degrees of freedom
## Residual deviance: 556.40 on 602 degrees of freedom
## AIC: 564.4
##
## Number of Fisher Scoring iterations: 5
Predict the test data and evaluate it:
y_predict_logit=predict(model_logit,test_df_filtered,type='response')
y_predict_logit_prob=factor(ifelse(y_predict_logit>=0.5,1,0))
caret::confusionMatrix(y_predict_logit_prob,test_df_filtered$Survived,positive='1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 56 10
## 1 8 34
##
## Accuracy : 0.8333
## 95% CI : (0.7494, 0.8981)
## No Information Rate : 0.5926
## P-Value [Acc > NIR] : 6.538e-08
##
## Kappa : 0.6524
##
## Mcnemar's Test P-Value : 0.8137
##
## Sensitivity : 0.7727
## Specificity : 0.8750
## Pos Pred Value : 0.8095
## Neg Pred Value : 0.8485
## Prevalence : 0.4074
## Detection Rate : 0.3148
## Detection Prevalence : 0.3889
## Balanced Accuracy : 0.8239
##
## 'Positive' Class : 1
##
The confusion matrix shows that our model performs well, with an accuracy of about 83 percent. However, because the model was trained on data with a small number of observations, I believe the model’s accuracy may decreases if it is applied to test data with a larger number of observations. Another point to mention is that the model appears to be better at predicting passengers who did not survive. According to the confusion matrix, the model correctly guesses 56/64 (87.5 %) passengers who did not survive, which is considered better than guessing the survived passengers with 34/44 correct answers (77 %). The reason for this deficiency may comes from the imbalances distribution of the target variable: Survived itself, that makes our model has more resources on learning the passengers who did not survive and becomes slightly biased.
Based on our data analysis in Question no.1 , we know that :
Our model absolutely uses these patterns, that’s why based on this table, one can see that male and lower-class passengers have the lowest chance of survival.
test_df_filtered=test_df_filtered %>% mutate(Survive_Predict=y_predict_logit_prob,.after=Survived) %>% mutate(Survive_Prob=y_predict_logit,.after=Survive_Predict)
test_df_filtered %>% filter(Survived==0&Survived==Survive_Predict) %>% arrange((Survive_Prob))
However, let’s face it, every data set, including ours, contains noise. Of course, there are male and lower-class passengers who did survive and there is huge possibility that our model will predict these observations incorrectly.
test_df_filtered %>% filter(Survived==1&Survived!=Survive_Predict) %>% arrange(Survive_Prob) %>% head(5)
The above table is created to emphasize the noise. According to the table, our model predicts that certain lower-class and male passengers have a low chance of survival, but in reality, they did survive.
To summarize the findings, I believe it is normal for our model to make errors because noise exists in every dataset. Furthermore, the small number of observations in the train data, combined with the small number of columns we used, raises the possibility of an oversimplified or highly biased model. To improve the results, we can include more observations and columns with significant effects on the target variable, Survived. Also, think about whether removing the missing age values is really a good idea? Or perhaps, we can improve the result by keeping them and imputing them (number of observations will not decrease) based on another columns, e.g the passenger name’s title? Hopefully, I’ll be able to answer this question in future works.