Titanic Diaries - Part I: Data Dusting
In most data science tutorials I have seen, a lot of the data clean-up is done in what seems to me casually, an annoying obstacle to get to the sexy Machine Learning bit. I was curious to see what if any difference it made in my Kaggle ranking if I used a somewhat more cautious approach in my data cleanup. My approach was to dumbly follow Datacamp's tutorial and submit my test set labels as a benchmark. The second step is then to use a more elaborate data cleanup process and see whether taking the extra time actually moves my ranking up, or maybe down.
The Titanic data set¶
This data set can be used to train a machine learning algorithm to correctly classify passengers of the Titanic's first and last voyage as having survived the disaster or not. To that end, the data set available at kaggle contains various data type pertaining to each passenger. The label to be predicted is the Survived feature (0 for died, 1 for survived). The data comes pre-divided; a training set that includes survival data and a testing set that does not. The goal is to produce survival data for the test set, and upload the result to kaggle for scoring.
import pandas as pd
import os
import seaborn as sb
import matplotlib.pyplot as pl
from sklearn.preprocessing import Imputer
% matplotlib inline
sb.set_context("notebook", font_scale=1.2, rc={"lines.linewidth": 1.5})
trainset = '/home/madhatter106/DATA/Titanic/train.csv'
testset = '/home/madhatter106/DATA/Titanic/test.csv'
dfTrain = pd.read_csv(trainset)
dfTest = pd.read_csv(testset)
dfTrain.head(2)
dfTest.head(2)
print(dfTrain.describe())
print('-' * 40)
print(dfTrain.info())
print('-' * 40)
print(dfTrain.isnull().sum())
Problems in the training dataset:
- Missing data in Age; almost 20% in fact, so we're definitely imputing that one.
- Missing data in Cabin; a whoping 687/891 but this feature doesn't seem useable to me without some information on the cabin layout, so I will drop this
- Missing data in Embarked; just 2 entries missing, maybe assign whatever's more popular?
- Some of the data appears spurious, e.g. min(Fare) is 0. I'll assume that to be an omission, these are passengers after all and none seem related to the crew. The goal then is to figure out a scheme to replace the free tickets with more meaningful data.
- Some of the data seem useless; as a first cut: PassengerId, Ticket, and as mentioned before, Cabin.
- There are some categorical data that need to be transformed; Sex, Embarked
print(dfTest.describe())
print('-' * 40)
print(dfTest.info())
print('-' * 40)
print(dfTest.isnull().sum())
Problems in the test set:
- Age is missing 86 entries
- Fare is missing 1 entry and there are some free tickets in this set too.
- Cabin is missing 327 entries (but we don't care)
Getting rid of features, first cut:¶
First I'm going to get rid of PassengerId, Ticket and Cabin. While dropping the Name data is tempting, I'm going to hold on to it for now, in case I can use titles to help infer missing Age data.
dfTrain.drop(['PassengerId', 'Ticket', 'Cabin'], axis=1, inplace=True)
dfTest.drop(['PassengerId', 'Ticket', 'Cabin'], axis=1, inplace=True)
Before we impute/correct any of the Fare, Age and Embarked data, let's see if they appear to be a factor.
sb.factorplot(y='Age', x='Survived' ,data=dfTrain, aspect=3)
Age is clearly a factor, what about Fare?
sb.factorplot(y='Fare', x='Survived', data=dfTrain, aspect=3);
Fare is clearly also a factor, but is it because it's a proxy for class?
sb.factorplot(x='Survived', y='Fare', hue='Pclass', data=dfTrain, aspect=3);
Interestingly, Fare appears to have an effect in $1^{st}$ class only. Let's look at 'Embarked'.
sb.countplot(x='Embarked', hue='Survived', data=dfTrain);
'Embarked', 'Age', and 'Fare' all seem to have an effect on survival so I'll clean up all three features.
Correcting spurrious entries and imputing missing data¶
First, since a lot of the transformation are common to both the training and the test set I'm going to combine both sets to ensure imputation does not drive a statistical wedge between the two sets.
dfTemp = pd.concat((dfTrain,dfTest),join='inner')
print(dfTemp.describe())
print('-' * 50)
print(dfTemp.info())
Now we're ready to do some imputation/corrections.
Dealing with the "free" tickets in the training set:¶
dfTemp[dfTemp.Fare==0]
Some common details among "free ticket" cases:¶
- embarked in Southampton
- SibSp = 0
- Parch = 0
- Sex = male
- Pclass = 1, 2, 3
It makes more sense to me to impute the class-dependent median fee of tickets bought in Southampton:
Fares_S_AllCl_Non0 = dfTemp.loc[(dfTemp.Fare != 0) & (dfTemp.Embarked == 'S'),['Pclass', 'Fare']]
for i in range(1,4):
dfTrain.loc[(dfTrain.Fare == 0) & (dfTrain.Pclass == i),'Fare'] = Fares_S_AllCl_Non0[Fares_S_AllCl_Non0.Pclass == i].Fare.median()
dfTest.loc[(dfTest.Fare == 0) & dfTest.Pclass == i, 'Fare'] = Fares_S_AllCl_Non0[Fares_S_AllCl_Non0.Pclass == i].Fare.median()
Imputing missing values¶
Imputing Embarked¶
Embarked is only missing 2 entries, so I'd impute base on what's more common. But since the goal of this exercise is to be OCD:
dfTrain.loc[dfTrain.Embarked.isnull(),['Name','Fare', 'Pclass']]
sb.factorplot(x='Pclass',y='Fare',hue='Embarked', data=dfTemp,aspect=2)
This suggests that 'S' is a relative safe bet. But I still wonder what the more frequent value in 'Embarked' is?
dfTrain['Embarked'].value_counts()
Now I'm fairly confident 'S' is the right value to impute to 'Embarked'
dfTrain['Embarked'].fillna('S', inplace=True)
Imputing Fare (in the test set)¶
Based on the factor plot above plotting fare against class, seems safe to correct bad Fare data based on Passenger Class.
pclass4fare = dfTest.loc[dfTest.Fare.isnull(), 'Pclass'].values[0]
msgClassMedianFare = dfTest[(dfTest.Pclass==pclass4fare)].Fare.dropna().median()
dfTest.loc[dfTest.Fare.isnull(),'Fare'] = msgClassMedianFare
Imputing Age:¶
Age is one of those things that to a first approximation can be estimated by how a person is referred to. First I am going to catalogue titles present in names and create another feature, "Title". To do this I need to do an inventory of all possible titles present in the dataset. Titles seem to be the seconde word in the name string, ending with a '.'.
nameset = set()
cnt = 0
for name in dfTemp.Name.values:
for subname in name.split(' '):
# Now I grab the part that has a '.' after but verify it is an abbreviation, not an initial...
if '.' in subname and len(subname) > 2:
cnt += 1
nameset.add(subname)
print(nameset)
def createTitle(name):
return list(set(name.split(' ')) & nameset)[0][:-1]
dfTrain['Title'] = dfTrain.Name.apply(createTitle)
dfTest['Title'] = dfTest.Name.apply(createTitle)
dfTemp['Title'] = dfTemp.Name.apply(createTitle)
Does title allow distinguishing age?
f = pl.figure(figsize=(12,8))
ax = f.add_subplot()
violin = sb.violinplot(x='Title',y='Age',data=dfTemp,ax=ax, scale='area')
for item in violin.get_xticklabels():
item.set_rotation(45)
Not awesome. Still, some titles come with a wide age range; others, like Master have a markedly narrower range, and that could still be informative for my Age imputation. Also I probably don't need all titles. What titles do correspond to missing ages?
dfTemp.Title[dfTemp.Age.isnull()].unique()
Clearly I don't need to use all of the 'Title' data. Based on the graph above, I will pack Ms and Miss together and impute missing age from the median Age of Miss. I will impute the remaining missing ages directly from the median Age of their corresponding category.
# re-titling Ms as Miss
dfTemp.loc[dfTemp.Title=='Ms', 'Title'] = 'Miss'
dfTrain.loc[(dfTrain.Age.isnull()) &
((dfTrain.Title=='Miss')|
(dfTrain.Title=='Ms')),'Age'] = dfTemp.loc[dfTemp.Title=='Miss','Age'].median()
dfTest.loc[(dfTest.Age.isnull()) &
((dfTest.Title=='Miss')|
(dfTest.Title=='Ms')),'Age'] = dfTemp.loc[dfTemp.Title=='Miss','Age'].median()
for title in ['Mr', 'Mrs','Master','Dr']:
dfTrain.loc[(dfTrain.Age.isnull()) &
(dfTrain.Title==title),'Age'] = dfTemp.loc[dfTemp.Title==title, 'Age'].median()
dfTest.loc[(dfTest.Age.isnull()) &
(dfTest.Title==title),'Age'] = dfTemp.loc[dfTemp.Title==title,'Age'].median()
print(dfTrain.info())
print('-' * 50)
print(dfTest.info())
Now I can simplify this data set by dropping 'Title' and 'Name' data from both sets
dfTrain.drop(['Name','Title'],inplace=True, axis=1)
dfTest.drop(['Name','Title'], inplace=True,axis=1)
Labeling categorical data¶
Sex, and Embarked are categorical variables that need to be relabeled numerically. Because neither feature is ordinal, I'm going to one-hot encode both, using pandas' get_dummies method. This will avoid introducing a "ranking" in these variables.
# one-hot encoding non-hierarchical categorical labels
dfTrain = pd.concat([dfTrain,pd.get_dummies(dfTrain[['Sex', 'Embarked']])],axis=1)
dfTest = pd.concat([dfTest,pd.get_dummies(dfTest[['Sex', 'Embarked',]])],axis=1)
Now we don't need 'Sex' or 'Embarked any more so we drop them from both sets
dfTrain.drop(['Sex', 'Embarked'], axis=1, inplace=True)
dfTest.drop(['Sex', 'Embarked'], axis=1, inplace=True)
This concludes this part of the pre-processing; the data cleanup. Here's what the data sets look like
print(dfTrain.info())
print('-' * 50)
print(dfTest.info())
I'm going to pickle both Dataframes for safekeeping until the next blog...
dfTrain.to_pickle('/home/madhatter106/DATA/Titanic/dfTrainCln_I.pkl')
dfTest.to_pickle('/home/madhatter106/DATA/Titanic/dfTestCln_I.pkl')
In the next notebook, I'll be looking to do some additional post-cleanup pre-processing. Until next time!
Comments