Choosing the Right File Format for Your Data Lake: A Comprehensive Guide

Title: Choosing the Right File Format for Your Data Lake: A Comprehensive Guide

Introduction: In today’s data-driven world, organizations are leveraging data lakes to store and analyze vast amounts of information. With the advent of cloud providers’ deep storage systems, choosing the appropriate file format has become crucial for efficient data management. This article explores the different file formats available, their advantages, and considerations to help you make informed decisions when setting up your data lake.

File Formats for Deep Storage Systems: Deep storage systems, such as S3 or GCS, offer cost-effective storage options for data lakes but lack strong ACID guarantees. When utilizing these systems, selecting the right file format is paramount. Here are some key points to consider:

  1. Structure of Your Data: Certain file formats, like JSON, Avro, and Parquet, support nested data, while others do not. However, it’s important to note that not all formats optimize nested data efficiently. Avro, for example, stands out as the most efficient format for handling nested data. On the other hand, Parquet nested types can be inefficient, and processing nested JSON can be CPU-intensive. In most cases, it is recommended to flatten the data during ingestion.
  2. Performance: File formats such as Avro and Parquet offer superior performance compared to others like JSON. The choice between Avro and Parquet depends on the specific use case. Parquet, being a columnar format, excels in SQL-based querying, while Avro is ideal for row-level transformations during ETL processes.
  3. Readability: Consider whether the data needs to be human-readable or not. JSON and CSV are text formats that are easily readable by humans. However, more performant formats like Parquet and Avro are binary, optimized for storage efficiency.
  4. Compression: Different file formats provide varying compression rates. It’s important to assess the trade-off between file size and CPU costs. Some compression algorithms offer faster processing but result in larger file sizes, while others prioritize better compression rates at the expense of slower processing.
  5. Schema Evolution: Changing data schemas in a data lake can be challenging compared to databases. However, formats like Avro and Parquet offer some degree of schema evolution, allowing you to modify the schema while still being able to query the data. Additionally, specialized tools like Delta Lake provide enhanced capabilities for handling schema changes.
  6. Compatibility: Formats such as JSON and CSV enjoy widespread adoption and compatibility with various tools. In contrast, more performant options may have fewer integration points but offer superior performance.

File Format Options: Let’s explore some of the commonly used file formats for data lakes:

  1. CSV: Suitable for compatibility, spreadsheet processing, and small data sets. However, it lacks efficiency and cannot handle nested data well. Use CSV for exploratory analysis, proof-of-concepts, or small-scale datasets.
  2. JSON: Widely used in APIs and supports nested data. While it is human-readable, reading extensively nested fields can become challenging. JSON is great for small datasets, landing data, or API integration. For processing large amounts of data, consider converting to a more efficient format.
  3. Avro: Excellent for storing row data efficiently, especially when combined with Kafka. Avro supports schemas and provides integration with Kafka. It is recommended for row-level operations and data ingestion. However, it may have slower read performance compared to other formats.
  4. Protocol Buffers: Ideal for APIs, particularly gRPC. Protocol Buffers are known for their speed and schema support, making them suitable for APIs and machine learning workflows.
  5. Parquet: A columnar storage format that works well with Hive and Spark for SQL-based querying. It offers schema support and efficient storage. Query engines can selectively read only the required columns, resulting in improved performance compared to Avro. Parquet serves as an excellent reporting layer for data lakes.
  6. ORC: Similar to Parquet, ORC offers better compression rates and enhanced schema evolution capabilities. Though less popular, it is a viable alternative to Parquet in certain use cases.

File Compression: Apart from choosing the right file format, selecting an appropriate compression algorithm is crucial for optimizing storage efficiency. Consider the trade-off between file size and CPU costs. For streaming data, snappy compression is recommended due to its low CPU requirements. For batch processing, bzip2 offers a good balance between compression rates and processing speed.

Conclusion: When setting up a data lake, selecting the right file format is vital for efficient storage, querying, and data processing. While CSV and JSON formats are widely adopted and easy to use, they lack the capabilities of more optimized formats. Parquet and Avro are commonly used in data lake ecosystems, offering distinct advantages for different use cases. Consider the structure of your data, performance requirements, readability, compression options, schema evolution capabilities, and compatibility with existing tools when making your file format decisions. By understanding the strengths and trade-offs of each format, you can build a robust data lake infrastructure that meets your organization’s needs efficiently.

DieTanic – Titanic: Machine Learning from Disaster

Sometimes life has a cruel sense of humor, giving you the thing you always wanted at the worst time possible. -Lisa Kleypas

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. That’s why the name DieTanic. This is a very unforgettable disaster that no one in the world can forget.

Contents of the Notebook:

Part1: Exploratory Data Analysis(EDA):

1)Analysis of the features.

2)Finding any relations or trends considering multiple features.

Part2: Feature Engineering and Data Cleaning:

1)Adding any few features.

2)Removing redundant features.

3)Converting features into suitable form for modeling.

Part3: Predictive Modeling

1)Running Basic Algorithms.

2)Cross Validation.

3)Ensembling.

4)Important Features Extraction.

Part1: Exploratory Data Analysis(EDA)

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
In [2]:
data=pd.read_csv('../input/train.csv')
In [3]:
data.head()
Out[3]:

In [4]:
data.isnull().sum() #checking for total null values
Out[4]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

The Age, Cabin and Embarked have null values. I will try to fix them.

How many Survived??

In [5]:
f,ax=plt.subplots(1,2,figsize=(18,8))
data['Survived'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Survived')
ax[0].set_ylabel('')
sns.countplot('Survived',data=data,ax=ax[1])
ax[1].set_title('Survived')
plt.show()

It is evident that not many passengers survived the accident.

Out of 891 passengers in training set, only around 350 survived i.e Only 38.4% of the total training set survived the crash. We need to dig down more to get better insights from the data and see which categories of the passengers did survive and who didn’t.

We will try to check the survival rate by using the different features of the dataset. Some of the features being Sex, Port Of Embarcation, Age,etc.

First let us understand the different types of features.

Types Of Features

Categorical Features:

A categorical variable is one that has two or more categories and each value in that feature can be categorised by them.For example, gender is a categorical variable having two categories (male and female). Now we cannot sort or give any ordering to such variables. They are also known as Nominal Variables.

Categorical Features in the dataset: Sex,Embarked.

Ordinal Features:

An ordinal variable is similar to categorical values, but the difference between them is that we can have relative ordering or sorting between the values. For eg: If we have a feature like Height with values Tall, Medium, Short, then Height is a ordinal variable. Here we can have a relative sort in the variable.

Ordinal Features in the dataset: PClass

Continous Feature:

A feature is said to be continous if it can take values between any two points or between the minimum or maximum values in the features column.

Continous Features in the dataset: Age

Analysing The Features

Sex–> Categorical Feature

In [6]:
data.groupby(['Sex','Survived'])['Survived'].count()
Out[6]:
Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64
In [7]:
f,ax=plt.subplots(1,2,figsize=(18,8))
data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Survived vs Sex')
sns.countplot('Sex',hue='Survived',data=data,ax=ax[1])
ax[1].set_title('Sex:Survived vs Dead')
plt.show()

This looks interesting. The number of men on the ship is lot more than the number of women. Still the number of women saved is almost twice the number of males saved. The survival rates for a women on the ship is around 75% while that for men in around 18-19%.

This looks to be a very important feature for modeling. But is it the best?? Lets check other features.

Pclass –> Ordinal Feature

In [8]:
pd.crosstab(data.Pclass,data.Survived,margins=True).style.background_gradient(cmap='summer_r')
Out[8]:
Survived 0 1 All
Pclass
1 80 136 216
2 97 87 184
3 372 119 491
All 549 342 891
In [9]:
f,ax=plt.subplots(1,2,figsize=(18,8))
data['Pclass'].value_counts().plot.bar(color=['#CD7F32','#FFDF00','#D3D3D3'],ax=ax[0])
ax[0].set_title('Number Of Passengers By Pclass')
ax[0].set_ylabel('Count')
sns.countplot('Pclass',hue='Survived',data=data,ax=ax[1])
ax[1].set_title('Pclass:Survived vs Dead')
plt.show()

People say Money Can’t Buy Everything. But we can clearly see that Passenegers Of Pclass 1 were given a very high priority while rescue. Even though the the number of Passengers in Pclass 3 were a lot higher, still the number of survival from them is very low, somewhere around 25%.

For Pclass 1 %survived is around 63% while for Pclass2 is around 48%. So money and status matters. Such a materialistic world.

Lets Dive in little bit more and check for other interesting observations. Lets check survival rate with Sex and PclassTogether.

In [10]:
pd.crosstab([data.Sex,data.Survived],data.Pclass,margins=True).style.background_gradient(cmap='summer_r')
Out[10]:
Pclass 1 2 3 All
Sex Survived
female 0 3 6 72 81
1 91 70 72 233
male 0 77 91 300 468
1 45 17 47 109
All 216 184 491 891
In [11]:
sns.factorplot('Pclass','Survived',hue='Sex',data=data)
plt.show()

We use FactorPlot in this case, because they make the seperation of categorical values easy.

Looking at the CrossTab and the FactorPlot, we can easily infer that survival for Women from Pclass1 is about 95-96%, as only 3 out of 94 Women from Pclass1 died.

It is evident that irrespective of Pclass, Women were given first priority while rescue. Even Men from Pclass1 have a very low survival rate.

Looks like Pclass is also an important feature. Lets analyse other features.

Age–> Continous Feature

In [12]:
print('Oldest Passenger was of:',data['Age'].max(),'Years')
print('Youngest Passenger was of:',data['Age'].min(),'Years')
print('Average Age on the ship:',data['Age'].mean(),'Years')
Oldest Passenger was of: 80.0 Years
Youngest Passenger was of: 0.42 Years
Average Age on the ship: 29.69911764705882 Years
In [13]:
f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot("Pclass","Age", hue="Survived", data=data,split=True,ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("Sex","Age", hue="Survived", data=data,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()

Observations:

1)The number of children increases with Pclass and the survival rate for passenegers below Age 10(i.e children) looks to be good irrespective of the Pclass.

2)Survival chances for Passenegers aged 20-50 from Pclass1 is high and is even better for Women.

3)For males, the survival chances decreases with an increase in age.

As we had seen earlier, the Age feature has 177 null values. To replace these NaN values, we can assign them the mean age of the dataset.

But the problem is, there were many people with many different ages. We just cant assign a 4 year kid with the mean age that is 29 years. Is there any way to find out what age-band does the passenger lie??

Bingo!!!!, we can check the Name feature. Looking upon the feature, we can see that the names have a salutation like Mr or Mrs. Thus we can assign the mean values of Mr and Mrs to the respective groups.

”What’s In A Name??”—> Feature :p

In [14]:
data['Initial']=0
for i in data:
    data['Initial']=data.Name.str.extract('([A-Za-z]+)\.') #lets extract the Salutations

Okay so here we are using the Regex: [A-Za-z]+).. So what it does is, it looks for strings which lie between A-Z or a-z and followed by a .(dot). So we successfully extract the Initials from the Name.

In [15]:
pd.crosstab(data.Initial,data.Sex).T.style.background_gradient(cmap='summer_r') #Checking the Initials with the Sex
Out[15]:
Initial Capt Col Countess Don Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs Ms Rev Sir
Sex
female 0 0 1 0 1 0 1 0 0 182 2 1 0 125 1 0 0
male 1 2 0 1 6 1 0 2 40 0 0 0 517 0 0 6 1

Okay so there are some misspelled Initials like Mlle or Mme that stand for Miss. I will replace them with Miss and same thing for other values.

In [16]:
data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)
In [17]:
data.groupby('Initial')['Age'].mean() #lets check the average age by Initials
Out[17]:
Initial
Master     4.574167
Miss      21.860000
Mr        32.739609
Mrs       35.981818
Other     45.888889
Name: Age, dtype: float64

Filling NaN Ages

In [18]:
## Assigning the NaN Values with the Ceil values of the mean ages
data.loc[(data.Age.isnull())&(data.Initial=='Mr'),'Age']=33
data.loc[(data.Age.isnull())&(data.Initial=='Mrs'),'Age']=36
data.loc[(data.Age.isnull())&(data.Initial=='Master'),'Age']=5
data.loc[(data.Age.isnull())&(data.Initial=='Miss'),'Age']=22
data.loc[(data.Age.isnull())&(data.Initial=='Other'),'Age']=46
In [19]:
data.Age.isnull().any() #So no null values left finally 
Out[19]:
False
In [20]:
f,ax=plt.subplots(1,2,figsize=(20,10))
data[data['Survived']==0].Age.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('Survived= 0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
data[data['Survived']==1].Age.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('Survived= 1')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()

Observations:

1)The Toddlers(age<5) were saved in large numbers(The Women and Child First Policy).

2)The oldest Passenger was saved(80 years).

3)Maximum number of deaths were in the age group of 30-40.

In [21]:
sns.factorplot('Pclass','Survived',col='Initial',data=data)
plt.show()

The Women and Child first policy thus holds true irrespective of the class.

Embarked–> Categorical Value

In [22]:
pd.crosstab([data.Embarked,data.Pclass],[data.Sex,data.Survived],margins=True).style.background_gradient(cmap='summer_r')
Out[22]:
Sex female male All
Survived 0 1 0 1
Embarked Pclass
C 1 1 42 25 17 85
2 0 7 8 2 17
3 8 15 33 10 66
Q 1 0 1 1 0 2
2 0 2 1 0 3
3 9 24 36 3 72
S 1 2 46 51 28 127
2 6 61 82 15 164
3 55 33 231 34 353
All 81 231 468 109 889

Chances for Survival by Port Of Embarkation

In [23]:
sns.factorplot('Embarked','Survived',data=data)
fig=plt.gcf()
fig.set_size_inches(5,3)
plt.show()

The chances for survival for Port C is highest around 0.55 while it is lowest for S.

In [24]:
f,ax=plt.subplots(2,2,figsize=(20,15))
sns.countplot('Embarked',data=data,ax=ax[0,0])
ax[0,0].set_title('No. Of Passengers Boarded')
sns.countplot('Embarked',hue='Sex',data=data,ax=ax[0,1])
ax[0,1].set_title('Male-Female Split for Embarked')
sns.countplot('Embarked',hue='Survived',data=data,ax=ax[1,0])
ax[1,0].set_title('Embarked vs Survived')
sns.countplot('Embarked',hue='Pclass',data=data,ax=ax[1,1])
ax[1,1].set_title('Embarked vs Pclass')
plt.subplots_adjust(wspace=0.2,hspace=0.5)
plt.show()

Observations:

1)Maximum passenegers boarded from S. Majority of them being from Pclass3.

2)The Passengers from C look to be lucky as a good proportion of them survived. The reason for this maybe the rescue of all the Pclass1 and Pclass2 Passengers.

3)The Embark S looks to the port from where majority of the rich people boarded. Still the chances for survival is low here, that is because many passengers from Pclass3 around 81% didn’t survive.

4)Port Q had almost 95% of the passengers were from Pclass3.

In [25]:
sns.factorplot('Pclass','Survived',hue='Sex',col='Embarked',data=data)
plt.show()

Observations:

1)The survival chances are almost 1 for women for Pclass1 and Pclass2 irrespective of the Pclass.

2)Port S looks to be very unlucky for Pclass3 Passenegers as the survival rate for both men and women is very low.(Money Matters)

3)Port Q looks looks to be unlukiest for Men, as almost all were from Pclass 3.

Filling Embarked NaN

As we saw that maximum passengers boarded from Port S, we replace NaN with S.

In [26]:
data['Embarked'].fillna('S',inplace=True)
In [27]:
data.Embarked.isnull().any()# Finally No NaN values
Out[27]:
False

SibSip–>Discrete Feature

This feature represents whether a person is alone or with his family members.

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife

In [28]:
pd.crosstab([data.SibSp],data.Survived).style.background_gradient(cmap='summer_r')
Out[28]:
Survived 0 1
SibSp
0 398 210
1 97 112
2 15 13
3 12 4
4 15 3
5 5 0
8 7 0
In [29]:
f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot('SibSp','Survived',data=data,ax=ax[0])
ax[0].set_title('SibSp vs Survived')
sns.factorplot('SibSp','Survived',data=data,ax=ax[1])
ax[1].set_title('SibSp vs Survived')
plt.close(2)
plt.show()
In [30]:
pd.crosstab(data.SibSp,data.Pclass).style.background_gradient(cmap='summer_r')
Out[30]:
Pclass 1 2 3
SibSp
0 137 120 351
1 71 55 83
2 5 8 15
3 3 1 12
4 0 0 18
5 0 0 5
8 0 0 7

Observations:

The barplot and factorplot shows that if a passenger is alone onboard with no siblings, he have 34.5% survival rate. The graph roughly decreases if the number of siblings increase. This makes sense. That is, if I have a family on board, I will try to save them instead of saving myself first. Surprisingly the survival for families with 5-8 members is 0%. The reason may be Pclass??

The reason is Pclass. The crosstab shows that Person with SibSp>3 were all in Pclass3. It is imminent that all the large families in Pclass3(>3) died.

Parch

In [31]:
pd.crosstab(data.Parch,data.Pclass).style.background_gradient(cmap='summer_r')
Out[31]:
Pclass 1 2 3
Parch
0 163 134 381
1 31 32 55
2 21 16 43
3 0 2 3
4 1 0 3
5 0 0 5
6 0 0 1

The crosstab again shows that larger families were in Pclass3.

In [32]:
f,ax=plt.subplots(1,2,figsize=(20,8))
sns.barplot('Parch','Survived',data=data,ax=ax[0])
ax[0].set_title('Parch vs Survived')
sns.factorplot('Parch','Survived',data=data,ax=ax[1])
ax[1].set_title('Parch vs Survived')
plt.close(2)
plt.show()

Observations:

Here too the results are quite similar. Passengers with their parents onboard have greater chance of survival. It however reduces as the number goes up.

The chances of survival is good for somebody who has 1-3 parents on the ship. Being alone also proves to be fatal and the chances for survival decreases when somebody has >4 parents on the ship.

Fare–> Continous Feature

In [33]:
print('Highest Fare was:',data['Fare'].max())
print('Lowest Fare was:',data['Fare'].min())
print('Average Fare was:',data['Fare'].mean())
Highest Fare was: 512.3292
Lowest Fare was: 0.0
Average Fare was: 32.2042079685746

The lowest fare is 0.0. Wow!! a free luxorious ride.

In [34]:
f,ax=plt.subplots(1,3,figsize=(20,8))
sns.distplot(data[data['Pclass']==1].Fare,ax=ax[0])
ax[0].set_title('Fares in Pclass 1')
sns.distplot(data[data['Pclass']==2].Fare,ax=ax[1])
ax[1].set_title('Fares in Pclass 2')
sns.distplot(data[data['Pclass']==3].Fare,ax=ax[2])
ax[2].set_title('Fares in Pclass 3')
plt.show()

There looks to be a large distribution in the fares of Passengers in Pclass1 and this distribution goes on decreasing as the standards reduces. As this is also continous, we can convert into discrete values by using binning.

Observations in a Nutshell for all features:

Sex: The chance of survival for women is high as compared to men.

Pclass:There is a visible trend that being a 1st class passenger gives you better chances of survival. The survival rate for Pclass3 is very low. For women, the chance of survival from Pclass1 is almost 1 and is high too for those from Pclass2.Money Wins!!!.

Age: Children less than 5-10 years do have a high chance of survival. Passengers between age group 15 to 35 died a lot.

Embarked: This is a very interesting feature. The chances of survival at C looks to be better than even though the majority of Pclass1 passengers got up at S. Passengers at Q were all from Pclass3.

Parch+SibSp: Having 1-2 siblings,spouse on board or 1-3 Parents shows a greater chance of probablity rather than being alone or having a large family travelling with you.

Correlation Between The Features

In [35]:
sns.heatmap(data.corr(),annot=True,cmap='RdYlGn',linewidths=0.2) #data.corr()-->correlation matrix
fig=plt.gcf()
fig.set_size_inches(10,8)
plt.show()

Interpreting The Heatmap

The first thing to note is that only the numeric features are compared as it is obvious that we cannot correlate between alphabets or strings. Before understanding the plot, let us see what exactly correlation is.

POSITIVE CORRELATION: If an increase in feature A leads to increase in feature B, then they are positively correlated. A value 1 means perfect positive correlation.

NEGATIVE CORRELATION: If an increase in feature A leads to decrease in feature B, then they are negatively correlated. A value -1 means perfect negative correlation.

Now lets say that two features are highly or perfectly correlated, so the increase in one leads to increase in the other. This means that both the features are containing highly similar information and there is very little or no variance in information. This is known as MultiColinearity as both of them contains almost the same information.

So do you think we should use both of them as one of them is redundant. While making or training models, we should try to eliminate redundant features as it reduces training time and many such advantages.

Now from the above heatmap,we can see that the features are not much correlated. The highest correlation is between SibSp and Parch i.e 0.41. So we can carry on with all features.

Part2: Feature Engineering and Data Cleaning

Now what is Feature Engineering?

Whenever we are given a dataset with features, it is not necessary that all the features will be important. There maybe be many redundant features which should be eliminated. Also we can get or add new features by observing or extracting information from other features.

An example would be getting the Initals feature using the Name Feature. Lets see if we can get any new features and eliminate a few. Also we will tranform the existing relevant features to suitable form for Predictive Modeling.

Age_band

Problem With Age Feature:

As I have mentioned earlier that Age is a continous feature, there is a problem with Continous Variables in Machine Learning Models.

Eg:If I say to group or arrange Sports Person by Sex, We can easily segregate them by Male and Female.

Now if I say to group them by their Age, then how would you do it? If there are 30 Persons, there may be 30 age values. Now this is problematic.

We need to convert these continous values into categorical values by either Binning or Normalisation. I will be using binning i.e group a range of ages into a single bin or assign them a single value.

Okay so the maximum age of a passenger was 80. So lets divide the range from 0-80 into 5 bins. So 80/5=16. So bins of size 16.

In [36]:
data['Age_band']=0
data.loc[data['Age']<=16,'Age_band']=0
data.loc[(data['Age']>16)&(data['Age']<=32),'Age_band']=1
data.loc[(data['Age']>32)&(data['Age']<=48),'Age_band']=2
data.loc[(data['Age']>48)&(data['Age']<=64),'Age_band']=3
data.loc[data['Age']>64,'Age_band']=4
data.head(2)
Out[36]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Initial Age_band
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S Mr 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C Mrs 2
In [37]:
data['Age_band'].value_counts().to_frame().style.background_gradient(cmap='summer')#checking the number of passenegers in each band
Out[37]:
Age_band
1 382
2 325
0 104
3 69
4 11
In [38]:
sns.factorplot('Age_band','Survived',data=data,col='Pclass')
plt.show()

True that..the survival rate decreases as the age increases irrespective of the Pclass.

Family_Size and Alone

At this point, we can create a new feature called “Family_size” and “Alone” and analyse it. This feature is the summation of Parch and SibSp. It gives us a combined data so that we can check if survival rate have anything to do with family size of the passengers. Alone will denote whether a passenger is alone or not.

In [39]:
data['Family_Size']=0
data['Family_Size']=data['Parch']+data['SibSp']#family size
data['Alone']=0
data.loc[data.Family_Size==0,'Alone']=1#Alone

f,ax=plt.subplots(1,2,figsize=(18,6))
sns.factorplot('Family_Size','Survived',data=data,ax=ax[0])
ax[0].set_title('Family_Size vs Survived')
sns.factorplot('Alone','Survived',data=data,ax=ax[1])
ax[1].set_title('Alone vs Survived')
plt.close(2)
plt.close(3)
plt.show()

Family_Size=0 means that the passeneger is alone. Clearly, if you are alone or family_size=0,then chances for survival is very low. For family size > 4,the chances decrease too. This also looks to be an important feature for the model. Lets examine this further.

In [40]:
sns.factorplot('Alone','Survived',data=data,hue='Sex',col='Pclass')
plt.show()

It is visible that being alone is harmful irrespective of Sex or Pclass except for Pclass3 where the chances of females who are alone is high than those with family.

Fare_Range

Since fare is also a continous feature, we need to convert it into ordinal value. For this we will use pandas.qcut.

So what qcut does is it splits or arranges the values according the number of bins we have passed. So if we pass for 5 bins, it will arrange the values equally spaced into 5 seperate bins or value ranges.

In [41]:
data['Fare_Range']=pd.qcut(data['Fare'],4)
data.groupby(['Fare_Range'])['Survived'].mean().to_frame().style.background_gradient(cmap='summer_r')
Out[41]:
Survived
Fare_Range
(-0.001, 7.91] 0.197309
(7.91, 14.454] 0.303571
(14.454, 31.0] 0.454955
(31.0, 512.329] 0.581081

As discussed above, we can clearly see that as the fare_range increases, the chances of survival increases.

Now we cannot pass the Fare_Range values as it is. We should convert it into singleton values same as we did in Age_Band

In [42]:
data['Fare_cat']=0
data.loc[data['Fare']<=7.91,'Fare_cat']=0
data.loc[(data['Fare']>7.91)&(data['Fare']<=14.454),'Fare_cat']=1
data.loc[(data['Fare']>14.454)&(data['Fare']<=31),'Fare_cat']=2
data.loc[(data['Fare']>31)&(data['Fare']<=513),'Fare_cat']=3
In [43]:
sns.factorplot('Fare_cat','Survived',data=data,hue='Sex')
plt.show()

Clearly, as the Fare_cat increases, the survival chances increases. This feature may become an important feature during modeling along with the Sex.

Converting String Values into Numeric

Since we cannot pass strings to a machine learning model, we need to convert features loke Sex, Embarked, etc into numeric values.

In [44]:
data['Sex'].replace(['male','female'],[0,1],inplace=True)
data['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
data['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)

Dropping UnNeeded Features

Name–> We don’t need name feature as it cannot be converted into any categorical value.

Age–> We have the Age_band feature, so no need of this.

Ticket–> It is any random string that cannot be categorised.

Fare–> We have the Fare_cat feature, so unneeded

Cabin–> A lot of NaN values and also many passengers have multiple cabins. So this is a useless feature.

Fare_Range–> We have the fare_cat feature.

PassengerId–> Cannot be categorised.

In [45]:
data.drop(['Name','Age','Ticket','Fare','Cabin','Fare_Range','PassengerId'],axis=1,inplace=True)
sns.heatmap(data.corr(),annot=True,cmap='RdYlGn',linewidths=0.2,annot_kws={'size':20})
fig=plt.gcf()
fig.set_size_inches(18,15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

Now the above correlation plot, we can see some positively related features. Some of them being SibSp andd Family_Sizeand Parch and Family_Size and some negative ones like Alone and Family_Size.

Part3: Predictive Modeling

We have gained some insights from the EDA part. But with that, we cannot accurately predict or tell whether a passenger will survive or die. So now we will predict the whether the Passenger will survive or not using some great Classification Algorithms.Following are the algorithms I will use to make the model:

1)Logistic Regression

2)Support Vector Machines(Linear and radial)

3)Random Forest

4)K-Nearest Neighbours

5)Naive Bayes

6)Decision Tree

7)Logistic Regression

In [46]:
#importing all the required ML packages
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn import svm #support vector Machine
from sklearn.ensemble import RandomForestClassifier #Random Forest
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.naive_bayes import GaussianNB #Naive bayes
from sklearn.tree import DecisionTreeClassifier #Decision Tree
from sklearn.model_selection import train_test_split #training and testing data split
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix
In [47]:
train,test=train_test_split(data,test_size=0.3,random_state=0,stratify=data['Survived'])
train_X=train[train.columns[1:]]
train_Y=train[train.columns[:1]]
test_X=test[test.columns[1:]]
test_Y=test[test.columns[:1]]
X=data[data.columns[1:]]
Y=data['Survived']

Radial Support Vector Machines(rbf-SVM)

In [48]:
model=svm.SVC(kernel='rbf',C=1,gamma=0.1)
model.fit(train_X,train_Y)
prediction1=model.predict(test_X)
print('Accuracy for rbf SVM is ',metrics.accuracy_score(prediction1,test_Y))
Accuracy for rbf SVM is  0.835820895522

Linear Support Vector Machine(linear-SVM)

In [49]:
model=svm.SVC(kernel='linear',C=0.1,gamma=0.1)
model.fit(train_X,train_Y)
prediction2=model.predict(test_X)
print('Accuracy for linear SVM is',metrics.accuracy_score(prediction2,test_Y))
Accuracy for linear SVM is 0.817164179104

Logistic Regression

In [50]:
model = LogisticRegression()
model.fit(train_X,train_Y)
prediction3=model.predict(test_X)
print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction3,test_Y))
The accuracy of the Logistic Regression is 0.817164179104

Decision Tree

In [51]:
model=DecisionTreeClassifier()
model.fit(train_X,train_Y)
prediction4=model.predict(test_X)
print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction4,test_Y))
The accuracy of the Decision Tree is 0.80223880597

K-Nearest Neighbours(KNN)

In [52]:
model=KNeighborsClassifier() 
model.fit(train_X,train_Y)
prediction5=model.predict(test_X)
print('The accuracy of the KNN is',metrics.accuracy_score(prediction5,test_Y))
The accuracy of the KNN is 0.832089552239

Now the accuracy for the KNN model changes as we change the values for n_neighbours attribute. The default value is 5. Lets check the accuracies over various values of n_neighbours.

In [53]:
a_index=list(range(1,11))
a=pd.Series()
x=[0,1,2,3,4,5,6,7,8,9,10]
for i in list(range(1,11)):
    model=KNeighborsClassifier(n_neighbors=i) 
    model.fit(train_X,train_Y)
    prediction=model.predict(test_X)
    a=a.append(pd.Series(metrics.accuracy_score(prediction,test_Y)))
plt.plot(a_index, a)
plt.xticks(x)
fig=plt.gcf()
fig.set_size_inches(12,6)
plt.show()
print('Accuracies for different values of n are:',a.values,'with the max value as ',a.values.max())
Accuracies for different values of n are: [ 0.75746269  0.79104478  0.80970149  0.80223881  0.83208955  0.81716418
  0.82835821  0.83208955  0.8358209   0.83208955] with the max value as  0.835820895522

Gaussian Naive Bayes

In [54]:
model=GaussianNB()
model.fit(train_X,train_Y)
prediction6=model.predict(test_X)
print('The accuracy of the NaiveBayes is',metrics.accuracy_score(prediction6,test_Y))
The accuracy of the NaiveBayes is 0.813432835821

Random Forests

In [55]:
model=RandomForestClassifier(n_estimators=100)
model.fit(train_X,train_Y)
prediction7=model.predict(test_X)
print('The accuracy of the Random Forests is',metrics.accuracy_score(prediction7,test_Y))
The accuracy of the Random Forests is 0.813432835821

The accuracy of a model is not the only factor that determines the robustness of the classifier. Let’s say that a classifier is trained over a training data and tested over the test data and it scores an accuracy of 90%.

Now this seems to be very good accuracy for a classifier, but can we confirm that it will be 90% for all the new test sets that come over??. The answer is No, because we can’t determine which all instances will the classifier will use to train itself. As the training and testing data changes, the accuracy will also change. It may increase or decrease. This is known as model variance.

To overcome this and get a generalized model,we use Cross Validation.

Cross Validation

Many a times, the data is imbalanced, i.e there may be a high number of class1 instances but less number of other class instances. Thus we should train and test our algorithm on each and every instance of the dataset. Then we can take an average of all the noted accuracies over the dataset.

1)The K-Fold Cross Validation works by first dividing the dataset into k-subsets.

2)Let’s say we divide the dataset into (k=5) parts. We reserve 1 part for testing and train the algorithm over the 4 parts.

3)We continue the process by changing the testing part in each iteration and training the algorithm over the other parts. The accuracies and errors are then averaged to get a average accuracy of the algorithm.

This is called K-Fold Cross Validation.

4)An algorithm may underfit over a dataset for some training data and sometimes also overfit the data for other training set. Thus with cross-validation, we can achieve a generalised model.

In [56]:
from sklearn.model_selection import KFold #for K-fold cross validation
from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict #prediction
kfold = KFold(n_splits=10, random_state=22) # k=10, split the data into 10 equal parts
xyz=[]
accuracy=[]
std=[]
classifiers=['Linear Svm','Radial Svm','Logistic Regression','KNN','Decision Tree','Naive Bayes','Random Forest']
models=[svm.SVC(kernel='linear'),svm.SVC(kernel='rbf'),LogisticRegression(),KNeighborsClassifier(n_neighbors=9),DecisionTreeClassifier(),GaussianNB(),RandomForestClassifier(n_estimators=100)]
for i in models:
    model = i
    cv_result = cross_val_score(model,X,Y, cv = kfold,scoring = "accuracy")
    cv_result=cv_result
    xyz.append(cv_result.mean())
    std.append(cv_result.std())
    accuracy.append(cv_result)
new_models_dataframe2=pd.DataFrame({'CV Mean':xyz,'Std':std},index=classifiers)       
new_models_dataframe2
Out[56]:
CV Mean Std
Linear Svm 0.793471 0.047797
Radial Svm 0.828290 0.034427
Logistic Regression 0.805843 0.021861
KNN 0.813783 0.041210
Decision Tree 0.810375 0.033901
Naive Bayes 0.801386 0.028999
Random Forest 0.813720 0.032062
In [57]:
plt.subplots(figsize=(12,6))
box=pd.DataFrame(accuracy,index=[classifiers])
box.T.boxplot()
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4215880e48>
In [58]:
new_models_dataframe2['CV Mean'].plot.barh(width=0.8)
plt.title('Average CV Mean Accuracy')
fig=plt.gcf()
fig.set_size_inches(8,5)
plt.show()

The classification accuracy can be sometimes misleading due to imbalance. We can get a summarized result with the help of confusion matrix, which shows where did the model go wrong, or which class did the model predict wrong.

Confusion Matrix

It gives the number of correct and incorrect classifications made by the classifier.

In [59]:
f,ax=plt.subplots(3,3,figsize=(12,10))
y_pred = cross_val_predict(svm.SVC(kernel='rbf'),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,0],annot=True,fmt='2.0f')
ax[0,0].set_title('Matrix for rbf-SVM')
y_pred = cross_val_predict(svm.SVC(kernel='linear'),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,1],annot=True,fmt='2.0f')
ax[0,1].set_title('Matrix for Linear-SVM')
y_pred = cross_val_predict(KNeighborsClassifier(n_neighbors=9),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[0,2],annot=True,fmt='2.0f')
ax[0,2].set_title('Matrix for KNN')
y_pred = cross_val_predict(RandomForestClassifier(n_estimators=100),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,0],annot=True,fmt='2.0f')
ax[1,0].set_title('Matrix for Random-Forests')
y_pred = cross_val_predict(LogisticRegression(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,1],annot=True,fmt='2.0f')
ax[1,1].set_title('Matrix for Logistic Regression')
y_pred = cross_val_predict(DecisionTreeClassifier(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[1,2],annot=True,fmt='2.0f')
ax[1,2].set_title('Matrix for Decision Tree')
y_pred = cross_val_predict(GaussianNB(),X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,y_pred),ax=ax[2,0],annot=True,fmt='2.0f')
ax[2,0].set_title('Matrix for Naive Bayes')
plt.subplots_adjust(hspace=0.2,wspace=0.2)
plt.show()

Interpreting Confusion Matrix

The left diagonal shows the number of correct predictions made for each class while the right diagonal shows the number of wrong prredictions made. Lets consider the first plot for rbf-SVM:

1)The no. of correct predictions are 491(for dead) + 247(for survived) with the mean CV accuracy being (491+247)/891 = 82.8% which we did get earlier.

2)Errors–> Wrongly Classified 58 dead people as survived and 95 survived as dead. Thus it has made more mistakes by predicting dead as survived.

By looking at all the matrices, we can say that rbf-SVM has a higher chance in correctly predicting dead passengers but NaiveBayes has a higher chance in correctly predicting passengers who survived.

Hyper-Parameters Tuning

The machine learning models are like a Black-Box. There are some default parameter values for this Black-Box, which we can tune or change to get a better model. Like the C and gamma in the SVM model and similarly different parameters for different classifiers, are called the hyper-parameters, which we can tune to change the learning rate of the algorithm and get a better model. This is known as Hyper-Parameter Tuning.

We will tune the hyper-parameters for the 2 best classifiers i.e the SVM and RandomForests.

SVM

In [60]:
from sklearn.model_selection import GridSearchCV
C=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
gamma=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
kernel=['rbf','linear']
hyper={'kernel':kernel,'C':C,'gamma':gamma}
gd=GridSearchCV(estimator=svm.SVC(),param_grid=hyper,verbose=True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)
Fitting 3 folds for each of 240 candidates, totalling 720 fits
0.828282828283
SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
[Parallel(n_jobs=1)]: Done 720 out of 720 | elapsed:   14.4s finished

Random Forests

In [61]:
n_estimators=range(100,1000,100)
hyper={'n_estimators':n_estimators}
gd=GridSearchCV(estimator=RandomForestClassifier(random_state=0),param_grid=hyper,verbose=True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:   21.0s finished
0.817059483726
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=900, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

The best score for Rbf-Svm is 82.82% with C=0.05 and gamma=0.1. For RandomForest, score is abt 81.8% with n_estimators=900.

Ensembling

Ensembling is a good way to increase the accuracy or performance of a model. In simple words, it is the combination of various simple models to create a single powerful model.

Lets say we want to buy a phone and ask many people about it based on various parameters. So then we can make a strong judgement about a single product after analysing all different parameters. This is Ensembling, which improves the stability of the model. Ensembling can be done in ways like:

1)Voting Classifier

2)Bagging

3)Boosting.

Voting Classifier

It is the simplest way of combining predictions from many different simple machine learning models. It gives an average prediction result based on the prediction of all the submodels. The submodels or the basemodels are all of diiferent types.

In [62]:
from sklearn.ensemble import VotingClassifier
ensemble_lin_rbf=VotingClassifier(estimators=[('KNN',KNeighborsClassifier(n_neighbors=10)),
                                              ('RBF',svm.SVC(probability=True,kernel='rbf',C=0.5,gamma=0.1)),
                                              ('RFor',RandomForestClassifier(n_estimators=500,random_state=0)),
                                              ('LR',LogisticRegression(C=0.05)),
                                              ('DT',DecisionTreeClassifier(random_state=0)),
                                              ('NB',GaussianNB()),
                                              ('svm',svm.SVC(kernel='linear',probability=True))
                                             ], 
                       voting='soft').fit(train_X,train_Y)
print('The accuracy for ensembled model is:',ensemble_lin_rbf.score(test_X,test_Y))
cross=cross_val_score(ensemble_lin_rbf,X,Y, cv = 10,scoring = "accuracy")
print('The cross validated score is',cross.mean())
The accuracy for ensembled model is: 0.820895522388
The cross validated score is 0.823766031097

Bagging

Bagging is a general ensemble method. It works by applying similar classifiers on small partitions of the dataset and then taking the average of all the predictions. Due to the averaging,there is reduction in variance. Unlike Voting Classifier, Bagging makes use of similar classifiers.

Bagged KNN

Bagging works best with models with high variance. An example for this can be Decision Tree or Random Forests. We can use KNN with small value of n_neighbours, as small value of n_neighbours.

In [63]:
from sklearn.ensemble import BaggingClassifier
model=BaggingClassifier(base_estimator=KNeighborsClassifier(n_neighbors=3),random_state=0,n_estimators=700)
model.fit(train_X,train_Y)
prediction=model.predict(test_X)
print('The accuracy for bagged KNN is:',metrics.accuracy_score(prediction,test_Y))
result=cross_val_score(model,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for bagged KNN is:',result.mean())
The accuracy for bagged KNN is: 0.835820895522
The cross validated score for bagged KNN is: 0.814889342867

Bagged DecisionTree

In [64]:
model=BaggingClassifier(base_estimator=DecisionTreeClassifier(),random_state=0,n_estimators=100)
model.fit(train_X,train_Y)
prediction=model.predict(test_X)
print('The accuracy for bagged Decision Tree is:',metrics.accuracy_score(prediction,test_Y))
result=cross_val_score(model,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for bagged Decision Tree is:',result.mean())
The accuracy for bagged Decision Tree is: 0.824626865672
The cross validated score for bagged Decision Tree is: 0.820482635342

Boosting

Boosting is an ensembling technique which uses sequential learning of classifiers. It is a step by step enhancement of a weak model.Boosting works as follows:

A model is first trained on the complete dataset. Now the model will get some instances right while some wrong. Now in the next iteration, the learner will focus more on the wrongly predicted instances or give more weight to it. Thus it will try to predict the wrong instance correctly. Now this iterative process continous, and new classifers are added to the model until the limit is reached on the accuracy.

AdaBoost(Adaptive Boosting)

The weak learner or estimator in this case is a Decsion Tree. But we can change the dafault base_estimator to any algorithm of our choice.

In [65]:
from sklearn.ensemble import AdaBoostClassifier
ada=AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.1)
result=cross_val_score(ada,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for AdaBoost is:',result.mean())
The cross validated score for AdaBoost is: 0.824952616048

Stochastic Gradient Boosting

Here too the weak learner is a Decision Tree.

In [66]:
from sklearn.ensemble import GradientBoostingClassifier
grad=GradientBoostingClassifier(n_estimators=500,random_state=0,learning_rate=0.1)
result=cross_val_score(grad,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for Gradient Boosting is:',result.mean())
The cross validated score for Gradient Boosting is: 0.818286233118

XGBoost

In [67]:
import xgboost as xg
xgboost=xg.XGBClassifier(n_estimators=900,learning_rate=0.1)
result=cross_val_score(xgboost,X,Y,cv=10,scoring='accuracy')
print('The cross validated score for XGBoost is:',result.mean())
The cross validated score for XGBoost is: 0.810471002156

We got the highest accuracy for AdaBoost. We will try to increase it with Hyper-Parameter Tuning

Hyper-Parameter Tuning for AdaBoost

In [68]:
n_estimators=list(range(100,1100,100))
learn_rate=[0.05,0.1,0.2,0.3,0.25,0.4,0.5,0.6,0.7,0.8,0.9,1]
hyper={'n_estimators':n_estimators,'learning_rate':learn_rate}
gd=GridSearchCV(estimator=AdaBoostClassifier(),param_grid=hyper,verbose=True)
gd.fit(X,Y)
print(gd.best_score_)
print(gd.best_estimator_)
Fitting 3 folds for each of 120 candidates, totalling 360 fits
[Parallel(n_jobs=1)]: Done 360 out of 360 | elapsed:  5.5min finished
0.83164983165
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=0.05, n_estimators=200, random_state=None)

The maximum accuracy we can get with AdaBoost is 83.16% with n_estimators=200 and learning_rate=0.05

Confusion Matrix for the Best Model

In [69]:
ada=AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.05)
result=cross_val_predict(ada,X,Y,cv=10)
sns.heatmap(confusion_matrix(Y,result),cmap='winter',annot=True,fmt='2.0f')
plt.show()

Feature Importance

In [70]:
f,ax=plt.subplots(2,2,figsize=(15,12))
model=RandomForestClassifier(n_estimators=500,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,0])
ax[0,0].set_title('Feature Importance in Random Forests')
model=AdaBoostClassifier(n_estimators=200,learning_rate=0.05,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[0,1],color='#ddff11')
ax[0,1].set_title('Feature Importance in AdaBoost')
model=GradientBoostingClassifier(n_estimators=500,learning_rate=0.1,random_state=0)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,0],cmap='RdYlGn_r')
ax[1,0].set_title('Feature Importance in Gradient Boosting')
model=xg.XGBClassifier(n_estimators=900,learning_rate=0.1)
model.fit(X,Y)
pd.Series(model.feature_importances_,X.columns).sort_values(ascending=True).plot.barh(width=0.8,ax=ax[1,1],color='#FD0F00')
ax[1,1].set_title('Feature Importance in XgBoost')
plt.show()

We can see the important features for various classifiers like RandomForests, AdaBoost,etc.

Observations:

1)Some of the common important features are Initial,Fare_cat,Pclass,Family_Size.

2)The Sex feature doesn’t seem to give any importance, which is shocking as we had seen earlier that Sex combined with Pclass was giving a very good differentiating factor. Sex looks to be important only in RandomForests.

However, we can see the feature Initial, which is at the top in many classifiers.We had already seen the positive correlation between Sex and Initial, so they both refer to the gender.

3)Similarly the Pclass and Fare_cat refer to the status of the passengers and Family_Size with Alone,Parch and SibSp.

What is the best love story you can come up with in two sentences?

What is the best love story you can come up with in two sentences? by @dhawalbarot

Answer by Dhawal Barot:

Here I go,

Story 1 –

To make her blush, he pulled out her hair stick, saying, "You look beautiful in open hair."

And she blushed, her cheeks flushed crimson with love, again!

Story 2 –

"What is that?", he said as he pointed to an empty wall. And with a peek on her cheek, he kissed her.

And she just fell for the same trick. Again. And twice in a row!!

Story 3 –

“Cigarette or kiss? You’ve to choose today!” She snarled as she demanded.

From that moment on, he found a new addiction to life. Her lips.!!

Story 4 –

He was like the president of anti-photogenic club.

But after the first couple of photographs with her, he installed Instagram.

Story 5 –

She prayed.

And, he prayed for her prayers.

Story 6 –

Once again, she ordered pizza and thrashed it into the dustbin.

After all, she had requested for the special delivery guy whom she had a huge crush on. Not the pizza.

Story 7 –

He hated coffee. But, just ordered another for both of them anyways.

Deep down he knew, it was to buy more time to be with her.

Story 8 –

In the literature of life, she wrote poems.

But, he wrote her.

Story 9 –

They broke up.

But she kept giving respect to every name similar to his.

Story 10 –

Both exchanged books.

The roses inside each book giggled.

Story 11 –

Somewhere, between “I need you” to “I want you”;

She realized that she started loving him, desperately!

Story 12 –

She clicked “Unblock” and squeezed “Add friend”.

Love – 1 Misunderstanding – 0

What is the best love story you can come up with in two sentences?

Good Job!

Terence Fletcher: I told you that story about how Charlie Parker became Charlie Parker, right?

Andrew Neiman: Yup, jo jones threw a cymbal at his head.

Terence Fletcher: Exactly. Parker’s a young kid, pretty good on the sax. Gets up to play at a cutting session… and he fucks it up. And Jones nearly decapitates him for it. And he’s laughed off-stage. Cries himself to sleep that night but the next morning, what does he do? He practices. And he practices and he practices with one goal in mind: Never to be laughed at again. And a year later, he goes back to the Reno… And he steps up on that stage and he plays the best motherfucking solo the world has ever heard. (beat) So imagine if Jones had just said: “Well, that’s okay Charlie. Eh… that was alright. Good job.” Then Charlie thinks to himself, “Well, shit. I did do a pretty good job.” End of story, no “Bird.” That, to me, is… an absolute tragedy. But that’s just what the world wants now! People wonder why jazz is dying. (beat) I’ll tell you man. And every Starbucks “jazz” album just proves my point, really. There are no two words in the English language more harmful… Than “good job.”

 

Which are the most clichéd scenes in Indian movies?

Which are the most clichéd scenes in Indian movies!!

Answer by Jitender S Bhatia:

Cliches~

  • Rocky – Spoiled brat.
  • Kishan – Humble sweet guy.
  • Rosy – Vamp.
  • Radha – Saari clad temple bound girl.
  • Ramu Kaka – Faithful family servant.
  • Rich – Arrogant.
  • Poor – Pious.
  • Police Officer – Corrupt or Ultra Upright.
  • Police Constable – Joker. Goofy.
  • Judge – Order Order. Hammer.
  • Lawyer – Drama
  • Businessman – Cigar, Suit
  • Employee – Tie
  • Govt. Employee – File
  • Doctor – Appears only at operation theatre entrance.
  • Nurse – Crisp White. Tray. Injection.
  • Patient – Over Bandaged.
  • College – Picnic spot.
  • Professor – Abnormal.
  • Mother – Teary
  • Father – Stiff
  • Mother-in-Law – Pious or Danger
  • Hero – Hero
  • Heroine – Dances.
  • Hero’s sister – Docile.
  • Villain – Constipated look.
  • Milkman/Driver – Abnormal.
  • Land Line Phone – Very Loud.

Enjoy~

Which are the most clichéd scenes in Indian movies?

How would you react if you were stuck in an elevator with Chetan Bhagat?

How would you react if you were stuck in an elevator with Chetan Bhagat? by Jitender S Bhatia

Answer by Jitender S Bhatia:

Me: Hi, no security?

Chetan: Ah no.. i am just a writer.

Me: Writer? I have never read anything you wrote though.

Chetan: Really? Not even my newspaper articles?

Me: No.. but good that politicians are writing.

Chetan: I am just a writer – no politician.

Me: You look different on TV though.

Chetan: Really? Dude.. do you even recognise me?

Me: Of-course. You are Rahul Gandhi.

Chetan: <Long Silence> I am Chetan Bhagat.

Me: <Awkward Silence – serious face>

The Lift halts at a random floor.

A foreign lady gets in.

Lady: Ooooooh! Shaitaan… Shaitaan Bucket !

Chetan: <Silence – looks at me sideways>

Me: <Trying to correct her> Actually he is…

Lady: <interrupts> Oh so he is not Shaitaan Bucket? Good. Junk writer anyway hehe.

Chetan: <Awkward Silence>

Me: <looks at the floor – innocent face – suppressed laughter>

The Lift halts. Doors open

Chetan rushes out.

Lady: Who is he?

Me: He is indeed The Chetan Bhagat. Yes.

Lady: OMG.. <rushes out>

I see Chetan running with The Lady after him shouting: Hey Shaitaan Shaitaan!

I collapse with laughter on the lift floor as the doors close.

How would you react if you were stuck in an elevator with Chetan Bhagat?

What is the best Facebook post that you’ve ever seen?

Answer by Raunak Singhi:

This is a joke I have read on Facebook and pasted on Quora on other answer, copied it .Don't remember the source:

It was the first day of a school in USA and a new Indian student named Chandrasekhar Subramanian entered the fourth grade.

The teacher said, "Let's begin by reviewing some American History. Who said 'Give me Liberty, or give me Death'?"

She saw a sea of blank faces except for Chandrasekhar, who had his hand up: 'Patrick Henry, 1775,' he said.

'Very good! Who said 'Government of the People, by the People, for the People, shall not perish from the Earth?''

Again, no response except from Chandrasekhar. 'Abraham Lincoln, 1863' said Chandrasekhar.

The teacher snapped at the class, 'Class, you should be ashamed. Chandrasekhar, who is new to our country, knows more about our history than you do.'

She heard a loud whisper: 'F ___ the Indians'

'Who said that?' she demanded. Chandrasekhar put his hand up. 'General Custer, 1862.'

At that point, a student in the back said, 'I'm gonna puke.'

The teacher glares around and asks 'All right! Now, who said that?' Again, Chandrasekhar says, 'George H. W. Bush to the Japanese Prime Minister, 1991.'

Now furious, another student yells, 'Oh yeah? Suck this!'

Chandrasekhar jumps out of his chair waving his hand and shouts to the teacher, 'Bill Clinton, to Monica Lewinsky, 1997.'

Now with almost mob hysteria someone said 'You little shit. If you say anything else, I'll kill you.' Chandrasekhar frantically yells at the top of his voice, 'Michael Jackson to the child witnesses testifying against him, 2004.'

The teacher fainted. And as the class gathered around the teacher on the floor, someone said, 'Oh shit, we're screwed!' And Chandrasekhar said quietly, 'I think it was Lehman Brothers, September 15th, 2008'.

What is the best Facebook post that you've ever seen?

Why are some Indians so furious about the BBC documentary ‘India’s Daughter’? Why did the government of India ban this documentary film?

Answer by Suchi Dey:

I think I can tell you exactly why! But first, let me tell you a story…

A couple of days back, I hired an Uber cab to go to a mall in Calcutta to meet some friends. The journey was about 40 minutes long and I was travelling alone. About 20 minutes into my journey, the driver asked me, "Madam, would you mind giving me a 5 star rating for this trip?"
 I said, "No, I don't mind. I will. But why do you ask suddenly?"

He replied with a sad, long face, "Madam, a few days back two lady passengers gave me extremely poor rating, dropping my rating to 2 stars."
When I asked why, this is what he said..
"At around 11.30 PM, I picked up two lady passengers from Quest mall. They were both extremely drunk. One was falling over the other. They sat in the car and started talking about their personal stuff aloud. They were talking about things that made me uncomfortable as a man. But that was still okay. Then they opened up cans of beer and started smoking too. I warned them.. "unko bola, yeh sab nahi chalega gaadi me" but they did not listen.

At one point, it was enough for me. I asked them to stop immediately or I would call Uber office and get their accounts blocked. Then they got angry and started calling me idiot, stupid and what not. They cursed me in english too. (In his thick Bihari accent it sounded funny to an extent, but I saw him weigh his every single word and it made drop dead sense)

Then I asked him, "what did you do then?"
He said, "I made myself to drop them till their home. I did not want to. But still, I did."

I'm sure you get the point of the story. That was my taste of a featured interview with an Indian man.

You ask why India is furious?? Here's why –

On International Women's day, the entire world is going to see a highly skewed picture of India and Indian men. Our men are not male chauvinists. They are supportive to women in equal measure. They are rational, responsible, sensible, protective and sensitive. We all know it, we all live and laugh with them. India's daughters' don't give a fuck about what a rapist has to say about men, women and culture. They just want him to be hanged!

Thanks for A2A.
—————————————————————————
EDIT:
Thanks all of you for sharing your responses and opinions.
This is a very sensitive issue – the issue of rape. Hence, I will cut short my response within limits of reasoning and statistics, not allowing for any emotional bias.

Okay, so you all are justified in all of your concerns. I have them too.

I said the documentary presents a "highly skewed" picture of India and Indian mentality perpetuating rapes.
Some people said, "No! Indian men and their mentality alone are majorly responsible for rapes."

I have nothing but some statistics to highlight here.

Guess which country is the rape capital of the world? It's one of our fav. countries.I'll tell you- it's USA.

"In India, a country of over 1.2 billion people, 24,206 rapes were reported in 2011.  The same year in the United States, a nation of 300 million, 83,425 rapes were reported. In the United States, every 6.2 minutes a woman is raped."

Source: www.more.com, India not ‘rape capital of the world’

Moreover,India with its sick mentality and chauvinist men seem to take the issue of rape pretty seriously. Here you go…

"According to the Guardian, just 7% of reported rapes in the U.K. resulted in convictions during 2011-12. In Sweden, the conviction rate is as low as 10%. France had a conviction rate of 25% in 2006. Poor India, a developing nation with countless challenges, managed an impressive 24.2% conviction rate in 2012. That’s thanks to the efforts of a lot of good people — police, lawyers, victims and their families — working heroically with limited resources."

Source: TIME MAGAZINE ARTICLE: Why Rape Seems Worse in India Than Everywhere Else (but Actually Isn’t) | TIME.com

India features nowhere in the list of Top ten countries with highest rates(r/100,000) of rape.This statistic gives a clearer picture than if we see the total number of reported rapes, as the total goes higher with the population.

Source: http://www.statisticbrain.com/ra…

I would rather like to see few more documentaries along with this one, namely "USA's daughters" and of course "UK's daughters".

Finally, about the ban.
Well, when my nation's and half of its population's image is dragged and stretched to fit stereotypes. I am certainly not okay.
BBC airs its shows on radio, TV and other mass reaching media outlets. Some of our countrymen unfortunately are not educated enough to debate and discuss like us. They might take it in a way which is unhealthy, hatred-prone and anti-social. Why take that risk when frankly the video gave us nothing new, except a rapist narrating his story as the voice of Indian mindset.
I support the ban. That's my opinion. You all are free.

This would be my only edit as I see no point in stretching a sensitive issue beyond a respectable point.

Why are some Indians so furious about the BBC documentary 'India's Daughter'? Why did the government of India ban this documentary film?