Pivot Tables Are Not Just An Excel Thing
If you’re a frequent Excel user, then you’ve had to make a pivot table or 10 in your day. It’s a quick and convenient way to slice data and identify key trends and remains to this day one of the key selling points of Excel (and the bane of junior analysts throughout corporate America).
When people unfamiliar with Python ask me what Pandas is, I respond that at a high level it’s like SQL and Excel all in one (with a lot more functionality). So given that it includes many of the best parts of Excel, it’s only natural that Pandas has its own implementation of Excel’s famous pivot table. Let’s see how it works and how it can help us analyze things quickly.
Getting Some Data
We’re going to work with the Titanic dataset (downloaded from Kaggle). I’ve downloaded the training section of it to a csv file and loaded it into a dataframe:
import numpy as np
import pandas as pdtitanic = pd.read_csv('train.csv')
Let’s make sure there are no null values before we begin building our pivot tables:
We have a few nulls so let’s get rid of them. Normally, we would think harder about how to treat nulls if we were trying to build a robust model. But today we’re just here to demonstrate how to build and use pivot tables with Pandas so we will be more cavalier about dropping data. We will drop the Cabin column altogether, and any other rows that contain non-Cabin nulls. The first line drops the Cabin column. The second line finds any other rows that contain nulls and drops them.
titanic.drop(axis=1, labels=['Cabin'], inplace=True)
titanic.dropna(axis=0, how='any', inplace=True)
The result is our dataframe going from 891 rows to 712. That’s not too bad. Both the fare and age columns have a lot of distinct values so we should bin them. We can take advantage of Pandas’ cut function (and the quantile method) to do this. The following block of code bins fare into 5 bins based on their quintiles (quintile 1 is the lowest fare, quintile 5 is the highest):
fare_bin_edges = [0,
max(titanic['Fare'])+1]titanic['fare_bin'] = pd.cut(titanic['Fare'],
And this block of code bins age (I manually assigned the bin edges). Note that the bins are ranges, but I named each bin using just the greater bin edge. For example, the first bin includes everyone who is between the ages of 0 and 10 years of age, but I just call it 10.
age_bin_edges = [0,10,20,30,40,50,60,70,80,1000]titanic['age_bin'] = pd.cut(titanic['Age'],
Cool, time to make some pivot tables.
Pivot Tabling The Titanic
To get our bearings, let’s first figure out the count of passengers by age group and gender. We can use the pivot_table function to calculate this:
In:pivot1 = pd.pivot_table(titanic, values=['PassengerId'],
Sex female male
10 31.0 33.0
20 46.0 69.0
30 81.0 149.0
40 54.0 100.0
50 31.0 55.0
60 14.0 28.0
70 2.0 14.0
80 NaN 5.0
For those unfamiliar with pivot tables, it’s basically a table where each cell is a filtered count (another way to think of it is as a 2 or more-dimensional groupby). For example, the value of 31 corresponds to age_bin=10 and gender=female — in other words, there were 31 female passengers aged 0 to 10 years old. So it helps us quickly eyeball the cross-sectional composition of our data across several features of interest. For example, from the previous pivot table (pivot1), we can see that most of the passengers in the Titanic training data were between the ages of 20 and 40 years old, and that there were more male passengers than female ones across all of our age bins. Before we make more tables, let’s quickly go over the arguments to the pivot_table function:
- The first argument is just the dataframe containing our data.
- The second one, values, is the one we want to apply the aggregating function on. For example, in this case we want to count the number of passengers (via PassengerId).
- The third argument, index, is the feature (or features) that we group by on the row axis. So the row labels of our table will be the unique values of this feature. In pivot1, it is age_bin.
- The fourth one, columns, is the feature (or features) that we group by on the column axis. In pivot1, it is gender.
- Finally, aggfunc is the function we apply to values. Here we apply count, but we can also apply sum, mean, min, max, or others.
Who Survived The Titanic?
Now let’s use a pivot table to answer the question: who was most likely to survive the Titanic? From the movie (I know I really did my historical research right?), I recall women and children being reserved spaces on the precious and scarce lifeboats. And according to the layout of most cruise ships, the premium cabins are higher up in the boat — an advantageous place to be when the ship is rapidly filling with water. So let’s make a pivot table where we group by age_bin along the row axis, and gender and passenger class along the column axis. Since the Survived feature of our data is in 0s and 1s (0 if the person did not survive, 1 and if they did), taking the mean of it is a convenient way of calculating the percentage of people that survived. That is:
percentage that survived
Now let’s code up our pivot table (we round the output to 2 decimal places to make it easier to interpret):
pivot2 = pd.pivot_table(titanic, values=['Survived'],
And here is pivot2. Each cell is the percentage of survivors within the subgroup (specified by age_bin, sex/gender, and Pclass). For example, take the top right-most cell — it states that 36% of male passengers aged between 0 and 10 years old and belonging to Pclass 3 survived.
It’s pretty clear what’s going on:
- If you were a female passenger who could afford a premium ticket (Pclass 1 or 2), chances are you survived the disaster.
- For female passengers in Pclass 3, it became a coin flip, regardless of your age surprisingly.
- Unless you were a child (10 or younger), male passengers were generally out of luck. Though Pclass 1 male passengers still were far more likely to survive than their less fortunate brethren.
- Also surprising is the massive disparity between survival rates of male children across the 3 Pclass levels. That might also have to do with the fact that there were 22 male children in Pclass 3 vs. just 11 total in the other 2 classes)
The impact of gender and sadly class becomes even more apparent if we simplify our pivot table by removing age (and shift Pclass to the row axis):
pivot3 = pd.pivot_table(titanic, values=['Survived'],
The differences are stark: females in Pclass 1 and Pclass 2 were likely to survive and males in Pclass 2 and Pclass 3 were likely to perish.
Thanks to our pivot table, we now have the basis of a simple model (logistic regression or a decision tree perhaps?). The truth is that we don’t always need a complicated machine learning model to figure out what’s going on (sometimes using too complicated of a model actually clouds our ability to see the truth). Many times, sound logic and some simple cross-tabulations (along with some heat maps or scatter plots) will get us most of the way there. So the next time you analyze your data, throw up a few pivot tables first. What you find might surprise you. Cheers and stay safe everyone! : Tony Yiu
Data Science @Solovis, Doing my Best to Explain Data Science and Finance in Plain English. Follow my publication at: https://medium.com/alpha-beta-blog
for future queries you can connect with him on linkedin.com/in/tony-yiu-7705097