# bring in the pandas!
import pandas as pd
Introduction
Having code that is clean, readable and has a logical flow is invaluable. I discovered Structured Query Language (SQL) before Python, and as the name suggests, this already pushes you down this structured, logical road. I have only recently started to explore Python, but my experience so far is that the code can quickly become scattered and difficult to follow, particulary during the exploratory data analysis (EDA) phase.
I have just finished actively watching Efficient Pandas by Matt Harrison and decided to share the content via this blog. The video feels like a bit of a breakthrough for me, someone who is just starting out in the world of data science, and hopefully others will also benefit from reading this. Adopting the chaining method covered in this blog, whenever possible, should ensure that your code is cleaner, and reads like a recipe of ordered steps, reducing any potential ambiguities.
Initial set up
# check which version of pandas we're on
pd.__version__
'1.5.0'
# control the pandas display features
= 20 pd.options.display.min_rows
Dataset
The dataset we will be exploring is from https://www.fueleconomy.gov/feg/download.shtml which is the official U.S. government source for fuel economy information. The zipped csv file can be downloaded from here but we can just read in the file using pandas:
# read in our dataset
= pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip') autos
/tmp/ipykernel_2753/3884461791.py:2: DtypeWarning: Columns (68,70,71,72,73,74,76,79) have mixed types. Specify dtype option on import or set low_memory=False.
autos = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip')
# Let's take a look
autos
barrels08 | barrelsA08 | charge120 | charge240 | city08 | city08U | cityA08 | cityA08U | cityCD | cityE | ... | mfrCode | c240Dscr | charge240b | c240bDscr | createdOn | modifiedOn | startStop | phevCity | phevHwy | phevComb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 15.695714 | 0.0 | 0.0 | 0.0 | 19 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
1 | 29.964545 | 0.0 | 0.0 | 0.0 | 9 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
2 | 12.207778 | 0.0 | 0.0 | 0.0 | 23 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
3 | 29.964545 | 0.0 | 0.0 | 0.0 | 10 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
4 | 17.347895 | 0.0 | 0.0 | 0.0 | 17 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
5 | 14.982273 | 0.0 | 0.0 | 0.0 | 21 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
6 | 13.184400 | 0.0 | 0.0 | 0.0 | 22 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
7 | 13.733750 | 0.0 | 0.0 | 0.0 | 23 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
8 | 12.677308 | 0.0 | 0.0 | 0.0 | 23 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
9 | 13.184400 | 0.0 | 0.0 | 0.0 | 23 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
41134 | 16.480500 | 0.0 | 0.0 | 0.0 | 18 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
41135 | 12.677308 | 0.0 | 0.0 | 0.0 | 23 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
41136 | 13.733750 | 0.0 | 0.0 | 0.0 | 21 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
41137 | 11.771786 | 0.0 | 0.0 | 0.0 | 24 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
41138 | 13.184400 | 0.0 | 0.0 | 0.0 | 21 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
41139 | 14.982273 | 0.0 | 0.0 | 0.0 | 19 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
41140 | 14.330870 | 0.0 | 0.0 | 0.0 | 20 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
41141 | 15.695714 | 0.0 | 0.0 | 0.0 | 18 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
41142 | 15.695714 | 0.0 | 0.0 | 0.0 | 18 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
41143 | 18.311667 | 0.0 | 0.0 | 0.0 | 16 | 0.0 | 0 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.0 | NaN | Tue Jan 01 00:00:00 EST 2013 | Tue Jan 01 00:00:00 EST 2013 | NaN | 0 | 0 | 0 |
41144 rows × 83 columns
So our dataset includes 41,144 rows and 83 columns - that’s a lot of data! Let’s have a look at the columns:
autos.columns
Index(['barrels08', 'barrelsA08', 'charge120', 'charge240', 'city08',
'city08U', 'cityA08', 'cityA08U', 'cityCD', 'cityE', 'cityUF', 'co2',
'co2A', 'co2TailpipeAGpm', 'co2TailpipeGpm', 'comb08', 'comb08U',
'combA08', 'combA08U', 'combE', 'combinedCD', 'combinedUF', 'cylinders',
'displ', 'drive', 'engId', 'eng_dscr', 'feScore', 'fuelCost08',
'fuelCostA08', 'fuelType', 'fuelType1', 'ghgScore', 'ghgScoreA',
'highway08', 'highway08U', 'highwayA08', 'highwayA08U', 'highwayCD',
'highwayE', 'highwayUF', 'hlv', 'hpv', 'id', 'lv2', 'lv4', 'make',
'model', 'mpgData', 'phevBlended', 'pv2', 'pv4', 'range', 'rangeCity',
'rangeCityA', 'rangeHwy', 'rangeHwyA', 'trany', 'UCity', 'UCityA',
'UHighway', 'UHighwayA', 'VClass', 'year', 'youSaveSpend', 'guzzler',
'trans_dscr', 'tCharger', 'sCharger', 'atvType', 'fuelType2', 'rangeA',
'evMotor', 'mfrCode', 'c240Dscr', 'charge240b', 'c240bDscr',
'createdOn', 'modifiedOn', 'startStop', 'phevCity', 'phevHwy',
'phevComb'],
dtype='object')
Data Types
Let’s concentrate our focus on a subset of the data. Let’s look at 14 of the 83 columns and also find out about the types of data included. Getting the right types will enable analysis and correctness.
# Let's drill down and focus on just 14 of the 83 columns
= ['city08','comb08', 'highway08','cylinders', 'displ', 'drive', 'eng_dscr', 'fuelCost08',
cols 'make', 'model', 'trany', 'range', 'createdOn', 'year']
# Let's see the data types for each column
autos[cols].dtypes
city08 int64
comb08 int64
highway08 int64
cylinders float64
displ float64
drive object
eng_dscr object
fuelCost08 int64
make object
model object
trany object
range int64
createdOn object
year int64
dtype: object
# Let's see how much memory is being used by column
=True) autos[cols].memory_usage(deep
Index 128
city08 329152
comb08 329152
highway08 329152
cylinders 329152
displ 329152
drive 3028369
eng_dscr 2135693
fuelCost08 329152
make 2606267
model 2813134
trany 2933276
range 329152
createdOn 3497240
year 329152
dtype: int64
# Let's see how much memory is being used in total
=True).sum() autos[cols].memory_usage(deep
19647323
Integers (int)
Integers(int) are numbers without a decimal point. Let’s grab some summary statistics for our integer columns:
# summary stats for integer columns
(autos
[cols]int)
.select_dtypes(
.describe() )
city08 | comb08 | highway08 | fuelCost08 | range | year | |
---|---|---|---|---|---|---|
count | 41144.000000 | 41144.000000 | 41144.000000 | 41144.000000 | 41144.000000 | 41144.000000 |
mean | 18.369045 | 20.616396 | 24.504667 | 2362.335942 | 0.793506 | 2001.535266 |
std | 7.905886 | 7.674535 | 7.730364 | 654.981925 | 13.041592 | 11.142414 |
min | 6.000000 | 7.000000 | 9.000000 | 500.000000 | 0.000000 | 1984.000000 |
25% | 15.000000 | 17.000000 | 20.000000 | 1900.000000 | 0.000000 | 1991.000000 |
50% | 17.000000 | 20.000000 | 24.000000 | 2350.000000 | 0.000000 | 2002.000000 |
75% | 20.000000 | 23.000000 | 28.000000 | 2700.000000 | 0.000000 | 2011.000000 |
max | 150.000000 | 136.000000 | 124.000000 | 7400.000000 | 370.000000 | 2020.000000 |
Chaining
The above code is fine but it can quickly become cluttered and unreadable. A better way is to lean on SQL coding best practice which means that our code reads more like a recipe of ordered steps:
# use chaining to grab summary stats for integer columns
(autos
[cols]int)
.select_dtypes(
.describe() )
city08 | comb08 | highway08 | fuelCost08 | range | year | |
---|---|---|---|---|---|---|
count | 41144.000000 | 41144.000000 | 41144.000000 | 41144.000000 | 41144.000000 | 41144.000000 |
mean | 18.369045 | 20.616396 | 24.504667 | 2362.335942 | 0.793506 | 2001.535266 |
std | 7.905886 | 7.674535 | 7.730364 | 654.981925 | 13.041592 | 11.142414 |
min | 6.000000 | 7.000000 | 9.000000 | 500.000000 | 0.000000 | 1984.000000 |
25% | 15.000000 | 17.000000 | 20.000000 | 1900.000000 | 0.000000 | 1991.000000 |
50% | 17.000000 | 20.000000 | 24.000000 | 2350.000000 | 0.000000 | 2002.000000 |
75% | 20.000000 | 23.000000 | 28.000000 | 2700.000000 | 0.000000 | 2011.000000 |
max | 150.000000 | 136.000000 | 124.000000 | 7400.000000 | 370.000000 | 2020.000000 |
Same result, much more readable code! Chaining is also known as ‘flow programming’. Rather than creating intermediate variables, leverage the fact that most operations return a new object which can be worked on.
Note, if you can’t find a way to chain we can use pandas .pipe. We’ll see how this works later.
Saving Space
import numpy as np
The int columns are currently in int64 format. Let’s try to free up some space by representing our data more memory efficiently. We can use Numpy to help with this:
# Can comb08 column be int8?
np.iinfo(np.int8)
iinfo(min=-128, max=127, dtype=int8)
The range -128 to 127 can be represented as 8 bits. (There are 256 values which in binary form can be represented by 11111111 i.e. 8 bits). We have a maximum value of 136 for the comb08 column so we can’t convert to int8 without losing info. We can however convert the highway08 column to int8. Let’s try int16:
# Can comb08 column be int16?
np.iinfo(np.int16)
iinfo(min=-32768, max=32767, dtype=int16)
All of our data is within this range, so we can go ahead and convert all int64 columns to int16 (and int8 for highway08):
# convert from int64 to int16 and int8 in order to free up some memory
# also obtain summary statistics for integer columns
(autos
[cols]'city08': 'int16', 'comb08': 'int16', 'highway08': 'int8','fuelCost08':'int16', 'range':'int16', 'year':'int16'})
.astype({'integer'])
.select_dtypes([
.describe() )
city08 | comb08 | highway08 | fuelCost08 | range | year | |
---|---|---|---|---|---|---|
count | 41144.000000 | 41144.000000 | 41144.000000 | 41144.000000 | 41144.000000 | 41144.000000 |
mean | 18.369045 | 20.616396 | 24.504667 | 2362.335942 | 0.793506 | 2001.535266 |
std | 7.905886 | 7.674535 | 7.730364 | 654.981925 | 13.041592 | 11.142414 |
min | 6.000000 | 7.000000 | 9.000000 | 500.000000 | 0.000000 | 1984.000000 |
25% | 15.000000 | 17.000000 | 20.000000 | 1900.000000 | 0.000000 | 1991.000000 |
50% | 17.000000 | 20.000000 | 24.000000 | 2350.000000 | 0.000000 | 2002.000000 |
75% | 20.000000 | 23.000000 | 28.000000 | 2700.000000 | 0.000000 | 2011.000000 |
max | 150.000000 | 136.000000 | 124.000000 | 7400.000000 | 370.000000 | 2020.000000 |
Let’s see if we have saved any space by converting:
# check memory usage
(autos
[cols]'city08': 'int16', 'comb08': 'int16', 'highway08': 'int8','fuelCost08':'int16', 'range':'int16', 'year':'int16'})
.astype({=True)
.memory_usage(deepsum() # was 19,647,323
. )
18124995
So a saving, but not substantial - just under 8%. Let’s see if we can improve on this:
Floats
A floating point (known as a float) number has decimal points even if that decimal point value is 0. For example: 1.13, 2.0, 1234.345. If we have a column that contains both integers and floating point numbers, Pandas will assign the entire column to the float data type so the decimal points are not lost.
# Let's take a look at the columns with a float data type
(autos
[cols]'float')
.select_dtypes( )
cylinders | displ | |
---|---|---|
0 | 4.0 | 2.0 |
1 | 12.0 | 4.9 |
2 | 4.0 | 2.2 |
3 | 8.0 | 5.2 |
4 | 4.0 | 2.2 |
5 | 4.0 | 1.8 |
6 | 4.0 | 1.8 |
7 | 4.0 | 1.6 |
8 | 4.0 | 1.6 |
9 | 4.0 | 1.8 |
... | ... | ... |
41134 | 4.0 | 2.1 |
41135 | 4.0 | 1.9 |
41136 | 4.0 | 1.9 |
41137 | 4.0 | 1.9 |
41138 | 4.0 | 1.9 |
41139 | 4.0 | 2.2 |
41140 | 4.0 | 2.2 |
41141 | 4.0 | 2.2 |
41142 | 4.0 | 2.2 |
41143 | 4.0 | 2.2 |
41144 rows × 2 columns
Cylinders look int like - we would expect the number of cylinders to be an integer, and not a float (decimal).
# summary stats for cylinders
(autos
.cylinders
.describe() )
count 40938.000000
mean 5.717084
std 1.755517
min 2.000000
25% 4.000000
50% 6.000000
75% 6.000000
max 16.000000
Name: cylinders, dtype: float64
Oops, we have missing values - count = 40,938 but we have 41,144 rows.
# Let's count the various values for cylinders
(autos
.cylinders=False)
.value_counts(dropna )
4.0 15938
6.0 14284
8.0 8801
5.0 771
12.0 626
3.0 279
NaN 206
10.0 170
2.0 59
16.0 10
Name: cylinders, dtype: int64
As anticipated, we have missing values (206) represented by NaN
## where are they missing? We can use .query
(autos
[cols]'cylinders.isna()')
.query( )
city08 | comb08 | highway08 | cylinders | displ | drive | eng_dscr | fuelCost08 | make | model | trany | range | createdOn | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7138 | 81 | 85 | 91 | NaN | NaN | NaN | NaN | 800 | Nissan | Altra EV | NaN | 90 | Tue Jan 01 00:00:00 EST 2013 | 2000 |
7139 | 81 | 72 | 64 | NaN | NaN | 2-Wheel Drive | NaN | 900 | Toyota | RAV4 EV | NaN | 88 | Tue Jan 01 00:00:00 EST 2013 | 2000 |
8143 | 81 | 72 | 64 | NaN | NaN | 2-Wheel Drive | NaN | 900 | Toyota | RAV4 EV | NaN | 88 | Tue Jan 01 00:00:00 EST 2013 | 2001 |
8144 | 74 | 65 | 58 | NaN | NaN | NaN | NaN | 1000 | Ford | Th!nk | NaN | 29 | Tue Jan 01 00:00:00 EST 2013 | 2001 |
8146 | 45 | 39 | 33 | NaN | NaN | 2-Wheel Drive | NaN | 1700 | Ford | Explorer USPS Electric | NaN | 38 | Tue Jan 01 00:00:00 EST 2013 | 2001 |
8147 | 84 | 75 | 66 | NaN | NaN | NaN | NaN | 900 | Nissan | Hyper-Mini | NaN | 33 | Tue Jan 01 00:00:00 EST 2013 | 2001 |
9212 | 87 | 78 | 69 | NaN | NaN | 2-Wheel Drive | NaN | 850 | Toyota | RAV4 EV | NaN | 95 | Tue Jan 01 00:00:00 EST 2013 | 2002 |
9213 | 45 | 39 | 33 | NaN | NaN | 2-Wheel Drive | NaN | 1700 | Ford | Explorer USPS Electric | NaN | 38 | Tue Jan 01 00:00:00 EST 2013 | 2002 |
10329 | 87 | 78 | 69 | NaN | NaN | 2-Wheel Drive | NaN | 850 | Toyota | RAV4 EV | NaN | 95 | Tue Jan 01 00:00:00 EST 2013 | 2003 |
21413 | 22 | 24 | 28 | NaN | NaN | 4-Wheel Drive | NaN | 1750 | Subaru | RX Turbo | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1985 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
34407 | 73 | 72 | 71 | NaN | NaN | Front-Wheel Drive | NaN | 900 | BYD | e6 | Automatic (A1) | 187 | Wed Mar 13 00:00:00 EDT 2019 | 2019 |
34408 | 118 | 108 | 97 | NaN | NaN | Front-Wheel Drive | NaN | 600 | Nissan | Leaf (62 kW-hr battery pack) | Automatic (A1) | 226 | Wed Mar 13 00:00:00 EDT 2019 | 2019 |
34409 | 114 | 104 | 94 | NaN | NaN | Front-Wheel Drive | NaN | 650 | Nissan | Leaf SV/SL (62 kW-hr battery pack) | Automatic (A1) | 215 | Wed Mar 13 00:00:00 EDT 2019 | 2019 |
34538 | 74 | 74 | 73 | NaN | NaN | All-Wheel Drive | NaN | 900 | Audi | e-tron | Automatic (A1) | 204 | Tue Apr 16 00:00:00 EDT 2019 | 2019 |
34561 | 80 | 76 | 72 | NaN | NaN | 4-Wheel Drive | NaN | 850 | Jaguar | I-Pace | Automatic (A1) | 234 | Thu May 02 00:00:00 EDT 2019 | 2020 |
34563 | 138 | 131 | 124 | NaN | NaN | Rear-Wheel Drive | NaN | 500 | Tesla | Model 3 Standard Range | Automatic (A1) | 220 | Thu May 02 00:00:00 EDT 2019 | 2019 |
34564 | 140 | 133 | 124 | NaN | NaN | Rear-Wheel Drive | NaN | 500 | Tesla | Model 3 Standard Range Plus | Automatic (A1) | 240 | Thu May 02 00:00:00 EDT 2019 | 2019 |
34565 | 115 | 111 | 107 | NaN | NaN | All-Wheel Drive | NaN | 600 | Tesla | Model S Long Range | Automatic (A1) | 370 | Thu May 02 00:00:00 EDT 2019 | 2019 |
34566 | 104 | 104 | 104 | NaN | NaN | All-Wheel Drive | NaN | 650 | Tesla | Model S Performance (19" Wheels) | Automatic (A1) | 345 | Thu May 02 00:00:00 EDT 2019 | 2019 |
34567 | 98 | 97 | 96 | NaN | NaN | All-Wheel Drive | NaN | 700 | Tesla | Model S Performance (21" Wheels) | Automatic (A1) | 325 | Thu May 02 00:00:00 EDT 2019 | 2019 |
206 rows × 14 columns
## chaining - add cylinders and displ columns replacing NaN with 0
(autos
[cols]=autos.cylinders.fillna(0).astype('int8'),
.assign(cylinders=autos.displ.fillna(0))
displ'city08': 'int16', 'comb08': 'int16', 'highway08': 'int8','fuelCost08':'int16', 'range':'int16', 'year':'int16'})
.astype({
.describe() )
city08 | comb08 | highway08 | cylinders | displ | fuelCost08 | range | year | |
---|---|---|---|---|---|---|---|---|
count | 41144.000000 | 41144.000000 | 41144.000000 | 41144.000000 | 41144.000000 | 41144.000000 | 41144.000000 | 41144.000000 |
mean | 18.369045 | 20.616396 | 24.504667 | 5.688460 | 3.277904 | 2362.335942 | 0.793506 | 2001.535266 |
std | 7.905886 | 7.674535 | 7.730364 | 1.797009 | 1.373415 | 654.981925 | 13.041592 | 11.142414 |
min | 6.000000 | 7.000000 | 9.000000 | 0.000000 | 0.000000 | 500.000000 | 0.000000 | 1984.000000 |
25% | 15.000000 | 17.000000 | 20.000000 | 4.000000 | 2.200000 | 1900.000000 | 0.000000 | 1991.000000 |
50% | 17.000000 | 20.000000 | 24.000000 | 6.000000 | 3.000000 | 2350.000000 | 0.000000 | 2002.000000 |
75% | 20.000000 | 23.000000 | 28.000000 | 6.000000 | 4.300000 | 2700.000000 | 0.000000 | 2011.000000 |
max | 150.000000 | 136.000000 | 124.000000 | 16.000000 | 8.400000 | 7400.000000 | 370.000000 | 2020.000000 |
# use this to inspect float sizes
np.finfo(np.float16)
finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)
## chaining - add cylinders and displ columns replacing NaN with 0
(autos
[cols]=autos.cylinders.fillna(0).astype('int8'),
.assign(cylinders=autos.displ.fillna(0).astype('float16'))
displ'city08': 'int16', 'comb08': 'int16', 'highway08': 'int8','fuelCost08':'int16', 'range':'int16', 'year':'int16'})
.astype({ )
city08 | comb08 | highway08 | cylinders | displ | drive | eng_dscr | fuelCost08 | make | model | trany | range | createdOn | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19 | 21 | 25 | 4 | 2.000000 | Rear-Wheel Drive | (FFS) | 2000 | Alfa Romeo | Spider Veloce 2000 | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1985 |
1 | 9 | 11 | 14 | 12 | 4.898438 | Rear-Wheel Drive | (GUZZLER) | 3850 | Ferrari | Testarossa | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1985 |
2 | 23 | 27 | 33 | 4 | 2.199219 | Front-Wheel Drive | (FFS) | 1550 | Dodge | Charger | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1985 |
3 | 10 | 11 | 12 | 8 | 5.199219 | Rear-Wheel Drive | NaN | 3850 | Dodge | B150/B250 Wagon 2WD | Automatic 3-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1985 |
4 | 17 | 19 | 23 | 4 | 2.199219 | 4-Wheel or All-Wheel Drive | (FFS,TRBO) | 2700 | Subaru | Legacy AWD Turbo | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
5 | 21 | 22 | 24 | 4 | 1.799805 | Front-Wheel Drive | (FFS) | 1900 | Subaru | Loyale | Automatic 3-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
6 | 22 | 25 | 29 | 4 | 1.799805 | Front-Wheel Drive | (FFS) | 1700 | Subaru | Loyale | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
7 | 23 | 24 | 26 | 4 | 1.599609 | Front-Wheel Drive | (FFS) | 1750 | Toyota | Corolla | Automatic 3-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
8 | 23 | 26 | 31 | 4 | 1.599609 | Front-Wheel Drive | (FFS) | 1600 | Toyota | Corolla | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
9 | 23 | 25 | 30 | 4 | 1.799805 | Front-Wheel Drive | (FFS) | 1700 | Toyota | Corolla | Automatic 4-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
41134 | 18 | 20 | 24 | 4 | 2.099609 | Front-Wheel Drive | (FFS) | 2100 | Saab | 900 | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
41135 | 23 | 26 | 33 | 4 | 1.900391 | Front-Wheel Drive | (TBI) (FFS) | 1600 | Saturn | SL | Automatic 4-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
41136 | 21 | 24 | 30 | 4 | 1.900391 | Front-Wheel Drive | (MFI) (FFS) | 1750 | Saturn | SL | Automatic 4-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
41137 | 24 | 28 | 33 | 4 | 1.900391 | Front-Wheel Drive | (TBI) (FFS) | 1500 | Saturn | SL | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
41138 | 21 | 25 | 32 | 4 | 1.900391 | Front-Wheel Drive | (MFI) (FFS) | 1700 | Saturn | SL | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
41139 | 19 | 22 | 26 | 4 | 2.199219 | Front-Wheel Drive | (FFS) | 1900 | Subaru | Legacy | Automatic 4-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
41140 | 20 | 23 | 28 | 4 | 2.199219 | Front-Wheel Drive | (FFS) | 1850 | Subaru | Legacy | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
41141 | 18 | 21 | 24 | 4 | 2.199219 | 4-Wheel or All-Wheel Drive | (FFS) | 2000 | Subaru | Legacy AWD | Automatic 4-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
41142 | 18 | 21 | 24 | 4 | 2.199219 | 4-Wheel or All-Wheel Drive | (FFS) | 2000 | Subaru | Legacy AWD | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
41143 | 16 | 18 | 21 | 4 | 2.199219 | 4-Wheel or All-Wheel Drive | (FFS,TRBO) | 2900 | Subaru | Legacy AWD Turbo | Automatic 4-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1993 |
41144 rows × 14 columns
# new memory usage
(autos#[cols]
.loc[:,cols]=autos.cylinders.fillna(0).astype('int8'),
.assign(cylinders=autos.displ.fillna(0).astype('float16'))
displ'city08': 'int16', 'comb08': 'int16', 'highway08': 'int8','fuelCost08':'int16', 'range':'int16', 'year':'int16'})
.astype({=True)
.memory_usage(deepsum() # was 19,647,323
. )
17590123
A further reduction.
Objects
At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.
# let's take a look at our object columns
(autos
[cols]object)
.select_dtypes( )
drive | eng_dscr | make | model | trany | createdOn | |
---|---|---|---|---|---|---|
0 | Rear-Wheel Drive | (FFS) | Alfa Romeo | Spider Veloce 2000 | Manual 5-spd | Tue Jan 01 00:00:00 EST 2013 |
1 | Rear-Wheel Drive | (GUZZLER) | Ferrari | Testarossa | Manual 5-spd | Tue Jan 01 00:00:00 EST 2013 |
2 | Front-Wheel Drive | (FFS) | Dodge | Charger | Manual 5-spd | Tue Jan 01 00:00:00 EST 2013 |
3 | Rear-Wheel Drive | NaN | Dodge | B150/B250 Wagon 2WD | Automatic 3-spd | Tue Jan 01 00:00:00 EST 2013 |
4 | 4-Wheel or All-Wheel Drive | (FFS,TRBO) | Subaru | Legacy AWD Turbo | Manual 5-spd | Tue Jan 01 00:00:00 EST 2013 |
5 | Front-Wheel Drive | (FFS) | Subaru | Loyale | Automatic 3-spd | Tue Jan 01 00:00:00 EST 2013 |
6 | Front-Wheel Drive | (FFS) | Subaru | Loyale | Manual 5-spd | Tue Jan 01 00:00:00 EST 2013 |
7 | Front-Wheel Drive | (FFS) | Toyota | Corolla | Automatic 3-spd | Tue Jan 01 00:00:00 EST 2013 |
8 | Front-Wheel Drive | (FFS) | Toyota | Corolla | Manual 5-spd | Tue Jan 01 00:00:00 EST 2013 |
9 | Front-Wheel Drive | (FFS) | Toyota | Corolla | Automatic 4-spd | Tue Jan 01 00:00:00 EST 2013 |
... | ... | ... | ... | ... | ... | ... |
41134 | Front-Wheel Drive | (FFS) | Saab | 900 | Manual 5-spd | Tue Jan 01 00:00:00 EST 2013 |
41135 | Front-Wheel Drive | (TBI) (FFS) | Saturn | SL | Automatic 4-spd | Tue Jan 01 00:00:00 EST 2013 |
41136 | Front-Wheel Drive | (MFI) (FFS) | Saturn | SL | Automatic 4-spd | Tue Jan 01 00:00:00 EST 2013 |
41137 | Front-Wheel Drive | (TBI) (FFS) | Saturn | SL | Manual 5-spd | Tue Jan 01 00:00:00 EST 2013 |
41138 | Front-Wheel Drive | (MFI) (FFS) | Saturn | SL | Manual 5-spd | Tue Jan 01 00:00:00 EST 2013 |
41139 | Front-Wheel Drive | (FFS) | Subaru | Legacy | Automatic 4-spd | Tue Jan 01 00:00:00 EST 2013 |
41140 | Front-Wheel Drive | (FFS) | Subaru | Legacy | Manual 5-spd | Tue Jan 01 00:00:00 EST 2013 |
41141 | 4-Wheel or All-Wheel Drive | (FFS) | Subaru | Legacy AWD | Automatic 4-spd | Tue Jan 01 00:00:00 EST 2013 |
41142 | 4-Wheel or All-Wheel Drive | (FFS) | Subaru | Legacy AWD | Manual 5-spd | Tue Jan 01 00:00:00 EST 2013 |
41143 | 4-Wheel or All-Wheel Drive | (FFS,TRBO) | Subaru | Legacy AWD Turbo | Automatic 4-spd | Tue Jan 01 00:00:00 EST 2013 |
41144 rows × 6 columns
## drive looks categorical
(autos
.drive=False)
.value_counts(dropna )
Front-Wheel Drive 14236
Rear-Wheel Drive 13831
4-Wheel or All-Wheel Drive 6648
All-Wheel Drive 3015
4-Wheel Drive 1460
NaN 1189
2-Wheel Drive 507
Part-time 4-Wheel Drive 258
Name: drive, dtype: int64
# Where are the missing values NaN ?
(autos
[cols]'drive.isna()')
.query( )
city08 | comb08 | highway08 | cylinders | displ | drive | eng_dscr | fuelCost08 | make | model | trany | range | createdOn | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7138 | 81 | 85 | 91 | NaN | NaN | NaN | NaN | 800 | Nissan | Altra EV | NaN | 90 | Tue Jan 01 00:00:00 EST 2013 | 2000 |
8144 | 74 | 65 | 58 | NaN | NaN | NaN | NaN | 1000 | Ford | Th!nk | NaN | 29 | Tue Jan 01 00:00:00 EST 2013 | 2001 |
8147 | 84 | 75 | 66 | NaN | NaN | NaN | NaN | 900 | Nissan | Hyper-Mini | NaN | 33 | Tue Jan 01 00:00:00 EST 2013 | 2001 |
18217 | 18 | 21 | 25 | 4.0 | 2.0 | NaN | (FFS) | 2000 | Alfa Romeo | Spider Veloce 2000 | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1984 |
18218 | 20 | 22 | 26 | 4.0 | 1.5 | NaN | (FFS) | 1900 | Bertone | X1/9 | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1984 |
18219 | 13 | 15 | 20 | 8.0 | 5.7 | NaN | (350 V8) (FFS) | 2800 | Chevrolet | Corvette | Automatic 4-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1984 |
18220 | 13 | 15 | 20 | 8.0 | 5.7 | NaN | (350 V8) (FFS) | 2800 | Chevrolet | Corvette | Manual 4-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1984 |
18221 | 15 | 17 | 20 | 6.0 | 3.0 | NaN | (FFS,TRBO) | 2500 | Nissan | 300ZX | Automatic 4-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1984 |
18222 | 16 | 18 | 20 | 6.0 | 3.0 | NaN | (FFS) | 2350 | Nissan | 300ZX | Automatic 4-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1984 |
18223 | 16 | 18 | 22 | 6.0 | 3.0 | NaN | (FFS,TRBO) | 2350 | Nissan | 300ZX | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1984 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20063 | 13 | 15 | 19 | 8.0 | 5.0 | NaN | (FFS) CA model | 2800 | Mercury | Grand Marquis Wagon | Automatic 4-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1984 |
20064 | 13 | 15 | 20 | 8.0 | 5.0 | NaN | (GM-OLDS) CA model | 2800 | Oldsmobile | Custom Cruiser Wagon | Automatic 4-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1984 |
20065 | 14 | 16 | 19 | 8.0 | 5.0 | NaN | (GM-CHEV) CA model | 2650 | Pontiac | Parisienne Wagon | Automatic 4-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1984 |
20387 | 14 | 14 | 15 | 4.0 | 2.4 | NaN | (FFS) CA model | 3000 | Nissan | Pickup Cab Chassis | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 1984 |
21129 | 14 | 16 | 21 | 8.0 | 3.5 | NaN | GUZZLER FFS,TURBO | 3250 | Lotus | Esprit V8 | Manual 5-spd | 0 | Tue Jan 01 00:00:00 EST 2013 | 2002 |
23029 | 79 | 85 | 94 | NaN | NaN | NaN | Lead Acid | 800 | GMC | EV1 | Automatic (A1) | 55 | Tue Jan 01 00:00:00 EST 2013 | 1999 |
23030 | 35 | 37 | 39 | NaN | NaN | NaN | NiMH | 1750 | GMC | EV1 | Automatic (A1) | 105 | Tue Jan 01 00:00:00 EST 2013 | 1999 |
23032 | 49 | 48 | 46 | NaN | NaN | NaN | NaN | 1400 | Honda | EV Plus | Automatic (A1) | 81 | Tue Jan 01 00:00:00 EST 2013 | 1999 |
23037 | 49 | 48 | 46 | NaN | NaN | NaN | NaN | 1400 | Honda | EV Plus | Automatic (A1) | 81 | Tue Jan 01 00:00:00 EST 2013 | 1998 |
23040 | 102 | 98 | 94 | NaN | NaN | NaN | NaN | 650 | MINI | MiniE | Automatic (A1) | 100 | Tue Jan 01 00:00:00 EST 2013 | 2008 |
1189 rows × 14 columns
# let's look at the drive column, grouped by year
(autos
[cols]'year')
.groupby(
.drive
.nunique() )
year
1984 3
1985 4
1986 4
1987 3
1988 3
1989 3
1990 3
1991 3
1992 3
1993 3
1994 3
1995 4
1996 3
1997 4
1998 4
1999 4
2000 4
2001 4
2002 4
2003 4
2004 4
2005 4
2006 4
2007 4
2008 3
2009 4
2010 6
2011 5
2012 5
2013 5
2014 5
2015 5
2016 5
2017 5
2018 5
2019 5
2020 5
Name: drive, dtype: int64
# let's convert drive to category, replacing NaN with 'Other using .assign .astype
# and convert make to category, updating .astype dictionary
# and check our memory usage
(autos
[cols]=autos.cylinders.fillna(0).astype('int8'),
.assign(cylinders=autos.displ.fillna(0).astype('float16'),
displ=autos.drive.fillna('Other').astype('category'))
drive'city08': 'int16', 'comb08': 'int16', 'highway08': 'int8','fuelCost08':'int16', 'range':'int16', 'year':'int16', 'make': 'category'})
.astype({=True)
.memory_usage(deepsum() # was 19,647,323
. )
12093275
As we can see, converting to category has freed up a lot of space, a reduction from 17590123 - just over 30%
# Let's inspect trany
# looks like it has two pices of info embedded in the column
(autos
.trany=False)
.value_counts(dropna )
Automatic 4-spd 11047
Manual 5-spd 8361
Automatic 3-spd 3151
Automatic (S6) 3106
Manual 6-spd 2757
Automatic 5-spd 2203
Automatic (S8) 1665
Automatic 6-spd 1619
Manual 4-spd 1483
Automatic (S5) 833
Automatic (variable gear ratios) 826
Automatic 7-spd 724
Automatic 8-spd 433
Automatic (AM-S7) 424
Automatic (S7) 327
Automatic 9-spd 293
Automatic (AM7) 245
Automatic (S4) 233
Automatic (AV-S6) 208
Automatic (A1) 201
Automatic (AM6) 151
Automatic (AV-S7) 139
Automatic (S10) 124
Automatic (AM-S6) 116
Manual 7-spd 114
Automatic (S9) 86
Manual 3-spd 77
Automatic (AM-S8) 60
Automatic (AV-S8) 47
Automatic 10-spd 25
Manual 4-spd Doubled 17
Automatic (AM5) 14
NaN 11
Automatic (AV-S10) 11
Automatic (AM8) 6
Automatic (AM-S9) 3
Automatic (L3) 2
Automatic (L4) 2
Name: trany, dtype: int64
11 NaN values
The information from the trany column seems to have two components:
- Automatic v Manual
- Speed
# add new columns for automatic using .str.contains
# add new column for speeds using .str.extract
# drop exisitng trany column
(autos
[cols]=autos.cylinders.fillna(0).astype('int8'),
.assign(cylinders=autos.displ.fillna(0).astype('float16'),
displ=autos.drive.fillna('Other').astype('category'),
drive=autos.trany.str.contains('Auto'),
automatic=autos.trany.str.extract(r'(\d)+').fillna('20').astype('int8')
speeds
) 'city08': 'int16', 'comb08': 'int16', 'highway08': 'int8','fuelCost08':'int16', 'range':'int16', 'year':'int16', 'make': 'category'})
.astype({=['trany'])
.drop(columns=True)
.memory_usage(deepsum() # was 19,647,323
. )
10631047
Great, another reduction - we have almost halved our original memory usage.
Dates
pandas contains extensive capabilities and features for working with time series data for all domains. Check out the documentation for more info.
We can convert the CreatedOn column from an object to datetime using.
# add createdOn using pd.to_datetime .dt.tz_localize
(autos
[cols]=autos.cylinders.fillna(0).astype('int8'),
.assign(cylinders=autos.displ.fillna(0).astype('float16'),
displ=autos.drive.fillna('Other').astype('category'),
drive=autos.trany.str.contains('Auto'),
automatic=autos.trany.str.extract(r'(\d)+').fillna('20').astype('int8'),
speeds=pd.to_datetime(autos.createdOn).dt.tz_localize('America/New_York')
createdOn
) 'city08': 'int16', 'comb08': 'int16', 'highway08': 'int8','fuelCost08':'int16', 'range':'int16', 'year':'int16', 'make': 'category'})
.astype({=['trany'])
.drop(columns=True)
.memory_usage(deepsum() # was 19,647,323
. )
/home/stephen137/mambaforge/lib/python3.10/site-packages/dateutil/parser/_parser.py:1207: UnknownTimezoneWarning: tzname EST identified but not understood. Pass `tzinfos` argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception.
warnings.warn("tzname {tzname} identified but not understood. "
/home/stephen137/mambaforge/lib/python3.10/site-packages/dateutil/parser/_parser.py:1207: UnknownTimezoneWarning: tzname EDT identified but not understood. Pass `tzinfos` argument in order to correctly return a timezone-aware datetime. In a future version, this will raise an exception.
warnings.warn("tzname {tzname} identified but not understood. "
7462959
Excellent, we have successfully reduced our memory usage by 62%!
# Pythom doesn't like EST/EDT
(autos
[cols]
.createdOn )
0 Tue Jan 01 00:00:00 EST 2013
1 Tue Jan 01 00:00:00 EST 2013
2 Tue Jan 01 00:00:00 EST 2013
3 Tue Jan 01 00:00:00 EST 2013
4 Tue Jan 01 00:00:00 EST 2013
5 Tue Jan 01 00:00:00 EST 2013
6 Tue Jan 01 00:00:00 EST 2013
7 Tue Jan 01 00:00:00 EST 2013
8 Tue Jan 01 00:00:00 EST 2013
9 Tue Jan 01 00:00:00 EST 2013
...
41134 Tue Jan 01 00:00:00 EST 2013
41135 Tue Jan 01 00:00:00 EST 2013
41136 Tue Jan 01 00:00:00 EST 2013
41137 Tue Jan 01 00:00:00 EST 2013
41138 Tue Jan 01 00:00:00 EST 2013
41139 Tue Jan 01 00:00:00 EST 2013
41140 Tue Jan 01 00:00:00 EST 2013
41141 Tue Jan 01 00:00:00 EST 2013
41142 Tue Jan 01 00:00:00 EST 2013
41143 Tue Jan 01 00:00:00 EST 2013
Name: createdOn, Length: 41144, dtype: object
# Fix date warnings - move on to eng_dscr
# https://www.fueleconomy.gov/feg/findacarhelp.shtml
(autos
[cols]=autos.cylinders.fillna(0).astype('int8'),
.assign(cylinders=autos.displ.fillna(0).astype('float16'),
displ=autos.drive.fillna('Other').astype('category'),
drive=autos.trany.str.contains('Auto'),
automatic=autos.trany.str.extract(r'(\d)+').fillna('20').astype('int8'),
speeds=pd.to_datetime(autos.createdOn.replace({'EDT': '-04:00', 'EST':'-05:00'}, regex=True))
createdOn
) 'city08': 'int16', 'comb08': 'int16', 'highway08': 'int8','fuelCost08':'int16', 'range':'int16', 'year':'int16', 'make': 'category'})
.astype({=['trany'])
.drop(columns
.eng_dscr=False)
.value_counts(dropna )
NaN 16153
(FFS) 8827
SIDI 5526
(FFS) CA model 926
(FFS) (MPFI) 734
FFV 701
(FFS,TRBO) 666
(350 V8) (FFS) 411
(GUZZLER) (FFS) 366
SOHC 354
...
B234L/R4 (FFS,TRBO) 1
GUZZLER V8 FFS,TURBO 1
4.6M FFS MPFI 1
CNG FFS 1
POLICE FFS MPFI 1
B308E5 FFS,TURBO 1
5.4E-R FFS MPFI 1
V-6 FFS 1
(GUZZLER) (FFS) (S-CHARGE) 1
R-ENG (FFS,TRBO) 1
Name: eng_dscr, Length: 558, dtype: int64
As we can see the majority of values within the eng_dscr column are NaN and the other values are very messy. How should we deal with this?
# drop eng_dscr column, and bring in an FFS column (feedback fuel system)
# check update to memory usage
(autos
[cols]=autos.cylinders.fillna(0).astype('int8'),
.assign(cylinders=autos.displ.fillna(0).astype('float16'),
displ=autos.drive.fillna('Other').astype('category'),
drive=autos.trany.str.contains('Auto'),
automatic=autos.trany.str.extract(r'(\d)+').fillna('20').astype('int8'),
speeds=pd.to_datetime(autos.createdOn.replace({'EDT': '-04:00', 'EST':'-05:00'}, regex=True)),
createdOn=autos.eng_dscr.str.contains('FFS')
ffs
) 'city08': 'int16', 'comb08': 'int16', 'highway08': 'int8','fuelCost08':'int16', 'range':'int16', 'year':'int16', 'make': 'category'})
.astype({=['trany','eng_dscr'])
.drop(columns=True)
.memory_usage(deepsum() # was 19,647,323
. )
8676214
Functions - .apply
Let’s now create a function which brings together all the exploratory data analysis we have performed in one place:
def autos_tweak(autos):
= ['city08','comb08', 'highway08','cylinders', 'displ', 'drive', 'eng_dscr', 'fuelCost08',
cols 'make', 'model', 'trany', 'range', 'createdOn', 'year']
return (autos
[cols]=autos.cylinders.fillna(0).astype('int8'),
.assign(cylinders=autos.displ.fillna(0).astype('float16'),
displ=autos.drive.fillna('Other').astype('category'),
drive=autos.trany.str.contains('Auto'),
automatic=autos.trany.str.extract(r'(\d)+').fillna('20').astype('int8'),
speeds=pd.to_datetime(autos.createdOn.replace({'EDT': '-04:00', 'EST':'-05:00'}, regex=True)),
createdOn=autos.eng_dscr.str.contains('FFS')
ffs
) 'city08': 'int16', 'comb08': 'int16', 'highway08': 'int8','fuelCost08':'int16', 'range':'int16', 'year':'int16', 'make': 'category'})
.astype({=['trany','eng_dscr'])
.drop(columns )
Look how neat and tidy the above code is compared to the following alternative approach:
= autos[cols]
a1 = autos.cylinders.fillna(0)
cyls = cyls.astype('int8')
cyls2 'cylinders'] = cyls2
a1[= a1.displ
displ = displ.fillna(0)
displ2 =displ2.astype('float16')
displ3=displ3
a1.displ=autos.drive.fillna('Other').astype('category')
a1.drive'automatic'] = autos.trany.str.contains('Auto')
a1[=autos.trany.str.extract(r'(\d)+')
speed= speed.fillna('20')
speedfill = speedfill.astype('int8')
speedint 'speeds'] = speedint
a1[=pd.to_datetime(autos.createdOn).dt.tz_localize('America/New_York')
a1.createdOn=autos.eng_dscr.str.contains('FFS')
a1.ffs'highway08'] = autos.highway08.astype('int8')
a1['city08'] = autos.city08.astype('int8')
a1['comb08'] = autos.comb08.astype('int16')
a1['fuelCost08'] =autos.fuelCost08.astype('int16')
a1['range'] = autos.range.astype('int16')
a1['make'] = autos.make.astype('category')
a1[= a1.drop(columns=['trany','eng_dscr']) a3
Don’t Mutate
“you are missing the point, inplace rarely actually does something inplace. you are thinking that you are saving memory but you are not.”
jreback - Pandas core dev
https://github.com/pandas-dev/pandas/issues/16529#issuecomment-676518136
- in general, no performance benefits
- prohibits chaining
- SettingWithCopyWarning fun
Try to avoid using .apply (where possible)
def autos_tweak(autos):
= ['city08','comb08', 'highway08','cylinders', 'displ', 'drive', 'eng_dscr', 'fuelCost08',
cols 'make', 'model', 'trany', 'range', 'createdOn', 'year']
return (autos
[cols]=autos.cylinders.fillna(0).astype('int8'),
.assign(cylinders=autos.displ.fillna(0).astype('float16'),
displ=autos.drive.fillna('Other').astype('category'),
drive=autos.trany.str.contains('Auto'),
automatic=autos.trany.str.extract(r'(\d)+').fillna('20').astype('int8'),
speeds=pd.to_datetime(autos.createdOn.replace({'EDT': '-04:00', 'EST':'-05:00'}, regex=True)),
createdOn=autos.eng_dscr.str.contains('FFS')
ffs
) 'city08': 'int16', 'comb08': 'int16', 'highway08': 'int8','fuelCost08':'int16', 'range':'int16', 'year':'int16', 'make': 'category'})
.astype({=['trany','eng_dscr'])
.drop(columns
)
= autos_tweak(autos) autos2
# try to be more Euro-centric
def to_lper100km(val):
return 235.215 / val
%%timeit
apply(to_lper100km) autos2.city08.
5.3 ms ± 390 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.95 ms (milliseconds) is equivalent to 4,950 μs (microseconds)
There is a lot of computational overhead using this method - the function pulls out each individual entry from the Series, convert it to a Python object, pass the individual entry into the function, and then convert back to a pandas object.
%%timeit
# note that you can't run %%timeit with a leading #comment
# this gives the same results
235.215 / autos2.city08
84.5 µs ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
90.7 μs (microseconds)
This gives the same answer but is more than 50 x faster than using the .apply method, because it is leveraging modern CPU single instruction multiple data (SIMD) architecture - here’s a block of data - do the division on it.
# create a function to return whether the make of the car is US
def is_american(val):
return val in {'Chevrolet', 'Ford', 'Dodge', 'GMC', 'Tesla'}
%%timeit
# use .apply
apply(is_american) autos2.make.
233 µs ± 6.36 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
245 μs (microseconds)
%%timeit
# use .isin
'Chevrolet', 'Ford', 'Dodge', 'GMC', 'Tesla'}) autos2.make.isin({
448 µs ± 9.32 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
465 μs (microseconds)
%%timeit
# use .astype(str) and then .isin
str).isin({'Chevrolet', 'Ford', 'Dodge', 'GMC', 'Tesla'}) autos2.make.astype(
4.91 ms ± 65.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.35 ms (milliseconds) is equivalent to 5,350 μs (microseconds)
%%timeit
# use .astype(str) and then .apply
str).apply(is_american) autos2.make.astype(
8.23 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.93 ms (milliseconds) is equivalent to 8,930 μs (microseconds)
In the case of the categorical column - make:
- the .apply method on the function was fastest
- the .isin method was next fastest (~ 2 x slower)
- third fastest was (~ 22 x slower)
- finally the .astype(str).apply method (~36 x slower)
def country(val):
if val in {'Chevrolet', 'Ford', 'Dodge', 'GMC', 'Tesla'}:
return 'US'
# else
return 'Other'
%%timeit
# use .apply
# Might be OK for strings, since they are not vectorized...
(autos2=autos2.make.apply(country))
.assign(country )
2.14 ms ± 66.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
# use .assign
= {'Chevrolet', 'Ford', 'Dodge', 'GMC', 'Tesla'}
values
(autos2='US')
.assign(country=lambda df_:df_.country.where(df_.make.isin(values), 'Other'))
.assign(country )
4.31 ms ± 83 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
# using MumPy .select - allows you to specify a list of Booleans, and wherever they hold true [i.e make is Chevrolet, Ford, Dodge...Tesla, you specify the value ['US'] to put into placeholder
# this method is not available within pandas
(autos2=np.select([autos2.make.isin({'Chevrolet', 'Ford', 'Dodge', 'GMC', 'Tesla'})],
.assign(country'US'], 'Other'))
[ )
3.36 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
# using MumPy .where - allows you to specify a list of Booleans, and wherever they hold true [i.e make is Chevrolet, Ford, Dodge...Tesla, you specify the value ['US'] to put into placeholder
# this method is not available within pandas
(autos2=np.where(autos2.make.isin({'Chevrolet', 'Ford', 'Dodge', 'GMC', 'Tesla'}),
.assign(country'US'], 'Other'))
[ )
3.38 ms ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Key takeaways
- if you find yourself using a for loop alarm bells should be ringing!
- you could use .apply but still slow
- apply where or np.select
- the same result can be obtained much faster using list comprehension
Aggregation
It is important as a data science to work with the raw data and get to know the finer details, but ultimately, providing higher level insights are our main goal. This can be obtained by aggregating data. Let’s compare mileage by country by year…
# start off with auto
# group by year
# then grab the mean values
(autos2# Year will therefore be our index
'year')
.groupby(
.mean() )
/tmp/ipykernel_2753/262800323.py:7: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
.mean()
city08 | comb08 | highway08 | cylinders | displ | fuelCost08 | range | speeds | |
---|---|---|---|---|---|---|---|---|
year | ||||||||
1984 | 17.982688 | 19.881874 | 23.075356 | 5.385438 | 3.165017 | 2313.543788 | 0.000000 | 3.928208 |
1985 | 17.878307 | 19.808348 | 23.042328 | 5.375661 | 3.164080 | 2334.509112 | 0.000000 | 3.924750 |
1986 | 17.665289 | 19.550413 | 22.699174 | 5.425620 | 3.183762 | 2354.049587 | 0.000000 | 3.984298 |
1987 | 17.310345 | 19.228549 | 22.445068 | 5.412189 | 3.173949 | 2403.648757 | 0.000000 | 4.037690 |
1988 | 17.333628 | 19.328319 | 22.702655 | 5.461947 | 3.194899 | 2387.035398 | 0.000000 | 4.129204 |
1989 | 17.143972 | 19.125759 | 22.465742 | 5.488291 | 3.209926 | 2433.434519 | 0.000000 | 4.166522 |
1990 | 17.033395 | 19.000928 | 22.337662 | 5.496289 | 3.217369 | 2436.178108 | 0.000000 | 4.238404 |
1991 | 16.848940 | 18.825972 | 22.253534 | 5.598940 | 3.266809 | 2490.856890 | 0.000000 | 4.301237 |
1992 | 16.805531 | 18.862623 | 22.439786 | 5.623550 | 3.276159 | 2494.736842 | 0.000000 | 4.318466 |
1993 | 16.998170 | 19.104300 | 22.780421 | 5.602928 | 3.248540 | 2454.620311 | 0.000000 | 4.339433 |
1994 | 16.918534 | 19.012220 | 22.725051 | 5.704684 | 3.333190 | 2461.507128 | 0.000000 | 4.332994 |
1995 | 16.569804 | 18.797311 | 22.671148 | 5.892451 | 3.471776 | 2497.828335 | 0.000000 | 4.356774 |
1996 | 17.289780 | 19.584735 | 23.569211 | 5.627426 | 3.234789 | 2375.032342 | 0.000000 | 4.364812 |
1997 | 17.135171 | 19.429134 | 23.451444 | 5.666667 | 3.226933 | 2405.511811 | 0.000000 | 4.402887 |
1998 | 17.113300 | 19.518473 | 23.546798 | 5.633005 | 3.201979 | 2382.635468 | 0.229064 | 4.419951 |
1999 | 17.272300 | 19.611502 | 23.552817 | 5.667840 | 3.188794 | 2392.194836 | 0.570423 | 4.421362 |
2000 | 17.221429 | 19.526190 | 23.414286 | 5.713095 | 3.200517 | 2429.702381 | 0.348810 | 4.508333 |
2001 | 17.275521 | 19.479693 | 23.328211 | 5.720088 | 3.192452 | 2448.463227 | 0.261251 | 4.660812 |
2002 | 16.893333 | 19.168205 | 23.030769 | 5.827692 | 3.264525 | 2479.794872 | 0.136410 | 4.757949 |
2003 | 16.780651 | 19.000958 | 22.836207 | 5.942529 | 3.358259 | 2525.574713 | 0.090996 | 4.911877 |
2004 | 16.740642 | 19.067736 | 23.064171 | 5.957219 | 3.393626 | 2512.566845 | 0.000000 | 4.976827 |
2005 | 16.851630 | 19.193825 | 23.297599 | 5.944254 | 3.399485 | 2518.610635 | 0.000000 | 5.192110 |
2006 | 16.626812 | 18.959239 | 23.048913 | 6.100543 | 3.549294 | 2539.175725 | 0.000000 | 5.315217 |
2007 | 16.605684 | 18.978686 | 23.083481 | 6.166075 | 3.628539 | 2535.923623 | 0.000000 | 5.610124 |
2008 | 16.900590 | 19.276327 | 23.455771 | 6.192923 | 3.637796 | 2536.436394 | 0.084246 | 5.773378 |
2009 | 17.334459 | 19.735642 | 24.017736 | 6.122466 | 3.624839 | 2427.027027 | 0.000000 | 6.043074 |
2010 | 18.105500 | 20.588819 | 24.947701 | 5.965735 | 3.502548 | 2351.082056 | 0.000000 | 6.271416 |
2011 | 18.669027 | 21.011504 | 25.169912 | 5.980531 | 3.521903 | 2333.982301 | 0.259292 | 6.560177 |
2012 | 19.362847 | 21.819444 | 26.105035 | 5.910590 | 3.460015 | 2289.973958 | 0.782118 | 6.706597 |
2013 | 20.661318 | 23.125000 | 27.504223 | 5.762669 | 3.327529 | 2210.768581 | 1.255068 | 6.896959 |
2014 | 21.033469 | 23.531429 | 27.978776 | 5.745306 | 3.289703 | 2198.040816 | 1.405714 | 6.985306 |
2015 | 21.445830 | 24.038971 | 28.586906 | 5.635230 | 3.205085 | 2148.869836 | 2.208106 | 7.035853 |
2016 | 22.591918 | 25.150555 | 29.606973 | 5.463550 | 3.054415 | 2091.204437 | 4.546751 | 7.080032 |
2017 | 22.761021 | 25.249033 | 29.554524 | 5.453210 | 3.026032 | 2096.558391 | 4.336427 | 7.225058 |
2018 | 22.564732 | 25.019345 | 29.273065 | 5.438988 | 2.992239 | 2103.980655 | 3.519345 | 7.017113 |
2019 | 23.318147 | 25.627942 | 29.664389 | 5.368261 | 2.964679 | 2093.545938 | 5.565680 | 7.136674 |
2020 | 22.679426 | 25.267943 | 29.617225 | 5.071770 | 2.644994 | 2023.444976 | 2.282297 | 7.746411 |
# let's focus on just the comb08 and speeds columns
(autos2'year')
.groupby('comb08','speeds']]
[[
.mean() )
comb08 | speeds | |
---|---|---|
year | ||
1984 | 19.881874 | 3.928208 |
1985 | 19.808348 | 3.924750 |
1986 | 19.550413 | 3.984298 |
1987 | 19.228549 | 4.037690 |
1988 | 19.328319 | 4.129204 |
1989 | 19.125759 | 4.166522 |
1990 | 19.000928 | 4.238404 |
1991 | 18.825972 | 4.301237 |
1992 | 18.862623 | 4.318466 |
1993 | 19.104300 | 4.339433 |
1994 | 19.012220 | 4.332994 |
1995 | 18.797311 | 4.356774 |
1996 | 19.584735 | 4.364812 |
1997 | 19.429134 | 4.402887 |
1998 | 19.518473 | 4.419951 |
1999 | 19.611502 | 4.421362 |
2000 | 19.526190 | 4.508333 |
2001 | 19.479693 | 4.660812 |
2002 | 19.168205 | 4.757949 |
2003 | 19.000958 | 4.911877 |
2004 | 19.067736 | 4.976827 |
2005 | 19.193825 | 5.192110 |
2006 | 18.959239 | 5.315217 |
2007 | 18.978686 | 5.610124 |
2008 | 19.276327 | 5.773378 |
2009 | 19.735642 | 6.043074 |
2010 | 20.588819 | 6.271416 |
2011 | 21.011504 | 6.560177 |
2012 | 21.819444 | 6.706597 |
2013 | 23.125000 | 6.896959 |
2014 | 23.531429 | 6.985306 |
2015 | 24.038971 | 7.035853 |
2016 | 25.150555 | 7.080032 |
2017 | 25.249033 | 7.225058 |
2018 | 25.019345 | 7.017113 |
2019 | 25.627942 | 7.136674 |
2020 | 25.267943 | 7.746411 |
%%timeit
# Watch out for the ordering!!!
# here we are grouping by year
# but then we are taking average of all columns - computationally expensive
# we are only interested in comb08 and speeds
(autos2'year')
.groupby(
.mean()#.median()
#.quantile(.1)
#.std()
'comb08','speeds']]
[[ )
Visualizations
The pandas library includes a variety of visualization tools which allow us to communicate our findings visually. Note that is very easy to show a variety of different plots quickly, simply by commenting out (#) to leave the desired plot:
# in pandas default plot is a line plot
# with index as the x axis and the selected grouped columns as the lines
(autos2'Time_mins')
.groupby('comb08','speeds']]
[[#.mean()
#.median()
#.quantile(.1)
.std()#.var()
.plot() )
<AxesSubplot: xlabel='year'>
# add country
(autos2=autos2.make.apply(country))
.assign(country# can group by more than one column
'year','country'])
.groupby([
.mean() )
/tmp/ipykernel_2753/361744348.py:6: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
.mean()
city08 | comb08 | highway08 | cylinders | displ | fuelCost08 | range | speeds | ||
---|---|---|---|---|---|---|---|---|---|
year | country | ||||||||
1984 | Other | 19.384615 | 21.417330 | 24.847038 | 4.908046 | 2.690516 | 2118.125553 | 0.000000 | 3.969054 |
US | 16.079232 | 17.797119 | 20.669868 | 6.033613 | 3.809268 | 2578.871549 | 0.000000 | 3.872749 | |
1985 | Other | 19.284768 | 21.373068 | 24.816777 | 4.871965 | 2.636070 | 2141.997792 | 0.000000 | 3.958057 |
US | 16.275472 | 18.025157 | 21.020126 | 5.949686 | 3.765813 | 2553.899371 | 0.000000 | 3.886792 | |
1986 | Other | 19.167183 | 21.213622 | 24.650155 | 4.804954 | 2.536234 | 2149.148607 | 0.000000 | 4.069659 |
US | 15.945035 | 17.645390 | 20.464539 | 6.136525 | 3.925433 | 2588.741135 | 0.000000 | 3.886525 | |
1987 | Other | 18.633381 | 20.710414 | 24.186876 | 4.825963 | 2.583168 | 2227.318117 | 0.000000 | 4.142653 |
US | 15.611722 | 17.326007 | 20.208791 | 6.164835 | 3.932442 | 2630.036630 | 0.000000 | 3.902930 | |
1988 | Other | 18.668224 | 20.814642 | 24.437695 | 4.819315 | 2.531434 | 2207.476636 | 0.000000 | 4.205607 |
US | 15.577869 | 17.372951 | 20.420082 | 6.307377 | 4.067735 | 2623.258197 | 0.000000 | 4.028689 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2016 | Other | 21.903749 | 24.439716 | 28.866261 | 5.493414 | 2.992272 | 2127.608916 | 1.017224 | 7.296859 |
US | 25.061818 | 27.701818 | 32.265455 | 5.356364 | 3.277454 | 1960.545455 | 17.214545 | 6.301818 | |
2017 | Other | 22.423795 | 24.910521 | 29.208456 | 5.431662 | 2.919041 | 2114.110128 | 1.243854 | 7.474926 |
US | 24.003623 | 26.496377 | 30.829710 | 5.532609 | 3.420272 | 2031.884058 | 15.731884 | 6.304348 | |
2018 | Other | 22.310442 | 24.779868 | 29.042333 | 5.396990 | 2.886801 | 2121.448730 | 1.135466 | 7.391345 |
US | 23.526690 | 25.925267 | 30.145907 | 5.597865 | 3.391101 | 2037.900356 | 12.537367 | 5.601423 | |
2019 | Other | 23.084221 | 25.456922 | 29.560503 | 5.315586 | 2.839671 | 2093.659245 | 2.581801 | 7.545983 |
US | 24.169014 | 26.250000 | 30.042254 | 5.559859 | 3.419375 | 2093.133803 | 16.419014 | 5.647887 | |
2020 | Other | 22.579487 | 25.174359 | 29.543590 | 5.148718 | 2.692823 | 2050.256410 | 2.446154 | 7.743590 |
US | 24.071429 | 26.571429 | 30.642857 | 4.000000 | 1.978795 | 1650.000000 | 0.000000 | 7.785714 |
74 rows × 8 columns
# we can go deeper and apply multiple aggregates
# this is loosely equivalent to the sort of thing that a pivot table in Excel might provide
# penultimate row
def second_to_last(ser):
return ser.iloc[-2]
(autos2=autos2.make.apply(country))
.assign(country'year', 'country'])
.groupby([# we can use .agg to include a list of different aggregation types - we can even call a function
'min', 'mean', second_to_last])
.agg([ )
/tmp/ipykernel_2753/2706922386.py:12: FutureWarning: ['drive', 'make', 'model', 'createdOn'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
.agg(['min', 'mean', second_to_last])
city08 | comb08 | highway08 | cylinders | ... | range | automatic | speeds | ffs | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
min | mean | second_to_last | min | mean | second_to_last | min | mean | second_to_last | min | ... | second_to_last | min | mean | second_to_last | min | mean | second_to_last | min | mean | second_to_last | ||
year | country | |||||||||||||||||||||
1984 | Other | 7 | 19.384615 | 14 | 8 | 21.417330 | 14 | 9 | 24.847038 | 15 | 2 | ... | 0 | False | 0.550840 | False | 3 | 3.969054 | 5 | False | 0.714554 | True |
US | 8 | 16.079232 | 15 | 9 | 17.797119 | 17 | 10 | 20.669868 | 19 | 4 | ... | 0 | False | 0.521059 | False | 3 | 3.872749 | 4 | False | 0.638801 | NaN | |
1985 | Other | 7 | 19.284768 | 19 | 8 | 21.373068 | 20 | 9 | 24.816777 | 22 | 0 | ... | 0 | False | 0.554084 | True | 3 | 3.958057 | 4 | False | 0.889160 | True |
US | 8 | 16.275472 | 14 | 10 | 18.025157 | 15 | 10 | 21.020126 | 17 | 3 | ... | 0 | False | 0.520755 | False | 3 | 3.886792 | 4 | False | 0.851351 | NaN | |
1986 | Other | 6 | 19.167183 | 10 | 7 | 21.213622 | 11 | 9 | 24.650155 | 12 | 0 | ... | 0 | False | 0.520124 | False | 3 | 4.069659 | 4 | False | 0.934211 | NaN |
US | 9 | 15.945035 | 16 | 10 | 17.645390 | 17 | 11 | 20.464539 | 19 | 3 | ... | 0 | False | 0.533688 | False | 3 | 3.886525 | 4 | False | 0.795699 | NaN | |
1987 | Other | 6 | 18.633381 | 12 | 7 | 20.710414 | 12 | 9 | 24.186876 | 12 | 2 | ... | 0 | False | 0.516405 | True | 3 | 4.142653 | 4 | False | 0.949778 | True |
US | 8 | 15.611722 | 12 | 9 | 17.326007 | 13 | 10 | 20.208791 | 14 | 3 | ... | 0 | False | 0.549451 | True | 3 | 3.902930 | 4 | False | 0.909457 | True | |
1988 | Other | 6 | 18.668224 | 12 | 7 | 20.814642 | 12 | 10 | 24.437695 | 12 | 2 | ... | 0 | False | 0.521807 | True | 3 | 4.205607 | 4 | False | 0.993681 | True |
US | 8 | 15.577869 | 14 | 9 | 17.372951 | 14 | 10 | 20.420082 | 15 | 3 | ... | 0 | False | 0.569672 | True | 3 | 4.028689 | 4 | False | 0.936306 | True | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2016 | Other | 10 | 21.903749 | 28 | 12 | 24.439716 | 30 | 13 | 28.866261 | 32 | 0 | ... | 0 | False | 0.837893 | True | 1 | 7.296859 | 7 | False | 0.000000 | False |
US | 11 | 25.061818 | 91 | 12 | 27.701818 | 93 | 16 | 32.265455 | 94 | 0 | ... | 200 | False | 0.850909 | True | 1 | 6.301818 | 1 | False | 0.000000 | NaN | |
2017 | Other | 10 | 22.423795 | 21 | 11 | 24.910521 | 24 | 11 | 29.208456 | 28 | 0 | ... | 0 | False | 0.848574 | True | 1 | 7.474926 | 7 | False | 0.000000 | False |
US | 11 | 24.003623 | 131 | 12 | 26.496377 | 126 | 15 | 30.829710 | 120 | 0 | ... | 310 | False | 0.858696 | True | 0 | 6.304348 | 1 | False | 0.000000 | NaN | |
2018 | Other | 9 | 22.310442 | 11 | 11 | 24.779868 | 12 | 11 | 29.042333 | 15 | 0 | ... | 0 | False | 0.863594 | True | 0 | 7.391345 | 0 | False | 0.000000 | NaN |
US | 11 | 23.526690 | 120 | 14 | 25.925267 | 116 | 15 | 30.145907 | 112 | 0 | ... | 310 | False | 0.882562 | True | 0 | 5.601423 | 1 | False | 0.000000 | NaN | |
2019 | Other | 9 | 23.084221 | 19 | 11 | 25.456922 | 22 | 14 | 29.560503 | 27 | 0 | ... | 0 | False | 0.879961 | True | 0 | 7.545983 | 8 | False | 0.000000 | NaN |
US | 11 | 24.169014 | 104 | 14 | 26.250000 | 104 | 15 | 30.042254 | 104 | 0 | ... | 345 | False | 0.915493 | True | 0 | 5.647887 | 1 | False | 0.000000 | NaN | |
2020 | Other | 13 | 22.579487 | 17 | 15 | 25.174359 | 20 | 18 | 29.543590 | 24 | 0 | ... | 0 | False | 0.871795 | True | 0 | 7.743590 | 0 | False | 0.000000 | False |
US | 20 | 24.071429 | 21 | 22 | 26.571429 | 24 | 26 | 30.642857 | 28 | 4 | ... | 0 | False | 0.857143 | True | 0 | 7.785714 | 0 | False | 0.000000 | False |
74 rows × 30 columns
# back to simpler example, adding plots
(autos2=autos2.make.apply(country))
.assign(country# can group by more than one column
'year','country'])
.groupby([
.mean()
.plot() )
/tmp/ipykernel_2753/1855101905.py:6: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
.mean()
<AxesSubplot: xlabel='year,country'>
This doesn’t really work as we can see. Let’s see if we can resolve this:
# back to simpler example, adding plots
(autos2=autos2.make.apply(country))
.assign(country# can group by more than one column
'year','country'])
.groupby([
.mean()# rotates or unstacks the innermost index, country
.unstack() )
/tmp/ipykernel_2753/1937283994.py:6: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
.mean()
city08 | comb08 | highway08 | cylinders | displ | fuelCost08 | range | speeds | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
country | Other | US | Other | US | Other | US | Other | US | Other | US | Other | US | Other | US | Other | US |
year | ||||||||||||||||
1984 | 19.384615 | 16.079232 | 21.417330 | 17.797119 | 24.847038 | 20.669868 | 4.908046 | 6.033613 | 2.690516 | 3.809268 | 2118.125553 | 2578.871549 | 0.000000 | 0.000000 | 3.969054 | 3.872749 |
1985 | 19.284768 | 16.275472 | 21.373068 | 18.025157 | 24.816777 | 21.020126 | 4.871965 | 5.949686 | 2.636070 | 3.765813 | 2141.997792 | 2553.899371 | 0.000000 | 0.000000 | 3.958057 | 3.886792 |
1986 | 19.167183 | 15.945035 | 21.213622 | 17.645390 | 24.650155 | 20.464539 | 4.804954 | 6.136525 | 2.536234 | 3.925433 | 2149.148607 | 2588.741135 | 0.000000 | 0.000000 | 4.069659 | 3.886525 |
1987 | 18.633381 | 15.611722 | 20.710414 | 17.326007 | 24.186876 | 20.208791 | 4.825963 | 6.164835 | 2.583168 | 3.932442 | 2227.318117 | 2630.036630 | 0.000000 | 0.000000 | 4.142653 | 3.902930 |
1988 | 18.668224 | 15.577869 | 20.814642 | 17.372951 | 24.437695 | 20.420082 | 4.819315 | 6.307377 | 2.531434 | 4.067735 | 2207.476636 | 2623.258197 | 0.000000 | 0.000000 | 4.205607 | 4.028689 |
1989 | 18.533040 | 15.139831 | 20.662261 | 16.908898 | 24.252570 | 19.887712 | 4.879589 | 6.366525 | 2.542154 | 4.173385 | 2250.000000 | 2698.093220 | 0.000000 | 0.000000 | 4.264317 | 4.025424 |
1990 | 18.510109 | 14.850575 | 20.640747 | 16.577011 | 24.267496 | 19.485057 | 4.839813 | 6.466667 | 2.508090 | 4.265798 | 2238.258165 | 2728.735632 | 0.000000 | 0.000000 | 4.328149 | 4.105747 |
1991 | 18.087943 | 14.803279 | 20.174468 | 16.599532 | 23.809929 | 19.683841 | 5.029787 | 6.538642 | 2.610219 | 4.350876 | 2348.581560 | 2725.761124 | 0.000000 | 0.000000 | 4.341844 | 4.234192 |
1992 | 17.915374 | 14.895631 | 20.098731 | 16.735437 | 23.820874 | 20.063107 | 5.145275 | 6.446602 | 2.709618 | 4.251104 | 2373.272214 | 2703.762136 | 0.000000 | 0.000000 | 4.356841 | 4.252427 |
1993 | 18.084866 | 15.007772 | 20.309760 | 16.896373 | 24.172560 | 20.230570 | 5.114569 | 6.497409 | 2.683870 | 4.282793 | 2333.097595 | 2677.202073 | 0.000000 | 0.000000 | 4.371994 | 4.279793 |
1994 | 18.046474 | 14.952514 | 20.264423 | 16.829609 | 24.173077 | 20.201117 | 5.185897 | 6.608939 | 2.713631 | 4.413091 | 2326.041667 | 2697.625698 | 0.000000 | 0.000000 | 4.355769 | 4.293296 |
1995 | 17.678914 | 14.533724 | 20.091054 | 16.422287 | 24.263578 | 19.747801 | 5.444089 | 6.715543 | 2.908022 | 4.506701 | 2355.191693 | 2759.677419 | 0.000000 | 0.000000 | 4.380192 | 4.313783 |
1996 | 18.480545 | 14.926641 | 20.906615 | 16.961390 | 25.093385 | 20.544402 | 5.147860 | 6.579151 | 2.708768 | 4.278708 | 2250.291829 | 2622.586873 | 0.000000 | 0.000000 | 4.416342 | 4.262548 |
1997 | 18.090909 | 14.978632 | 20.509470 | 16.991453 | 24.678030 | 20.683761 | 5.261364 | 6.581197 | 2.786582 | 4.220544 | 2319.128788 | 2600.427350 | 0.000000 | 0.000000 | 4.452652 | 4.290598 |
1998 | 17.925267 | 15.288000 | 20.457295 | 17.408000 | 24.704626 | 20.944000 | 5.275801 | 6.436000 | 2.800378 | 4.104777 | 2295.373665 | 2578.800000 | 0.144128 | 0.420000 | 4.485765 | 4.272000 |
1999 | 17.925125 | 15.709163 | 20.386023 | 17.756972 | 24.577371 | 21.099602 | 5.377704 | 6.362550 | 2.832181 | 4.042677 | 2312.728785 | 2582.470120 | 0.251248 | 1.334661 | 4.507488 | 4.215139 |
2000 | 17.881849 | 15.714844 | 20.301370 | 17.757812 | 24.416096 | 21.128906 | 5.441781 | 6.332031 | 2.859111 | 3.979351 | 2385.958904 | 2529.492188 | 0.304795 | 0.449219 | 4.619863 | 4.253906 |
2001 | 17.941267 | 15.643939 | 20.289026 | 17.496212 | 24.372488 | 20.768939 | 5.479134 | 6.310606 | 2.872808 | 3.975823 | 2399.536321 | 2568.371212 | 0.187017 | 0.443182 | 4.761978 | 4.412879 |
2002 | 17.644412 | 15.083916 | 20.076923 | 16.979021 | 24.207547 | 20.195804 | 5.576197 | 6.433566 | 2.935398 | 4.057423 | 2425.689405 | 2610.139860 | 0.137881 | 0.132867 | 4.920174 | 4.367133 |
2003 | 17.565101 | 14.826087 | 19.953020 | 16.628763 | 24.052349 | 19.806020 | 5.683221 | 6.588629 | 3.032161 | 4.170778 | 2480.604027 | 2637.625418 | 0.127517 | 0.000000 | 5.154362 | 4.307692 |
2004 | 17.426290 | 14.928571 | 19.923833 | 16.805195 | 24.160934 | 20.165584 | 5.729730 | 6.558442 | 3.088666 | 4.199593 | 2476.719902 | 2607.305195 | 0.000000 | 0.000000 | 5.229730 | 4.308442 |
2005 | 17.412170 | 15.196610 | 19.892078 | 17.132203 | 24.189437 | 20.664407 | 5.773823 | 6.447458 | 3.151592 | 4.131402 | 2493.455798 | 2592.881356 | 0.000000 | 0.000000 | 5.362801 | 4.688136 |
2006 | 17.062575 | 15.300366 | 19.509025 | 17.285714 | 23.762936 | 20.875458 | 5.977136 | 6.476190 | 3.345220 | 4.170487 | 2527.496992 | 2574.725275 | 0.000000 | 0.000000 | 5.492178 | 4.776557 |
2007 | 16.996403 | 15.489726 | 19.452038 | 17.626712 | 23.742206 | 21.202055 | 6.044365 | 6.513699 | 3.423963 | 4.212841 | 2544.664269 | 2510.958904 | 0.000000 | 0.000000 | 5.864508 | 4.883562 |
2008 | 17.239869 | 15.770073 | 19.677985 | 17.937956 | 23.983571 | 21.697080 | 6.095290 | 6.518248 | 3.462049 | 4.223408 | 2551.369113 | 2486.678832 | 0.109529 | 0.000000 | 5.969332 | 5.120438 |
2009 | 17.696803 | 16.148014 | 20.186329 | 18.259928 | 24.590959 | 22.140794 | 5.970232 | 6.620939 | 3.402613 | 4.352489 | 2433.076075 | 2407.220217 | 0.000000 | 0.000000 | 6.189636 | 5.563177 |
2010 | 18.325342 | 17.278970 | 20.851598 | 19.600858 | 25.256849 | 23.785408 | 5.897260 | 6.223176 | 3.357208 | 4.048979 | 2374.429224 | 2263.304721 | 0.000000 | 0.000000 | 6.378995 | 5.866953 |
2011 | 19.247387 | 16.817844 | 21.635308 | 19.014870 | 25.855981 | 22.973978 | 5.851336 | 6.394052 | 3.319702 | 4.169094 | 2326.248548 | 2358.736059 | 0.340302 | 0.000000 | 6.714286 | 6.066914 |
2012 | 19.838052 | 17.802974 | 22.339751 | 20.111524 | 26.695357 | 24.167286 | 5.792752 | 6.297398 | 3.268908 | 4.087332 | 2282.502831 | 2314.498141 | 0.634202 | 1.267658 | 6.834655 | 6.286245 |
2013 | 20.982888 | 19.453815 | 23.471658 | 21.823293 | 27.860963 | 26.164659 | 5.658824 | 6.152610 | 3.179253 | 3.884311 | 2208.288770 | 2220.080321 | 0.853476 | 2.763052 | 7.033155 | 6.385542 |
2014 | 21.159919 | 20.506329 | 23.655870 | 23.012658 | 28.088057 | 27.523207 | 5.719636 | 5.852321 | 3.211738 | 3.614723 | 2212.196356 | 2139.029536 | 0.859312 | 3.683544 | 7.210526 | 6.046414 |
2015 | 21.350000 | 21.817490 | 23.935294 | 24.441065 | 28.481373 | 28.996198 | 5.604902 | 5.752852 | 3.101696 | 3.606063 | 2164.215686 | 2089.353612 | 0.638235 | 8.296578 | 7.211765 | 6.353612 |
2016 | 21.903749 | 25.061818 | 24.439716 | 27.701818 | 28.866261 | 32.265455 | 5.493414 | 5.356364 | 2.992272 | 3.277454 | 2127.608916 | 1960.545455 | 1.017224 | 17.214545 | 7.296859 | 6.301818 |
2017 | 22.423795 | 24.003623 | 24.910521 | 26.496377 | 29.208456 | 30.829710 | 5.431662 | 5.532609 | 2.919041 | 3.420272 | 2114.110128 | 2031.884058 | 1.243854 | 15.731884 | 7.474926 | 6.304348 |
2018 | 22.310442 | 23.526690 | 24.779868 | 25.925267 | 29.042333 | 30.145907 | 5.396990 | 5.597865 | 2.886801 | 3.391101 | 2121.448730 | 2037.900356 | 1.135466 | 12.537367 | 7.391345 | 5.601423 |
2019 | 23.084221 | 24.169014 | 25.456922 | 26.250000 | 29.560503 | 30.042254 | 5.315586 | 5.559859 | 2.839671 | 3.419375 | 2093.659245 | 2093.133803 | 2.581801 | 16.419014 | 7.545983 | 5.647887 |
2020 | 22.579487 | 24.071429 | 25.174359 | 26.571429 | 29.543590 | 30.642857 | 5.148718 | 4.000000 | 2.692823 | 1.978795 | 2050.256410 | 1650.000000 | 2.446154 | 0.000000 | 7.743590 | 7.785714 |
# back to simpler example, adding plots
(autos2=autos2.make.apply(country))
.assign(country# can group by more than one column
'year','country'])
.groupby([
.mean()# .std()
# rotates or unstacks the innermost index, country
.unstack()# focus on city08 column
.city08
.plot()=(1,1))
.legend(bbox_to_anchor )
/tmp/ipykernel_2753/2308684326.py:6: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
.mean()
<matplotlib.legend.Legend at 0x7f2091b75e10>
# time series smoothing - rolling average(2)
(autos2=autos2.make.apply(country))
.assign(country# can group by more than one column
'year','country'])
.groupby([
.mean()# .std()
# rotates or unstacks the innermost index, country
.unstack()# focus on city08 column
.city08# rolling average to smooth time series
2)
.rolling(
.mean()
.plot()=(1,1))
.legend(bbox_to_anchor )
/tmp/ipykernel_2753/2271732648.py:6: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
.mean()
<matplotlib.legend.Legend at 0x7f2091a62ad0>
# can we emulate a SQL GROUP BY - HAVING query?
# let's only show grouped values over say 750
def vals_gt(df_, num):
return df_[df_.gt(num)].dropna()
(autos2=autos2.make.apply(country))
.assign(country# can group by more than one column
'year','country'])
.groupby([
.count()750)
.pipe(vals_gt, )
city08 | comb08 | highway08 | cylinders | displ | drive | fuelCost08 | make | model | range | createdOn | automatic | speeds | ffs | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | country | ||||||||||||||
1984 | Other | 1131.0 | 1131.0 | 1131.0 | 1131.0 | 1131.0 | 1131.0 | 1131.0 | 1131.0 | 1131.0 | 1131.0 | 1131.0 | 1131.0 | 1131.0 | 1065.0 |
1985 | Other | 906.0 | 906.0 | 906.0 | 906.0 | 906.0 | 906.0 | 906.0 | 906.0 | 906.0 | 906.0 | 906.0 | 906.0 | 906.0 | 821.0 |
2017 | Other | 1017.0 | 1017.0 | 1017.0 | 1017.0 | 1017.0 | 1017.0 | 1017.0 | 1017.0 | 1017.0 | 1017.0 | 1017.0 | 1017.0 | 1017.0 | 766.0 |
2018 | Other | 1063.0 | 1063.0 | 1063.0 | 1063.0 | 1063.0 | 1063.0 | 1063.0 | 1063.0 | 1063.0 | 1063.0 | 1063.0 | 1063.0 | 1063.0 | 844.0 |
2019 | Other | 1033.0 | 1033.0 | 1033.0 | 1033.0 | 1033.0 | 1033.0 | 1033.0 | 1033.0 | 1033.0 | 1033.0 | 1033.0 | 1033.0 | 1033.0 | 855.0 |
Key takeaways
Having code that is clean, readable and has a logical flow is invaluable. Structured Query Language (SQL), as the name suggests, already pushes you down this road. Python, although often referred to as the swiss army knife of programming languages, can quickly beocme scattered and difficult to follow particulary during the exploratory data analysis (EDA) phase. However by adopting the chaining method covered in this blog, your code should ensure that others (as well as yourself!) can follow exactly what is going on.
Although computer memory continues to get cheaper, I think it is good practice to keep track of how much data is being used, and there are some interesting ways covered in this blog as to how we might go about achieving this. e.g. numpy.iinfo. In the same breath, speed is also of the essence, and the handy Python cell magic %%timeit provides an easy way to quantify this. With pandas there is generally a variety of different ways to arrive at the same outcome, but as this blog covers there can be substantial time savings if you know which method to use and when, and which to avoid if possible, in particular the use of .apply.
This blog has been produced in response to the Effective Pandas video presented by Matt Harrison which you can view below. Thanks also to Santiago for the heads up on Twitter.
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/zgbUk90aQ6A?start=482" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
If the above %%HTML cell magic is unfamiliar to you, then I recommend you watch the excellent video below posted by Corey Shafer. An excellent introduction for anyone starting out with Jupyter NoteBooks.
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/HW29067qVWk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>