pandas
Aggregate Functions
groupby
For a DataFrame
, groupby
groups each unique value in a given column (or set of columns) and allows you to perform operations on those groups. For example, we can calculate the average departure delay (DepDelay
) for each Year
in our data by using groupby
.
import pandas as pd
myDF = pd.read_csv("flights_sample.csv")
myDF.head()
Year Month DayofMonth ... NASDelay SecurityDelay LateAircraftDelay 0 1987 10 14 ... NaN NaN NaN 1 1990 10 15 ... NaN NaN NaN 2 1990 10 17 ... NaN NaN NaN 3 1990 10 18 ... NaN NaN NaN 4 1991 10 19 ... NaN NaN NaN
myDF.groupby("Year").mean()
Month DayofMonth DayOfWeek ... NASDelay SecurityDelay LateAircraftDelay Year ... 1987 10.0 14.000000 3.000000 ... NaN NaN NaN 1990 10.0 16.666667 5.666667 ... NaN NaN NaN 1991 10.0 21.800000 3.800000 ... NaN NaN NaN
As you can see, the average for each column is now calculated for each of the 3 Year
values in the dataset: 1987, 1990, and 1991. If you wanted the DepDelay
column by itself you can isolate it before or after the calculation:
# before
print(myDF.groupby("Year")['DepDelay'].mean())
# after
print(myDF.groupby("Year").mean()['DepDelay'])
# the outputs are the same; we include only one print below
Year 1987 11 1990 3 1991 6
You have the ability to group by multiple variables as well! For example, you could find the mean DepDelay
for each day of the week for each year:
myDF.groupby(["Year", "DayOfWeek"])['DepDelay'].mean()
Year DayOfWeek 1987 3 11 1990 4 -1 6 11 7 -1 1991 1 19 3 -2 4 -2 5 1 6 14
You may notice that using groupby
will return your group as an index instead of a column (or multiple groups as a MultiIndex
). You can use the reset_index()
method return a DataFrame
with default indexing instead of the DataFrameGroupBy
object it usually returns:
myDF.groupby(["Year", "DayOfWeek"])['DepDelay'].mean().reset_index()
Year DayOfWeek DepDelay 0 1987 3 11 1 1990 4 -1 2 1990 6 11 3 1990 7 -1 4 1991 1 19 5 1991 3 -2 6 1991 4 -2 7 1991 5 1 8 1991 6 14
agg
pandas
also offers the agg
function, which takes another function (or list of functions) as its argument, returning the name of the function as the index and the result of the function’s application for each column.
Let’s look at an example from the documentation:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
[np.nan, np.nan, np.nan]],
columns=['A', 'B', 'C'])
print(df.agg(['sum', 'min']))
A B C sum 12.0 15.0 18.0 min 1.0 2.0 3.0
As mentioned, we get the name of our functions (sum
and min
) as the resulting indices, then the sum and maximum value is applied to each column.
When we have invalid entries (like the fourth row with all missing values), they are simply removed from consideration. If there are no complete columns that can be processed by the function(s), we get the error |
The end result of agg
is similar to groupby
, but it can be applied to groupby
itself to enhance its output AND/OR allows for different functions to be performed simultaneously on different columns. Here’s an example of both at work:
list_1 = ['Wisconsin', 'IU', 'Rutgers', 'Michigan State', 'Ohio State']
list_2 = ['home', 'away', 'home', 'away', 'home']
list_3 = [85, 78, 90, 75, 74]
list_4 = [500, 1000, 430, 4800, 10000]
myDF = pd.DataFrame(zip(list_1, list_2, list_3, list_4), columns=['opponent', 'location', 'temp', 'attendance'])
print(myDF.groupby('location').agg({'temp': 'mean', 'attendance': 'sum'}).reset_index())
location temp attendance 0 away 76.5 5800 1 home 83.0 10930
We group by location type ("home" or "away"), then supply agg
a dictionary
that asserts mean
will be applied to temp
, while sum
will be applied to attendance
, then reset the index to put location
back into the DataFrame
. That’s a lot of words to describe what we did with one line of code — agg
enabled us to use two different functions for analysis within one DataFrame
grouping.
transform
transform
, much like agg
, is an aggregating function that can be used on its own or to augment the result of groupby
.
Like agg
, it takes a function as an argument (including custom lambda
functions) and applies to the values or columns of a DataFrame
. What’s different here is that the output of transform
has to be the same length as the DataFrame
— if there are 8 rows in the input, there must be 8 rows in the output. We’ll demonstrate this with an example from the documentation:
df = pd.DataFrame({
"Date": [
"2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05",
"2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05"],
"Data": [5, 8, 6, 1, 50, 100, 60, 120],
})
print(df)
print()
print(df.groupby('Date')['Data'].transform(sum))
Date Data 0 2015-05-08 5 1 2015-05-07 8 2 2015-05-06 6 3 2015-05-05 1 4 2015-05-08 50 5 2015-05-07 100 6 2015-05-06 60 7 2015-05-05 120 0 55 1 108 2 66 3 121 4 55 5 108 6 66 7 121 Name: Data, dtype: int64
We see that 55 repeats at indices 0 and 4, 108 at indices 1 and 5, and so on. This corresponds to the dates that match, leading to the obvious conclusion from our groupby
function that each date has one sum. We can use groupby
on its own to verify this:
print(df.groupby('Date')['Data'].sum())
Date 2015-05-05 121 2015-05-06 66 2015-05-07 108 2015-05-08 55 Name: Data, dtype: int64
As it would seem, transform
is useful when combined with groupby
to create a new column in your DataFrame
based on the result of groupby
.
Examples
Create the group_size
column in the experiment
data set, which includes the size of the response
group. Use groupby
and transform
to accomplish this in 1 line of code.
Click to see solution
experiment = pd.DataFrame({
"treatment": ['a', 'a', 'a', 'b', 'b', 'b'],
"response": ["y", "n", "o", "y", "y", "o"]
})
experiment['group_size'] = experiment.groupby('response').transform(len)
print(experiment)
treatment response group_size 0 a y 3 1 a n 1 2 a o 2 3 b y 3 4 b y 3 5 b o 2