pandas Aggregate Functions

groupby

For a DataFrame, groupby groups each unique value in a given column (or set of columns) and allows you to perform operations on those groups. For example, we can calculate the average departure delay (DepDelay) for each Year in our data by using groupby.

import pandas as pd

myDF = pd.read_csv("flights_sample.csv")
myDF.head()
   Year  Month  DayofMonth  ...  NASDelay  SecurityDelay  LateAircraftDelay
0  1987     10          14  ...       NaN            NaN                NaN
1  1990     10          15  ...       NaN            NaN                NaN
2  1990     10          17  ...       NaN            NaN                NaN
3  1990     10          18  ...       NaN            NaN                NaN
4  1991     10          19  ...       NaN            NaN                NaN
myDF.groupby("Year").mean()
      Month  DayofMonth  DayOfWeek  ...  NASDelay  SecurityDelay  LateAircraftDelay
Year                                ...
1987   10.0   14.000000   3.000000  ...       NaN            NaN                NaN
1990   10.0   16.666667   5.666667  ...       NaN            NaN                NaN
1991   10.0   21.800000   3.800000  ...       NaN            NaN                NaN

As you can see, the average for each column is now calculated for each of the 3 Year values in the dataset: 1987, 1990, and 1991. If you wanted the DepDelay column by itself you can isolate it before or after the calculation:

# before
print(myDF.groupby("Year")['DepDelay'].mean())
# after
print(myDF.groupby("Year").mean()['DepDelay'])

# the outputs are the same; we include only one print below
Year
1987    11
1990     3
1991     6

You have the ability to group by multiple variables as well! For example, you could find the mean DepDelay for each day of the week for each year:

myDF.groupby(["Year", "DayOfWeek"])['DepDelay'].mean()
Year  DayOfWeek
1987  3            11
1990  4            -1
      6            11
      7            -1
1991  1            19
      3            -2
      4            -2
      5             1
      6            14

You may notice that using groupby will return your group as an index instead of a column (or multiple groups as a MultiIndex). You can use the reset_index() method return a DataFrame with default indexing instead of the DataFrameGroupBy object it usually returns:

myDF.groupby(["Year", "DayOfWeek"])['DepDelay'].mean().reset_index()
   Year  DayOfWeek  DepDelay
0  1987          3        11
1  1990          4        -1
2  1990          6        11
3  1990          7        -1
4  1991          1        19
5  1991          3        -2
6  1991          4        -2
7  1991          5         1
8  1991          6        14


Examples

How do I calculate the number of rows for each Year?

Click to see solution
myDF.groupby("Year")["DepDelay"].median()

How do I calculate the median DepDelay by Year?

Click to see solution
myDF.groupby("Year")["DepDelay"].median()


agg

pandas also offers the agg function, which takes another function (or list of functions) as its argument, returning the name of the function as the index and the result of the function’s application for each column.

Let’s look at an example from the documentation:

import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]
                   [np.nan, np.nan, np.nan]],
                  columns=['A', 'B', 'C'])
print(df.agg(['sum', 'min']))
        A     B     C
sum  12.0  15.0  18.0
min   1.0   2.0   3.0

As mentioned, we get the name of our functions (sum and min) as the resulting indices, then the sum and maximum value is applied to each column.

When we have invalid entries (like the fourth row with all missing values), they are simply removed from consideration. If there are no complete columns that can be processed by the function(s), we get the error ValueError: no results.

The end result of agg is similar to groupby, but it can be applied to groupby itself to enhance its output AND/OR allows for different functions to be performed simultaneously on different columns. Here’s an example of both at work:

list_1 = ['Wisconsin', 'IU', 'Rutgers', 'Michigan State', 'Ohio State']
list_2 = ['home', 'away', 'home', 'away', 'home']
list_3 = [85, 78, 90, 75, 74]
list_4 = [500, 1000, 430, 4800, 10000]

myDF = pd.DataFrame(zip(list_1, list_2, list_3, list_4), columns=['opponent', 'location', 'temp', 'attendance'])

print(myDF.groupby('location').agg({'temp': 'mean', 'attendance': 'sum'}).reset_index())
  location  temp  attendance
0     away  76.5        5800
1     home  83.0       10930

We group by location type ("home" or "away"), then supply agg a dictionary that asserts mean will be applied to temp, while sum will be applied to attendance, then reset the index to put location back into the DataFrame. That’s a lot of words to describe what we did with one line of code — agg enabled us to use two different functions for analysis within one DataFrame grouping.


transform

transform, much like agg, is an aggregating function that can be used on its own or to augment the result of groupby.

Like agg, it takes a function as an argument (including custom lambda functions) and applies to the values or columns of a DataFrame. What’s different here is that the output of transform has to be the same length as the DataFrame — if there are 8 rows in the input, there must be 8 rows in the output. We’ll demonstrate this with an example from the documentation:

df = pd.DataFrame({
    "Date": [
        "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05",
        "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05"],
    "Data": [5, 8, 6, 1, 50, 100, 60, 120],
})
print(df)
print()
print(df.groupby('Date')['Data'].transform(sum))
         Date  Data
0  2015-05-08     5
1  2015-05-07     8
2  2015-05-06     6
3  2015-05-05     1
4  2015-05-08    50
5  2015-05-07   100
6  2015-05-06    60
7  2015-05-05   120

0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Data, dtype: int64

We see that 55 repeats at indices 0 and 4, 108 at indices 1 and 5, and so on. This corresponds to the dates that match, leading to the obvious conclusion from our groupby function that each date has one sum. We can use groupby on its own to verify this:

print(df.groupby('Date')['Data'].sum())
Date
2015-05-05    121
2015-05-06     66
2015-05-07    108
2015-05-08     55
Name: Data, dtype: int64

As it would seem, transform is useful when combined with groupby to create a new column in your DataFrame based on the result of groupby.


Examples

Create the group_size column in the experiment data set, which includes the size of the response group. Use groupby and transform to accomplish this in 1 line of code.

Click to see solution
experiment = pd.DataFrame({
    "treatment": ['a', 'a', 'a', 'b', 'b', 'b'],
    "response": ["y", "n", "o", "y", "y", "o"]
})

experiment['group_size'] = experiment.groupby('response').transform(len)
print(experiment)
  treatment response  group_size
0         a        y           3
1         a        n           1
2         a        o           2
3         b        y           3
4         b        y           3
5         b        o           2