In the R programming language, Wickham (2011) has popularized the so-called split-apply-combine strategy for data transformations. In essence, this strategy **splits** a dataset into distinct groups, **applies** one or more functions to each group, and then **combines** the result. `DataFrames.jl`

fully supports split-apply-combine. We will use the student grades example like before. Suppose that we want to know each student’s mean grade:

```
function all_grades()
df1 = grades_2020()
df1 = select(df1, :name, :grade_2020 => :grade)
df2 = grades_2021()
df2 = select(df2, :name, :grade_2021 => :grade)
rename_bob2(data_col) = replace.(data_col, "Bob 2" => "Bob")
df2 = transform(df2, :name => rename_bob2 => :name)
return vcat(df1, df2)
end
all_grades()
```

name | grade |
---|---|

Sally | 1.0 |

Bob | 5.0 |

Alice | 8.5 |

Hank | 4.0 |

Bob | 9.5 |

Sally | 9.5 |

Hank | 6.0 |

The strategy is to **split** the dataset into distinct students, **apply** the mean function to each student, and **combine** the result.

The split is called `groupby`

and we give as second argument the column ID that we want to split the dataset into:

`groupby(all_grades(), :name)`

```
GroupedDataFrame with 4 groups based on key: name
Group 1 (2 rows): name = "Sally"
Row │ name grade
│ String Float64
─────┼─────────────────
1 │ Sally 1.0
2 │ Sally 9.5
Group 2 (2 rows): name = "Bob"
Row │ name grade
│ String Float64
─────┼─────────────────
1 │ Bob 5.0
2 │ Bob 9.5
Group 3 (1 row): name = "Alice"
Row │ name grade
│ String Float64
─────┼─────────────────
1 │ Alice 8.5
Group 4 (2 rows): name = "Hank"
Row │ name grade
│ String Float64
─────┼─────────────────
1 │ Hank 4.0
2 │ Hank 6.0
```

We apply the `mean`

function from Julia’s standard library `Statistics`

module:

`using Statistics`

To apply this function, use the `combine`

function:

```
gdf = groupby(all_grades(), :name)
combine(gdf, :grade => mean)
```

name | grade_mean |
---|---|

Sally | 5.25 |

Bob | 7.25 |

Alice | 8.5 |

Hank | 5.0 |

Imagine having to do this without the `groupby`

and `combine`

functions. We would need to loop over our data to split it up into groups, then loop over each split to apply a function, **and** finally loop over each group to gather the final result. Therefore, the split-apply-combine technique is a great one to know.

But what if we want to apply a function to multiple columns of our dataset?

```
group = [:A, :A, :B, :B]
X = 1:4
Y = 5:8
df = DataFrame(; group, X, Y)
```

group | X | Y |
---|---|---|

A | 1 | 5 |

A | 2 | 6 |

B | 3 | 7 |

B | 4 | 8 |

This is accomplished in a similar manner:

```
gdf = groupby(df, :group)
combine(gdf, [:X, :Y] .=> mean; renamecols=false)
```

group | X | Y |
---|---|---|

A | 1.5 | 5.5 |

B | 3.5 | 7.5 |

Note that we’ve used the dot `.`

operator before the right arrow `=>`

to indicate that the `mean`

has to be applied to multiple source columns `[:X, :Y]`

.

To use composable functions, a simple way is to create a function that does the intended composable transformations. For instance, for a series of values, let’s first take the `mean`

followed by `round`

to a whole number (also known as an integer `Int`

):

```
gdf = groupby(df, :group)
rounded_mean(data_col) = round(Int, mean(data_col))
combine(gdf, [:X, :Y] .=> rounded_mean; renamecols=false)
```

group | X | Y |
---|---|---|

A | 2 | 6 |

B | 4 | 8 |