Aggregation
Edit this pageTo aggregate data in Vega-Lite, users can either use the aggregate
property of an encoding field definition or the aggregate
transform inside the transform
array. Aggregate summarized a table as one record for each group. To preserve the original table structure and instead add a new column with the aggregate values, use the join aggregate transform.
Documentation Overview
- Documentation Overview
- Aggregate in Encoding Field Definition
- Aggregate Transform
- Supported Aggregation Operations
- Argmin / Argmax
Aggregate in Encoding Field Definition
// A Single View or a Layer Specification
{
...,
"mark/layer": ...,
"encoding": {
"x": {
"aggregate": ..., // aggregate
"field": ...,
"type": "quantitative",
...
},
"y": ...,
...
},
...
}
The aggregate
property of a field definition can be used to compute aggregate summary statistics (e.g., median, min, max) over groups of data.
If at least one fields in the specified encoding channels contain aggregate
, the resulting visualization will show aggregate data. In this case, all fields without aggregation function specified are treated as group-by fields^{1} in the aggregation process.
For example, the following bar chart aggregates mean of Acceleration
, grouped by the number of Cylinders
.
The detail
channel can be used to specify additional summary and group-by fields without mapping the field(s) to any visual properties. For example, the following plots add Origin
as a group by field.
^{1}The group-by fields are also known as independent/condition variables in statistics and dimensions in Business Intelligence. Similarly, the aggregate fields are known as dependent variables and measures.
Aggregate Transform
// Any View Specification
{
...
"transform": [
{
// Aggregate Transform
"aggregate": [{"op": ..., "field": ..., "as": ...}],
"groupby": [...]
}
...
],
...
}
For example, here is the same bar chart which aggregates mean of Acceleration, grouped by the number of Cylinders, but this time using the aggregate
property as part of the transform
.
An aggregate
transform in the transform
array has the following properties:
Property | Type | Description |
---|---|---|
aggregate | AggregatedFieldDef[] |
Required. Array of objects that define fields to aggregate. |
groupby | FieldName[] |
The data fields to group by. If not specified, a single group containing all data objects will be used. |
Aggregated Field Definition for Aggregate Transform
Property | Type | Description |
---|---|---|
op | String |
Required. The aggregation operation to apply to the fields (e.g., sum, average or count). See the full list of supported aggregation operations for more information. |
field | FieldName |
The data field for which to compute aggregate function. This is required for all aggregation operations except |
as | FieldName |
Required. The output field names to use for each aggregated field. |
Note: It is important you parse
your data types explicitly, especially if you are likely to have null
values in your dataset and automatic type inference will fail.
Supported Aggregation Operations
The supported aggregation operations are:
Operation | Description |
---|---|
count | The total count of data objects in the group. Note: ‘count’ operates directly on the input objects and return the same value regardless of the provided field. Similar to SQL’s count(*) , count can be specified with a field "*" . |
valid | The count of field values that are not null , undefined or NaN . |
missing | The count of null or undefined field values. |
distinct | The count of distinct field values. |
sum | The sum of field values. |
mean | The mean (average) field value. |
average | The mean (average) field value. Identical to mean. |
variance | The sample variance of field values. |
variancep | The population variance of field values. |
stdev | The sample standard deviation of field values. |
stdevp | The population standard deviation of field values. |
stderr | The standard error of field values. |
median | The median field value. |
q1 | The lower quartile boundary of field values. |
q3 | The upper quartile boundary of field values. |
ci0 | The lower boundary of the bootstrapped 95% confidence interval of the mean field value. |
ci1 | The upper boundary of the bootstrapped 95% confidence interval of the mean field value. |
min | The minimum field value. |
max | The maximum field value. |
argmin | An input data object containing the minimum field value. |
argmax | An input data object containing the maximum field value. |
Argmin / Argmax
The argmax and argmin operation can be specified in an encoding field definition by setting aggregate
to an object with argmax/min
describing the field to maximize/minimize. For example, the following plot shows the production budget of the movie that has the highest US Gross in each major genre.
This is equivalent to specifying argmax in an aggregate transform and encode its nested data.
Note: When accessing aggregated argmax/argmin fields, the aggregated fields must be flattened, due to the nested field issue. The aggregated fields can be flattened with the calculate transform as done in the CO2 example.