Optimize and execute aggregation operations
This topic describes how the query optimizer and the query executor process aggregation operations. This reduces the amount of data to be transmitted and improve execution efficiency.
Aggregation
The semantics of aggregation is to aggregate input rows based on the columns that are specified in the GROUP BY clause, or to aggregate all input rows without grouping. PolarDB-X supports the following aggregate functions:
COUNT
SUM
AVG
MAX
MIN
BIT_OR
BIT_XOR
GROUP_CONCAT
Aggregation operators and aggregation optimization
The aggregation operators described in this topic are not pushed down to LogicalView. If an aggregation operator is pushed down to LogicalView, MySQL at the storage layer determines how to execute this operator. In most cases, the HashAgg and SortAgg operators are used to aggregate data.
HashAgg HashAgg uses a hash table to aggregate data by performing the following steps:
Selects a group for an input row based on the hash value of the grouping column of the input row.
Aggregates the data in the input row and the existing data in the group by executing the specified aggregate function.
Repeats the preceding steps until all input rows are processed. Then, HashAgg returns the aggregation result.
> explain select count(*) from t1 join t2 on t1.id = t2.id group by t1.name,t2.name;
Project(count(*)="count(*)")
HashAgg(group="name,name0", count(*)="COUNT()")
BKAJoin(condition="id = id", type="inner")
Gather(concurrent=true)
LogicalView(tables="t1", shardCount=2, sql="SELECT `id`, `name` FROM `t1` AS `t1`")
Gather(concurrent=true)
LogicalView(tables="t2_[0-3]", shardCount=4, sql="SELECT `id`, `name` FROM `t2` AS `t2` WHERE (`id` IN ('?'))")
In the result of the Explain statement, the HashAgg operator also contains the following key information:
group: The columns by which data is grouped. In this example, the name column of the t1 table is referenced by name in name,name0. The name column of the t2 table is referenced by name0 in name,name0. Duplicate column names are suffixed by numbers to help you identify each column.
Aggregate function: The output column name that corresponds to the aggregate function is specified before the equal sign (=). The calculation method is specified after the equal sign. In count(*)="COUNT()", count(*) specifies the output column name and COUNT() calculates the number of input rows.
You can disable HashAgg by using the following hint: /*+TDDL:cmd_extra(ENABLE_HASH_AGG=false)*/
.
SortAgg SortAgg aggregates data for each group in sequence after the input rows are sorted based on the grouping column:
Ensures that the input rows are sorted based on the specified grouping column. For example, if the MergeSort or MemSort operator is used, the input rows are sorted.
Reads the input data of each row in sequence, and then aggregates data for the row if the group of the row is the same as the current group.
Returns the aggregation result of the current group if the group of the row is different from the current group.
Compared with HashAgg, SortAgg consumes less memory because it needs only to handle one group each time. HashAgg consumes more memory because it must store all the groups in the memory.
> explain select count(*) from t1 join t2 on t1.id = t2.id group by t1.name,t2.name order by t1.name, t2.name;
Project(count(*)="count(*)")
MemSort(sort="name ASC,name0 ASC")
HashAgg(group="name,name0", count(*)="COUNT()")
BKAJoin(condition="id = id", type="inner")
Gather(concurrent=true)
LogicalView(tables="t1", shardCount=2, sql="SELECT `id`, `name` FROM `t1` AS `t1`")
Gather(concurrent=true)
LogicalView(tables="t2_[0-3]", shardCount=4, sql="SELECT `id`, `name` FROM `t2` AS `t2` WHERE (`id` IN ('?'))")
You can disable SortAgg by using the following hint: /*+TDDL:cmd_extra(ENABLE_SORT_AGG=false)*/
.
Optimize a two-phase aggregation
A two-phase aggregation consists of the partial aggregation and final aggregation phases. The system aggregates data for each result set, and then combines the aggregation results of all result sets. In the following example, partial aggregation operations are pushed down by HashAgg to each MySQL table shard. The AVG function is also converted to a combination of SUM and COUNT.
> explain select avg(age) from t2 group by name
Project(avg(age)="sum_pushed_sum / sum_pushed_count")
HashAgg(group="name", sum_pushed_sum="SUM(pushed_sum)", sum_pushed_count="SUM(pushed_count)")
Gather(concurrent=true)
LogicalView(tables="t2_[0-3]", shardCount=4, sql="SELECT `name`, SUM(`age`) AS `pushed_sum`, COUNT(`age`) AS `pushed_count` FROM `t2` AS `t2` GROUP BY `name`")
Two-phase aggregations largely reduce the amount of data to be transmitted and improve the execution efficiency. HashAgg is used in most scenarios. SortAgg is applicable only to the following scenarios:
The system needs to process a large amount of data. As a result, memory resources are insufficient.
The input rows of the aggregation operator are sorted based on the columns specified in the GROUP BY clause. This makes SortAgg more efficient because SortAgg does not need to sort the input data.
The data is severely skewed. In this case, HashAgg is less efficient. Use SortAgg instead.