Optimize and execute subqueries

A subquery is a SELECT statement that is nested inside the WHERE or HAVING clause of a query. This topic describes how the query optimizer and executor of PolarDB-X process subqueries.

Subquery

A subquery is classified as a non-correlated subquery or a correlated subquery based on whether it uses values from the outer query. A non-correlated subquery is executed independent of the variables of the outer query. In most cases, non-correlated subqueries are executed only once. A correlated subquery uses variables from the outer query. A correlated subquery is executed on each input row because the subquery must use the values of the variables from the outer query.

/* An example of non-correlated subqueries. */
SELECT * FROM lineitem WHERE l_partkey IN (SELECT p_partkey FROM part);

/* An example of correlated subqueries. l_suppkey is the column referenced from the outer query. */
SELECT * FROM lineitem WHERE l_partkey IN (SELECT ps_partkey FROM partsupp WHERE ps_suppkey = l_suppkey);

PolarDB-X supports most SQL subqueries. For more information, see SQL limits.

Execute a subquery

PolarDB-X can convert common subqueries to SEMIJOIN statements or JOIN statements to improve execution efficiency. This way, the system no longer needs to iterate a group of nested parameters if a large amount of data is involved. This significantly reduces the execution cost. This subquery conversion method is known as unnesting. The following example show how PolarDB-X unnests two subqueries by replacing them with JOIN statements in the execution plans.

> EXPLAIN SELECT p_partkey, (
      SELECT COUNT(ps_partkey) FROM partsupp WHERE ps_suppkey = p_partkey
      ) supplier_count FROM part;

Project(p_partkey="p_partkey", supplier_count="CASE(IS NULL($10), 0, $9)", cor=[$cor0])
  HashJoin(condition="p_partkey = ps_suppkey", type="left")
    Gather(concurrent=true)
      LogicalView(tables="part_[0-7]", shardCount=8, sql="SELECT * FROM `part` AS `part`")
    Project(count(ps_partkey)="count(ps_partkey)", ps_suppkey="ps_suppkey", count(ps_partkey)2="count(ps_partkey)")
      HashAgg(group="ps_suppkey", count(ps_partkey)="SUM(count(ps_partkey))")
        Gather(concurrent=true)
          LogicalView(tables="partsupp_[0-7]", shardCount=8, sql="SELECT `ps_suppkey`, COUNT(`ps_partkey`) AS `count(ps_partkey)` FROM `partsupp` AS `partsupp` GROUP BY `ps_suppkey`")

However, PolarDB-X cannot unnest subqueries in some scenarios. In these scenarios, a query can be executed only after the subqueries are executed. If the outer query involves a large amount of data, the iteration may be time-consuming. In the following example, the subquery cannot be unnested because the value of l_partkey is less than 50 in specific rows. Therefore, PolarDB-X performs a nested iteration. To improve the execution efficiency, we recommend that you delete the OR condition and rewrite the SQL statement.

> EXPLAIN SELECT * FROM lineitem WHERE l_partkey IN (SELECT ps_partkey FROM partsupp WHERE ps_suppkey = l_suppkey) OR l_partkey IS NOT

Filter(condition="IS(in,[$1])[29612489] OR l_partkey < ?0")
  Gather(concurrent=true)
    LogicalView(tables="QIMU_0000_GROUP,QIMU_0001_GROUP.lineitem_[0-7]", shardCount=8, sql="SELECT * FROM `lineitem` AS `lineitem`")

>> individual correlate subquery : 29612489
Gather(concurrent=true)
  LogicalView(tables="QIMU_0000_GROUP,QIMU_0001_GROUP.partsupp_[0-7]", shardCount=8, sql="SELECT * FROM (SELECT `ps_partkey` FROM `partsupp` AS `partsupp` WHERE (`ps_suppkey` = `l_suppkey`)) AS `t0` WHERE (((`l_partkey` = `ps_partkey`) OR (`l_partkey` IS NULL)) OR (`ps_partkey` IS NULL))")

results matching ""

    No results matching ""