Use IN queries in an efficient manner
This topic describes how to determine the optimal number of values in the IN clause of your query when the query is performed in PolarDB-X.
Introduction
In actual scenarios, you often need to use constant metrics as the query conditions to perform IN queries. For each IN query, the field values in the IN clause are the values of a partition key. For example, in an e-commerce scenario, all orders are recorded in a table named Order. This table is partitioned by order ID. A buyer often queries the details about specific orders based on the order list. If the buyer has two orders, the number of values in the IN clause of the query is two. In theory, the query is routed to two shards. The following sample code provides an example of the SQL statement:
SELECT * FROM ORDER WHERE ORDER_ID IN (id1,id2)
The number of values in the IN clause increases when the number of orders increases. In this case, an IN query may need to be routed to all the shards. This increases the response time of the query. The following figure shows the relationship between the number of values in the IN clause and the response time, and the relationship between the number of shards to scan and the response time.
To minimize the scan workload and reduce the number of SQL statements that are caused by increases in the number of values within the IN clause, PolarDB-X V5.4.8-16069335 or later supports dynamic partition pruning that is performed based on the number of values in the IN clause.
If the Order table in the preceding example has 128 shards and the number of values in the IN clause is 128, you need to run the following query:
SELECT * FROM ORDER WHERE ORDER_ID IN (id1,id2,id3....id128)
In versions earlier than PolarDB-X V5.4.8-16069335, the IN query can be routed to up to 128 shards if each value resides in a different shard. The IN clause in each physical query statement that is sent to ApsaraDB RDS for MySQL contains 128 values because the values are not pruned. This increases the workload of executing SQL statements. The following sample code provides an example on how queries are run:
SELECT * FROM ORDER WHERE ORDER_ID_1 IN (id1,id2,id3....id128);
SELECT * FROM ORDER WHERE ORDER_ID_2 IN (id1,id2,id3....id128);
SELECT * FROM ORDER WHERE ORDER_ID_3 IN (id1,id2,id3....id128);
.....
SELECT * FROM ORDER WHERE ORDER_ID_128 IN (id1,id2,id3....id128);
In PolarDB-X V5.4.8-16069335 or later, PolarDB-X calculates the shards to which the IN query is to be routed at the computing layer. Before each physical query statement is sent to ApsaraDB RDS for MySQL, PolarDB-X uses the dynamic partition pruning feature to ensure that the IN clause in the SQL statement of the query contains only the values that are stored in the shard. This increases the throughput and decreases the response time of an IN query. The following sample code provides an example on how queries are run:
SELECT * FROM ORDER WHERE ORDER_ID_1 IN (id1);
SELECT * FROM ORDER WHERE ORDER_ID_2 IN (id2,id12);
SELECT * FROM ORDER WHERE ORDER_ID_3 IN (id3,id4,id5);
.....
SELECT * FROM ORDER WHERE ORDER_ID_32 IN (id100....id128);
In addition, PolarDB-X uses the single-node parallel execution feature to run the IN query on different shards in parallel. For example, if the values in the IN clause of a query are distributed across 32 shards, the value of the degree of parallelism for each query is equal to the number of cores of your node. If your node in PolarDB-X has 16 cores, the default value of the degree of parallelism is 16. In this case, the IN query routed to 32 shards is completed in 2 batches. Alibaba Cloud provides the following recommendations based on experience to help you determine the number of values in the IN clause of a query:
Specify a number that is significantly smaller than the number of shards. This way, not all shards need to be scanned each time a query is performed.
Make sure that the number of values does not increase due to business development. This prevents the query performance from being reduced because of the business development.
To ensure that the response time is low and throughput is high, specify a number from the range of 8 to 32.
If you follow the preceding recommendations, linear scalability can be performed on your business to handle concurrent IN queries without obvious fluctuations in the response time. For example, PolarDB-X in which a 16-core node is deployed can concurrently run 10,000 IN queries. After another 16-core node is added, PolarDB-X can concurrently run 20,000 IN queries.
Comparison test
This test can help you determine the optimal number of values in the IN clause of a query to ensure low response time and a high throughput. In the test environment, two nodes and a table that has 64 shards are used. The specification for each of the nodes is 16 cores and 64 GB memory. Each of the table shards contains millions of data records. The test results show the changes in response time and throughput when the number of values in the IN clause and the number of concurrent queries increase. In PolarDB-X V5.4.8-16069335 or later, the dynamic partition pruning feature is optimized to process IN queries. The excessive values in the IN clause of each SQL statement in your query are pruned. The following figures show the test results.
The following figure shows the changes in response time when the number of concurrent queries and the number of values in the IN clause increase, after the dynamic partition pruning feature is enabled.
The following figure shows the changes in throughput when the number of concurrent queries and the number of values in the IN clause increase, after the dynamic partition pruning feature is enabled.
The following figure shows the changes in response time when the number of concurrent queries and the number of values in the IN clause increase, after the dynamic partition pruning feature is disabled.
The following figure shows the changes in throughput when the number of concurrent queries and the number of values in the IN clause increase, after the dynamic partition pruning feature is disabled.
The following conclusions can be drawn from this comparison test:
When the number of values in the IN clause is between 8 and 32, you can maintain low response time and a high throughput. In this case, the value of the degree of parallelism is close to the default value of the degree of parallelism of the single-node parallel execution feature that is provided by PolarDB-X. The default value of the degree of parallelism is equal to the number of cores of a node.
We recommend that you update your PolarDB-X to V5.4.8 or later to enable the dynamic partition pruning feature. After the dynamic partition pruning feature provided by PolarDB-X V5.4.8-16069335 or later is enabled, the response time for IN queries decreases and the throughput for IN queries increases.