Practice and optimization in high concurrency scenarios

This topic describes how to determine whether queries are point queries and how to convert queries into point queries for optimization.

Point queries are commonly used when applications access online transaction processing (OLTP) databases. In point queries, only a small amount of data in tables is scanned to return results. For example, when you view orders or commodity information on the Taobao shopping platform, the operations that are performed on the database are point queries. PolarDB-X reduces response time and improves resource usage for point queries. PolarDB-X supports high throughput and is suitable for scenarios in which a large number of read requests are concurrently processed.

What is a point query?

A point query is a query in which only a small amount of data is scanned. If a small amount of data is returned but a large amount of data is scanned in a query, the query is not a point query. For example, although select * from t1 order by c1 limit 1 returns only one data record, this query may not be a point query. The reason is that if no index is created on c1, all data in t1 needs to be scanned and sorted to return results.

In a standalone database, the most common point query is to query data by primary key. The results can be quickly returned after primary key indexes are scanned. In most cases, only logn records need to be scanned. If you query data by using other conditions, you can create a local secondary index. This way, the local secondary index is scanned to retrieve the primary key. Then, the table is accessed based on index row IDs to retrieve complete records. In scenarios in which the local secondary index contains all columns that are involved in the query, the table does not need to be accessed based on index row IDs.

PolarDB-X is a distributed database service. Partitioned tables are supported so that data can be distributed among different data nodes. Data is partitioned into multiple partitions, and then mappings between partitions and data nodes are established. One or more columns are selected as dimensions for data partitioning. These columns constitute the partition key. In distributed databases, query performance is linearly related to the amount of scanned data and positively correlated to the number of scanned partitions. Therefore, the definition of a point query must include the following information: A small number of partitions are scanned.

PolarDB-X provides transparency and distributed capabilities. By default, the primary key is used as the partition key. When data is queried based on the primary key, the partition that stores the data is located first, and then the primary key index on the partition is scanned to return results. This method provides the highest performance. If you query data by using other conditions, you can create a global secondary index. For more information, see GSI. A global secondary index can be used to optimize queries in the same manner as a local secondary index. The primary key is retrieved and then the table is accessed based on index row IDs to retrieve complete records. The difference is that the global secondary index is also a partitioned table. The data and the base table are stored on different data nodes. In most cases, data is transferred among nodes and network overhead is required when the table is accessed based on index row IDs. The cost of accessing the table based on index row IDs is higher than the cost of scanning data in a standalone database. In PolarDB-X, clustered indexes can be created to eliminate the need to access the table based on index row IDs. This ensures that the performance of queries is the same as the queries that are performed based on the primary key. Notes

The local secondary index and the global secondary index are used to provide high query performance at the expense of additional storage and write overhead. We recommend that you evaluate the impacts on write performance before you use the local secondary index and the global secondary index. Data is distributed in a different manner between the index table and the base table. All write operations that involve the global secondary index are encapsulated in distributed transactions by default. This ensures strong consistency between the global secondary index data and the data in the base table. Compared with scenarios in which no global secondary index is created, the response time for write requests increases by two to three times. Data is written to the index table and the base table in parallel. As a result, the probability of generating distributed deadlocks in highly concurrent write scenarios increases. We recommend that you create up to three global secondary indexes on each logical table.

Identify point queries

Point queries in distributed databases are queries in which a small number of shards and data are scanned. You can check the number of shards that are scanned for a query statement by viewing the execution plan of the statement. For more information about the execution plan, see Operators. The following code block provides an example of a point query:

> explain select c_custkey, c_name, c_address from customer where c_custkey = 42;
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| LOGICAL EXECUTIONPLAN                                                                                                                                            |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| LogicalView(tables="TEST1_000002_GROUP.customer_IVgG_10", sql="SELECT `c_custkey`, `c_name`, `c_address` FROM `customer` AS `customer` WHERE (`c_custkey` = ?)") |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+

EXPLAIN EXECUTE is used to summarize and display the execution plan on data nodes. This allows you to determine whether the query hits the correct index on the data nodes. The execution plan on the data nodes is implemented based on MySQL. The execution plan is processed in the same manner as the execution plan in MySQL. For more information, see MySQL official documentation. The following code block provides an example of a point query:

> explain execute select c_custkey, c_name, c_address from customer where c_custkey = 42;
+----+-------------+----------+------------+-------+---------------+---------+---------+-------+------+----------+-----------------+
| id | select_type | table    | partitions | type  | possible_keys | key     | key_len | ref   | rows | filtered | Extra           |
+----+-------------+----------+------------+-------+---------------+---------+---------+-------+------+----------+-----------------+
| 1  | SIMPLE      | customer | NULL       | const | PRIMARY       | PRIMARY | 4       | const | 1    | 100      | Using pk access |
+----+-------------+----------+------------+-------+---------------+---------+---------+-------+------+----------+-----------------+

Convert queries into point queries for optimization

All queries cannot be converted into point queries for optimization. For example, the following queries cannot be converted into point queries: the select * from t1 query that contains no filtering conditions to retrieve data, the inappropriate paged query select * from t1 where c1 = 1 limit 100000, 10, and IN queries in which the number of parameters increases due to business growth. The following two types of queries can be converted into point queries for optimization:

  1. Queries in which the scan range is fixed, such as queries that contain the following conditions: conditions that include equivalent conditions or can be simplified as equivalent conditions, BETWEEN AND conditions that narrow down the scan range, or IN conditions in which the number of parameters is fixed

  2. TopN queries in which a fixed number of rows are returned, such as select * from t1 where c1 > 42 limit 10 and select * from t1 order by c1 limit 10 and select * from t1 order by c1 limit 10

You can optimize these two types of queries by creating suitable indexes to convert full table scans into index scans. The following code block provides an example:

CREATE TABLE `customer` (
  `c_custkey` int(11) NOT NULL,
  `c_name` varchar(25) NOT NULL,
  `c_address` varchar(40) NOT NULL,
  `c_nationkey` int(11) NOT NULL,
  `c_phone` varchar(15) NOT NULL,
  `c_acctbal` decimal(15,2) NOT NULL,
  `c_mktsegment` varchar(10) NOT NULL,
  `c_comment` varchar(117) NOT NULL,
  PRIMARY KEY (`c_custkey`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 dbpartition by hash(`c_custkey`) tbpartition by hash(`c_custkey`) tbpartitions 4;

> explain select * from customer where c_phone = "11";
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| LOGICAL EXECUTIONPLAN                                                                                                                                                                                                                            |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Gather(concurrent=true)                                                                                                                                                                                                                          |
|   LogicalView(tables="[000000-000003].customer_[00-15]", shardCount=16, sql="SELECT `c_custkey`, `c_name`, `c_address`, `c_nationkey`, `c_phone`, `c_acctbal`, `c_mktsegment`, `c_comment` FROM `customer` AS `customer` WHERE (`c_phone` = ?)") |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The customer table has only the primary key index. Although an equivalent condition is specified for c_phone, all shards need to be scanned. You can optimize this query by creating a global secondary index.

> create global index g_i_phone on customer(c_phone) dbpartition by hash(c_phone);

> explain select * from customer where c_phone = "11";
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| LOGICAL EXECUTIONPLAN                                                                                                                                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Project(c_custkey="c_custkey", c_name="c_name", c_address="c_address", c_nationkey="c_nationkey", c_phone="c_phone", c_acctbal="c_acctbal", c_mktsegment="c_mktsegment", c_comment="c_comment")                                                                        |
|   BKAJoin(condition="c_custkey = c_custkey", type="inner")                                                                                                                                                                                                             |
|     IndexScan(tables="TEST1_000000_GROUP.g_i_phone_2CSp", sql="SELECT `c_custkey`, `c_phone` FROM `g_i_phone` AS `g_i_phone` WHERE (`c_phone` = ?)")                                                                                                                   |
|     Gather(concurrent=true)                                                                                                                                                                                                                                            |
|       LogicalView(tables="[000000-000003].customer_[00-15]", shardCount=16, sql="SELECT `c_custkey`, `c_name`, `c_address`, `c_nationkey`, `c_acctbal`, `c_mktsegment`, `c_comment` FROM `customer` AS `customer` WHERE ((`c_phone` = ?) AND (`c_custkey` IN (...)))") |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

After the global secondary index is created, the query can be performed in two phases: a point query on the index table and table access based on index row IDs. In the second phase, only one shard is accessed. In the execution plan, a full table scan on the base table is performed when the table is accessed based on index row IDs. This is because the base table shards that need to be scanned depend on the result of the query on the index table and cannot be determined in the explain phase.

> drop index g_i_phone on customer;

> create clustered index g_i_phone on customer(c_phone) dbpartition by hash(c_phone);

> explain select * from customer where c_phone = "11";
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| LOGICAL EXECUTIONPLAN                                                                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| IndexScan(tables="TEST1_000000_GROUP.g_i_phone_fHmZ", sql="SELECT `c_custkey`, `c_name`, `c_address`, `c_nationkey`, `c_phone`, `c_acctbal`, `c_mktsegment`, `c_comment` FROM `g_i_phone` AS `g_i_phone` WHERE (`c_phone` = ?)") |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

After you use a clustered index instead of the global secondary index, the table does not need to be accessed based on index row IDs because the index table contains all columns of the base table. Then, in the execution plan, only the point query on the index table is performed.

The preceding examples describe the general procedure for using indexes to improve the performance of point queries. The key point is to find suitable columns on which indexes are created based on query characteristics. For complex queries that contain multiple conditions, you can use the built-in index recommendation feature of PolarDB-X to find the appropriate local secondary index and global secondary index. For more information, see Intelligent index recommendation.

results matching ""

    No results matching ""