MYSQL 中的GROUP BY 的方式 (1)(loose index scan松散扫描 tight index scan紧凑扫描)

发布时间：2020-08-17 18:47:11 作者：gaopengtttt
来源：ITPUB博客阅读：263

水平有限有误请指出，转载请说明出处

测试脚本：
create table tgrploose(p_id int primary key auto_increment,s_id1 int,s_id2 int,s_id3 int, key(s_id1,s_id2,s_id3));
create table tgrpnloose(p_id int primary key auto_increment,s_id1 int,s_id2 int,s_id3 int, key(s_id1,s_id2,s_id3));

delimiter //
create procedure inloose1()
begin
declare i int;
set i=0;
while i<20000 do
insert into tgrploose(s_id1,s_id2,s_id3) values(FLOOR((RAND()*2)),FLOOR((RAND()*3)),FLOOR((RAND()*4)) );
set i=i+1;
end while;
end;
//
delimiter ;

delimiter //
create procedure innloose()
begin
declare i int;
set i=0;
while i<20000 do
insert into tgrpnloose(s_id1,s_id2,s_id3) values(FLOOR((RAND()*10000)),FLOOR((RAND()*10000)),FLOOR((RAND()*10000)) );
set i=i+1;
end while;
end;
//
delimiter ;

call inloose();
call innloose();

一、MYSQL 中可能的GROUP BY 方式
1、loose index scan(松散索引扫描) 执行计划必然出现Using index for group-by
2、tight index scan(紧凑索引扫描) 执行计划必然出现Using index但是不涉及Using temporary; Using filesort
3、常规方式扫描   执行计划涉及到Using temporary; Using filesort

二、各种方式说明
1、loose index scan(松散索引扫描) 执行计划必然出现Using index for group-by
实际上这种扫描方式源于B+树索引结构的，我们回顾一下索引叶子结点的结构
首先考虑语句
while i<20000 do
insert into tgrploose(s_id1,s_id2,s_id3) values(FLOOR((RAND()*2)),FLOOR((RAND()*3)),FLOOR((RAND()*4)) );
明显这里我是向tgrploose插入大约20000条数据，同时s_id1的值只有2个不同值，s_id2的值只有3个不同的值，s_id3的值只有
4个不同的值，这样做是为了将loose(松散)的影响尽量放大，我们知道在key(s_id1,s_id2,s_id3)中索引的排列为先按照
s_id1的顺序排列如果相同按照s_id2的顺序排列如果相同按照s_id3排列，如果都相同按照该primary id排列就是我们这里的
p_id这个我早就证明过了，其实这也是B+树索引的特性，验证参考如下文章:
(http://blog.itpub.net/7728585/viewspace-2128817/)
实际上我们可以通过语句验证索引的结构

mysql> explain select s_id1,s_id2,s_id3,min(p_id),max(p_id),count(*) from tgrploose group by s_id1,s_id2,s_id3;
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | tgrploose | NULL | index | s_id1 | s_id1 | 15 | NULL | 19982 | 100.00 | Using index |
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.01 sec)

可以看到这里没有filesort没有排序，并且使用TYPE=INDEX的方式进行访问索引S_ID1，这种方式实际上就是group by中的紧凑扫描方式
mysql> select s_id1,s_id2,s_id3,min(p_id),max(p_id),count(*) from tgrploose group by s_id1,s_id2,s_id3 ;
+-------+-------+-------+-----------+-----------+----------+
| s_id1 | s_id2 | s_id3 | min(p_id) | max(p_id) | count(*) |
+-------+-------+-------+-----------+-----------+----------+
| 0 | 0 | 0 | 20 | 19991 | 882 |
| 0 | 0 | 1 | 10 | 19950 | 835 |
| 0 | 0 | 2 | 3 | 19979 | 830 |
| 0 | 0 | 3 | 1 | 19944 | 853 |
| 0 | 1 | 0 | 22 | 19999 | 762 |
| 0 | 1 | 1 | 26 | 19987 | 786 |
| 0 | 1 | 2 | 21 | 19968 | 867 |
| 0 | 1 | 3 | 19 | 19997 | 908 |
| 0 | 2 | 0 | 7 | 19916 | 848 |
| 0 | 2 | 1 | 14 | 19971 | 906 |
| 0 | 2 | 2 | 4 | 19988 | 870 |
| 0 | 2 | 3 | 80 | 19906 | 762 |
| 1 | 0 | 0 | 49 | 19990 | 779 |
| 1 | 0 | 1 | 38 | 19976 | 886 |
| 1 | 0 | 2 | 2 | 19981 | 857 |
| 1 | 0 | 3 | 37 | 19998 | 830 |
| 1 | 1 | 0 | 65 | 19993 | 839 |
| 1 | 1 | 1 | 5 | 19984 | 822 |
| 1 | 1 | 2 | 8 | 19996 | 808 |
| 1 | 1 | 3 | 6 | 19927 | 792 |
| 1 | 2 | 0 | 91 | 19992 | 797 |
| 1 | 2 | 1 | 24 | 20000 | 839 |
| 1 | 2 | 2 | 9 | 19965 | 779 |
| 1 | 2 | 3 | 12 | 19977 | 863 |
+-------+-------+-------+-----------+-----------+----------+
实际也是验证了我们说法s_id1的顺序排列如果相同按照s_id2的顺序排列如果相同按照s_id3排列，如果都相同按照该primary id排列。
那么考虑如下的group by查询
select s_id1,s_id2,s_id3,min(p_id) from tgrploose group by s_id1,s_id2,s_id3 ;

mysql> explain select s_id1,s_id2,s_id3,min(p_id) from tgrploose group by s_id1,s_id2,s_id3 ;
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
| 1 | SIMPLE | tgrploose | NULL | range | s_id1 | s_id1 | 15 | NULL | 25 | 100.00 | Using index for group-by |
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)

因为s_id1,s_id2,s_id3在本列子中一共才24个不同的值，相当的稀疏，那就可以根据索引结构访问到叶子结点找到最小的那个p_id值即可，
其他的p_id就不用看了，然后跳到下一个s_id1,s_id2,s_id3组合，这就是稀疏扫描的优势，如果非要给稀疏下一个定义那么就是group by
后字段组合相对于表总行数的比率，这里表为20000行，s_id1,s_id2,s_id3为24个值，那么比率为24/20000,如果这个值越大则越稠密，
如果越小越则稀疏(这让我想到稀疏矩阵)。
其次我们要考虑一下loose稀疏索引扫描的性能问题（这部分为自己理解的，没有参考资料）一般的情况下我们使用type=INDEX这样的方式完成
group by，这种方式下访问是比较顺序的并且访问叶子结点即可，而稀疏索引扫描不得不多次使用根结点分支结点来定位，每次跳过的距离，这
个可能随机访问，并且多次访问跟节点和分支节点也是需要开销的，索引他们之间就存在一个性能的综合考虑，到底使用稀疏索引扫描还是紧凑
索引扫描其根源在于上面说的那个稀疏的比例问题。考虑前面给出的tgrpnloose这个表我插入数据时候将s_id1,s_id2,s_id3不同值都设置为10000
那么这个时候就非常稠密了，比较执行计划如下：
mysql> explain select s_id1,s_id2,s_id3,min(p_id) from tgrpnloose group by s_id1,s_id2,s_id3;
+----+-------------+------------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | tgrpnloose | NULL | index | s_id1 | s_id1 | 15 | NULL | 19982 | 100.00 | Using index |
+----+-------------+------------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql> explain select s_id1,s_id2,s_id3,min(p_id) from tgrploose group by s_id1,s_id2,s_id3;
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
| 1 | SIMPLE | tgrploose | NULL | range | s_id1 | s_id1 | 15 | NULL | 25 | 100.00 | Using index for group-by |
+----+-------------+-----------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)

可以看到MYSQL做出了选择，对tgrpnloose使用了紧凑索引扫描tight index scan(后面描述)，当出现Using index for group-by为使用稀疏索引扫描完成group
by。一些不能用到稀疏索引扫描来完成group by 的限制如下：
1、性能考虑(参考如上tgrpnloose的列子)
2、对单一数据表进行group by(参考tight index scan部分例子)
3、不能使用前缀索引(prefix index)
4、仅仅可以使用max(),min()聚合函数，其他聚合函数如count(),sum()不支持
如：
mysql> explain select a.s_id1,a.s_id2,count(*) from tgrploose a where a.s_id1>30 group by a.s_id1,a.s_id2;
+----+-------------+-------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
| 1 | SIMPLE | a | NULL | range | s_id1 | s_id1 | 5 | NULL | 1 | 100.00 | Using where; Using index |
+----+-------------+-------+------------+-------+---------------+-------+---------+------+------+----------+--------------------------+
1 row in set, 1 warning (0.00 sec)
5、group by中必须满足最左侧列一致，如果不一致弃用松散扫描方式
如：
mysql> explain select a.s_id2,a.s_id3 from tgrploose a group by a.s_id2,a.s_id3;
+----+-------------+-------+------------+-------+---------------+-------+---------+------+-------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+-------+---------+------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | a | NULL | index | s_id1 | s_id1 | 15 | NULL | 19982 | 100.00 | Using index; Using temporary; Using filesort |
+----+-------------+-------+------------+-------+---------------+-------+---------+------+-------+----------+----------------------------------------------+
1 row in set, 1 warning (0.00 sec)

注意一下这个查询，虽然使用了索引扫描，同时使用了Using temporary; Using filesort，也就是使用了临时表来存储a.s_id2,a.s_id3和值然后做了排序操作。
但这个不是紧凑索引扫描的方式，因为使用了临时表和排序。

在官方文档中如下查询都能使用到稀疏索引扫描(key(c1,c2,c3) table(c1,c2,c3,c4))
SELECT c1, c2 FROM t1 GROUP BY c1, c2;
SELECT DISTINCT c1, c2 FROM t1;
SELECT c1, MIN(c2) FROM t1 GROUP BY c1;
SELECT c1, c2 FROM t1 WHERE c1 < const GROUP BY c1, c2;
SELECT MAX(c3), MIN(c3), c1, c2 FROM t1 WHERE c2 > const GROUP BY c1, c2;
SELECT c2 FROM t1 WHERE c1 < const GROUP BY c1, c2;
SELECT c1, c2 FROM t1 WHERE c3 = const GROUP BY c1, c2;
SELECT COUNT(DISTINCT c1), SUM(DISTINCT c1) FROM t1;
SELECT COUNT(DISTINCT c1, c2), COUNT(DISTINCT c2, c1) FROM t1;
其实根据索引的结构和稀疏扫描的原理稍加考虑可以明白为什么能够使用到稀疏索引扫描

2、tight index scan(紧凑索引扫描) 执行计划必然出现Using index但是不涉及Using temporary; Using filesort
描述为仅对驱动表(注意是驱动表)中数据进行分组的时候，如果group by按照索引的顺序给出，如果有缺失需要使用column=constant的情况
比如：
按照顺序给出
mysql> explain select b.s_id1,b.s_id2,max(a.s_id3) from tgrpnloose b STRAIGHT_JOIN tgrploose a on a.s_id1=b.s_id1 group by b.s_id1,b.s_id2;
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+--------------------------+
| 1 | SIMPLE | b | NULL | index | s_id1 | s_id1 | 15 | NULL | 19982 | 100.00 | Using where; Using index |
| 1 | SIMPLE | a | NULL | ref | s_id1 | s_id1 | 5 | test.b.s_id1 | 9991 | 100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+--------------------------+
2 rows in set, 1 warning (0.01 sec)

顺序缺失但是使用s_id1=10
mysql> explain select b.s_id2,max(a.s_id3) from tgrpnloose b STRAIGHT_JOIN tgrploose a on a.s_id1=b.s_id1 where b.s_id1=10 group by b.s_id2;
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+--------------------------+
| 1 | SIMPLE | b | NULL | ref | s_id1 | s_id1 | 5 | const | 2 | 100.00 | Using where; Using index |
| 1 | SIMPLE | a | NULL | ref | s_id1 | s_id1 | 5 | const | 1 | 100.00 | Using index |
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+--------------------------+
2 rows in set, 1 warning (0.02 sec)

我们能够清楚的看到这种情况用不到索引稀疏扫描因为不是单表查询，但是在 Extra 都没有出现Using temporary; Using filesort，因为使用了tight
index scan(紧凑索引扫描)，注意type=ref 是b.s_id1=10因为索引不是唯一索引。
再看下面的例子：

mysql> explain select b.s_id2,max(a.s_id3) from tgrpnloose b STRAIGHT_JOIN tgrploose a on a.s_id1=b.s_id1 group by b.s_id2;
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+-----------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+-----------------------------------------------------------+
| 1 | SIMPLE | b | NULL | index | s_id1 | s_id1 | 15 | NULL | 19982 | 100.00 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | a | NULL | ref | s_id1 | s_id1 | 5 | test.b.s_id1 | 9991 | 100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+-----------------------------------------------------------+
2 rows in set, 1 warning (0.01 sec)

明显group by b.s_id2不满足按照索引顺序进行group by也就是不满足最左原则，同时没有s_id1=10这样常量，使用了
Using index; Using temporary; Using filesort用到了临时表和filesort排序。
还要明确一点这里是驱动表一定要注意，如果不加STRAIGHT_JOIN MYSQL给出如下的执行计划
mysql> explain select b.s_id1,b.s_id2,max(a.s_id3) from tgrpnloose b join tgrploose a on a.s_id1=b.s_id1 group by b.s_id1,b.s_id2;
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+-----------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+-----------------------------------------------------------+
| 1 | SIMPLE | a | NULL | index | s_id1 | s_id1 | 15 | NULL | 19982 | 100.00 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | b | NULL | ref | s_id1 | s_id1 | 5 | test.a.s_id1 | 2 | 100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+-------+---------+--------------+-------+----------+-----------------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
和第一个列子的区别就是a变为了驱动表，b变为了被驱动表，使用了 Using index; Using temporary; Using filesort

另外tight index scan(紧凑索引扫描)当然也适用于单表的情况，如：
-- 不满足最左原则，弃用loose index scan
mysql> explain select b.s_id2 from tgrploose b where b.s_id1=10 group by b.s_id2;
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+--------------------------+
| 1 | SIMPLE | b | NULL | ref | s_id1 | s_id1 | 5 | const | 1 | 100.00 | Using where; Using index |
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+--------------------------+
1 row in set, 1 warning (0.01 sec)
-- 聚合函数count，弃用loose index scan
mysql> explain select b.s_id1,count(*) from tgrploose b group by b.s_id1;
+----+-------------+-------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | b | NULL | index | s_id1 | s_id1 | 15 | NULL | 19982 | 100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
-- 性能考虑，弃用loose index scan
mysql> explain select s_id1,s_id2,s_id3,min(p_id) from tgrpnloose group by s_id1,s_id2,s_id3;
+----+-------------+------------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | tgrpnloose | NULL | index | s_id1 | s_id1 | 15 | NULL | 19982 | 100.00 | Using index |
+----+-------------+------------+------------+-------+---------------+-------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

简而言之，如果不能使用loose index scan会优先考虑tight index scan，来避免可能的使用临时表和排序操作

3、常规方式扫描,执行计划涉及到Using temporary; Using filesort
简单的说在没有办法使用索引规避排序的情况下需要使用这种方式进行group by，需要使用这种常规的方式，先将group by的字段放到临时表
然后进行排序去重操作。
上面已经给出了一列子这里再看一个
-- //不满足最左原则，弃用loose index scan,不满足缺失部分column=constant弃用tight index scan
mysql> explain select b.s_id3,count(*) from tgrploose b where b.s_id1=10 group by b.s_id3;
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+-----------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+-----------------------------------------------------------+
| 1 | SIMPLE | b | NULL | ref | s_id1 | s_id1 | 5 | const | 1 | 100.00 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-------+------------+------+---------------+-------+---------+-------+------+----------+-----------------------------------------------------------+

改为
mysql> explain select b.s_id3,count(*) from tgrploose b where b.s_id1=10 and b.s_id2=10 group by b.s_id3;
+----+-------------+-------+------------+------+---------------+-------+---------+-------------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+-------+---------+-------------+------+----------+--------------------------+
| 1 | SIMPLE | b | NULL | ref | s_id1 | s_id1 | 10 | const,const | 1 | 100.00 | Using where; Using index |
+----+-------------+-------+------------+------+---------------+-------+---------+-------------+------+----------+--------------------------+
可以使用tight index scan(紧凑索引扫描)，这是前面说了的原则

三、oracle中的index skip scan
ORACLE中也有差不多的用法，但是范围更广不限于仅仅是group by，但是一般情况考虑为没有正确的为谓词建立前缀索引，效率不是最优，而是
比全表好一些，它仅仅适用于前导列distinct值很少，非前导列选择性比较高的情况，其实也就是前导列是稀疏的，而非前导列是稠密的。其性能的
问题应该也在于不断的使用根结点和分支节点定位和可能的随机读上，这里就不给出例子了。
四、总结
可以看到MYSQL三种group by 方式效率越来越低及(优化器选择正确的情况下)：
loose index scan(松散索引扫描)>tight index scan(紧凑索引扫描)>常规方式扫描
但是适用范围越来越广。

关于在trace中发现了代价的计算，这也是导致loose index scan(松散索引扫描)和tight index scan(紧凑索引扫描)切换的
根据，随后会在根据trace看看源码的判断。

tight index scan：   loose index scan：
T@2: | | | | | | | | | | opt: distinct_aggregate: 0 T@2: | | | | | | | | | | opt: distinct_aggregate: 0
T@2: | | | | | | | | | | opt: rows: 19983 T@2: | | | | | | | | | | opt: rows: 7
T@2: | | | | | | | | | | opt: cost: 8040.2   T@2: | | | | | | | | | | opt: cost: 9.8

这里也给出一个12条数据key(s_id1,s_id2) primary key(p_id)的索引的排列方式，方便大家理解
明显先按照s_id1排序，s_id1相同按照s_id2排序，s_id2相同按照主键p_id排序

s_id1	0	0	0	0	0	0	1	1	1	1	1	1
s_id2	0	0	1	1	2	2	0	0	1	1	2	2
p_id	5	12	1	6	3	9	2	8	4	10	0	11

MYSQL 中的GROUP BY 的方式 (1)(loose index scan松散扫描 tight index scan紧凑扫描)

相关阅读