PostgreSQL中HashAggregate与GroupAggregate的区别是什么

发布时间：2021-11-09 14:13:36 作者：iii
来源：亿速云阅读：309

本篇内容介绍了“PostgreSQL中HashAggregate与GroupAggregate的区别是什么”的有关知识，在实际案例的操作过程中，不少人都会遇到这样的困境，接下来就让小编带领大家学习一下如何处理这些情况吧！希望大家仔细阅读，能够学有所成！

案例一

首先我们看一个案例:
测试表:

drop table  if exists t_agg;
create table t_agg(bh varchar(20),c1 int,c2 int,c3 int,c4 int,c5 int,c6 int);
insert into t_agg select 'GZ01',col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select 'GZ02',col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select 'GZ03',col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select 'GZ04',col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select 'GZ05',col,col,col,col,col,col from generate_series(1,100000) as col;

执行查询:

testdb=# -- 禁用并行
testdb=# set max_parallel_workers_per_gather=0;
SET
testdb=# explain verbose select bh,min(c1),max(c1),min(c2),max(c2),min(c3),max(c3),min(c4),max(c4),min(c5),max(c5) from t_agg group by bh;
                                               QUERY PLAN                                               
--------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=22427.00..22427.05 rows=5 width=45)
   Output: bh, min(c1), max(c1), min(c2), max(c2), min(c3), max(c3), min(c4), max(c4), min(c5), max(c5)
   Group Key: t_agg.bh
   ->  Seq Scan on public.t_agg  (cost=0.00..8677.00 rows=500000 width=25)
         Output: bh, c1, c2, c3, c4, c5, c6
(5 rows)

PG的优化器选择了HashAggregate.
下面禁用HashAggregate,优化器只能选择GroupAggregate.可以看到两者的总成本比较:22427.05 vs 82968.97

testdb=# set enable_hashagg = off;
SET
testdb=# explain verbose select bh,min(c1),max(c1),min(c2),max(c2),min(c3),max(c3),min(c4),max(c4),min(c5),max(c5) from t_agg group by bh;
                                               QUERY PLAN                                               
--------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=67968.92..82968.97 rows=5 width=45)
   Output: bh, min(c1), max(c1), min(c2), max(c2), min(c3), max(c3), min(c4), max(c4), min(c5), max(c5)
   Group Key: t_agg.bh
   ->  Sort  (cost=67968.92..69218.92 rows=500000 width=25)
         Output: bh, c1, c2, c3, c4, c5
         Sort Key: t_agg.bh
         ->  Seq Scan on public.t_agg  (cost=0.00..8677.00 rows=500000 width=25)
               Output: bh, c1, c2, c3, c4, c5
(8 rows)

案例二
下面用一个宽表来进行测试:分组键值很少,但聚合列很多

drop table  if exists t_agg_width;
create table t_agg_width
(bh varchar(20)
,c1 int,c2 int,c3 int,c4 int,c5 int,c6 int,c7 int,c8 int,c9 int
,c11 int,c12 int,c13 int,c14 int,c15 int,c16 int,c17 int,c18 int,c19 int
,c21 int,c22 int,c23 int,c24 int,c25 int,c26 int,c27 int,c28 int,c29 int
,c31 int,c32 int,c33 int,c34 int,c35 int,c36 int,c37 int,c38 int,c39 int);
insert into t_agg_width 
select 'GZ01'
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
insert into t_agg_width 
select 'GZ02'
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
insert into t_agg_width 
select 'GZ03'
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
insert into t_agg_width 
select 'GZ04'
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
-- 禁用hashagg
set enable_hashagg = off;
-- 禁用并行
set max_parallel_workers_per_gather=0;
select bh
,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
from t_agg_width group by bh;

在这种情况下,优化器仍会选择Hash

testdb=# explain verbose select bh
testdb-# ,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
testdb-# ,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
testdb-# ,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
testdb-# ,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
testdb-# from t_agg_width group by bh;
                                                    QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=49889.00..49889.04 rows=4 width=149)
   Output: bh, min(c1), min(c2), min(c3), min(c4), min(c5), min(c6), min(c7), min(c8), min(c9), min(c11), min(c12), min(c13),
 min(c14), min(c15), min(c16), min(c17), min(c18), min(c19), min(c21), min(c22), min(c23), min(c24), min(c25), min(c26), min(
c27), min(c28), min(c29), min(c31), min(c32), min(c33), min(c34), min(c35), min(c36), min(c37), min(c38), min(c39)
   Group Key: t_agg_width.bh
   ->  Seq Scan on public.t_agg_width  (cost=0.00..12889.00 rows=400000 width=149)
         Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c24, c25
, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
(5 rows)
testdb=# set enable_hashagg = off;
SET
testdb=# explain verbose select bh
,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
from t_agg_width group by bh;
                                                    QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=110266.28..148266.32 rows=4 width=149)
   Output: bh, min(c1), min(c2), min(c3), min(c4), min(c5), min(c6), min(c7), min(c8), min(c9), min(c11), min(c12), min(c13),
 min(c14), min(c15), min(c16), min(c17), min(c18), min(c19), min(c21), min(c22), min(c23), min(c24), min(c25), min(c26), min(
c27), min(c28), min(c29), min(c31), min(c32), min(c33), min(c34), min(c35), min(c36), min(c37), min(c38), min(c39)
   Group Key: t_agg_width.bh
   ->  Sort  (cost=110266.28..111266.28 rows=400000 width=149)
         Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c24, c25
, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
         Sort Key: t_agg_width.bh
         ->  Seq Scan on public.t_agg_width  (cost=0.00..12889.00 rows=400000 width=149)
               Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c2
4, c25, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
(8 rows)
testdb=#

下面增大分组键值的分布,同时提高c1等列的选择率,再次测试:

testdb=# insert into t_agg_width 
testdb-# select 'GZ'||col
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# from generate_series(1,1000000) as col;
INSERT 0 1000000
testdb=# set enable_hashagg = on;
SET
testdb=# explain verbose select bh
,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
from t_agg_width group by bh;
                                                    QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=440012.46..586553.52 rows=7414 width=149)
   Output: bh, min(c1), min(c2), min(c3), min(c4), min(c5), min(c6), min(c7), min(c8), min(c9), min(c11), min(c12), min(c13),
 min(c14), min(c15), min(c16), min(c17), min(c18), min(c19), min(c21), min(c22), min(c23), min(c24), min(c25), min(c26), min(
c27), min(c28), min(c29), min(c31), min(c32), min(c33), min(c34), min(c35), min(c36), min(c37), min(c38), min(c39)
   Group Key: t_agg_width.bh
   ->  Sort  (cost=440012.46..443866.86 rows=1541757 width=149)
         Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c24, c25
, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
         Sort Key: t_agg_width.bh
         ->  Seq Scan on public.t_agg_width  (cost=0.00..49681.57 rows=1541757 width=149)
               Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c2
4, c25, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
(8 rows)
testdb=#

这一次选择的是GroupAggregate.

HashAggregate
HashAggregate,数据库会根据group by字段后面的值算出hash值,并在内存中维护对应的Hash表,比如select有n个聚合函数,那么在内存中就会维护n个Hash表.这种方式使用的内存比GroupAggregate要大,内存的使用与group by COLUMN中的COLUMN的唯一键值以及聚合列的多少成正比.

GroupAggregate
GroupAggregate,数据库先将表中的数据按group by的字段进行排序,然后对排好序的数据进行一次扫描,计算得到聚合的结果.这种方式需要先执行一次排序,计算复杂度上面要比HashAggregate要高,但这种方法的好处是与group by COLUMN中的COLUMN的唯一键值多寡/聚合列多寡无关,分组键值很多而且聚合列很多且列数据选择很高的情况下,会优于HashAggregate.

“PostgreSQL中HashAggregate与GroupAggregate的区别是什么”的内容就介绍到这里了，感谢大家的阅读。如果想了解更多行业相关的知识可以关注亿速云网站，小编将为大家输出更多高质量的实用文章！

PostgreSQL中HashAggregate与GroupAggregate的区别是什么

相关阅读