如何理解MYSQL-GroupCommit 和 2pc提交

发布时间：2021-11-16 11:34:56 作者：柒染
来源：亿速云阅读：363

这篇文章将为大家详细讲解有关如何理解MYSQL-GroupCommit 和 2pc提交，文章内容质量较高，因此小编分享给大家做个参考，希望大家阅读完这篇文章后对相关知识有一定的了解。

组提交(group commit)是MYSQL处理日志的一种优化方式，主要为了解决写日志时频繁刷磁盘的问题。组提交伴随着MYSQL的发展不断优化，从最初只支持redo log 组提交，到目前5.6官方版本同时支持redo log 和binlog组提交。组提交的实现大大提高了mysql的事务处理性能，将以innodb 存储引擎为例，详细介绍组提交在各个阶段的实现原理。

redo log的组提交

WAL(Write-Ahead-Logging)是实现事务持久性的一个常用技术，基本原理是在提交事务时，为了避免磁盘页面的随机写，只需要保证事务的redo log写入磁盘即可，这样可以通过redo log的顺序写代替页面的随机写，并且可以保证事务的持久性，提高了数据库系统的性能。虽然WAL使用顺序写替代了随机写，但是，每次事务提交，仍然需要有一次日志刷盘动作，受限于磁盘IO，这个操作仍然是事务并发的瓶颈。

组提交思想是，将多个事务redo log的刷盘动作合并，减少磁盘顺序写。Innodb的日志系统里面，每条redo log都有一个LSN(Log Sequence Number)，LSN是单调递增的。每个事务执行更新操作都会包含一条或多条redo log，各个事务将日志拷贝到log_sys_buffer时(log_sys_buffer 通过log_mutex

保护)，都会获取当前最大的LSN，因此可以保证不同事务的LSN不会重复。那么假设三个事务Trx1,Trx2和Trx3的日志的最大LSN分别为LSN1,LSN2,LSN3(LSN1<lsn2<lsn3)，它们同时进行提交，那么如果trx3日志先获取到log_mutex进行落盘，它就可以顺便把[lsn1---lsn3]这段日志也刷了，这样trx1和trx2就不用再次请求磁盘io。组提交的基本流程如下： </lsn2<lsn3)，它们同时进行提交，那么如果trx3日志先获取到log_mutex进行落盘，它就可以顺便把[lsn1---lsn3]这段日志也刷了，这样trx1和trx2就不用再次请求磁盘io。组提交的基本流程如下：<>

获取 log_mutex
若flushed_to_disk_lsn>=lsn，表示日志已经被刷盘,跳转5
若 current_flush_lsn>=lsn，表示日志正在刷盘中，跳转5后进入等待状态
将小于LSN的日志刷盘(flush and sync)
退出log_mutex

备注：lsn表示事务的lsn，flushed_to_disk_lsn和current_flush_lsn分别表示已刷盘的LSN和正在刷盘的LSN。

redo log 组提交优化

我们知道，在开启binlog的情况下，prepare阶段，会对redo log进行一次刷盘操作(innodb_flush_log_at_trx_commit=1)，确保对data页和undo 页的更新已经刷新到磁盘；commit阶段，会进行刷binlog操作(sync_binlog=1),并且会对事务的undo log从prepare状态设置为提交状态(可清理状态)。通过两阶段提交方式(innodb_support_xa=1)，可以保证事务的binlog和redo log顺序一致。二阶段提交过程中，mysql_binlog作为协调者，各个存储引擎和mysql_binlog作为参与者。故障恢复时，扫描最后一个binlog文件(进行rotate binlog文件时，确保老的binlog文件对应的事务已经提交)，提取其中的xid；重做检查点以后的redo日志，读取事务的undo段信息，搜集处于prepare阶段的事务链表，将事务的xid与binlog中的xid对比，若存在，则提交，否则就回滚。

通过上述的描述可知，每个事务提交时，都会触发一次redo flush动作，由于磁盘读写比较慢，因此很影响系统的吞吐量。淘宝童鞋做了一个优化，将prepare阶段的刷redo动作移到了commit(flush-sync-commit)的flush阶段之前，保证刷binlog之前，一定会刷redo。这样就不会违背原有的故障恢复逻辑。移到commit阶段的好处是，可以不用每个事务都刷盘，而是leader线程帮助刷一批redo。如何实现，很简单，因为log_sys->lsn始终保持了当前最大的lsn，只要我们刷redo刷到当前的log_sys->lsn，就一定能保证，将要刷binlog的事务redo日志一定已经落盘。通过延迟写redo方式，实现了redo log组提交的目的，而且减少了log_sys->mutex的竞争。目前这种策略已经被官方mysql5.7.6引入。

两阶段提交

在单机情况下，redo log组提交很好地解决了日志落盘问题，那么开启binlog后，binlog能否和redo log一样也开启组提交？首先开启binlog后，我们要解决的一个问题是，如何保证binlog和redo log的一致性。因为binlog是Master-Slave的桥梁，如果顺序不一致，意味着Master-Slave可能不一致。MYSQL通过两阶段提交很好地解决了这一问题。Prepare阶段，innodb刷redo log，并将回滚段设置为Prepared状态，binlog不作任何操作；commit阶段，innodb释放锁，释放回滚段，设置提交状态，binlog刷binlog日志。出现异常，需要故障恢复时，若发现事务处于Prepare阶段，并且binlog存在则提交，否则回滚。通过两阶段提交，保证了redo log和binlog在任何情况下的一致性。

binlog的组提交

回到上节的问题，开启binlog后，如何在保证redo log-binlog一致的基础上，实现组提交。因为这个问题，5.6以前，mysql在开启binlog的情况下，无法实现组提交，通过一个臭名昭著的prepare_commit_mutex，将redo log和binlog刷盘串行化，串行化的目的也仅仅是为了保证redo log-Binlog一致，但这种实现方式牺牲了性能。这个情况显然是不能容忍的，因此各个mysql分支，mariadb，facebook，perconal等相继出了补丁改进这一问题，mysql官方版本5.6也终于解决了这一问题。由于各个分支版本解决方法类似，我主要通过分析5.6的实现来说明实现方法。

binlog组提交的基本思想是，引入队列机制保证innodb commit顺序与binlog落盘顺序一致，并将事务分组，组内的binlog刷盘动作交给一个事务进行，实现组提交目的。binlog提交将提交分为了3个阶段，FLUSH阶段，SYNC阶段和COMMIT阶段。每个阶段都有一个队列，每个队列有一个mutex保护，约定进入队列第一个线程为leader，其他线程为follower，所有事情交由leader去做，leader做完所有动作后，通知follower刷盘结束。binlog组提交基本流程如下：

FLUSH 阶段

1) 持有Lock_log mutex [leader持有，follower等待]

2) 获取队列中的一组binlog(队列中的所有事务)

3) 将binlog buffer到I/O cache

4) 通知dump线程dump binlog

SYNC阶段

1) 释放Lock_log mutex，持有Lock_sync mutex[leader持有，follower等待]

2) 将一组binlog 落盘(sync动作，最耗时，假设sync_binlog为1)

COMMIT阶段

1) 释放Lock_sync mutex，持有Lock_commit mutex[leader持有，follower等待]

2) 遍历队列中的事务，逐一进行innodb commit

3) 释放Lock_commit mutex

4) 唤醒队列中等待的线程

说明：由于有多个队列，每个队列各自有mutex保护，队列之间是顺序的，约定进入队列的一个线程为leader，因此FLUSH阶段的leader可能是SYNC阶段的follower，但是follower永远是follower。

通过上文分析，我们知道MYSQL目前的组提交方式解决了一致性和性能的问题。通过二阶段提交解决一致性，通过redo log和binlog的组提交解决磁盘IO的性能。下面我整理了Prepare阶段和Commit阶段的框架图供各位参考。

如何理解MYSQL-GroupCommit 和 2pc提交

参考文档

http://mysqlmusings.blogspot.com/2012/06/binary-log-group-commit-in-mysql-56.html

http://www.lupaworld.com/portal.php?mod=view&aid=250169&page=all

http://www.oschina.net/question/12_89981

http://kristiannielsen.livejournal.com/12254.html

http://blog.chinaunix.net/uid-26896862-id-3432594.html

http://www.csdn.net/article/2015-01-16/2823591

MySQL的事务提交逻辑主要在函数ha_commit_trans中完成。事务的提交涉及到binlog及具体的存储的引擎的事务提交。所以MySQL用2PC来保证的事务的完整性。MySQL的2PC过程如下：

如何理解MYSQL-GroupCommit 和 2pc提交

T@4 : | | | | >trans_commit
T@4 : | | | | | enter: stmt.ha_list: , all.ha_list:  T@4 : | | | | | debug: stmt.unsafe_rollback_flags: 
T@4 : | | | | | debug: all.unsafe_rollback_flags: 
T@4 : | | | | | >trans_check
T@4 : | | | | | <trans_check 49 T@4 : | | | | | info: clearing SERVER_STATUS_IN_TRANS
T@4    : | | | | | >ha_commit_trans T@4 : | | | | | | info: all=1 thd->in_sub_stmt=0 ha_info=0x0 is_real_trans=1 T@4 : | | | | | | >MYSQL_BIN_LOG::commit T@4 : | | | | | | | enter: thd: 0x2b9f4c07beb0, all: yes, xid: 0, cache_mngr: 0x0 T@4 : | | | | | | | >ha_commit_low
T@4 : | | | | | | | | >THD::st_transaction::cleanup
T@4 : | | | | | | | | | >free_root
T@4 : | | | | | | | | | | enter: root: 0x2b9f4c07d660 flags: 1 T@4 : | | | | | | | | | <free_root 396 T@4 : | | | | | | | | <thd::st_transaction::cleanup 2521 T@4 : | | | | | | | <ha_commit_low 1535 T@4 : | | | | | | <mysql_bin_log::commit 6383 T@4 : | | | | | | >THD::st_transaction::cleanup
T@4 : | | | | | | | >free_root
T@4 : | | | | | | | | enter: root: 0x2b9f4c07d660 flags: 1 T@4 : | | | | | | | <free_root 396 T@4 : | | | | | | <thd::st_transaction::cleanup 2521 T@4 : | | | | | <ha_commit_trans 1458 T@4 : | | | | | debug: reset_unsafe_rollback_flags
T@4 : | | | | <trans_commit 233</ha_commit_trans<> T@4 : | | | | >MDL_context::release_transactional_locks
T@4 : | | | | | >MDL_context::release_locks_stored_before
T@4 : | | | | | <mdl_context::release_locks_stored_before 2771 T@4 : | | | | | >MDL_context::release_locks_stored_before
T@4 : | | | | | <mdl_context::release_locks_stored_before 2771 T@4 : | | | | <mdl_context::release_transactional_locks 2926 T@4 : | | | | >set_ok_status
T@4 : | | | | <set_ok_status 446 T@4 : | | | | THD::enter_stage: /usr/src/mysql-5.6.28/sql/sql_parse.cc:4996 T@4 : | | | | >PROFILING::status_change
T@4 : | | | | <profiling::status_change 354 T@4 : | | | | >trans_commit_stmt
T@4 : | | | | | enter: stmt.ha_list: , all.ha_list:  T@4 : | | | | | enter: stmt.ha_list: , all.ha_list:  T@4 : | | | | | debug: stmt.unsafe_rollback_flags: 
T@4 : | | | | | debug: all.unsafe_rollback_flags: 
T@4 : | | | | | debug: add_unsafe_rollback_flags: 0 T@4 : | | | | | >MYSQL_BIN_LOG::commit

如何理解MYSQL-GroupCommit 和 2pc提交

(1)先调用binglog_hton和innobase_hton的prepare方法完成第一阶段，binlog_hton的papare方法实际上什么也没做，innodb的prepare将事务状态设为TRX_PREPARED，并将redo log刷磁盘 (innobase_xa_prepare à trx_prepare_for_mysql à trx_prepare_off_kernel)。

(2)如果事务涉及的所有存储引擎的prepare都执行成功，则调用TC_LOG_BINLOG::log_xid将SQL语句写到binlog，此时，事务已经铁定要提交了。否则，调用ha_rollback_trans回滚事务，而SQL语句实际上也不会写到binlog。

(3)最后，调用引擎的commit完成事务的提交。实际上binlog_hton->commit什么也不会做(因为(2)已经将binlog写入磁盘)，innobase_hton->commit则清除undo信息，刷redo日志，将事务设为TRX_NOT_STARTED状态(innobase_commit à innobase_commit_low à trx_commit_for_mysql à trx_commit_off_kernel)。

//ha_innodb.cc

static

int

innobase_commit(

/*============*/

/* out: 0 */

THD* thd, /* in: MySQL thread handle of the user for whom

the transaction should be committed */

bool all) /* in: TRUE - commit transaction

FALSE - the current SQL statement ended */

{

...

trx->mysql_log_file_name = mysql_bin_log.get_log_fname();

trx->mysql_log_offset =

(ib_longlong)mysql_bin_log.get_log_file()->pos_in_file;

...

}

函数innobase_commit提交事务，先得到当前的binlog的位置，然后再写入事务系统PAGE(trx_commit_off_kernel à trx_sys_update_mysql_binlog_offset)。

InnoDB将MySQL binlog的位置记录到trx system header中：

//trx0sys.h

/* The offset of the MySQL binlog offset info in the trx system header */

#define TRX_SYS_MYSQL_LOG_INFO (UNIV_PAGE_SIZE - 1000)

#define TRX_SYS_MYSQL_LOG_MAGIC_N_FLD 0 /* magic number which shows

if we have valid data in the

MySQL binlog info; the value

is ..._MAGIC_N if yes */

#define TRX_SYS_MYSQL_LOG_OFFSET_HIGH 4 /* high 4 bytes of the offset

within that file */

#define TRX_SYS_MYSQL_LOG_OFFSET_LOW 8 /* low 4 bytes of the offset

within that file */

#define TRX_SYS_MYSQL_LOG_NAME 12 /* MySQL log file name */

5.3.2 事务恢复流程

Innodb在恢复的时候，不同状态的事务，会进行不同的处理(见trx_rollback_or_clean_all_without_sess函数)：

<1>对于TRX_COMMITTED_IN_MEMORY的事务，清除回滚段，然后将事务设为TRX_NOT_STARTED；

<2>对于TRX_NOT_STARTED的事务，表示事务已经提交，跳过；

<3>对于TRX_PREPARED的事务，要根据binlog来决定事务的命运，暂时跳过;

<4>对于TRX_ACTIVE的事务，回滚。

MySQL在打开binlog时，会检查binlog的状态(TC_LOG_BINLOG::open)。如果binlog没有正常关闭(LOG_EVENT_BINLOG_IN_USE_F为1)，则进行恢复操作，基本流程如下：

如何理解MYSQL-GroupCommit 和 2pc提交

<1>扫描binlog，读取XID_EVENT事务，得到所有已经提交的XA事务列表(实际上事务在innodb可能处于prepare或者commit)；

<2>对每个XA事务，调用handlerton::recover，检查存储引擎是否存在处于prepare状态的该事务(见innobase_xa_recover)，也就是检查该XA事务在存储引擎中的状态；

<3>如果存在处于prepare状态的该XA事务，则调用handlerton::commit_by_xid提交事务；

<4>否则，调用handlerton::rollback_by_xid回滚XA事务。

5.3.3 几个参数讨论

(1)sync_binlog

Mysql在提交事务时调用MYSQL_LOG::write完成写binlog，并根据sync_binlog决定是否进行刷盘。默认值是0，即不刷盘，从而把控制权让给OS。如果设为1，则每次提交事务，就会进行一次刷盘；这对性能有影响(5.6已经支持binlog group)，所以很多人将其设置为100。

bool MYSQL_LOG::flush_and_sync()

{

int err=0, fd=log_file.file;

safe_mutex_assert_owner(&LOCK_log);

if (flush_io_cache(&log_file))

return 1;

if (++sync_binlog_counter >= sync_binlog_period && sync_binlog_period)

{

sync_binlog_counter= 0;

err=my_sync(fd, MYF(MY_WME));

}

return err;

}

(2) innodb_flush_log_at_trx_commit

该参数控制innodb在提交事务时刷redo log的行为。默认值为1，即每次提交事务，都进行刷盘操作。为了降低对性能的影响，在很多生产环境设置为2，甚至0。

如何理解MYSQL-GroupCommit 和 2pc提交

 trx_flush_log_if_needed_low( /*========================*/ lsn_t    lsn) /*!< in: lsn up to which logs are to be
            flushed. */ { switch (srv_flush_log_at_trx_commit) { case 0: /* Do nothing */ break; case 1: /* Write the log and optionally flush it to disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP,
                srv_unix_file_flush_method != SRV_UNIX_NOSYNC); break; case 2: /* Write the log but do not flush it to disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, FALSE); break; default:
        ut_error;
    }
}

如何理解MYSQL-GroupCommit 和 2pc提交

If the value of innodb_flush_log_at_trx_commit is 0, the log buffer is written out to the log file once per second and the flush to disk operation is performed on the log file, but nothing is done at a transaction commit. When the value is 1 (the default), the log buffer is written out to the log file at each transaction commit and the flush to disk operation is performed on the log file. When the value is 2, the log buffer is written out to the file at each commit, but the flush to disk operation is not performed on it. However, the flushing on the log file takes place once per second also when the value is 2. Note that the once-per-second flushing is not 100% guaranteed to happen every second, due to process scheduling issues.

The default value of 1 is required for full ACID compliance. You can achieve better performance by setting the value different from 1, but then you can lose up to one second worth of transactions in a crash. With a value of 0, any mysqld process crash can erase the last second of transactions. With a value of 2, only an operating system crash or a power outage can erase the last second of transactions.

(3) innodb_support_xa

用于控制innodb是否支持XA事务的2PC，默认是TRUE。如果关闭，则innodb在prepare阶段就什么也不做；这可能会导致binlog的顺序与innodb提交的顺序不一致(比如A事务比B事务先写binlog，但是在innodb内部却可能A事务比B事务后提交)，这会导致在恢复或者slave产生不同的数据。

int

innobase_xa_prepare(

/*================*/

/* out: 0 or error number */

THD* thd, /* in: handle to the MySQL thread of the user

whose XA transaction should be prepared */

bool all) /* in: TRUE - commit transaction

FALSE - the current SQL statement ended */

{

…

if (!thd->variables.innodb_support_xa) {

return(0);

}

如何理解MYSQL-GroupCommit 和 2pc提交

ver  mysql 5.7

bool trans_xa_commit(THD *thd)
{ bool res= TRUE; enum xa_states xa_state= thd->transaction.xid_state.xa_state;
  DBUG_ENTER("trans_xa_commit"); if (!thd->transaction.xid_state.xid.eq(thd->lex->xid))
  { /* xid_state.in_thd is always true beside of xa recovery procedure.
      Note, that there is no race condition here between xid_cache_search
      and xid_cache_delete, since we always delete our own XID
      (thd->lex->xid == thd->transaction.xid_state.xid).
      The only case when thd->lex->xid != thd->transaction.xid_state.xid
      and xid_state->in_thd == 0 is in the function
      xa_cache_insert(XID, xa_states), which is called before starting
      client connections, and thus is always single-threaded. */ XID_STATE *xs= xid_cache_search(thd->lex->xid);
    res= !xs || xs->in_thd; if (res)
      my_error(ER_XAER_NOTA, MYF(0)); else {
      res= xa_trans_rolled_back(xs);
      ha_commit_or_rollback_by_xid(thd, thd->lex->xid, !res);
      xid_cache_delete(xs);
    }
    DBUG_RETURN(res);
  } if (xa_trans_rolled_back(&thd->transaction.xid_state))
  {
    xa_trans_force_rollback(thd);
    res= thd->is_error();
  } else if (xa_state == XA_IDLE && thd->lex->xa_opt == XA_ONE_PHASE)
  { int r= ha_commit_trans(thd, TRUE); if ((res= MY_TEST(r)))
      my_error(r == 1 ? ER_XA_RBROLLBACK : ER_XAER_RMERR, MYF(0));
  } else if (xa_state == XA_PREPARED && thd->lex->xa_opt == XA_NONE)
  {
    MDL_request mdl_request; /* Acquire metadata lock which will ensure that COMMIT is blocked
      by active FLUSH TABLES WITH READ LOCK (and vice versa COMMIT in
      progress blocks FTWRL).

      We allow FLUSHer to COMMIT; we assume FLUSHer knows what it does. */ mdl_request.init(MDL_key::COMMIT, "", "", MDL_INTENTION_EXCLUSIVE,
                     MDL_TRANSACTION); if (thd->mdl_context.acquire_lock(&mdl_request,
                                      thd->variables.lock_wait_timeout))
    {
      ha_rollback_trans(thd, TRUE);
      my_error(ER_XAER_RMERR, MYF(0));
    } else {
      DEBUG_SYNC(thd, "trans_xa_commit_after_acquire_commit_lock"); if (tc_log)
        res= MY_TEST(tc_log->commit(thd, /* all */ true)); else res= MY_TEST(ha_commit_low(thd, /* all */ true)); if (res)
        my_error(ER_XAER_RMERR, MYF(0));
    }
  } else {
    my_error(ER_XAER_RMFAIL, MYF(0), xa_state_names[xa_state]);
    DBUG_RETURN(TRUE);
  }

  thd->variables.option_bits&= ~OPTION_BEGIN;
  thd->transaction.all.reset_unsafe_rollback_flags();
  thd->server_status&=
    ~(SERVER_STATUS_IN_TRANS | SERVER_STATUS_IN_TRANS_READONLY);
  DBUG_PRINT("info", ("clearing SERVER_STATUS_IN_TRANS"));
  xid_cache_delete(&thd->transaction.xid_state);
  thd->transaction.xid_state.xa_state= XA_NOTR;

  DBUG_RETURN(res);
}

如何理解MYSQL-GroupCommit 和 2pc提交

5.3.4 安全性/性能讨论

上面3个参数不同的值会带来不同的效果。三者都设置为1(TRUE)，数据才能真正安全。sync_binlog非1，可能导致binlog丢失(OS挂掉)，从而与innodb层面的数据不一致。innodb_flush_log_at_trx_commit非1，可能会导致innodb层面的数据丢失(OS挂掉)，从而与binlog不一致。

关于性能分析，可以参考

http://www.mysqlperformanceblog.com/2011/03/02/what-is-innodb_support_xa/

http://www.mysqlperformanceblog.com/2009/01/21/beware-ext3-and-sync-binlog-do-not-play-well-together/

如何理解MYSQL-GroupCommit 和 2pc提交

在事务提交时innobase会调用ha_innodb.cc 中的innobase_commit，而innobase_commit通过调用trx_commit_complete_for_mysql（trx0trx.c)来调用log_write_up_to（log0log.c),也就是当innobase提交事务的时候就会调用log_write_up_to来写redo log
innobase_commit中 if (all              # 如果是事务提交 || (!thd_test_options(thd, OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN))) {
通过下面的代码实现事务的commit串行化 if (innobase_commit_concurrency > 0) {
                        pthread_mutex_lock(&commit_cond_m);
                        commit_threads++; if (commit_threads > innobase_commit_concurrency) {
                                commit_threads--;
                                pthread_cond_wait(&commit_cond, &commit_cond_m);
                                pthread_mutex_unlock(&commit_cond_m); goto retry;
                        } else {
                                pthread_mutex_unlock(&commit_cond_m);
                        }
                }

                trx->flush_log_later = TRUE;   # 在做提交操作时禁止flush binlog 到磁盘
                innobase_commit_low(trx);
                trx->flush_log_later = FALSE;
先略过innobase_commit_low调用 ,下面开始调用trx_commit_complete_for_mysql做write日志操作
trx_commit_complete_for_mysql(trx);  #开始flush  log
trx->active_trans = 0;
在trx_commit_complete_for_mysql中，主要做的是对系统参数srv_flush_log_at_trx_commit值做判断来调用
log_write_up_to，或者write redo log file或者write&&flush to disk if (!trx->must_flush_log_later) { /* Do nothing */ } else if (srv_flush_log_at_trx_commit == 0) {  #flush_log_at_trx_commit=0，事务提交不写redo log /* Do nothing */ } else if (srv_flush_log_at_trx_commit == 1) { #flush_log_at_trx_commit=1,事务提交写log并flush磁盘,如果flush方式不是SRV_UNIX_NOSYNC （这个不是很熟悉） if (srv_unix_file_flush_method == SRV_UNIX_NOSYNC) { /* Write the log but do not flush it to disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, FALSE);
                } else { /* Write the log to the log files AND flush them to
                        disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, TRUE);
                }
        } else if (srv_flush_log_at_trx_commit == 2) {   #如果是2，则只write到redo log /* Write the log but do not flush it to disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, FALSE);
        } else {
                ut_error;
        }
那么下面看log_write_up_to if (flush_to_disk              #如果flush到磁盘，则比较当前commit的lsn是否大于已经flush到磁盘的lsn && ut_dulint_cmp(log_sys->flushed_to_disk_lsn, lsn) >= 0) {

                mutex_exit(&(log_sys->mutex)); return;
        } if (!flush_to_disk       #如果不flush磁盘则比较当前commit的lsn是否大于已经写到所有redo log file的lsn,或者在只等一个group完成条件下是否大于已经写到某个redo file的lsn && (ut_dulint_cmp(log_sys->written_to_all_lsn, lsn) >= 0 || (ut_dulint_cmp(log_sys->written_to_some_lsn, lsn) >= 0 && wait != LOG_WAIT_ALL_GROUPS))) {

                mutex_exit(&(log_sys->mutex)); return;
        }
#下面的代码判断是否log在write,有的话等待其完成 if (log_sys->n_pending_writes > 0) { if (flush_to_disk    # 如果需要刷新到磁盘，如果正在flush的lsn包括了commit的lsn，只要等待操作完成就可以了 && ut_dulint_cmp(log_sys->current_flush_lsn, lsn) >= 0) { goto do_waits;
                } if (!flush_to_disk  # 如果是刷到redo log file的那么如果在write的lsn包括了commit的lsn,也只要等待就可以了 && ut_dulint_cmp(log_sys->write_lsn, lsn) >= 0) { goto do_waits;
                }
      ...... if (!flush_to_disk  # 如果在当前IO空闲情况下 ，而且不需要flush到磁盘，那么 如果下次写的位置已经到达buf_free位置说明wirte操作都已经完成了，直接返回 && log_sys->buf_free == log_sys->buf_next_to_write) {
                             mutex_exit(&(log_sys->mutex)); return;
               }
下面取到group,设置相关write or flush相关字段，并且得到起始和结束位置的block号
log_sys->n_pending_writes++;

        group = UT_LIST_GET_FIRST(log_sys->log_groups);
        group->n_pending_writes++; /* We assume here that we have only
                                        one log group! */ os_event_reset(log_sys->no_flush_event);
        os_event_reset(log_sys->one_flushed_event);

        start_offset = log_sys->buf_next_to_write;
        end_offset = log_sys->buf_free;

        area_start = ut_calc_align_down(start_offset, OS_FILE_LOG_BLOCK_SIZE);
        area_end = ut_calc_align(end_offset, OS_FILE_LOG_BLOCK_SIZE);

        ut_ad(area_end - area_start > 0);

        log_sys->write_lsn = log_sys->lsn; if (flush_to_disk) {
                log_sys->current_flush_lsn = log_sys->lsn;
        }
log_block_set_checkpoint_no调用设置end_offset所在block的LOG_BLOCK_CHECKPOINT_NO为log_sys中下个检查点号
        log_block_set_flush_bit(log_sys->buf + area_start, TRUE);   # 这个没看明白
        log_block_set_checkpoint_no(
                log_sys->buf + area_end - OS_FILE_LOG_BLOCK_SIZE,
                log_sys->next_checkpoint_no);
保存不属于end_offset但在其所在的block中的数据到下一个空闲的block
ut_memcpy(log_sys->buf + area_end,
                  log_sys->buf + area_end - OS_FILE_LOG_BLOCK_SIZE,
                  OS_FILE_LOG_BLOCK_SIZE);
对于每个group调用log_group_write_buf写redo log buffer while (group) {
                log_group_write_buf(
                        group, log_sys->buf + area_start,
                        area_end - area_start,
                        ut_dulint_align_down(log_sys->written_to_all_lsn,
                                             OS_FILE_LOG_BLOCK_SIZE),
                        start_offset - area_start);

                log_group_set_fields(group, log_sys->write_lsn);   # 计算这次写的lsn和offset来设置group->lsn和group->lsn_offset

                group = UT_LIST_GET_NEXT(log_groups, group);
        }
...... if (srv_unix_file_flush_method == SRV_UNIX_O_DSYNC) {  # 这个是什么东西 /* O_DSYNC means the OS did not buffer the log file at all:
                so we have also flushed to disk what we have written */ log_sys->flushed_to_disk_lsn = log_sys->write_lsn;

        } else if (flush_to_disk) {

                group = UT_LIST_GET_FIRST(log_sys->log_groups);

                fil_flush(group->space_id);         # 最后调用fil_flush执行flush到磁盘
                log_sys->flushed_to_disk_lsn = log_sys->write_lsn;
        }

接下来看log_group_write_buf做了点什么

在log_group_calc_size_offset中,从group中取到上次记录的lsn位置（注意是log files组成的1个环状buffer),并计算这次的lsn相对于上次的差值
# 调用log_group_calc_size_offset计算group->lsn_offset除去多个LOG_FILE头部长度后的大小，比如lsn_offset落在第3个log file上，那么需要减掉3*LOG_FILE_HDR_SIZE的大小
gr_lsn_size_offset = (ib_longlong)
       log_group_calc_size_offset(group->lsn_offset, group);
group_size = (ib_longlong) log_group_get_capacity(group); # 计算group除去所有LOG_FILE_HDR_SIZE长度后的DATA部分大小

# 下面是典型的环状结构差值计算 if (ut_dulint_cmp(lsn, gr_lsn) >= 0) {

                difference = (ib_longlong) ut_dulint_minus(lsn, gr_lsn);
        } else {
                difference = (ib_longlong) ut_dulint_minus(gr_lsn, lsn);

                difference = difference % group_size;

                difference = group_size - difference;
        }

        offset = (gr_lsn_size_offset + difference) % group_size;
# 最后算上每个log file 头部大小，返回真实的offset return(log_group_calc_real_offset((ulint)offset, group));
接着看

# 如果需要写的内容超过一个文件大小 if ((next_offset % group->file_size) + len > group->file_size) {

                write_len = group->file_size                # 写到file末尾 - (next_offset % group->file_size);
        } else {
                write_len = len;                              # 否者写len个block
        }

# 最后真正的内容就是写buffer了，如果跨越file的话另外需要写file log file head部分 if ((next_offset % group->file_size == LOG_FILE_HDR_SIZE) && write_header) { /* We start to write a new log file instance in the group */ log_group_file_header_flush(group,
                                            next_offset / group->file_size,
                                            start_lsn);
                srv_os_log_written+= OS_FILE_LOG_BLOCK_SIZE;
                srv_log_writes++;
        }

# 调用fil_io来执行buffer写 if (log_do_write) {
                log_sys->n_log_ios++;

                srv_os_log_pending_writes++;

                fil_io(OS_FILE_WRITE | OS_FILE_LOG, TRUE, group->space_id,
                       next_offset / UNIV_PAGE_SIZE,
                       next_offset % UNIV_PAGE_SIZE, write_len, buf, group);

                srv_os_log_pending_writes--;

                srv_os_log_written+= write_len;
                srv_log_writes++;

如何理解MYSQL-GroupCommit 和 2pc提交

然而我们考虑如下序列（Copy from worklog…）

    Trx1 ------------P----------C-------------------------------->
                                |
    Trx2 ----------------P------+---C---------------------------->
                                |   |
    Trx3 -------------------P---+---+-----C---------------------->
                                |   |     |
    Trx4 -----------------------+-P-+-----+----C----------------->
                                |   |     |    |
    Trx5 -----------------------+---+-P---+----+---C------------->
                                |   |     |    |   |
    Trx6 -----------------------+---+---P-+----+---+---C---------->
                                |   |     |    |   |   |
    Trx7 -----------------------+---+-----+----+---+-P-+--C------->
                                |   |     |    |   |   |  |

在之前的逻辑中，trx5 和 trx6是可以并发执行的，因为他们拥有相同的序列号；Trx4无法和Trx5并行，因为他们的序列号不同。同样的trx6和trx7也无法并行。当发现一个无法并发的事务时，就需要等待前面的事务执行完成才能继续下去，这会影响到备库的TPS。

但是理论上trx4应该可以和trx5和trx6并行，因为trx4先于trx5和trx6 prepare，如果trx5 和trx6能进入Prepare阶段，证明其和trx4是没有冲突的。

解决方案：

0.增加两个全局变量：

/* Committed transactions timestamp */
Logical_clock max_committed_transaction;
/* "Prepared" transactions timestamp */
Logical_clock transaction_counter;

每个事务对应两个counter：last_committed 及 sequence_number

1.每次rotate或打开新的binlog时

MYSQL_BIN_LOG::open_binlog:

max_committed_transaction.update_offset(transaction_counter.get_timestamp());
transaction_counter.update_offset(transaction_counter.get_timestamp());

—>更新max_committed_transaction和transaction_counter的offset为当前的state值（或者说，为上个Binlog文件最大的transaction counter值）

2.每执行一条DML语句完成时，更新当前会话的last_committed= mysql_bin_log.max_committed_transaction

参考函数： binlog_prepare （参数all为false）

3. 事务提交时，写入binlog之前

binlog_cache_data::flush:

trn_ctx->sequence_number= mysql_bin_log.transaction_counter.step();

其中transaction_counter递增1

4.写入binlog

将sequence_number 和 last_committed写入binlog

MYSQL_BIN_LOG::write_gtid

记录binlog文件的seq number 和last committed会减去max_committed_transaction.get_offset()，也就是说，每个Binlog文件的序列号总是从(last_committed, sequence_number)=(0,1)开始

5.引擎层提交每个事务前更新max_committed_transaction

如果当前事务的sequence_number大于max_committed_transaction，则更新max_committed_transaction的值

MYSQL_BIN_LOG::process_commit_stage_queue –> MYSQL_BIN_LOG::update_max_committed

6.备库并发检查

函数：Mts_submode_logical_clock::schedule_next_event

假设初始状态下transaction_counter=1, max_committed_transaction=1，以上述流程为例，每个事务的<last_committed, sequence_number>序列为:

Trx1 prepare: last_commited = max_committed_transaction = 1;

Trx2 prepare: last_commited = max_committed_transaction = 1;

Trx3 prepare: last_commited = max_committed_transaction = 1;

Trx1 commit: sequence_number=++transaction_counter = 2, (transaction_counter=2, max_committed_transaction=2), write(1,2)

Trx4 prepare: last_commited =max_committed_transaction = 2;

Trx2 commit: sequence_number=++transaction_counter= 3, (transaction_counter=3, max_committed_transaction=3), write(1,3)

Trx5 prepare: last_commited = max_committed_transaction = 3;

Trx6 prepare: last_commited = max_committed_transaction = 3;

Trx3 commit: sequence_number=++transaction_counter= 4, (transaction_counter=4, max_committed_transaction=4), write(1,4)

Trx4 commit: sequence_number=++transaction_counter= 5, (transaction_counter=5, max_committed_transaction=5), write(2, 5)

Trx5 commit: sequence_number=++transaction_counter= 6, (transaction_counter=6, max_committed_transaction=6), write(3, 6)

Trx7 prepare: last_commited = max_committed_transaction = 6;

Trx6 commit: sequence_number=++transaction_counter= 7, (transaction_counter=7, max_committed_transaction=7), write(3, 7)

Trx7 commit: sequence_number=++transaction_counter= 8, (transaction_counter=8, max_committed_transaction=8), write(6, 8)

并发规则：

因此上述序列中可以并发的序列为：

trx1 1…..2

trx2 1………….3

trx3 1…………………….4

trx4 2………………………….5

trx5 3………………………………..6

trx6 3………………………………………………7

trx7 6……………………..8

备库并行规则：当分发一个事务时，其last_committed 序列号比当前正在执行的事务的最小sequence_number要小时，则允许执行。

因此，

a)trx1执行，last_commit<2的可并发，trx2, trx3可继续分发执行

b)trx1执行完成后，last_commit < 3的可以执行， trx4可分发

c)trx2执行完成后，last_commit < 4的可以执行， trx5, trx6可分发

d)trx3、trx4、trx5完成后，last_commit < 7的可以执行，trx7可分发

关于如何理解MYSQL-GroupCommit 和 2pc提交就分享到这里了，希望以上内容可以对大家有一定的帮助，可以学到更多知识。如果觉得文章不错，可以把它分享出去让更多的人看到。

如何理解MYSQL-GroupCommit 和 2pc提交

相关阅读