MySQL中 kill会话的实现原理

发布时间：2021-08-31 16:48:28 作者：chen
来源：亿速云阅读：186

这篇文章主要介绍“MySQL中 kill会话的实现原理”，在日常操作中，相信很多人在MySQL中 kill会话的实现原理问题上存在疑惑，小编查阅了各式资料，整理出简单好用的操作方法，希望对大家解答”MySQL中 kill会话的实现原理”的疑惑有所帮助！接下来，请跟着小编一起来学习吧！

一、简单的过程梳理和列子

先要简单的梳理一下语句的执行的生命周期：

打个比方我们以如下的执行计划为列子：

mysql> desc select * from t1 where name='gaopeng';
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | t1    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |   14 |    10.00 | Using where |+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------------+1 row in set, 1 warning (1.67 sec)

某个客户端通过mysql客户端启动一个进程，通过socket IP:PORT的方式唯一确认一个mysqld服务器进程。
服务器进程mysqld准备好一个线程和这个mysql客户端进行网络通信。
mysql客户端发送命令通过mysql net协议到达mysqld服务器端。
mysqld服务端线程解包，获取mysql客户端发送过来的命令。
mysqld服务端线程通过权限认证，语法语义解析，然后经过物理逻辑优化生成一个执行计划。
loop:
mysqld服务端线程通过这个执行计划执行语句，首先innodb层会扫描出第一条数据，返回给mysql层进行过滤也就是是否符合条件name='gaopeng';。
如果符合则返回给 mysql客户端，如果不符合则继续loop。
直到loop 结束整个数据返回完成。

这里涉及到了一个mysql客户端进程和一个mysqld服务端线程，他们通过socket进行通信。如果我们要要kill某个会话我们显然一般是新开起来一个mysql客户端进程连接到mysqld服务端显然这个时候又需要开启一个服务端线程与其对接来响应你的kill命令那么这个时候图如下：

MySQL中 kill会话的实现原理

image.png

如图我们需要研究的就是线程2到底如何作用于线程1，实际上线程之间共享内存很简单这是线程的特性决定的，在MySQL中就共享了这样一个变量THD::killed，不仅线程1可以访问并且线程2也可以访问。实际上这种情况就是依赖在代码的某些位置做了THD::killed的检查而实现。先大概先描述一下这种情况kill 会话的过程

线程2将THD::killed 设置
线程1在innodb层做扫描行的时候每行扫描完成后都会去检查自己的线程是否设置为了KILL_CONNECTION
如果设置为KILL_CONNECTION，那么做相应的终止过程

二、kill的不同情况

上面已经描述了一个select语句的kill的流程，但是并非都是这种情况，我稍微总结了一下可能的情况：

正在执行命令，如上的select的情况(非Innodb行锁等待情况)。
正在执行命令，如DML等待(Innodb行锁等待情况)，需要Innodb层唤醒，代码则继续。
正在执行命令，MySQL层进行等待比如sleep命令，需要MySQL层唤醒，代码则继续。
空闲状态正在等待命令的到来。

注意上面的情况都是待杀线程处于的情况，而发起命令的线程只有一种方式，就是调用kill_one_thread函数。下面我将详细描述一下。对于唤醒操作参考附录的内容，我这里就默认大家都知道了。

三、发起kill命令的线程

下面是栈帧：

#0  THD::awake (this=0x7ffe7800e870, state_to_set=THD::KILL_CONNECTION) at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/sql_class.cc:2206#1  0x00000000015d5430 in kill_one_thread (thd=0x7ffe7c000b70, id=18, only_kill_query=false) at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/sql_parse.cc:6859#2  0x00000000015d5548 in sql_kill (thd=0x7ffe7c000b70, id=18, only_kill_query=false) at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/sql_parse.cc:6887

kill_one_thread
这是一个主要的函数，他会根据待杀死的my_thread_id也就是我们kill后面跟的值，获取这个会话的THD结构体然后调用THD::awake函数如下：

tmp= Global_THD_manager::get_instance()->find_thd(&find_thd_with_id);//获得待杀死的会话的THD结构体tmp->awake(only_kill_query ? THD::KILL_QUERY : THD::KILL_CONNECTION);//调用THD::awake命令我们这里是 THD::KILL_CONNECTION

THD::awake
这是一个主要的函数，这个函数会做将待杀死的会话的THD::killed标记为THD::KILL_CONNECTION，然后关闭socket连接，也就是这里客户端进程会收到一个类似如下的错误：

ERROR 2013 (HY000): Lost connection to MySQL server during query

然后会终止等待进入innodb连接，然后还会做唤醒操作，关于为什么要做唤醒操作我们后面再说如下：

 killed= state_to_set; \\这里设置THD::killed 状态为 KILL_CONNECTIONvio_cancel(active_vio, SHUT_RDWR); \\关闭socket连接，关闭socket连接后则客户端连接关闭  /* Interrupt target waiting inside a storage engine. */
  if (state_to_set != THD::NOT_KILLED)
    ha_kill_connection(this); \\lock_trx_handle_waitmysql_mutex_lock(current_mutex);
mysql_cond_broadcast(current_cond); \\做唤醒操作
mysql_mutex_unlock(current_mutex);

四、待杀死线程正在执行命令，如上的select的情况(非Innodb行锁等待情况)。

这种情况就是通过在代码合适的位置检查返回值完成了，比如下面栈帧：

#0  convert_error_code_to_mysql (error=DB_INTERRUPTED, flags=33, thd=0x7ffe74012f30)
    at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/handler/ha_innodb.cc:2064#1  0x00000000019d651e in ha_innobase::general_fetch (this=0x7ffe7493c960, buf=0x7ffe7493cea0 "\377", direction=1, match_mode=0)
    at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/handler/ha_innodb.cc:9907#2  0x00000000019d658b in ha_innobase::index_next (this=0x7ffe7493c960, buf=0x7ffe7493cea0 "\377")
    at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/handler/ha_innodb.cc:9929

我们可以在函数ha_innobase::general_fetch中找到这部分代码如下：

default:
        error = convert_error_code_to_mysql(ret, m_prebuilt->table->flags, m_user_thd);

这里ret如果等于DB_INTERRUPTED就会进入线程退出逻辑，具体逻辑我们后面再看。

而其中DB_INTERRUPTED则代表是被杀死的终止状态，由如下代码设置(所谓的"埋点")：

        if (trx_is_interrupted(prebuilt->trx)) {
            ret = DB_INTERRUPTED;

其中trx_is_interrupted很简单，代码如下：

return(trx && trx->mysql_thd && thd_killed(trx->mysql_thd));
而thd_killed如下：
extern "C" int thd_killed(const MYSQL_THD thd)
{  if (thd == NULL)    return current_thd != NULL ? current_thd->killed : 0;  return thd->killed; //返回了THD::killed}

我们可以看到thd->killed正是我们前面发起kill线程设置的THD::killed为THD::KILL_CONNECTION，最终这个错误会层层返回，最终导致handle_connection循环结束进入终止流程。

五、待杀死线程正在执行命令，如DML等待(Innodb行锁等待情况)，需要Innodb层唤醒，代码则继续。

这种情况和上面类似也是需要检查线程的THD::killed状态是否是THD::KILL_CONNECTION，但是我们知道如果处于pthread_cond_wait函数等待下，那么必须有其他线程对其做唤醒操作代码才会继续进行不然永远会不跑到判断逻辑，我们先来看一下等待栈帧

#0  0x00007ffff7bca68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1  0x0000000001ab1d35 in os_event::wait (this=0x7ffe74011f18) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/include/os0event.h:156#2  0x0000000001ab167d in os_event::wait_low (this=0x7ffe74011f18, reset_sig_count=2)
    at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/os/os0event.cc:131#3  0x0000000001ab1aa6 in os_event_wait_low (event=0x7ffe74011f18, reset_sig_count=0)
    at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/os/os0event.cc:328#4  0x0000000001a7305f in lock_wait_suspend_thread (thr=0x7ffe74005190) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/lock/lock0wait.cc:387#5  0x0000000001b391fc in row_mysql_handle_errors (new_err=0x7fffec091c4c, trx=0x7fffd78045f0, thr=0x7ffe74005190, savept=0x0)
    at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/row/row0mysql.cc:1312#6  0x0000000001b7c2ea in row_search_mvcc (buf=0x7ffe74010160 "\377", mode=PAGE_CUR_G, prebuilt=0x7ffe74004a20, match_mode=0, direction=0)
    at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/row/row0sel.cc:6318#7  0x00000000019d5443 in ha_innobase::index_read (this=0x7ffe7400e280, buf=0x7ffe74010160 "\377", key_ptr=0x0, key_len=0, find_flag=HA_READ_AFTER_KEY)
    at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/handler/ha_innodb.cc:9536

这种情况就需要有一个线程唤醒它，但是这里唤醒是Innodb层和上面的说的MySQL层唤醒还不是一个事情(后面描述)，到底由谁来唤醒它呢，我们可以将断点设置在：

event::broadcast
event::signal
上就可以抓到是谁做的这件事情，原来在Innodb内部会有一个线程专门干这事这个线程如下：

#0  os_event::broadcast (this=0x7ffe74011f18) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/include/os0event.h:166#1  0x0000000001ab1be8 in os_event::set (this=0x7ffe74011f18) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/include/os0event.h:61#2  0x0000000001ab1a3a in os_event_set (event=0x7ffe74011f18) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/os/os0event.cc:277#3  0x0000000001a73460 in lock_wait_release_thread_if_suspended (thr=0x7ffe70013360)
    at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/lock/lock0wait.cc:491#4  0x0000000001a6a80d in lock_cancel_waiting_and_release (lock=0x30b1938) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/lock/lock0lock.cc:6896#5  0x0000000001a736a6 in lock_wait_check_and_cancel (slot=0x7fff0060a2a0) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/lock/lock0wait.cc:539#6  0x0000000001a7383d in lock_wait_timeout_thread (arg=0x0) at /root/mysqlc/percona-server-locks-detail-5.7.22/storage/innobase/lock/lock0wait.cc:599#7  0x00007ffff7bc6aa1 in start_thread () from /lib64/libpthread.so.0#8  0x00007ffff6719bcd in clone () from /lib64/libc.so.6

我们稍微检查一下lock_wait_check_and_cancel的代码就会看到如下：

if (trx_is_interrupted(trx)
        || (slot->wait_timeout < 100000000
        && (wait_time > (double) slot->wait_timeout
           || wait_time < 0))) {        /* Timeout exceeded or a wrap-around in system
        time counter: cancel the lock request queued
        by the transaction and release possible
        other transactions waiting behind; it is
        possible that the lock has already been
        granted: in that case do nothing */
        lock_mutex_enter();
        trx_mutex_enter(trx);        if (trx->lock.wait_lock != NULL && !trx_is_high_priority(trx)) {
            ut_a(trx->lock.que_state == TRX_QUE_LOCK_WAIT);
            lock_cancel_waiting_and_release(trx->lock.wait_lock);
        }
        lock_mutex_exit();
        trx_mutex_exit(trx);
    }

我们看到关键地方trx_is_interrupted做了对THD::KILL_CONNECTION的判断，当然这个线程还会做Innodb 行锁超时的唤醒工作，这个线程我们可以看到的如下：

|     35 |         4036 | innodb/srv_lock_timeout_thread  |    NULL | BACKGROUND | NULL   | NULL         |

如果对于正在执行的语句，需要回滚的会在随后做回滚操作如下：

    if (thd->is_error() || (thd->variables.option_bits & OPTION_MASTER_SQL_ERROR))
      trans_rollback_stmt(thd);

六、Innodb kill 逻辑触发阶段总结

总的说来Innodb中正是通过丁奇老师所说的"埋点"来判断线程是否已经被杀掉，其"埋点"所做的事情就是检查线程的THD::killed状态是否是THD::KILL_CONNECTION，这种埋点是有检测周期的，不可能每行代码过后都检查一次所以我大概总结了一下埋点的检查位置：

每行记录返回给MySQL层的时候
如果遇到Innodb行锁处于pthread_cond_wait状态下，需要srv_lock_timeout_thread线程先对其唤醒做broadcast操作

实际上可以全代码搜索什么时候将ret = DB_INTERRUPTED; 的位置就是Innodb层的"埋点"。

七、待杀死线程空闲状态正在等待命令的到来

这种情况就比较简单了。在空闲的状态下，待杀死线程会一直堵塞在socket读上面，因为发起kill线程会关闭socket通道，待杀死线程可以轻松的感知到这件事情，下面是net_read_raw_loop中截取

  /* On failure, propagate the error code. */ 
  if (count)
  {    /* Socket should be closed. */ 
    net->error= 2;    /* Interrupted by a timeout? */
    if (!eof && vio_was_timeout(net->vio))
      net->last_errno= ER_NET_READ_INTERRUPTED;    else
      net->last_errno= ER_NET_READ_ERROR;#ifdef MYSQL_SERVER
    my_error(net->last_errno, MYF(0)); //这里触发#endif
  }

这样handle_connection循环结束，进入终止流程。这种情况会在release_resources中clean_up做回滚操作

八、待杀死线程正在执行命令，MySQL层进行等待比如sleep命令，需要MySQL层唤醒，代码则继续。

还记得前面我们的发起kill线程调用THD::awake的时候最后会做唤醒操作吗？和Innodb层行锁等待一样，如果不唤醒那么代码就没办法推进，到达不了Innodb层中设置的埋点位置，下面我用sleep为例进行描述。首先我们先来看看sleep的逻辑，实际上在 Item_func_sleep::val_int 函数中还有如下代码：

  timed_cond.set_timeout((ulonglong) (timeout * 1000000000.0));//这里就将sleep的值春如到了timed_cond这个结构体中
  mysql_cond_init(key_item_func_sleep_cond, &cond); // pthread_cond_init 初始化 cond
  mysql_mutex_lock(&LOCK_item_func_sleep); //加锁 pthread_mutex_lock  对LOCK_item_func_sleep mutex THD::enter_cond 
  thd->ENTER_COND(&cond, &LOCK_item_func_sleep, &stage_user_sleep, NULL);  //#define ENTER_COND(C, M, S, O) enter_cond(C, M, S, O, __func__, __FILE__, __LINE__)
  //这一步cond会传递给THD中其他线程也能拿到这个cond了,就可以唤醒它，KILL触发的时候就需要通过这个条件变量唤醒它
  DEBUG_SYNC(current_thd, "func_sleep_before_sleep");
  error= 0;
  thd_wait_begin(thd, THD_WAIT_SLEEP);  while (!thd->killed)
  {
    error= timed_cond.wait(&cond, &LOCK_item_func_sleep); //这里看是可等待 及sleep 功能实现 调用底层pthread_cond_timedwait函数实现 ，并且可以被条件变量唤醒
    if (error == ETIMEDOUT || error == ETIME)      break;
    error= 0;
  }

这里我们来证明一下，下面是sleep线程的栈帧：

[Switching to Thread 0x7fffec064700 (LWP 4738)]
#0  THD::enter_cond (this=0x7ffe70000950, cond=0x7fffec061510, mutex=0x2e4d6a0, stage=0x2d8b630, old_stage=0x0, src_function=0x1f2598c "val_int", 
    src_file=0x1f232e8 "/root/mysqlc/percona-server-locks-detail-5.7.22/sql/item_func.cc", src_line=6057)
    at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/sql_class.h:3395#1  0x00000000010265d8 in Item_func_sleep::val_int (this=0x7ffe70006210) at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/item_func.cc:6057#2  0x0000000000fafea5 in Item::send (this=0x7ffe70006210, protocol=0x7ffe70001c68, buffer=0x7fffec0619b0)
    at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/item.cc:7564#3  0x000000000156b10c in THD::send_result_set_row (this=0x7ffe70000950, row_items=0x7ffe700055d8)
    at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/sql_class.cc:5026#4  0x0000000001565708 in Query_result_send::send_data (this=0x7ffe700063a8, items=...) at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/sql_class.cc:2932

注意这里结构体cond=0x7fffec061510的地址，最终他会传递到THD中，以致于其他线程也能后拿到，我们再来看看THD::awake唤醒的条件变量的地址如下：

[Switching to Thread 0x7fffec0f7700 (LWP 4051)]
Breakpoint 2, THD::awake (this=0x7ffe70000950, state_to_set=THD::KILL_CONNECTION) at /root/mysqlc/percona-server-locks-detail-5.7.22/sql/sql_class.cc:2206......
(gdb) n2288          mysql_cond_broadcast(current_cond);
(gdb) p current_cond
$6 = (mysql_cond_t * volatile) 0x7fffec061510

我们可以到看到也是0x7fffec061510，他们是同一个条件变量，那么也证明了确实是THD::awake最终唤醒了我们的sleep。代码得以继续，继续后会达到"埋点"，最终handle_connection循环终止达到终止流程。

九、待杀死线程终止

最终在handle_connection 的循环达到退出了条件，进行连接终止逻辑如下：

    {      while (thd_connection_alive(thd)) //
      {        if (do_command(thd))          break;
      }
      end_connection(thd);
    }
    close_connection(thd, 0, false, false);
    thd->get_stmt_da()->reset_diagnostics_area();
    thd->release_resources();
.....
    thd_manager->remove_thd(thd);//这里从THD链表上摘下来，之后 KILLED状态的线程才没有了。
    Connection_handler_manager::dec_connection_count(extra_port_connection);
....
    delete thd;    if (abort_loop) // Server is shutting down so end the pthread.
      break;
    channel_info= Per_thread_connection_handler::block_until_new_connection();    if (channel_info == NULL)      break;
    pthread_reused= true;

这里我们发现会经历几个函数end_connection/get_stmt_da()->reset_diagnostics_area()/release_resources 然后来到了thd_manager->remove_thd(thd)，最终这个链接会被重用。实际上直到release_resources做完我们才会看到show processlist中的状态消失。可以修改代码，在release_resources函数前后加上sleep(10)函数来验证，如下：

sleep(10);thd->release_resources();sleep(10);

得到的测试结果如下：

mysql> show processlist ; kill 31;kill 33;kill 35;
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+| Id | User | Host      | db   | Command | Time | State    | Info             | Rows_sent | Rows_examined |
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
|  7 | root | localhost | NULL | Query   |    0 | starting | show processlist |         0 |             0 || 31 | root | localhost | NULL | Sleep   |   35 |          | NULL             |         1 |             0 |
| 33 | root | localhost | NULL | Sleep   |   32 |          | NULL             |         1 |             0 || 35 | root | localhost | NULL | Sleep   |   29 |          | NULL             |         1 |             0 |
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
mysql> show processlist ;
+----+------+-----------+------+---------+------+-------------+------------------+-----------+---------------+
| Id | User | Host      | db   | Command | Time | State       | Info             | Rows_sent | Rows_examined |+----+------+-----------+------+---------+------+-------------+------------------+-----------+---------------+|  7 | root | localhost | NULL | Query   |    0 | starting    | show processlist |         0 |             0 |
| 31 | root | localhost | NULL | Killed  |   44 | cleaning up | NULL             |         1 |             0 || 33 | root | localhost | NULL | Killed  |   41 | cleaning up | NULL             |         1 |             0 |
| 35 | root | localhost | NULL | Killed  |   38 | cleaning up | NULL             |         1 |             0 |+----+------+-----------+------+---------+------+-------------+------------------+-----------+---------------+4 rows in set (0.02 sec)
mysql> show processlist ;
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+| Id | User | Host      | db   | Command | Time | State    | Info             | Rows_sent | Rows_examined |
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
|  7 | root | localhost | NULL | Query   |    0 | starting | show processlist |         0 |             0 |+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+

可以看到大约10秒后才Killed状态才消失，而Killed状态没有出现20秒因此可以确认是这一步完成后Killed线程才会在show processlist中消失。

十、总结

kill动作是一个线程作用于另外一个线程，他们之间的桥梁就是THD:killed这个共享变量。
对于Innodb层的如果有行锁等待那么kill会通过线程srv_lock_timeout_thread将其唤醒，然后继续代码逻辑。
对于MySQL层的等待同样需要唤醒这是kill发起命令线程完成的，然后继续代码逻辑
将show processlist中的killed状态的线程移除是在整个工作完成之后，比如回滚等
kill状态的响应是通过某些预先设置的检查点进行的，如果达不到这个检查点将一直处于Killed状态
即便检查点达到，如果在代码逻辑中出现其他的Mutex锁问题得不到退出那么Killed状态一直持续如下的列子(BUG?)：
MySQL：kill和show global status命令hang住一列 https://www.jianshu.com/p/70614ae01046

到此，关于“MySQL中 kill会话的实现原理”的学习就结束了，希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习，快去试试吧！若想继续学习更多相关知识，请继续关注亿速云网站，小编会继续努力为大家带来更多实用的文章！

MySQL中 kill会话的实现原理

一、简单的过程梳理和列子

二、kill的不同情况

三、发起kill命令的线程

四、 待杀死线程正在执行命令，如上的select的情况(非Innodb行锁等待情况)。

五、 待杀死线程正在执行命令，如DML等待(Innodb行锁等待情况)，需要Innodb层唤醒，代码则继续。

六、Innodb kill 逻辑触发阶段总结

七、待杀死线程空闲状态正在等待命令的到来

八、待杀死线程正在执行命令，MySQL层进行等待比如sleep命令，需要MySQL层唤醒，代码则继续。

九、待杀死线程终止

十、总结

相关阅读

四、待杀死线程正在执行命令，如上的select的情况(非Innodb行锁等待情况)。

五、待杀死线程正在执行命令，如DML等待(Innodb行锁等待情况)，需要Innodb层唤醒，代码则继续。