网络故障引起的kafka自身的BUG该怎么解决

发布时间：2021-09-13 10:04:59 作者：柒染
来源：亿速云阅读：2036

网络故障引起的kafka自身的BUG该怎么解决，相信很多没有经验的人对此束手无策，为此本文总结了问题出现的原因和解决方法，通过这篇文章希望你能解决这个问题。

2019-09-03 17:06:25

机房网络出现一分钟波动，交换机问题导致kafka集群相互之间偶尔失联。

kafka日志如下所示：

[2019-09-03 17:06:25,610] WARN Attempting to send response via channel for which there is no open connection, connection id xxxxx (kafka.network.Processor)
[2019-09-03 17:06:31,906] INFO Unable to read additional data from server sessionid 0x46b0xxxx027, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:32,076] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient)
[2019-09-03 17:06:32,609] INFO Opening socket connection to server xxxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:33,810] WARN Client session timed out, have not heard from server in 1796ms for sessionid 0x46bxxxx40027 (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:33,810] INFO Client session timed out, have not heard from server in 1796ms for sessionid 0x46b03bxxx027, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:34,942] INFO Opening socket connection to server xxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:36,059] INFO [Partition opStaffCancelPost-18 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition)
[2019-09-03 17:06:36,059] WARN Client session timed out, have not heard from server in 2092ms for sessionid 0x46b0xxxx027 (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:36,059] INFO Client session timed out, have not heard from server in 2092ms for sessionid 0x46b0xxxx0027, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:36,382] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient)
[2019-09-03 17:06:37,305] INFO Opening socket connection to server xxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2019-09-03 17:06:38,507] WARN Client session timed out, have not heard from server in 2135ms for sessionid 0x46bxxxx0027 (org.apache.zookeeper.ClientCnxn)

短暂的波动网络持续了1分钟左右，之后网络恢复。

本来对于高可用的kafka集群来说应该也是可以自动恢复的，但是事与愿违。

接着是 kafka-manager 的监控出现异常，大量的 topic 全部都没有 broker ，且消息的 offset 不再变化。

手动使用 kafka client连接上去看也是很奇怪，看不到消息进来消费，但是应用生产和消费却是正常的。

在kafka的日志里看到如下：

[2019-09-03 17:11:37,572] INFO [Partition opStaffCancelPost-18 broker=180] Cached zkVersion [50] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-03 17:11:37,572] INFO [Partition sendThrowThirdBoxRetry-1 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition)
[2019-09-03 17:11:37,574] INFO [Partition sendThrowThirdBoxRetry-1 broker=180] Cached zkVersion [48] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-03 17:11:37,574] INFO [Partition __consumer_offsets-42 broker=180] Shrinking ISR from 180,181,182 to 180,181 (kafka.cluster.Partition)
[2019-09-03 17:11:37,576] INFO [Partition __consumer_offsets-42 broker=180] Cached zkVersion [45] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)

第二天发现监控还是这样，情况有点不对（这时候应用本身使用kafka生产和消费是正常的，但是各种监控数据却说它异常），那套kafka集群还是处于有问题的状态，上午11点多开始我们手动重启节点，这个时候再次出现故障，整套kafka集群连接失败生产不了消息，其中kafka日志如下：

[2019-09-04 11:59:12,862] INFO [Partition routeStaffPostQueue-15 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition)
[2019-09-04 11:59:12,864] INFO [Partition routeStaffPostQueue-15 broker=180] Cached zkVersion [43] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-04 11:59:12,864] INFO [Partition sfPushFvpRetryMsgProcQueue-5 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition)
[2019-09-04 11:59:12,865] INFO [Partition sfPushFvpRetryMsgProcQueue-5 broker=180] Cached zkVersion [41] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-04 11:59:12,866] INFO [Partition openRouteRetryMsgProcQueue-3 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition)
[2019-09-04 11:59:12,867] INFO [Partition openRouteRetryMsgProcQueue-3 broker=180] Cached zkVersion [44] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2019-09-04 11:59:12,867] INFO [Partition routeStaffCancelQueue-5 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition)
[2019-09-04 11:59:12,870] INFO [Partition routeStaffCancelQueue-5 broker=180] Cached zkVersion [43] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)

原因：

网络出现问题的时候，当kafka的controller和zk的会话过期了且失去了控制权，这个时候这个僵尸controller在短时间内还在继续更新zk和向broker发送 LeaderAndIsrRequests 。当这种情况发生的时候，其他的broker还没有更新leader信息和isr，导致后续需要更新的时候在zk上更新失败。

kafka官方已经确认了这个BUG，且在 KAFKA-5642通过合适的处理zk会话过期事件修复了这个问题。

我们使用的kakfa是1.0.0，官方的修复版本是在 1.1.0,所以如果还处于1.1.0以下版本的kafka用户，一定要注意下这个问题，可以调整下连接zk的超时时间，让超时时间多续几秒钟，要么就升级kafka版本。

zookeeper.connection.timeout.ms=10000
zookeeper.session.timeout.ms=10000

看完上述内容，你们掌握网络故障引起的kafka自身的BUG该怎么解决的方法了吗？如果还想学到更多技能或想了解更多相关内容，欢迎关注亿速云行业资讯频道，感谢各位的阅读！

网络故障引起的kafka自身的BUG该怎么解决

相关阅读