您好,登录后才能下订单哦!
网络故障引起的kafka自身的BUG该怎么解决,相信很多没有经验的人对此束手无策,为此本文总结了问题出现的原因和解决方法,通过这篇文章希望你能解决这个问题。
2019-09-03 17:06:25
机房网络出现一分钟波动,交换机问题导致kafka集群相互之间偶尔失联。
kafka日志如下所示:
[2019-09-03 17:06:25,610] WARN Attempting to send response via channel for which there is no open connection, connection id xxxxx (kafka.network.Processor) [2019-09-03 17:06:31,906] INFO Unable to read additional data from server sessionid 0x46b0xxxx027, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:32,076] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2019-09-03 17:06:32,609] INFO Opening socket connection to server xxxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:33,810] WARN Client session timed out, have not heard from server in 1796ms for sessionid 0x46bxxxx40027 (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:33,810] INFO Client session timed out, have not heard from server in 1796ms for sessionid 0x46b03bxxx027, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:34,942] INFO Opening socket connection to server xxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:36,059] INFO [Partition opStaffCancelPost-18 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition) [2019-09-03 17:06:36,059] WARN Client session timed out, have not heard from server in 2092ms for sessionid 0x46b0xxxx027 (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:36,059] INFO Client session timed out, have not heard from server in 2092ms for sessionid 0x46b0xxxx0027, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:36,382] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient) [2019-09-03 17:06:37,305] INFO Opening socket connection to server xxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2019-09-03 17:06:38,507] WARN Client session timed out, have not heard from server in 2135ms for sessionid 0x46bxxxx0027 (org.apache.zookeeper.ClientCnxn)
短暂的波动网络持续了1分钟左右,之后网络恢复。
本来对于高可用的kafka集群来说应该也是可以自动恢复的,但是事与愿违。
接着是 kafka-manager
的监控出现异常,大量的 topic
全部都没有 broker
,且消息的 offset
不再变化。
手动使用 kafka client
连接上去看也是很奇怪,看不到消息进来消费,但是应用生产和消费却是正常的。
在kafka的日志里看到如下:
[2019-09-03 17:11:37,572] INFO [Partition opStaffCancelPost-18 broker=180] Cached zkVersion [50] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-09-03 17:11:37,572] INFO [Partition sendThrowThirdBoxRetry-1 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition) [2019-09-03 17:11:37,574] INFO [Partition sendThrowThirdBoxRetry-1 broker=180] Cached zkVersion [48] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-09-03 17:11:37,574] INFO [Partition __consumer_offsets-42 broker=180] Shrinking ISR from 180,181,182 to 180,181 (kafka.cluster.Partition) [2019-09-03 17:11:37,576] INFO [Partition __consumer_offsets-42 broker=180] Cached zkVersion [45] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
第二天发现监控还是这样,情况有点不对(这时候应用本身使用kafka生产和消费是正常的,但是各种监控数据却说它异常),那套kafka集群还是处于有问题的状态,上午11点多开始我们手动重启节点,这个时候再次出现故障,整套kafka集群连接失败生产不了消息,其中kafka日志如下:
[2019-09-04 11:59:12,862] INFO [Partition routeStaffPostQueue-15 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition) [2019-09-04 11:59:12,864] INFO [Partition routeStaffPostQueue-15 broker=180] Cached zkVersion [43] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-09-04 11:59:12,864] INFO [Partition sfPushFvpRetryMsgProcQueue-5 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition) [2019-09-04 11:59:12,865] INFO [Partition sfPushFvpRetryMsgProcQueue-5 broker=180] Cached zkVersion [41] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-09-04 11:59:12,866] INFO [Partition openRouteRetryMsgProcQueue-3 broker=180] Shrinking ISR from 180,182,183 to 180 (kafka.cluster.Partition) [2019-09-04 11:59:12,867] INFO [Partition openRouteRetryMsgProcQueue-3 broker=180] Cached zkVersion [44] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-09-04 11:59:12,867] INFO [Partition routeStaffCancelQueue-5 broker=180] Shrinking ISR from 180,182,183 to 180,183 (kafka.cluster.Partition) [2019-09-04 11:59:12,870] INFO [Partition routeStaffCancelQueue-5 broker=180] Cached zkVersion [43] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
原因:
网络出现问题的时候,当kafka的controller和zk的会话过期了且失去了控制权,这个时候这个僵尸controller在短时间内还在继续更新zk和向broker发送 LeaderAndIsrRequests 。当这种情况发生的时候,其他的broker还没有更新leader信息和isr,导致后续需要更新的时候在zk上更新失败。
kafka官方已经确认了这个BUG,且在 KAFKA-5642
通过合适的处理zk会话过期事件修复了这个问题。
我们使用的kakfa是1.0.0,官方的修复版本是在 1.1.0,所以如果还处于1.1.0以下版本的kafka用户,一定要注意下这个问题,可以调整下连接zk的超时时间,让超时时间多续几秒钟,要么就升级kafka版本。
zookeeper.connection.timeout.ms=10000 zookeeper.session.timeout.ms=10000
看完上述内容,你们掌握网络故障引起的kafka自身的BUG该怎么解决的方法了吗?如果还想学到更多技能或想了解更多相关内容,欢迎关注亿速云行业资讯频道,感谢各位的阅读!
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。