Rebalance问题处理

最近有个项目，过一段时间就出出现如下问题，现象为offset不再提交，但是观察log数据发现数据还有在消费。

1	Attempt to heartbeat failed since group is rebalancing

出现consumer rebalance的情况分别是：

Group中有新成员加入
Group中有成员离开
Group中有成员崩溃

成员的加入和离开使我们向一般都是我们主动触发的。所以这个现象应该是发生了成员崩溃。

导致崩溃原因：

消费者心跳超时，导致rebalance。
消费者处理一次消费时间过长，导致rebalance。

消费者心跳超时

如果Group中某个成员心跳信息无法上报集群，会导致集群触发rebalance。
相关参数：

session.timeout.ms 设置会话超时时间，默认10s
heartbeat.interval.ms 设置心跳间隔时间, 默认3s

如果在一个会话周期内没有接受都心跳信息，会话结束时会触发rebalance。一般设置session.timeout.ms为heartbeat.interval.ms的三倍，这样可以保证在一次会话周期内收到3次心跳，及时网络不稳定的情况下也可以避免频繁rebalance。

消费者处理时间过长

如果消费者一次消费处理的时间过长，那么集群会认为consumer死掉了，从而发起rebalance。
相关参数

max.poll.interval.ms 每次消费处理的时间，默认5m
max.poll.records 每次消费的消息数量，默认500

如果处理时间超过map.poll.interval.ms，则会导致consumer重新rebalance。所以对于处理时间较长的任务，可以降低max.poll.records以减少每次处理的量，同时也可以增大map.poll.interval.ms来避免处理超时。

1
2
3

This places an upper bound on the amount of time that the consumer can be idle before fetching more records.
If poll() is not called before expiration of this timeout, 
then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.

Reason's Blog

kafka consumer rebalanse 问题解答

Rebalance问题处理

消费者心跳超时

消费者处理时间过长