Kafka 消费迟滞监控工具 Burrow
Kafka 官方对于自身的 LAG 监控并没有太好的方法,虽然Kafka broker 自带有 kafka-topic.sh, kafka-consumer-groups.sh, kafka-console-consumer.sh 等脚本,但是对于大规模的生产集群上,使用脚本采集是非常不可靠的。
Burrow 简介
LinkedIn 公司的数据基础设施Streaming SRE团队正在积极开发Burrow
,该软件由Go语言编写,在Apache许可证下发布,并托管在 GitHub Burrow 上。
它收集集群消费者群组的信息,并为每个群组计算出一个单独的状态,告诉我们群组是否运行正常,是否落后,速度是否变慢或者是否已经停止工作,以此来完成对消费者状态的监控。它不需要通过监控群组的进度来获得阈值,不过用户仍然可以从中获得消息的延时数量。
Burrow 的设计框架
Burrow
自动监控所有消费者和他们消费的每个分区。它通过消费特殊的内部Kafka主题来消费者偏移量。然后,Burrow将消费者信息作为与任何单个消费者分开的集中式服务提供。消费者状态通过评估滑动窗口中的消费者行为来确定。
这些信息被分解成每个分区的状态,然后转化为Consumer的单一状态。消费状态可以是OK,或处于WARNING状态(Consumer正在工作但消息消费落后),或处于ERROR状态(Consumer已停止消费或离线)。此状态可通过简单的HTTP请求发送至Burrow获取状态,也可以通过Burrow 定期检查并使用通知其通过电子邮件或单独的HTTP endpoint接口(例如监视或通知系统)发送出去。
Burrow能够监控Consumer消费消息的延迟,从而监控应用的健康状况,并且可以同时监控多个Kafka集群。用于获取关于Kafka集群和消费者的信息的HTTP上报服务与滞后状态分开,对于在无法运行Java Kafka客户端时有助于管理Kafka集群的应用程序非常有用。
Burrow 安装及版本
Burrow 是基于 Go 语言开发,当前 Burrow 的 v1.1 版本已经release。
Burrow 也提供用于 docker 镜像。
Burrow_1.2.2_checksums.txt 297 Bytes
Burrow_1.2.2_darwin_amd64.tar.gz 4.25 MB
Burrow_1.1.0_linux_amd64.tar.gz 3.22 MB (CentOS 6)
Burrow_1.2.2_linux_amd64.tar.gz 4.31 MB (CentOS 7 Require GLIBC >= 2.14)
Burrow_1.2.2_windows_amd64.tar.gz 4 MB
本发行版包含针对初始1.0.0发行版中发现的问题的一些重要修复,其中包括:
还有一些小的功能更新
- 存储最近的代理偏移环以避免停止的分区出现虚假警报
- 添加可配置的通知间隔
- 通过环境变量添加对配置的支持
- 支持存储模块中可配置的队列深度
Changelog - version 1.2[d244fce922] - Bump sarama to 1.20.1 (Vlad Gorodetsky)[793430d249] - Golang 1.9.x is no longer supported (Vlad Gorodetsky)[735fcb7c82] - Replace deprecated megacheck with staticcheck (Vlad Gorodetsky)[3d49b2588b] - Link the README to the Compose file in the project (Jordan Moore)[3a59b36d94] - Tests fixed (Mikhail Chugunkov)[6684c5e4db] - Added unit test for v3 value decoding (Mikhail Chugunkov)[10d4dc39eb] - Added v3 messages protocol support (Mikhail Chugunkov)[d6b075b781] - Replace deprecated MAINTAINER directive with a label (Vlad Gorodetsky)[52606499a6] - Refactor parseKafkaVersion to reduce method complexity (gocyclo) (Vlad Gorodetsky)[b0440f9dea] - Add gcc to build zstd (Vlad Gorodetsky)[6898a8de26] - Add libc-dev to build zstd (Vlad Gorodetsky)[b81089aada] - Add support for Kafka 2.1.0 (Vlad Gorodetsky)[cb004f9405] - Build with Go 1.11 (Vlad Gorodetsky)[679a95fb38] - Fix golint import path (golint fixer)[f88bb7d3a8] - Update docker-compose Readme section with working url. (Daniel Wojda)[3f888cdb2d] - Upgrade sarama to support Kafka 2.0.0 (#440) (daniel)[1150f6fef9] - Support linux/arm64 using Dup3() instead of Dup2() (Mpampis Kostas)[1b65b4b2f2] - Add support for Kafka 1.1.0 (#403) (Vlad Gorodetsky)[74b309fc8d] - code coverage for newly added lines (Clemens Valiente)[279c75375c] - accidentally reverted this (Clemens Valiente)[192878c69c] - gofmt (Clemens Valiente)[33bc8defcd] - make first regex test case a proper match everything (Clemens Valiente)[279b256b27] - only set whitelist / blacklist if it's not empty string (Clemens Valiente)[b48d30d18c] - naming (Clemens Valiente)[7d6c6ccb03] - variable naming (Clemens Valiente)[4e051e973f] - add tests (Clemens Valiente)[545bec66d0] - add blacklist for memory store (Clemens Valiente)[07af26d2f1] - Updated burrow endpoint in README : #401 (Ratish Ravindran)[fecab1ea88] - pass custom headers to http notifications. (#357) (vixns)Changelog - version 1.1fecab1e pass custom headers to http notifications. (#357)7c0b8b1 Add minimum-complete config for the evaluator (#388)dc4cb84 Fix mail template (#369)e2216d7 Fetch goreleaser via curl instead of 'go get' as compilation only works in 1.10 (#387)f3659d1 Add a send-interval configuration parameter (#364)3e488a2 Allow env vars to be used for configuration (#363)b7428c9 Fix typo in slack close (#361)5b546cc Create the broker offset rings earlier (#360)61f097a Metadata refresh on detecting a deleted topic must not be for that topic (#359)b890885 Make inmemory module request channel's size configurable (#352)9911709 Update sarama to support 10.2.1 too. (#345)a1bdcde Adjusting docker build to be self-contained (#344)a91cf4d Fix an incorrect cast from #338 and add a test to cover it (#340)389ef47 Store broker offset history (#338)1a60efe Fix alert closing (#334)b75a6f3 Fix typo in Cluster referencecacf05e Reject offsets that are older than the group expiration time (#330)b6184ff Fix typo in the config checked for TLS no-verify #316 (#329)3b765ea Sync Gopkg.lock with Gopkg.toml (#312)e47ec4c Fix ZK watch problem (#328)846d785 Assume backward-compatible consumer protocol version (fix #313) (#327)e3a1493 Update sarama to support Kafka 1.0.0 (#306)946a425 Fixing requests for StorageFetchConsumersForTopic (#310)52e3e5d Update burrow.toml (#300)3a4372f Upgrade sarama dependency to support Kafka 0.11.0 (#297)8993eb7 Fix goreleaser condition (#299)d088c99 Add gitter webhook to travis config (#296)08e9328 Merge branch 'gitter-badger-gitter-badge'76db0a9 Fix positioningdddd0ea Add Gitter badge
安装方法可以选用源码编译,和使用官方提供的二进制包等方法。
这里推荐使用二进制包的方式。
Burrow 是无本地状态存储的,CPU密集型,网络IO密集型应用。
安装方法
# wget https://github.com/linkedin/Burrow/releases/download/v1.1.0/Burrow_1.1.0_linux_amd64.tar.gz # mkdir burrow# tar -xf Burrow_1.1.0_linux_amd64.tar.gz -C burrow# cp burrow/burrow /usr/bin/# mkdir /etc/burrow# cp burrow/config/* /etc/burrow/# chkconfig --add burrow# /etc/init.d/burrow start
配置文件
[general]pidfile="/var/run/burrow.pid"stdout-logfile="/var/log/burrow.log"access-control-allow-origin="mysite.example.com"[logging]filename="/var/log/burrow.log"level="info"maxsize=512maxbackups=30maxage=10use-localtime=trueuse-compression=true[zookeeper]servers=[ "test1.localhost:2181","test2.localhost:2181" ]timeout=6root-path="/burrow"[client-profile.prod]client-id="burrow-lagchecker"kafka-version="0.10.0"[cluster.production]class-name="kafka"servers=[ "test1.localhost:9092","test2.localhost:9092" ]client-profile="prod"topic-refresh=180offset-refresh=30[consumer.production_kafka]class-name="kafka"cluster="production"servers=[ "test1.localhost:9092","test2.localhost:9092" ]client-profile="prod"start-latest=falsegroup-blacklist="^(console-consumer-|python-kafka-consumer-|quick-|test).*$"group-whitelist=""[consumer.production_consumer_zk]class-name="kafka_zk"cluster="production"servers=[ "test1.localhost:2181","test2.localhost:2181" ]#zookeeper-path="/"# If specified, this is the root of the Kafka cluster metadata in the Zookeeper ensemble. If not specified, the root path is used.zookeeper-timeout=30group-blacklist="^(console-consumer-|python-kafka-consumer-|quick-|test).*$"group-whitelist=""[httpserver.default]address=":8000"[storage.default]class-name="inmemory"workers=20intervals=15expire-group=604800min-distance=1#[notifier.default]#class-name="http"#url-open="http://127.0.0.1:1467/v1/event"#interval=60#timeout=5#keepalive=30#extras={ api_key="REDACTED", app="burrow", tier="STG", fabric="mydc" }#template-open="/etc/burrow/default-http-post.tmpl"#template-close="/etc/burrow/default-http-delete.tmpl"#method-close="DELETE"#send-close=false##send-close=true#threshold=1
启动脚本(RHEL6 CentOS6):
#!/bin/bash## Comments to support chkconfig# chkconfig: - 98 02# description: Burrow is kafka lag check_program by LinkedIn, Inc. ## Source function library.. /etc/init.d/functions### Default variablesprog_name="burrow"prog_path="/usr/bin/${prog_name}"pidfile="/var/run/${prog_name}.pid"options="-config-dir /etc/burrow/"# Check if requirements are met[ -x "${prog_path}" ] || exit 1RETVAL=0start(){ echo -n $"Starting $prog_name: " #pidfileofproc $prog_name #killproc $prog_path PID=$(pidofproc -p $pidfile $prog_name) #daemon $prog_path $options if [ -z $PID ]; then $prog_path $options > /dev/null 2>&1 & [ ! -e $pidfile ] && sleep 1 fi [ -z $PID ] && PID=$(pidof ${prog_path}) if [ -f $pidfile -a -d "/proc/$PID" ]; then #RETVAL=$? RETVAL=0 #[ ! -z "${PID}" ] && echo ${PID} > ${pidfile} echo_success [ $RETVAL -eq 0 ] && touch /var/lock/subsys/$prog_name else RETVAL=1 echo_failure fi echo return $RETVAL}stop(){ echo -n $"Shutting down $prog_name: " killproc -p ${pidfile} $prog_name RETVAL=$? echo [ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/$prog_name return $RETVAL}restart() { stop start}case "$1" in start) start ;; stop) stop ;; restart) restart ;; status) status $prog_path RETVAL=$? ;; *) echo $"Usage: $0 {start|stop|restart|status}" RETVAL=1esacexit $RETVAL
systemd 服务脚本(RHEL7 CentOS7):
[Unit]Description=Burrow - Kafka consumer LAG MonitorAfter=network.target[Service]Type=simpleRestartSec=20sExecStart=/usr/bin/burrow --config-dir /etc/burrowPIDFile=/var/run/burrow/burrow.pidUser=burrowGroup=burrowRestart=on-abnormal[Install]WantedBy=multi-user.target
默认配置文件为 burrow.toml
使用方法
获取消费者列表
GET /v3/kafka/(cluster)/consumer
Burrow 返回额接口均为 json 对象格式,所以非常方便用于二次采集处理。
获取指定消费者的状态 或 消费延时
GET /v3/kafka/(cluster)/consumer/(group)/statusGET /v3/kafka/(cluster)/consumer/(group)/lag
消费组健康状态的接口含义如下:
NOTFOUND – 消费组未找到OK – 消费组状态正常WARN – 消费组处在WARN状态,例如offset在移动但是Lag不停增长。 the offsets are moving but lag is increasingERR – 消费组处在ERR状态。例如,offset停止变动,但Lag非零。 the offsets have stopped for one or more partitions but lag is non-zeroSTOP – 消费组处在ERR状态。例如offset长时间未提交。the offsets have not been committed in a log period of timeSTALL – 消费组处在STALL状态。例如offset已提交但是没有变化,Lag非零。the offsets are being committed, but they are not changing and the lag is non-zero
获取 topic 列表
GET /v3/kafka/(cluster)/topic
获取指定 topic offsets 信息
GET /v3/kafka/(cluster)/topic/(topic)