友商设备STP异常导致广播风暴网络异常
问题描述
1、网络异常时S9700 log上报MAC地址飘移;
2、网管监控到S9700 CPU利用率超过阈值;
3、S9700作为网关,下挂用户ping不通网关。
告警信息
Jul 23 2017 07:14:24+08:00 9700_01 L2IFPPI/4/MFLPVLANALARM:OID 1.3.6.1.4.1.2011.5.25.160.3.7 MAC move detected, VlanId = 853, MacAddress = 0080-87db-a942, Original-Port = GE11/0/44, Flapping port = Eth-Trunk1. Please check the network accessed to flapping port.
Jul 23 2017 07:14:28+08:00 9700_01 L2IFPPI/4/MFLPVLANALARM:OID 1.3.6.1.4.1.2011.5.25.160.3.7 MAC move detected, VlanId = 850, MacAddress = 78e7-d1d2-de12, Original-Port = GE12/0/44, Flapping port = Eth-Trunk1. Please check the network accessed to flapping port.
Jul 23 2017 07:17:16+08:00 9700_01 %%01DEFD/6/CPCAR_DROP_LPU(l)[85839]:Rate of packets to cpu exceeded the CPCAR limit on the LPU in slot 8. (Protocol=vrrp, ExceededPacketCount=011850)
Jul 23 2017 07:17:16+08:00 9700_01 %%01DEFD/6/CPCAR_DROP_LPU(l)[85840]:Rate of packets to cpu exceeded the CPCAR limit on the LPU in slot 8. (Protocol=dhcp-server, ExceededPacketCount=042)
Jul 23 2017 07:17:16+08:00 9700_01 %%01DEFD/6/CPCAR_DROP_LPU(l)[85841]:Rate of packets to cpu exceeded the CPCAR limit on the LPU in slot 9. (Protocol=vrrp, ExceededPacketCount=014381)
处理过程
1、分析S9700_01和S9700_02的日志。
日志显示在发生MAC漂移前,S9700_01的端口GE9/0/42置为forwarding,并且是从discarding状态超时15s变成learning,再经过15s变成forwarding状态。这说明此时对端没有发送STP报文,可能是对端端口去使能了STP,或者其他原因导致发包异常。继续分析对端设备AS_23的信息进行确认。
Jul 23 2017 07:13:10+08:00 S9700_01 %%01MSTP/6/SET_PORT_DISCARDING(l)[85836]:In MSTP process 0 instance 0, MSTP set port GigabitEthernet9/0/42 state as discarding. --- 9/0/42端口discarding
Jul 23 2017 07:13:25+08:00 S9700_01 %%01MSTP/6/SET_PORT_LEARNING(l)[85837]:In process 0 instance 0, MSTP set port GigabitEthernet9/0/42 state as learning. ---端口learning
Jul 23 2017 07:13:40+08:00 S9700_01 %%01MSTP/6/SET_PORT_FORWARDING(l)[85838]:In MSTP process 0 instance 0, MSTP set port GigabitEthernet9/0/42 state as forwarding. --- 端口 forwarding
Jul 23 2017 07:14:24+08:00 S9700_01 L2IFPPI/4/MFLPVLANALARM:OID 1.3.6.1.4.1.2011.5.25.160.3.7 MAC move detected, VlanId = 853, MacAddress = 0080-87db-a942, Original-Port = GE11/0/44, Flapping port = Eth-Trunk1. Please check the network accessed to flapping port. --- 产生mac漂移
2、分析友商AS_23的日志。
S9700_01的端口GE9/0/42对端是友商交换机AS_23的端口E1/0/47。
根据该设备的日志,在发生MAC漂移前,一个端口突然变成learning状态,4s后端口进入Forwarding状态,且根桥切换成自己。在网络中存在更高优先级设备的情况下,根桥切换成自己,通常是因为链路拥塞或单板故障导致收不到报文或者报文不上送,STP超时将端口置为Forwarding状态,但是从learning到Forwarding的时间根据标准是15s。而这里只有4s,且配置中没有修改定时器,是该设备STP产生了故障。
%Jul 23 07:14:23 2017 <warnings> MODULE_L2_MSTP[tMstp]:MSTP set port = 48, mst = 0 to LEARNING! --- 端口learning
%Jul 23 07:14:27 2017 <warnings> MODULE_L2_MSTP[tMstp]:MSTP set port = 48, mst = 0 to FORWARDING! ---4s后端口forwarding
%Jul 23 07:14:28 2017 <warnings> MODULE_L2_MSTP[tMstp]:Root bridge has been changed. PRS is triggered: INSTANCE=0 --- 根桥发生变化
%Jul 23 07:14:28 2017 <warnings> MODULE_L2_MSTP[tMstp]:CistRoot = 32768.00:01:7a:f8:00:45, CistRegRoot = 32768.00:01:7a:f8:00:45 ---以本设备为根桥
AS_23端口突然进入Forwarding状态,以自己为根桥向S9700_01发送报文,S9700_01的9/0/42端口收到优先级低于本设备的报文,进入discarding状态(见步骤1的日志),并向AS_23发送高优先级报文。AS_23收到报文后会重新以S9700_01为根桥(见下面日志),并停止向S9700_01发送报文,因此后续S9700_01的9/0/42端口不再收到报文,15s超时进入learning状态,再经过15s超时进入forwarding状态(见步骤1的日志)。
%Jul 23 08:57:58 2017 <warnings> MODULE_L2_MSTP[tMstp]:Root bridge has been changed. PRS is triggered: INSTANCE=0 ---根桥发生变化
%Jul 23 08:57:58 2017 <warnings> MODULE_L2_MSTP[tMstp]:CistRoot = 0.d8:49:0b:8c:e4:a0, CistRegRoot = 32768.00:01:7a:f8:00:45 ---根桥恢复成原来的S9700_01
至此,可以证明是AS_23发生故障,端口forwarding,导致S9700_01端口状态震荡1次,并产生MAC漂移。
3、将AS_23连接S9700_01上行口shutdown后环路破除,网络恢复。
4、因组网为VRRP+RSTP,将AS_23连接S9700_01上行口shutdown为临时解决办法,后客户将AS_23设备替换,测试正常。
根因
设备上报MAC地址飘移,同时CPU利用率高,下挂用户ping不通网关,一般网络有环路导致广播风暴,需要及时破除环路。
建议与总结
如果设备同时上报MAC地址飘移,CPU利用率高,协议报文CPCAR丢包,一般为网络中出现环路。在局域网中环路对网络稳定性影响较大,需要及时破除环路。