LGWR waits for event ‘DLM cross inst call completion’ 故障排除

客户一套Oracle 19c Dataguard的数据库环境,Standby端总是会间隔性出现较大GAP,同时DB alert log日志出现LGWR (ospid: 105521) waits for event 'DLM cross inst call completion’ for N secs. 的现象,Standby端并未对外提供查询,同时也禁用了多实例日志应用,操作系统资源空闲,LMS进程个数正常,如果关闭其它节点只留apply log并不存在该问题。DLM是Distributed Lock Manager属于RAC架构中核心机制,实现多节点资源共享调度,通过interconnect Network传递请求。下面简单记录一下这个案例:
  • db alert log

PR00 (PID:109603): Media Recovery Log +ARCH/anbob1/ARCHIVELOG/2021_07_12/thread_3_seq_13586.1479.1077669291
2021-07-12T20:25:29.643687+08:00
PR00 (PID:109603): Media Recovery Log +ARCH/anbob1/ARCHIVELOG/2021_07_12/thread_2_seq_14361.1072.1077669019
2021-07-12T20:29:38.183656+08:00
LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 1 secs.
2021-07-12T20:29:48.137737+08:00
LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 2 secs.
2021-07-12T20:31:21.952345+08:00
rfs (PID:113884): Selected LNO:26 for T-2.S-14456 dbid 3902007743 branch 1037635587
2021-07-12T20:31:21.987333+08:00
rfs (PID:114704): Error ORA-235 occurred during an un-locked control file
rfs (PID:114704): transaction.  This error can be ignored.  The control
rfs (PID:114704): file transaction will be retried.
2021-07-12T20:31:43.532600+08:00
ARC2 (PID:106404): Archived Log entry 9591 added for T-2.S-14455 ID 0xe894b1bf LAD:1
2021-07-12T20:31:47.151671+08:00
rfs (PID:113882): Selected LNO:31 for T-3.S-13731 dbid 3902007743 branch 1037635587
2021-07-12T20:31:49.116049+08:00
rfs (PID:113880): Selected LNO:22 for T-1.S-13006 dbid 3902007743 branch 1037635587
2021-07-12T20:31:53.393547+08:00
ARC3 (PID:106408): Archived Log entry 9592 added for T-1.S-13005 ID 0xe894b1bf LAD:1
2021-07-12T20:32:02.346585+08:00
ARC2 (PID:106404): Archived Log entry 9593 added for T-3.S-13730 ID 0xe894b1bf LAD:1
2021-07-12T20:33:13.805344+08:00
LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 0 secs.
2021-07-12T20:33:13.805470+08:00
LGWR (ospid: 105521) is hung in an acceptable location (inwait 0x1.ffff).
2021-07-12T20:33:21.196764+08:00
LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 2 secs.
2021-07-12T20:33:31.310737+08:00
LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 0 secs.
2021-07-12T20:33:41.223781+08:00
LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 1 secs.
2021-07-12T20:33:51.205776+08:00
LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 2 secs.
2021-07-12T20:34:01.307770+08:00
LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 0 secs.
2021-07-12T20:34:25.440231+08:00
PR00 (PID:109603): Media Recovery Log +ARCH/anbob1/ARCHIVELOG/2021_07_12/thread_2_seq_14362.1867.1077670807
2021-07-12T20:34:44.864009+08:00
PR00 (PID:109603): Media Recovery Log +ARCH/anbob1/ARCHIVELOG/2021_07_12/thread_3_seq_13587.691.1077670845
2021-07-12T20:34:45.204773+08:00
PR00 (PID:109603): Media Recovery Log +ARCH/anbob1/ARCHIVELOG/2021_07_12/thread_1_seq_12934.1156.1077670917
2021-07-12T20:36:09.378685+08:00
LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 2 secs.
2021-07-12T20:36:19.341635+08:00
LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 0 secs.
2021-07-12T20:36:28.416573+08:00
LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 0 secs.
2021-07-12T20:36:38.375742+08:00
LGWR (ospid: 105521) waits for event 'DLM cross inst call completion' for 1 secs.

  • LGWR trace

*** 2021-07-12T20:33:43.793041+08:00 ((4))
Received ORADEBUG command (#235) 'dump KSTDUMPCURPROC 1' from process '105470'
-------------------------------------------------------------------------------
Trace Bucket Dump Begin: default bucket for process 47 (osid: 105521, LGWR)
CDB_NAME(CON_ID):CON_UID:TIME(*=approx):SEQ:COMPONENT:FILE@LINE:FUNCTION:SECT/DUMP:SID:SERIAL#: [EVENT#:PID] DATA
-------------------------------------------------------------------------------
IRMSDB(4):3247498417:2021-07-12 20:33:42.784 :KJCI:kjci.c@1957:kjci_complete():4466:40278: freeing request 0x20fd651e8 (inst|inc|reqid)=(1|88|823031) with opcode=146 and completion status [DONE]
IRMSDB(4):3247498417:2021-07-12 20:33:42.784 :KJCI:kjci.c@1089:kjci_initreq():4466:40278: request 0x20fd651e8 (inst|inc|reqid)=(1|88|823032) with group (type|id)=(1|1), opcode=146, flags=0x0, msglen=56, where=[kqlmClusterMessage] to target instances=
IRMSDB(4):3247498417:2021-07-12 20:33:42.784 :KJCI:kjci.c@1091:kjci_initreq():4466:40278:    1 2
IRMSDB(4):3247498417:2021-07-12 20:33:42.784 :KJCI:kjci.c@1618:kjci_processcrq():4466:40278: processing reply 0x2cff2d4e8 for request 0x20fd651e8 (inst|inc|reqid)=(1|88|823032) with opcode=146 from callee (inst|pid|psn)=(1|36|1)
IRMSDB(4):3247498417:2021-07-12 20:33:42.784 :KJCI:kjci.c@1618:kjci_processcrq():4466:40278: processing reply 0x2cff2d718 for request 0x20fd651e8 (inst|inc|reqid)=(1|88|823032) with opcode=146 from callee (inst|pid|psn)=(2|36|1)
IRMSDB(4):3247498417:2021-07-12 20:33:42.784 :KJCI:kjci.c@1957:kjci_complete():4466:40278: freeing request 0x20fd651e8 (inst|inc|reqid)=(1|88|823032) with opcode=146 and completion status [DONE]
IRMSDB(4):3247498417:2021-07-12 20:33:42.785 :KJCI:kjci.c@1089:kjci_initreq():4466:40278: request 0x20fd651e8 (inst|inc|reqid)=(1|88|823033) with group (type|id)=(1|1), opcode=146, flags=0x0, msglen=56, where=[kqlmClusterMessage] to target instances=
IRMSDB(4):3247498417:2021-07-12 20:33:42.785 :KJCI:kjci.c@1091:kjci_initreq():4466:40278:    1 2
IRMSDB(4):3247498417:2021-07-12 20:33:42.785 :KJCI:kjci.c@1618:kjci_processcrq():4466:40278: processing reply 0x2cff2d4e8 for request 0x20fd651e8 (inst|inc|reqid)=(1|88|823033) with opcode=146 from callee (inst|pid|psn)=(1|36|1)
IRMSDB(4):3247498417:2021-07-12 20:33:42.785 :KJCI:kjci.c@1618:kjci_processcrq():4466:40278: processing reply 0x2cff2d718 for request 0x20fd651e8 (inst|inc|reqid)=(1|88|823033) with opcode=146 from callee (inst|pid|psn)=(2|36|1)
IRMSDB(4):3247498417:2021-07-12 20:33:42.785 :KJCI:kjci.c@1957:kjci_complete():4466:40278: freeing request 0x20fd651e8 (inst|inc|reqid)=(1|88|823033) with opcode=146 and completion status [DONE]
IRMSDB(4):3247498417:2021-07-12 20:33:42.785 :KJCI:kjci.c@1089:kjci_initreq():4466:40278: request 0x20fd651e8 (inst|inc|reqid)=(1|88|823034) with group (type|id)=(1|1), opcode=146, flags=0x0, msglen=56, where=[kqlmClusterMessage] to target instances=
IRMSDB(4):3247498417:2021-07-12 20:33:42.785 :KJCI:kjci.c@1091:kjci_initreq():4466:40278:    1 2

**KJCJ ** ==> (kjci)_processcrq – kernel lock management communication cross instance call

跨节点的通信,MOS中不存在已知BUG。先分析网络问题,也可以从进程blocker做SSD或查看hangmgr Trace。Oracle 19c CRS中AHF框架自带了OSW。

  • OSW netstat data

zzz ***Tue Jul 13 00:59:51 CST 2021
...
#kernel
IpInReceives                    1456201695         0.0
IpInHdrErrors                   0                  0.0
IpInAddrErrors                  0                  0.0
IpForwDatagrams                 0                  0.0
IpInUnknownProtos               0                  0.0
IpInDiscards                    0                  0.0
IpInDelivers                    1085210966         0.0
IpOutRequests                   1007206469         0.0
IpOutDiscards                   5280               0.0
IpOutNoRoutes                   8                  0.0
IpReasmTimeout                  6333500            0.0
IpReasmReqds                    408470736          0.0
IpReasmOKs                      37504539           0.0
IpReasmFails                    8651478            0.0
IpFragOKs                       29029579           0.0

Note:
当前存在较高的IP重组失败包,这是一个累计值,下面可以查看日常变化。

  • 查看日常IP重组失败情况

awk '/zzz/{d=$3"/"$4" "$5}/IpReasmFails/{curr=$2;diff=curr-prev;if(diff>5)print d,diff,prev,curr;prev=curr}' *.dat
Jul/13 00:00:16 8620039  8620039
Jul/13 00:00:46 185 8620039 8620224
Jul/13 00:01:16 242 8620224 8620466
Jul/13 00:01:46 324 8620466 8620790
Jul/13 00:02:16 279 8620790 8621069
Jul/13 00:02:46 325 8621069 8621394
Jul/13 00:03:16 325 8621394 8621719
Jul/13 00:03:46 247 8621719 8621966
Jul/13 00:04:16 246 8621966 8622212
Jul/13 00:04:46 210 8622212 8622422
Jul/13 00:05:16 327 8622422 8622749
Jul/13 00:05:46 247 8622749 8622996
Jul/13 00:06:16 238 8622996 8623234
Jul/13 00:06:46 219 8623234 8623453
Jul/13 00:07:16 262 8623453 8623715
Jul/13 00:07:46 254 8623715 8623969
Jul/13 00:08:16 179 8623969 8624148
Jul/13 00:08:46 294 8624148 8624442

Note:
可见平时也存在较高的IP重组失败,下面尝试使用ping验证网络

  • 使用ping验证

— on node1

ping -s 4000 {node2-privateIP}
Note:

这里忘记保留历史输出,发现有12% package loss,说明当前和心跳网络并不健康,不过使用的是两个网卡做的BOND,当前是Active-Backup主备模式,可以尝试切换另一个网卡。

  • 网卡切换

cat /proc/net/bonding/bond0

Note:

检查到当前主卡是ens9f0,切换到备卡ens9f1

ifenslave -c bond0 ens9f1

做了主备网卡切换后:ping正常、IP重组失败消失、DLM cross inst call completion未在出现、DG同步正常、问题得到解决。

(0)

相关推荐

  • 3dmax软件缺少ColorCorrect.dlm文件怎么办?

    3dmax软件缺少ColorCorrect.dlm文件怎么办? 在3dmax导入文件模型的时候,遇到提示ColorCorrect.dlm缺少,遇到这种问题一般是因为在我们使用这个插件,在以往老版本的3 ...

  • 国际红十字运动金曲:Carry Me(The Red Cross Song)

    今天是"5·8世界红十字日",和大家分享这首国际红十字运动金曲:Carry Me(The Red Cross Song). "Carry me"(带上我)是红十 ...

  • 浏览器的 Event Loop

    前端技术优选 昨天 以下文章来源于掘金开发者社区 ,作者小蘑菇哥哥 掘金开发者社区 掘金,一个帮助开发者成长的技术社区 前端技术优选 为你精选前端领域优质技术博文,欢迎关注. 57篇原创内容 公众号 ...

  • TPAMI 2021 :基于 event stream 的步态识别,准确率高达90%!

    作者丨张贤同学 审稿丨邓富城 编辑丨极市平台 极市导读 本文介绍了作者被TPAMI接收的工作,基于 event stream 的两种不同表示形式,提出了一种新的基于 event stream 的步态识 ...

  • 2020年最后一天,“跨年”英文怎么说?可别说“cross the year”!

    圣诞节刚过, 跨年也到跟前了 2020是个不平凡的一年 终于把它过完了 想约上好友一起跨年么? 发个短信拽一句英文问问 才发现 我不会表达"跨年"啊 其实 英语中并没有和" ...

  • Hot Cross Buns 热十字面包 - 英语儿歌听唱跳128

    (歌词翻译:武太白) Hot cross buns. Hot cross buns. One a penny, two a penny. Hot cross buns. 热十字面包. 热十字面包. 一 ...

  • LIT.EVENT|世界读书日,故事从一杯咖啡开始

    文穴线上&线下双活动 4.23世界读书日 4.24文穴杭州书友面基日 活动一:4.23世界读书日知识竞赛 为了让大家的阅读生活更加有趣,文穴联合新民说推出了线上读书日知识竞赛活动,在阅读群内以 ...

  • Do Not Cross CrossFit

    作者:BURT HELM 来源:7/8月刊   INC. MAGAZINE 2013 译者:贾斯汀 *图片来源于网络 本文是我在查阅Glassman资料时无意中翻到的,里面详细讲述了创始人的发家史以及 ...

  • python笔记12-python多线程之事件(Event)

    前言 小伙伴a,b,c围着吃火锅,当菜上齐了,请客的主人说:开吃!,于是小伙伴一起动筷子,这种场景如何实现 一. Event(事件) Event(事件):事件处理的机制:全局定义了一个内置标志Flag ...

  • 贵,就对了——大阳巧客CROSS旗舰版53800元起

    翘首以盼的大阳巧客"四代机"巧客CROSS终于正式亮相了.昨天在洛阳,面向全国600多名经销商召开了一场盛大的品鉴会,巧客CROSS即将全面上市. 泰国正大集团易初工业集团领导谢傑 ...