案例解读:利用12c渐进式DASH分析"ON CPU"

墨天轮主页:https://www.modb.pro/u/3895

背景

6日下午17:20接到告警CPU使用率高, 环境 Oracle 12cR2 3-nodes RAC on Linux。登录系统查看是实例1活动会话接近200。这是一个<90 cpu的主机。Session status 全是on cpu。当然从我们的监控上很直观定位2条select SQL 占用了50% 左右的db time。首先可能怀疑是SQL 执行量增长或执行计划改变等效率变差了, 这是一个1类系统,需要快速恢复应用,客户要求先kill部分会话, 查看了sql执行计划并未改变, 可是在KILL 部分会话后还在负载逐渐增长,没有多长时间,17:29 主机ping不通了,并且没有重启。

应用还好有配置TAF自动failover到了其它节点,并且运行良好,就当时的负载能把主机压死确实不多见,难道X86就如此脆弱? 下一步如何分析?我们的AWR 周期是30分钟,AWR中关于SQL stat的信息还没收集。当然事后会说为什么不收systemdump , 为什么不收hanganalyze, 为什么不手动收AWR, 为什么不把v$ash 物化下来, 可往往故障就是这么猝不及防,如何根据当前的信息进行分析?

如果此时有套监控采集信息或许能够提供部分信息, 如分析sysstat中sql executes是不是增长了?奇怪这该时间段我们的监控sysstat数据也没有采集到。

当时KILL 部分会话后收集

-- Session Snapper v4.14 BETA - by Tanel Poder ( http://blog.tanelpoder.com ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! :)
----------------------------------------------------------------------------------------------------Active% | INST | SQL_ID | SQL_CHILD | EVENT | WAIT_CLASS---------------------------------------------------------------------------------------------------- 3500% | 1 | 56twj7s93jaz4 | 0 | ON CPU | ON CPU 1700% | 1 | dp8fnkzqdt3km | 0 | ON CPU | ON CPU 1300% | 1 | 05qszn0ufs4ff | 0 | ON CPU | ON CPU 1200% | 1 | 1b9zsamawq6mh | 0 | ON CPU | ON CPU 1100% | 1 | a7kcm21nbngvx | 0 | ON CPU | ON CPU 1100% | 1 | 2v7njuhdw0pgm | 1 | ON CPU | ON CPU 1000% | 1 | 13nf2mwh3xmsh | 1 | ON CPU | ON CPU 900% | 1 | | 0 | ON CPU | ON CPU 700% | 1 | 60t0pum7f1pbm | 1 | ON CPU | ON CPU 600% | 1 | 60t0pum7f1pbm | 2 | ON CPU | ON CPU
--  End of ASH snap 1, end=2021-01-06 17:24:43, seconds=9, samples_taken=1

可以确认当时全时ON CPU, 从上图看负载是从16:58左右开始, 分析这类问题DASH是常用的分析工具,之前总结过ASH相关的BLOG《Know more about Oracle ASH》,前不久同事还问我ASH是何时从memory刷到disk中的, 1,默认AWR的快照间隔;2,如果ASH buffer 已满时MMNL进程都会负责这事。如果17:00 到 17:30的AWR 没有形成,那17:00前那2分钟的DASH(dba_hist_active_sess_history)可以做为分析的入口。在统计top SQL 历史快照时发现实例1其实16:30 – 17:00 AWR snapshot也没形成。问题时间段跨越的2个AWR snapshot都没生成,那DASH 真没有数据了吗?11G前这情况是没有DASH了,从12c开始不是的。

12c ASH 从memroy刷到disk形为改变

检查DASH数据是否存在?

SQL> select to_char(sample_time,'yyyymmdd hh24:mi'),count(*) --not * 10from dba_hist_active_sess_history where sample_time >to_date('2021-01-06 17','yyyy-mm-dd hh24') and sample_time <to_date('2021-01-06 18','yyyy-mm-dd hh24') and instance_number=1 group by to_char(sample_time,'yyyymmdd hh24:mi') order by 1;
TO_CHAR(SAMPLE COUNT(*)-------------- ----------20210106 17:00 50320210106 17:01 27320210106 17:02 57220210106 17:03 60120210106 17:04 60720210106 17:05 29320210106 17:06 57120210106 17:07 55920210106 17:08 55320210106 17:09 28620210106 17:10 51020210106 17:11 45820210106 17:12 44320210106 17:13 43420210106 17:14 45520210106 17:15 71020210106 17:16 45720210106 17:17 43320210106 17:18 48120210106 17:19 50320210106 17:20 25920210106 17:21 52720210106 17:22 53020210106 17:23 52920210106 17:24 209
25 rows selected.

Note:
实例1 crash(17:28)之前的ASH数据基本上都在,这也正是oracle ASH即使在系统负载很高时也可以很好的工作。检查了db alert log没有提示ASH buffer 不足的问题, 那ASH是什么刷新频率刷到DISK上的呢?

SQL> select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1;
MAX(SAMPLE_TIME) SYSDATE--------------------------------------------------------------------------- -----------------14-JAN-21 10.54.32.234 PM 20210114 22:58:23
SQL> select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1;
MAX(SAMPLE_TIME) SYSDATE--------------------------------------------------------------------------- -----------------14-JAN-21 10.54.32.234 PM 20210114 22:59:52
SQL> r 1* select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1
MAX(SAMPLE_TIME) SYSDATE--------------------------------------------------------------------------- -----------------14-JAN-21 10.59.54.686 PM 20210114 23:00:56
SQL> r 1* select max(sample_time),sysdate from dba_hist_active_sess_history where instance_number=1
MAX(SAMPLE_TIME) SYSDATE--------------------------------------------------------------------------- -----------------14-JAN-21 11.04.26.622 PM   20210114 23:07:58

Note:
非AWR snapshot flush时间,DASH也在逐渐的更新最新数据,基本上可以判断,当前的ASH是大概5分钟更新到DASH(dba_hist_active_sess_history)中, 那继续查找一下相关的ASH参数。

SQL> select 2 n.indx 3 , to_char(n.indx, 'XXXX') i_hex 4 , n.ksppinm pd_name 5 , c.ksppstvl pd_value 6 , n.ksppdesc pd_descr 7 from sys.x$ksppi n, sys.x$ksppcv c 8 where n.indx=c.indx 9 and 10 lower(n.ksppinm) || ' ' || lower(n.ksppdesc) like lower('%\_ash%') 11 escape '\' ;
INDX I_HEX NAME VALUE DESCRIPTION---------- ----- ------------------------------------ ----------- ---------------------------------------------------------------------- 4546 11C2 _ash_sampling_interval 1000 Time interval between two successive Active Session samples in millisecs
4547 11C3 _ash_size 1048618 To set the size of the in-memory Active Session History buffers 4548 11C4 _ash_enable TRUE To enable or disable Active Session sampling and flushing 4549 11C5 _ash_disk_write_enable TRUE To enable or disable Active Session History flushing 4550 11C6 _ash_disk_filter_ratio 10 Ratio of the number of in-memory samples to the number of samples actually written to disk
4551 11C7 _ash_eflush_trigger 66 The percentage above which if the in-memory ASH is full the emergency flusher will be triggered
4552 11C8 _ash_sample_all FALSE To enable or disable sampling every connected session including ones waiting for idle waits
4553 11C9 _ash_dummy_test_param 0 Oracle internal dummy ASH parameter used ONLY for testing! 4554 11CA _ash_min_mmnl_dump 90 Minimum Time interval passed to consider MMNL Dump 4555 11CB _ash_compression_enable TRUE To enable or disable string compression in ASH 4556 11CC _ash_progressive_flush_interval 300 ASH Progressive Flush interval in secs
11 rows selected.

Note:
_ash_progressive_flush_interval 这个隐藏参数值300秒,描述也说明了它是控制ASH渐进式刷新频率秒数,该参数11G中并不存在,与之相关的TIPs几乎没有, 不过我们可以判断从12c除了上面提到的2种ASH flush形为,还有第3种,每300秒渐进式ASH也会flush到disk. 不得不感叹ORACLE一直在悄悄改善的更加完美, 这么实用的功能没有得到广泛的宣传。

从DASH 分析的SQL效率

先从现有的dba_hist_sqlstat分析TOP SQL的执行情况

脚本sql_hist.sql

Summary Execution Statistics Over Time of SQL_ID:dp8fnkzqdt3km Avg AvgSnapshot Avg LIO Avg PIO CPU (secs) Elapsed (secs)Beg Time INSTANCE_NUMBER Execs Per Exec Per Exec Per Exec Per Exec------------ --------------- ------------ ------------------- --------------- --------------- -------------------06-JAN 14:00 1 824 362,726.48 3.82 1.56 1.5706-JAN 14:00 2 841 0.00 6.04 1.58 1.5906-JAN 14:30 1 1,084 358,591.06 0.80 1.52 1.5306-JAN 14:30 2 725 239,373.03 10.00 1.35 1.3606-JAN 15:00 1 1,303 335,387.17 0.36 1.56 1.5606-JAN 15:00 2 903 254,055.09 2.25 1.37 1.3806-JAN 15:30 2 850 315,445.81 18.58 1.63 1.6606-JAN 15:30 1 1,238 335,970.25 1.11 1.50 1.5006-JAN 16:00 1 1,048 339,324.59 1.46 1.54 1.5506-JAN 16:00 2 765 310,440.36 1.66 1.43 1.4406-JAN 16:30 2 703 273,389.33 21.77 1.48 1.5006-JAN 17:00 2 595 238,034.54 67.25 1.33 1.4606-JAN 17:30 2 450 313,110.14 215.90 1.55 1.7206-JAN 18:00 2 408 312,342.91 59.53 1.50 1.5806-JAN 18:30 1 64 305,746.14 3,045.39 2.78 4.1506-JAN 18:30 2 152 395,698.88 224.32 1.94 2.0606-JAN 19:00 1 75 378,465.49 249.27 2.08 2.1606-JAN 19:00 2 122 346,031.39 33.33 1.88 1.9006-JAN 19:30 2 31 237,640.52 30.10 1.59 1.6106-JAN 19:30 1 77 193,193.10 37.27 1.49 1.5106-JAN 20:00 1 88 317,937.30 347.97 1.67 1.7606-JAN 20:00 2 26 253,282.27 6.77 1.41 1.4206-JAN 20:30 1 20 430,854.10 1.65 2.42 2.4306-JAN 20:30 2 18 344,005.61 75.72 2.42 2.4706-JAN 21:00 1 5 306,552.00 25.60 1.53 1.5606-JAN 21:00 2 13 342,130.31 2.77 2.31 2.3206-JAN 21:30 1 35 6,480.89 2.60 0.06 0.0607-JAN 07:30 1 12 396,023.42 2,950.17 2.50 3.9607-JAN 07:30 2 17 2,123,654.59 3,804.82 6.46 8.3507-JAN 08:00   1 144 225,576.83 489.54   1.35 1.76

Note:
当然没有问题时间段的实例1 sql的执行情况,但是可以判断基本上每次执行20-40万的逻辑读,耗费时间为1-2秒,另外每天7:00和实例1刚启动时因为第一次物理读,单次执行也是在4-8秒之间。

看一下具体的负载趋势

SQL> create table dbmt.dash0106 tablespace users as select * from dba_hist_active_sess_history where sample_time >to_date('2021-01-06 16','yyyy-mm-dd hh24') and sample_time <to_date('2021-01-06 18','yyyy-mm-dd hh24') ; Table created. SQL> select * from ( 2 select etime,nvl(event,'on cpu') events,sql_id, dbtime, cnt,first_time,end_time, 3 round(100*ratio_to_report(dbtime) OVER (partition by etime ),2) pct,row_number() over(partition by etime order by dbtime desc) rn 4 from ( 5 select to_char(SAMPLE_TIME,'yyyymmdd hh24:mi') etime,event,sql_id,count(*)*10 dbtime,count(*) cnt, 6 to_char(min(SAMPLE_TIME),'hh24:mi:ss') first_time,to_char(max(SAMPLE_TIME),'hh24:mi:ss') end_time 7 from dbmt.dash0106 8 --where sample_time between to_date('2015-4-1 16:00','yyyy-mm-dd hh24:mi') and to_date('2015-4-1 17:00','yyyy-mm-dd hh24:mi') 9 where INSTANCE_NUMBER=1 10 group by to_char(SAMPLE_TIME,'yyyymmdd hh24:mi'),event,sql_id 11 ) 12 ) where rn<=5;
ETIME EVENTS SQL_ID DBTIME CNT FIRST_TI END_TIME PCT RN-------------- ---------- ------------- ---------- ---------- -------- -------- ---------- ----------
20210106 16:30 on cpu d506gkxjgw7xr 40 4 16:30:04 16:30:54 7.41 1 on cpu 4vk65sy477mxm 40 4 16:30:24 16:30:54 7.41 2 on cpu 1wcsvyq7xshqz 40 4 16:30:24 16:30:54 7.41 3 on cpu 40 4 16:30:04 16:30:44 7.41 4 on cpu 60dw4cw904b83 30 3 16:30:14 16:30:34 5.56 5
20210106 16:31 on cpu 56twj7s93jaz4 120 12 16:31:24 16:31:24 15.19 1 on cpu b15j01vphnb60 100 10 16:31:14 16:31:34 12.66 2 on cpu dp8fnkzqdt3km 100 10 16:31:04 16:31:54 12.66 3 db file se fmtaf5hf9tm7s 60 6 16:31:04 16:31:54 7.59 4 quential r ead on cpu gm9rz10b0kb50 60 6 16:31:14 16:31:44 7.59 5
20210106 16:32 on cpu dp8fnkzqdt3km 50 5 16:32:15 16:32:45 9.43 1 on cpu 4vk65sy477mxm 40 4 16:32:25 16:32:55 7.55 2 on cpu gvv9dcnsc1jf9 40 4 16:32:25 16:32:55 7.55 3 on cpu d506gkxjgw7xr 40 4 16:32:04 16:32:45 7.55 4 on cpu 0hpuufdajtzch 30 3 16:32:04 16:32:25 5.66 5
20210106 16:33 on cpu b15j01vphnb60 60 6 16:33:05 16:33:55 16.22 1 on cpu dp8fnkzqdt3km 40 4 16:33:15 16:33:45 10.81 2 on cpu 4vk65sy477mxm 40 4 16:33:25 16:33:55 10.81 3 db file se fmtaf5hf9tm7s 30 3 16:33:05 16:33:55 8.11 4 quential r ead gc cr gran fmtaf5hf9tm7s 20 2 16:33:15 16:33:25 5.41 5 t 2-way.........
20210106 16:54 on cpu d506gkxjgw7xr 70 7 16:54:05 16:54:55 15.22 1 on cpu 40 4 16:54:25 16:54:45 8.7 2 gc cr gran fmtaf5hf9tm7s 30 3 16:54:05 16:54:45 6.52 3 t 2-way on cpu 1wcsvyq7xshqz 30 3 16:54:05 16:54:25 6.52 4 on cpu gvv9dcnsc1jf9 30 3 16:54:25 16:54:55 6.52 5
20210106 16:55 on cpu d506gkxjgw7xr 70 7 16:55:05 16:55:45 12.07 1 on cpu 50 5 16:55:05 16:55:45 8.62 2 db file se fmtaf5hf9tm7s 50 5 16:55:05 16:55:55 8.62 3 quential r ead on cpu gm9rz10b0kb50 40 4 16:55:05 16:55:55 6.9 4 on cpu gvv9dcnsc1jf9 40 4 16:55:05 16:55:55 6.9 5
20210106 16:56 on cpu dp8fnkzqdt3km 100 10 16:56:05 16:56:56 17.24 1 on cpu c8t8f3rps66d5 50 5 16:56:05 16:56:46 8.62 2 on cpu gvv9dcnsc1jf9 40 4 16:56:16 16:56:46 6.9 3 on cpu d506gkxjgw7xr 40 4 16:56:05 16:56:56 6.9 4 gc cr gran fmtaf5hf9tm7s 30 3 16:56:16 16:56:56 5.17 5 t 2-way
20210106 16:57 on cpu 56twj7s93jaz4 920 92 16:57:23 16:57:44 27.71 1 on cpu dp8fnkzqdt3km 350 35 16:57:08 16:57:44 10.54 2 on cpu 310 31 16:57:08 16:57:44 9.34 3 on cpu d506gkxjgw7xr 210 21 16:57:08 16:57:44 6.33 4 on cpu gvv9dcnsc1jf9 160 16 16:57:08 16:57:44 4.82 5
20210106 16:58 on cpu 56twj7s93jaz4 1970 197 16:58:08 16:58:57 36.41 1 on cpu dp8fnkzqdt3km 1100 110 16:58:08 16:58:57 20.33 2 on cpu 390 39 16:58:08 16:58:57 7.21 3 on cpu d506gkxjgw7xr 250 25 16:58:08 16:58:57 4.62 4 on cpu gvv9dcnsc1jf9 250 25 16:58:08 16:58:57 4.62 5
20210106 16:59 on cpu 56twj7s93jaz4 1370 137 16:59:25 16:59:53 32.39 1 on cpu dp8fnkzqdt3km 1220 122 16:59:25 16:59:53 28.84 2 on cpu 240 24 16:59:25 16:59:53 5.67 3 on cpu d506gkxjgw7xr 160 16 16:59:25 16:59:53 3.78 4 on cpu 05qszn0ufs4ff 140 14 16:59:25 16:59:53 3.31 5
20210106 17:00 on cpu dp8fnkzqdt3km 1650 165 17:00:25 17:00:56 32.8 1 on cpu 56twj7s93jaz4 1360 136 17:00:25 17:00:56 27.04 2 on cpu 270 27 17:00:25 17:00:56 5.37 3 on cpu d500bxxqf6dbs 160 16 17:00:25 17:00:56 3.18 4 on cpu 05qszn0ufs4ff 150 15 17:00:25 17:00:56 2.98 5
20210106 17:01 on cpu dp8fnkzqdt3km 900 90 17:01:30 17:01:30 32.97 1 on cpu 56twj7s93jaz4 620 62 17:01:30 17:01:30 22.71 2 on cpu 60t0pum7f1pbm 160 16 17:01:30 17:01:30 5.86 3 on cpu 2v7njuhdw0pgm 110 11 17:01:30 17:01:30 4.03 4 on cpu 100 10 17:01:30 17:01:30 3.66 5
20210106 17:02 on cpu dp8fnkzqdt3km 1890 189 17:02:03 17:02:37 33.04 1 on cpu 56twj7s93jaz4 970 97 17:02:03 17:02:37 16.96 2 on cpu 60t0pum7f1pbm 510 51 17:02:03 17:02:37 8.92 3 on cpu 2v7njuhdw0pgm 460 46 17:02:03 17:02:37 8.04 4 on cpu 05qszn0ufs4ff 190 19 17:02:03 17:02:37 3.32 5
20210106 17:03 on cpu dp8fnkzqdt3km 2090 209 17:03:12 17:03:47 34.78 1 on cpu 56twj7s93jaz4 730 73 17:03:12 17:03:47 12.15 2 on cpu 2v7njuhdw0pgm 660 66 17:03:12 17:03:47 10.98 3 on cpu 60t0pum7f1pbm 470 47 17:03:12 17:03:47 7.82 4 on cpu 320 32 17:03:12 17:03:47 5.32 5
20210106 17:04 on cpu dp8fnkzqdt3km 2100 210 17:04:19 17:04:58 34.6 1 on cpu 2v7njuhdw0pgm 550 55 17:04:19 17:04:58 9.06 2 on cpu 56twj7s93jaz4 550 55 17:04:19 17:04:58 9.06 3 on cpu 60t0pum7f1pbm 360 36 17:04:19 17:04:58 5.93 4 on cpu 260 26 17:04:19 17:04:58 4.28 5
20210106 17:05 on cpu dp8fnkzqdt3km 990 99 17:05:33 17:05:33 33.79 1 on cpu 2v7njuhdw0pgm 240 24 17:05:33 17:05:33 8.19 2 on cpu 56twj7s93jaz4 230 23 17:05:33 17:05:33 7.85 3 on cpu 60t0pum7f1pbm 190 19 17:05:33 17:05:33 6.48 4 on cpu a7kcm21nbngvx 180 18 17:05:33 17:05:33 6.14 5
20210106 17:06 on cpu dp8fnkzqdt3km 1980 198 17:06:05 17:06:40 34.68 1 on cpu 2v7njuhdw0pgm 380 38 17:06:05 17:06:40 6.65 2 on cpu 56twj7s93jaz4 380 38 17:06:05 17:06:40 6.65 3 on cpu a7kcm21nbngvx 380 38 17:06:05 17:06:40 6.65 4 on cpu 60t0pum7f1pbm 320 32 17:06:05 17:06:40 5.6 5
20210106 17:07 on cpu dp8fnkzqdt3km 1870 187 17:07:14 17:07:47 33.45 1 on cpu 2v7njuhdw0pgm 400 40 17:07:14 17:07:47 7.16 2 on cpu a7kcm21nbngvx 320 32 17:07:14 17:07:47 5.72 3 on cpu f018b0x00auxp 310 31 17:07:14 17:07:47 5.55 4 on cpu 60t0pum7f1pbm 300 30 17:07:14 17:07:47 5.37 5
20210106 17:08 on cpu dp8fnkzqdt3km 1820 182 17:08:18 17:08:51 32.91 1 on cpu 2v7njuhdw0pgm 420 42 17:08:18 17:08:51 7.59 2 on cpu f018b0x00auxp 400 40 17:08:18 17:08:51 7.23 3 on cpu 13nf2mwh3xmsh 390 39 17:08:18 17:08:51 7.05 4 on cpu 60t0pum7f1pbm 330 33 17:08:18 17:08:51 5.97 5
20210106 17:09 on cpu dp8fnkzqdt3km 910 91 17:09:26 17:09:26 31.82 1 on cpu 2v7njuhdw0pgm 310 31 17:09:26 17:09:26 10.84 2 on cpu f018b0x00auxp 220 22 17:09:26 17:09:26 7.69 3 on cpu 60t0pum7f1pbm 180 18 17:09:26 17:09:26 6.29 4 on cpu 13nf2mwh3xmsh 150 15 17:09:26 17:09:26 5.24 5
20210106 17:10 on cpu dp8fnkzqdt3km 1830 183 17:10:02 17:10:35 35.88 1 on cpu 2v7njuhdw0pgm 530 53 17:10:02 17:10:35 10.39 2 on cpu 13nf2mwh3xmsh 290 29 17:10:02 17:10:35 5.69 3 on cpu f018b0x00auxp 280 28 17:10:02 17:10:35 5.49 4 on cpu 270 27 17:10:02 17:10:35 5.29 5
...
20210106 17:23 on cpu dp8fnkzqdt3km 1390 139 17:23:09 17:23:40 26.28 1 on cpu 56twj7s93jaz4 770 77 17:23:09 17:23:40 14.56 2 on cpu 60t0pum7f1pbm 330 33 17:23:09 17:23:40 6.24 3 on cpu 05qszn0ufs4ff 260 26 17:23:09 17:23:40 4.91 4 on cpu 250 25 17:23:09 17:23:40 4.73 5
20210106 17:24 on cpu 56twj7s93jaz4 390 39 17:24:07 17:24:07 18.66 1 on cpu 160 16 17:24:07 17:24:07 7.66 2 on cpu 60t0pum7f1pbm 140 14 17:24:07 17:24:07 6.7 3 on cpu 05qszn0ufs4ff 130 13 17:24:07 17:24:07 6.22 4 on cpu a7kcm21nbngvx 110 11 17:24:07 17:24:07 5.26 5

425 rows selected.

Note:
从16:57开始的, DBtime 呈上升趋势,并且wait event: ON CPU, 也能看到TOP SQL。

正常时间段DASH中TOP sql

SQL> select session_id,to_char(sample_time,'yyyymmdd hh24:mi:ss') etime,event,sql_exec_id,SQL_EXEC_START,session_state,TIME_WAITED,IN_SQL_EXECUTION,TM_DELTA_CPU_TIME from dbmt.dash0106 where sample_time >to_date('2021-01-06 16:45','yyyy-mm-dd hh24:mi') and sample_time <to_date('2021-01-06 16:54','yyyy-mm-dd hh24:mi') and sql_id='dp8fnkzqdt3km' order by 1,2;
SESSION_ID ETIME EVENT SQL_EXEC_ID SQL_EXEC_START SESSION TIME_WAITED I TM_DELTA_CPU_TIME---------- ----------------- ---------- ----------- ----------------- ------- ----------- - ----------------- 12 20210106 16:49:22 20337054 20210106 16:49:20 ON CPU 0 Y 5824114 12 20210106 16:49:53 20337078 20210106 16:49:51 ON CPU 0 Y 2731150 467 20210106 16:45:06 41876133 20210106 16:45:05 ON CPU 0 Y 12821930 921 20210106 16:52:14 20337167 20210106 16:52:13 ON CPU 0 Y 9083661 1229 20210106 16:48:48 41876201 20210106 16:48:45 ON CPU 0 Y 28963774 1532 20210106 16:45:10 20336897 20210106 16:45:06 ON CPU 0 Y 12105818 1532 20210106 16:51:43 20337139 20210106 16:51:43 ON CPU 0 Y 9087716 1535 20210106 16:47:52 20337014 20210106 16:47:51 ON CPU 0 Y 6957901 1535 20210106 16:48:02 20337023 20210106 16:48:01 ON CPU 0 Y 6100204 1537 20210106 16:47:11 20336989 20210106 16:47:11 ON CPU 0 Y 10426243 1559 20210106 16:48:12 20337027 20210106 16:48:11 ON CPU 0 Y 20086346 1559 20210106 16:48:22 20337032 20210106 16:48:17 ON CPU 0 Y 5404968 1559 20210106 16:50:13 20337087 20210106 16:50:12 ON CPU 0 Y 5066286 2292 20210106 16:49:02 20337039 20210106 16:49:00 ON CPU 0 Y 6094836 2435 20210106 16:46:16 41876161 20210106 16:46:15 ON CPU 0 Y 16064789 2741 20210106 16:49:02 20337040 20210106 16:49:01 ON CPU 0 Y 5135847 2746 20210106 16:45:31 20336918 20210106 16:45:30 ON CPU 0 Y 559010   2746 20210106 16:46:21   20336951 20210106 16:46:20 ON CPU   0 Y   442342

TIP:
主要是看同一session 同一SQL的执行情况,通过session_id(session_serial#)和sql_exec_id来确认,sql_exec_id在同一SESSION执行sql会累增,sql_exec_id不变再看SQL_EXEC_START SQL开始执行的时间。可见正常时间SQL持续运行的时间都是<2秒的。

问题时间段的DASH中TOP sql

SQL> select session_id,--SESSION_SERIAL#, (used to uniquely identify a session's objects) to_char(sample_time,'yyyymmdd hh24:mi:ss') etime,event,sql_exec_id,SQL_EXEC_START,session_state,TIME_WAITED,IN_SQL_EXECUTION,TM_DELTA_CPU_TIME from dbmt.dash0106 where sample_time >to_date('2021-01-06 16:58','yyyy-mm-dd hh24:mi') and sample_time <to_date('2021-01-06 17:10','yyyy-mm-dd hh24:mi') and sql_id='dp8fnkzqdt3km' order by 1,2

SESSION_ID ETIME EVENT SQL_EXEC_ID SQL_EXEC_START SESSION TIME_WAITED I TM_DELTA_CPU_TIME---------- ----------------- ---------- ----------- ----------------- ------- ----------- - ----------------- 9 20210106 17:02:03 20337526 20210106 17:02:01 ON CPU 0 Y 3050637 9 20210106 17:02:37 20337526 20210106 17:02:01 ON CPU 0 Y 8411155 9 20210106 17:03:12 20337526 20210106 17:02:01 ON CPU 0 Y 7732513 9 20210106 17:03:47 20337526 20210106 17:02:01 ON CPU 0 Y 8855886 9 20210106 17:04:19 20337526 20210106 17:02:01 ON CPU 0 Y 6742950 9 20210106 17:04:58 20337526 20210106 17:02:01 ON CPU 0 Y 8861440 9 20210106 17:05:33 20337526 20210106 17:02:01 ON CPU 0 Y 7035218 9 20210106 17:06:05 20337526 20210106 17:02:01 ON CPU 0 Y 7703591 9 20210106 17:06:40 20337526 20210106 17:02:01 ON CPU 0 Y 8466669 12 20210106 16:58:08 20337396 20210106 16:57:38 ON CPU 0 Y 7636384 12 20210106 16:58:31 20337396 20210106 16:57:38 ON CPU 0 Y 6840769 12 20210106 16:58:57 20337396 20210106 16:57:38 ON CPU 0 Y 6953856 12 20210106 16:59:25 20337396 20210106 16:57:38 ON CPU 0 Y 6760293 12 20210106 16:59:53 20337396 20210106 16:57:38 ON CPU 0 Y 6476894 12 20210106 17:00:25 20337396 20210106 16:57:38 ON CPU 0 Y 6974848 12 20210106 17:00:56 20337396 20210106 16:57:38 ON CPU 0 Y 5540165 12 20210106 17:01:30 20337396 20210106 16:57:38 ON CPU 0 Y 6374697 12 20210106 17:02:03 20337396 20210106 16:57:38 ON CPU 0 Y 6177705 12 20210106 17:02:37 20337396 20210106 16:57:38 ON CPU 0 Y 5990692 12 20210106 17:03:12 20337396 20210106 16:57:38 ON CPU 0 Y 6111413 12 20210106 17:03:47 20337396 20210106 16:57:38 ON CPU 0 Y 6873905 12 20210106 17:04:19 20337396 20210106 16:57:38 ON CPU 0 Y 5680029 12 20210106 17:04:58 20337396 20210106 16:57:38 ON CPU 0 Y 6661646 12 20210106 17:05:33 20337396 20210106 16:57:38 ON CPU 0 Y 5920078 12 20210106 17:06:05 20337396 20210106 16:57:38 ON CPU 0 Y 6712145 12 20210106 17:06:40 20337396 20210106 16:57:38 ON CPU 0 Y 6765375 12 20210106 17:07:14 20337396 20210106 16:57:38 ON CPU 0 Y 5585508 12 20210106 17:07:47 20337396 20210106 16:57:38 ON CPU 0 Y 6002971 12 20210106 17:08:18 20337396 20210106 16:57:38 ON CPU 0 Y 6159024 12 20210106 17:08:51 20337396 20210106 16:57:38 ON CPU 0 Y 6594718 12 20210106 17:09:26 20337396 20210106 16:57:38 ON CPU 0 Y 6444770 40 20210106 17:03:47 20337544 20210106 17:03:13 ON CPU 0 Y 8750964 40 20210106 17:04:19 20337544 20210106 17:03:13 ON CPU 0 Y 5033203 40 20210106 17:04:58 20337544 20210106 17:03:13 ON CPU 0 Y 6892512 40 20210106 17:05:33 20337544 20210106 17:03:13 ON CPU 0 Y 6081303 40 20210106 17:06:05 20337544 20210106 17:03:13 ON CPU 0 Y 6165369 40 20210106 17:06:40 20337544 20210106 17:03:13 ON CPU 0 Y 6525653 40 20210106 17:07:14 20337544 20210106 17:03:13 ON CPU 0 Y 5790222 40 20210106 17:07:47 20337544 20210106 17:03:13 ON CPU 0 Y 6261831 40 20210106 17:08:18 20337544 20210106 17:03:13 ON CPU 0 Y 5841272 40 20210106 17:08:51 20337544 20210106 17:03:13 ON CPU 0 Y 6215711 40 20210106 17:09:26 20337544 20210106 17:03:13 ON CPU 0 Y 6987163 44 20210106 17:00:56 20337496 20210106 17:00:34 ON CPU 0 Y 7026321 44 20210106 17:01:30 20337496 20210106 17:00:34 ON CPU 0 Y 8437444 44 20210106 17:02:03 20337496 20210106 17:00:34 ON CPU 0 Y 8608695 44 20210106 17:02:37 20337496 20210106 17:00:34 ON CPU 0 Y 8561660 44 20210106 17:03:12 20337496 20210106 17:00:34 ON CPU 0 Y 8350169 181 20210106 16:58:08 20337392 20210106 16:57:35 ON CPU 0 Y 10671584...

Note:
这个时段的sql个别会话单次执行时间4分钟,10分钟以上。且这时间段没有I/O 类(物理读)wait event 全是ON CPU,业务也确认该时间没有太多数据变化, 从业务failover到其它节点后的执行效率也基本上是2秒内和之前一样,也能判断不是数据变化问题。

当然2个时间段的SQL 执行计划也一致

SQL> select to_char(sample_time,'yyyymmdd hh24:mi') etime,sql_id,SQL_PLAN_HASH_VALUE,count(*) from dbmt.dash0106 where sample_time >to_date('2021-01-06 16:40','yyyy-mm-dd hh24:mi') and sample_time <to_date('2021-01-06 17:10','yyyy-mm-dd hh24:mi') and sql_id in('dp8fnkzqdt3km','56twj7s93jaz4') group by to_char(sample_time,'yyyymmdd hh24:mi'),sql_id,SQL_PLAN_HASH_VALUE order by 2,1;
ETIME SQL_ID SQL_PLAN_HASH_VALUE COUNT(*)-------------- ------------- ------------------- ----------20210106 16:40 dp8fnkzqdt3km 2900077901 1120210106 16:41 dp8fnkzqdt3km 2900077901 420210106 16:42 dp8fnkzqdt3km 2900077901 720210106 16:43 dp8fnkzqdt3km 2900077901 9...20210106 16:51 dp8fnkzqdt3km 2900077901 320210106 16:52 dp8fnkzqdt3km 2900077901 620210106 16:53 dp8fnkzqdt3km 2900077901 920210106 16:54 dp8fnkzqdt3km 2900077901 620210106 16:55 dp8fnkzqdt3km 2900077901 620210106 16:56 dp8fnkzqdt3km 2900077901 1120210106 16:57 dp8fnkzqdt3km 2900077901 4720210106 16:58 dp8fnkzqdt3km 2900077901 11120210106 16:59 dp8fnkzqdt3km 2900077901 12320210106 17:00 dp8fnkzqdt3km 2900077901 16820210106 17:01 dp8fnkzqdt3km 2900077901 9520210106 17:02 dp8fnkzqdt3km 2900077901 19320210106 17:03 dp8fnkzqdt3km 2900077901 213...20210106 17:08 dp8fnkzqdt3km 2900077901 18620210106 17:09 dp8fnkzqdt3km 2900077901 95

其它SQL也存在这种现象。执行计划一致,数据无变化,短时间内单次执行时间从秒级放大分钟级,并且session state 也是在ON CPU, 这段时间在buffer gets做运算,所以怀疑当时buffer gets出现系统调用问题。

主机什么原因CRASH

主机上的messages log服务问题没有生成日志,但是硬件工程师确认是因为物理内存损坏了,时间点也基本和数据库负载突增吻合, 所以有理由怀疑是在主机CRASH前,内存异常,导致数据库同一类SQL的内存内数据buffer get异常,单次SQL执行时间变长,最终导致负载的积压。

本次数据库负载异常或故障突然CRASH,而AWR snapshot没有形成时,在12c后中的ASH每5分钟逐渐式flush disk,不会刷新太频繁而增加系统负载,也不会等到AWR SNAPSHOT时间大粒度间隔而突然重启而ASH数据缺失无法分析。本次就是利用DASH中SQL两个时间段的SQL执行持续时间判断SQL变慢而导致的业务积压,而非SQL执行量增加,或执行计划变化。

作者

张维照:云和恩墨技术总监,Oracle ACE-A。2006年起从事数据库管理工作,2009年起从事ORACLE DBA维护工作,十余年来专注于Database 技术和架构的研究,热衷于oracle数据库故障诊断、性能优化、内部原理、新特性的学习与分享,在BLOG分享大量的学习和案例经验。从事过多套TB级省级工商、医疗、交通、人社、政府、电信运营商等行业数据库项目从业经验。个人网站 www.anbob.com

墨天轮原文链接:https://www.modb.pro/db/43956(复制到浏览器中打开或者点击“阅读原文”立即查看)

- end-

(0)

相关推荐