oracle中hanganalyze怎么用

发布时间：2021-11-11 14:06:03 作者：小新
来源：亿速云阅读：175

这篇文章给大家分享的是有关oracle中hanganalyze怎么用的内容。小编觉得挺实用的，因此分享给大家做个参考，一起跟随小编过来看看吧。

hanganalyze有两种使用方法：
a.alter session set events 'immediate trace name hanganalyze level <level>';
会话级别，单实例和rac都可以使用。
b.--for 单实例
oradebug hanganalyze <level>

--for RAC
oradebug setmypid --跟踪当前会话
oradebug setinst all --rac环境
oradebug -g def hanganalyze <level>
oradebug unlimit   --取消trace文件大小限制
oradebug tracefile_name --查看trace文件名及位置

建议oradebug hanganalyze这一条命令多执行几次，这样可以比较进程的变化情况，查看是真的hang了，还是很慢
--------------------------------------------
此外oredebug工具常见的用处如下：
oradebug setospid --跟踪系统进程
oradebug setorapid --跟踪ORACLE进程
-------------------------------------------

level等级和概念：
1-2:只有hanganalyze输出，不dump任何进程
3：level2+dump出在in_hang状态的进程
4：level3+dump出在等待链里面的blockers(状态为leaf/leaf_nw/ign_dmp)
5：level4+dump出所有在等待链中的进程（状态为nleaf）
Oracle官方建议不要超过level 3。level 3 对于分析问题足够了。超过level 3会给系统带来额外负担。

了解了用法以后，我用library cache lock,library cache pin以及TX锁通过上面的方法得到一个trace文件，通过trace文件分析问题以及顺便解释一下如何读懂这个trace文件。
首先产生一个TX锁：
第一个session：
02:05:40 SQL> select sid,serial# from v$session where sid=(select userenv('SID') from dual);

       SID    SERIAL#
---------- ----------
        32      45444

1 row selected.
02:05:55 SQL> create table t1 (id int);

Table created.
02:07:20 SQL> insert into t1 values(1);

1 row created.
02:07:36 SQL> commit;

Commit complete.
02:07:39 SQL> update t1 set id=2 where id=1;
1 row updated.
第二个session去更新同一行，hang住：
02:05:51 SQL> select sid,serial# from v$session where sid=(select userenv('SID') from dual);

       SID    SERIAL#
---------- ----------
        40      17698

1 row selected.

Elapsed: 00:00:00.00
02:06:07 SQL> update t1 set id=3 where id=1;

此时产生的等待事件：
   INST_ID     EVENT# EVENT                                                              COUNT(*)
---------- ---------- ---------------------------------------------------------------- ----------
         1        341 SQL*Net message to client                                                 1
         1        237 enq: TX - row lock contention                                             1

下面模拟library cache pin的产生：
建立两个存储过程：
create or replace procedure aproc
is
begin
   null;
end;
/

create or replace procedure bproc
is
begin
   aproc;
   dbms_lock.sleep(3000);
end;
/

第三个session：
02:06:13 SQL> select sid,serial# from v$session where sid=(select userenv('SID') from dual);

       SID    SERIAL#
---------- ----------
         1         18

1 row selected.

执行存储过程：
exec bproc;

第四个session，编译aproc，hang住：
02:06:24 SQL> select sid,serial# from v$session where sid=(select userenv('SID') from dual);

       SID    SERIAL#
---------- ----------
        17         24

1 row selected.

Elapsed: 00:00:00.00
02:06:26 SQL> alter procedure aproc compile;

此时产生了等待事件：
   INST_ID     EVENT# EVENT                                                              COUNT(*)
---------- ---------- ---------------------------------------------------------------- ----------
         1        237 enq: TX - row lock contention                                             1
         1        341 SQL*Net message to client                                                 1
         1        280 library cache pin                                                         1

第五个session执行aproc：
02:06:51 SQL> select sid,serial# from v$session where sid=(select userenv('SID') from dual);

       SID    SERIAL#
---------- ----------
        44         48

1 row selected.

02:39:45 SQL> alter procedure aproc compile;

这个时候产生了library cache lock
   INST_ID     EVENT# EVENT                                                              COUNT(*)
---------- ---------- ---------------------------------------------------------------- ----------
         1        237 enq: TX - row lock contention                                             1
         1        341 SQL*Net message to client                                                 1
         1        281 library cache lock                                                        1
         1        280 library cache pin                                                         1
这里大概讲一下pin和lock的oracle内部的过程。
比如会话1执行了bproc，会话2执行了编译aproc。此时，会话1持有了bproc和aproc的共享型（share）pin，aproc和bproc的lock未被持有；会话2编译时，需要获得aproc的独占型（exclusive）pin，因为lock的优先级比pin高，所以获得pin之前要先获得aproc的handle的独占型lock，这个过程没有问题，但是在获得aproc独占型pin的时候，会被会话1的共享型pin阻塞。此时aproc的独占型lock被会话2持有，共享型pin被会话1持有；bproc的共享型pin被会话1持有，bproc的独占型lock被会话2持有。
会话1：aproc共享型pin，bproc共享型pin
会话2：aproc独占型lock，等待aproc独占pin（被会话1的aproc共享型pin阻塞）
这个时候如果session3再去编译aproc，会话3先需要获得aproc的独占型lock，后获得独占型pin，获得lock时会被会话2持有的aproc独占型lock阻塞，产生library cache lock等待；
如果会话3去编译bproc，会话3需要先获得bproc的独占型lock，然后获得独占型pin，lock获得没问题，但是在获得独占型pin的时候会被会话1持有bproc的共享型pin阻塞，差生library cache pin等待；
如果会话3去运行aproc，需要获得aproc共享型的lock，但是会被会话2持有的aproc独占型lock阻塞，从而产生过library cache lock等待。
如果会话3去运行bproc，需要获得bproc和aproc共享型pin，aproc共享型pin的获得需要排队，而且排在会话2后面，bproc共享型pin会被会话1持有的bproc独占型pin阻塞，从而产生library cache pin等待。这时如果杀掉会话1，aproc的独占型pin会被会话2获得，会话3还是要等待会话2。

实验验证，完全正确！
当然以上是我自己对pin和lock的理解，如果有异议，欢迎留言。

-------------------------------------------------------------
library cache lock和library cache pin的作用：
Oracle利用Library cache lock和Library cache pin来实现并发控制，Library cache lock是在handle上获取的，而Library cache pin则是在data heap上获取。访问对象时，首先必须获取handle上的lock，然后将访问的数据pin在内存中。lock的作用是控制进程间的并发访问，而pin的作用是保证数据一致性，防止数据在访问时被交换出去
参考：http://blog.csdn.net/tianlesoftware/article/details/6641440
----------------------------------------------------------------


现在我们总结一下阻塞情况：
(32.45444)阻塞(40.17698),(40.17698)上产生TX锁。
(1.18)阻塞(17.24)阻塞(44.48)，第一个阻塞产生pin，第二个阻塞产生lock。

下面收集system hanganalyze的trace file；
顺便也收集一下system statedump;

system hanganalyze:
SQL> oradebug hanganalyze 3;
Hang Analysis in /tpsys/app/oracle/diag/rdbms/ogg1/ogg1/trace/ogg1_ora_22133.trc

system statedump:
SQL> oradebug setmypid
Statement processed.
SQL> oradebug unlimit
Statement processed.
SQL> oradebug dump systemstate 266
Statement processed.
SQL> oradebug dump systemstate 266
Statement processed.
SQL> oradebug tracefile_name
/tpsys/app/oracle/diag/rdbms/ogg1/ogg1/trace/ogg1_ora_22313.trc

hanganalyze产生的trace文件还是比较容易读懂的，前面都是一些主机和库的信息，我这里忽略了。
这一部分罗列了最有可能导致system hang的chains，发现有两条；当trace文件很多的时候，可以搜索关键字Signature:
more ogg1_ora_22133.trc|grep Signature

Chains most likely to have caused the hang:
[a] Chain 1 Signature: 'SQL*Net message from client'<='enq: TX - row lock contention'
     Chain 1 Signature Hash: 0x38c48850
[b] Chain 2 Signature: 'PL/SQL lock timer'<='library cache pin'<='library cache lock'
     Chain 2 Signature Hash: 0x38f0a7dd

下面详细解释其中一条阻塞链：
-------------------------------------------------------------------------------
Chain 2:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 1 (ogg1.ogg1)
                   os id: 17826
              process id: 33, oracle@oggtest1 (TNS V1-V3)
              session id: 44
        session serial #: 48
    }
    is waiting for 'library cache lock' with wait info:   -----会话44在经历等待，等待信息如下：
    {
                      p1: 'handle address'=0xaa5ad9d8
                      p2: 'lock address'=0xaa605b38
                      p3: '100*mode+namespace'=0x12c3700010003
            time in wait: 3 min 2 sec
           timeout after: never
                 wait id: 45
                blocking: 0 sessions
             current sql: alter procedure aproc compile
             short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-semtimedop()+1
0<-skgpwwait()+156<-ksliwat()+1821<-kslwaitctx()+162<-__PGOSF85_ksfwaitctx()+14<-kglLockWait()+867<-kgllkal()+955<-k
glLock()+989<-kglget()+323<-kkdllk0()+431<-kkdlGetCodeObject()+307<-kkpalt()+353<-opiexe()+14929<-opiosq0()+3942<-kp
ooprx()+295<-kpoal8()+779<-opiodr()+1149<-ttcpip()+1251<-opitsk()+1633<-opiino()+958<-opiodr()+1149<-opidrv()+570<-s
ou2o()+103<-opimai_real()+133<-ssthrdmain()+214<-main()+201<-__libc_start_main()+
            wait history:                              ----------会话44曾经经历过的等待事件，但是不在阻塞链中。
              * time between current wait and wait #1: 0.007729 sec
              1.       event: 'asynch descriptor resize'
                 time waited: 0.000017 sec
                     wait id: 44              p1: 'outstanding #aio'=0x0
                                              p2: 'current aio limit'=0x82
                                              p3: 'new aio limit'=0x92
              * time between wait #1 and #2: 0.001617 sec
              2.       event: 'SQL*Net message from client'
                 time waited: 21.388190 sec
                     wait id: 43              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000012 sec
              3.       event: 'SQL*Net message to client'
                 time waited: 0.000001 sec
                     wait id: 42              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
    }
    and is blocked by                      --------阻塞会话44的会话信息如下：
=> Oracle session identified by:
    {
                instance: 1 (ogg1.ogg1)
                   os id: 17748
              process id: 32, oracle@oggtest1 (TNS V1-V3)
              session id: 17
        session serial #: 24
    }
    which is waiting for 'library cache pin' with wait info:      -----会话17阻塞了会话44，但是同时会话17也被其他会话阻塞
    {                                                              -----阻塞信息如下：
                      p1: 'handle address'=0xaa5ad9d8
                      p2: 'pin address'=0xaa5f2fd0
                      p3: '100*mode+namespace'=0x12c3700010003
            time in wait: 6 min 25 sec
           timeout after: 8 min 34 sec
                 wait id: 30
                blocking: 1 session
             current sql: alter procedure aproc compile
             short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-semtimedop()+1
0<-skgpwwait()+156<-ksliwat()+1821<-kslwaitctx()+162<-__PGOSF85_ksfwaitctx()+14<-kglpnal()+1882<-kglpin()+978<-kkdll
k0()+967<-kkdlGetCodeObject()+307<-kkpalt()+353<-opiexe()+14929<-opiosq0()+3942<-kpooprx()+295<-kpoal8()+779<-opiodr
()+1149<-ttcpip()+1251<-opitsk()+1633<-opiino()+958<-opiodr()+1149<-opidrv()+570<-sou2o()+103<-opimai_real()+133<-ss
thrdmain()+214<-main()+201<-__libc_start_main()+253<-_start()+36
            wait history:                              ----------会话17曾经经历过的等待事件，但是不在阻塞链中。
              * time between current wait and wait #1: 0.009959 sec
              1.       event: 'Disk file operations I/O'
                 time waited: 0.000028 sec
                     wait id: 29              p1: 'FileOperation'=0x2
                                              p2: 'fileno'=0xc9
                                              p3: 'filetype'=0x2
              * time between wait #1 and #2: 0.002386 sec
              2.       event: 'asynch descriptor resize'
                 time waited: 0.000003 sec
                     wait id: 28              p1: 'outstanding #aio'=0x0
                                              p2: 'current aio limit'=0x80
                                              p3: 'new aio limit'=0x82
              * time between wait #2 and #3: 0.008495 sec
              3.       event: 'SQL*Net message from client'
                 time waited: 30 min 18 sec
                     wait id: 27              p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
    }
    and is blocked by                        -------阻塞会话17的会话信息如下：
=> Oracle session identified by:
    {
                instance: 1 (ogg1.ogg1)
                   os id: 6780
              process id: 19, oracle@oggtest1 (TNS V1-V3)
              session id: 1
        session serial #: 18
    }
    which is waiting for 'PL/SQL lock timer' with wait info: ----会话1正在经历等待，等待信息如下：
    {
                      p1: 'duration'=0x493e0
            time in wait: 9 min 58 sec
           timeout after: 40 min 1 sec
                 wait id: 135
                blocking: 2 sessions
             current sql: BEGIN bproc; END;
             short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1782<-sspuser()+112<-__restore_rt()<-semtimedop()+1
0<-skgpwwait()+156<-ksliwat()+1821<-kslwaitctx()+162<-kslwait()+141<-psdwat()+112<-pevm_icd_call_common()+421<-pfrin
str_ICAL()+169<-pfrrun_no_tool()+63<-pfrrun()+1030<-plsql_run()+774<-peicnt()+301<-kkxexe()+525<-opiexe()+14801<-kpo
al8()+2288<-opiodr()+1149<-ttcpip()+1251<-opitsk()+1633<-opiino()+958<-opiodr()+1149<-opidrv()+570<-sou2o()+103<-opi
mai_real()+133<-ssthrdmain()+214<-main()+201<-__libc_start_main()+253<-_start()+3
            wait history:                                      -------会话1曾经经历过的等待事件，但是不在阻塞链中
              * time between current wait and wait #1: 0.000952 sec
              1.       event: 'SQL*Net message from client'
                 time waited: 17.746952 sec
                     wait id: 134             p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
              * time between wait #1 and #2: 0.000015 sec
              2.       event: 'SQL*Net message to client'
                 time waited: 0.000000 sec
                     wait id: 133             p1: 'driver id'=0x62657100
                                              p2: '#bytes'=0x1
              * time between wait #2 and #3: 0.000002 sec
              3.       event: 'SQL*Net break/reset to client'
                 time waited: 0.000088 sec
                     wait id: 132             p1: 'driver id'=0x62657100
                                              p2: 'break?'=0x0
    }
---------到此为止，发现会话1没有被其他会话阻塞了，说明这条阻塞链的源头已经找到了，就是会话1.与我们之前已知的阻塞情形是完全吻合的。会话1在等待的'PL/SQL lock timer'就是存储过程dbms_lock.sleep造成的。
Chain 2 Signature: 'PL/SQL lock timer'<='library cache pin'<='library cache lock'
Chain 2 Signature Hash: 0x38f0a7dd
-------------------------------------------------------------------------------

简化一下就是：
-------------------------------------------------------------------------------
Chain 2:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                被阻塞会话信息
    }
    is waiting for 'library cache lock<等待事件名称>' with wait info:
    {
                      等待事件的信息
            wait history:
              该被阻塞会话历史等待事件，这里与阻塞链没关系，只是罗列出来供我们分析关联性。
    }
    and is blocked by                 ----阻塞源信息：
=> Oracle session identified by:
    {
                阻塞会话信息
    }
    which is waiting for 'library cache pin<等待事件名称>' with wait info:   ----该阻塞会话同时也被其他会话阻塞，也在等待
    {
                      该会话经历等待的事件的详细情况
            wait history:
              该会话还在经历其他不在阻塞链中的等待事件
    }
    and is blocked by                 -----上面会话的阻塞源
=> Oracle session identified by:
    {
                阻塞源信息
    }
    which is waiting for 'PL/SQL lock timer' with wait info:
    {
                      阻塞事件详情
            wait history:
              该会话还在经历其他不在阻塞链中的等待事件
    }

Chain 2 Signature: 'PL/SQL lock timer'<='library cache pin'<='library cache lock'   -----阻塞链
Chain 2 Signature Hash: 0x38f0a7dd
-------------------------------------------------------------------------------
我们可以用下面nl ogg1_ora_22133_hang.trc |grep "which is waiting for"
    88      which is waiting for 'SQL*Net message from client' with wait info:
   169      which is waiting for 'library cache pin' with wait info:
   208      which is waiting for 'PL/SQL lock timer' with wait info:
然后根据行数去找到大概位置。
比如88和169,208相隔比较远，猜测可能是另外一条阻塞链，这里88应该是TX那条阻塞链。

下面看一下阻塞链的简化信息。我自己是把这部分信息看做上面阻塞链只保留了会话基本信息以后的阻塞关系。
State of ALL nodes
([nodenum]/cnode/sid/sess_srno/session/ospid/state/[adjlist]):
[0]/1/1/18/0xa8bc3930/6780/LEAF/
[16]/1/17/24/0xa8b95230/17748/NLEAF/[0]
[31]/1/32/45444/0xa8b699a0/17703/LEAF/
[39]/1/40/17698/0xa8b52620/17704/NLEAF/[31]
[43]/1/44/48/0xa8b46c60/17826/NLEAF/[16]

每一行从后往前读的时候
把leaf看做被阻塞
nleaf看做阻塞。
先找到leaf，leaf一般是阻塞的源头。这里[0]是一个源头，然后再其他行结尾找[0]，但是每一行倒着读的时候leaf和nleaf的含义也要相反。
[0](即会话1) nleaf（阻塞） [16](即会话17)：0阻塞16；
找到[16]那一行，同理：
16阻塞了43
没有找到[43]那一行，说明[43]是阻塞链的末端。
综上：会话1 blocking 会话17 blocking 会话44；
如果要杀的话，先杀1，在杀17。

还有一个leaf：相关两行：
[31]/1/32/45444/0xa8b699a0/17703/LEAF/
[39]/1/40/17698/0xa8b52620/17704/NLEAF/[31]
会话32阻塞了会话40，即我们之前产生的TX锁。

完全正确！

下面是一些解释：
[nodenum]：阻塞链中session的序列号
cnode：节点number，针对rac
sid:session sid
sess_srno:session serial#
session:会话地址
ospid:操作系统进程id
state:节点状态
[adjlist]:blocker node session nodenum
predecessor：waiter node session nodenum

--------------------------------------------------------------------------------------------------
state：
IN_HANG ：该状态是一个非常危险的状态，通常表现该会话陷入了死循环或挂起(hang)。
          一般来说出现这种情况，该节点的临近节点(adjlist)也是一样的状态.adjlist 其实就是表示session id.
LEAF    ：通常是被认为blockers的重点对象。可以根据后面的predecesor来判断该session是不是blocker或者是waiter。
LEAF_NW ：跟leaf类似不过可能会占用cpu
NLEAF   ：该状态的session通常被认为 “stuck” session。即其他session所需要的资源正被其holding。
IGN     ：该状态的session通常是处理IDLE状态，除非其adjlist存在，如果是，那么该session正在等待其他session。
IGN_DMP ：跟IGN类似。
SINGLE_NODE，SINGLE_NODE_NW 可以认为跟LEAF，LEAF_NW类似。
参考：http://blog.csdn.net/xujinyang/article/details/6858086
        IGN     ignore
        LEAF    A waiting leaf node
        LEAF_NW A running (using CPU?) leaf node
        NLEAF   An element in a chain but not at the end (not a leaf)
参考：EVENT: HANGANALYZE - Reference Note (文档 ID 130874.1)
-----------------------------------------------------------------------------------------------------

trace文件中还有一个non-intersecting chains，相对应的是intersecting chains，区别：
从字面意思来看，intersect是交叉，即两条阻塞链中的会话都不相同即为non-intersecting chains；如果两条chains中有相同的会话，比如类似‘X’有一个交叉点会话，即为intersecting chains。

感谢各位的阅读！关于“oracle中hanganalyze怎么用”这篇文章就分享到这里了，希望以上内容可以对大家有一定的帮助，让大家可以学到更多知识，如果觉得文章不错，可以把它分享出去让更多的人看到吧！

oracle中hanganalyze怎么用

相关阅读