您好,登录后才能下订单哦!
这期内容当中小编将会给大家带来有关goldengate故障的处理方法,文章内容丰富且以专业的角度为大家分析和叙述,阅读完这篇文章希望大家可以有所收获。
问题描述:
我们线上的gg上线时间是上周三晚上,也就是4月19号晚上,当时上线的时候是配置在rac的节点3上的,在重启节点3的时候由于疏忽,原本32G的内存,起来之后只识别了24G,当时没有发现,运行几天后,突然发现,每天都有那么一、二次,节点3并发非常高,操作系统层面平均负载从几一下飙升到五六十,造成数据库短暂性假死现象,恰恰在这个时间点上,gg的抽取进程在top1,再看操作系统的内存使用情况,只剩下几十k了,一开始怀疑是nfs挂载的问题,最后测试下来,也没什么问题,最后决定紧急处理节点3的内存问题,具体处理细节如下:
晚6点下班后,由于6点到9点这个时间段,相对来说网站和boss都还比较繁忙,这段时间就没做任何操作,到了9点钟,通知运维相关人员,把节点3的tomcat全部停止,然后我这里停gg,卸载nfs,关闭节点3的所有数据库进程,最后关机,操作见下:
GGSCI (rac3) 21> stop mgr
GGSCI (rac3) 21> stop extract xxxx
GGSCI (rac3) 21> stop dpump xxxx
停的过程中,errlog中的信息如下:
2012-04-26 20:57:39 INFO OGG-00497 Oracle GoldenGate Capture for Oracle, extksr1.prm: Writing DDL operation to extract trail file.
2012-04-26 21:01:36 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (oracle): stop extksr1.
2012-04-26 21:01:38 INFO OGG-01021 Oracle GoldenGate Capture for Oracle, extksr1.prm: Command received from GGSCI: STOP.
2012-04-26 21:01:39 INFO OGG-00991 Oracle GoldenGate Capture for Oracle, extksr1.prm: EXTRACT EXTKSR1 stopped normally.
2012-04-26 21:01:41 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (oracle): stop dpksr1.
2012-04-26 21:01:43 INFO OGG-01021 Oracle GoldenGate Capture for Oracle, dpksr1.prm: Command received from GGSCI: STOP.
2012-04-26 21:01:43 INFO OGG-00991 Oracle GoldenGate Capture for Oracle, dpksr1.prm: EXTRACT DPKSR1 stopped normally.
2012-04-26 21:01:47 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (oracle): stop mgr.
2012-04-26 21:01:49 INFO OGG-00963 Oracle GoldenGate Manager for Oracle, mgr.prm: Command received from GGSCI on host 10.1.8.49 (STOP).
2012-04-26 21:01:49 WARNING OGG-00938 Oracle GoldenGate Manager for Oracle, mgr.prm: Manager is stopping at user request.
相关进程都停止之后,卸载nfs,umount了节点1,2以及共享存储,具体命令略过,很简单,值得一提的是,在卸载共享存储的时候,会出现资源忙的情况,只要加个-l参数就可以了,同时主站gg进程都停止之后,会发现gg的目标端进程虽然是running状态,但是errlog里会提示抽取进程已停止的相关信息:
2012-04-26 20:54:38 INFO OGG-00484 Oracle GoldenGate Delivery for Oracle, repksr1.prm: Executing DDL operation.
2012-04-26 20:54:38 INFO OGG-00483 Oracle GoldenGate Delivery for Oracle, repksr1.prm: DDL operation successful.
2012-04-26 20:54:38 INFO OGG-01408 Oracle GoldenGate Delivery for Oracle, repksr1.prm: Restoring current schema for DDL operation to [OGG].
2012-04-26 20:58:41 INFO OGG-01735 Oracle GoldenGate Collector: Synchronizing /home/oracle/ggs/trails/t1000239 to disk.
2012-04-26 20:58:41 INFO OGG-01670 Oracle GoldenGate Collector: Closing /home/oracle/ggs/trails/t1000239.
2012-04-26 20:58:41 INFO OGG-01675 Oracle GoldenGate Collector: Terminating because extract is stopped.
以上步骤执行完了之后,停掉节点3上的数据库相关进程和服务,略过,然后就是关机,通知在机房候命的同事,然后那边开始处理内存问题.........大约30分钟后,内存问题解决,服务器启动起来后,我这里开始处理后续事宜:
首先就是在节点3上启动portmap和nfs服务,略过................
之后挂载节点1,2以及共享存储,之后在启动mgr进程的时候会报错,如下:
2012-04-26 21:50:18 ERROR OGG-01117 Oracle GoldenGate Command Interpreter for Oracle: Received signal: Program interrupt (2).
2012-04-26 21:50:18 ERROR OGG-01668 Oracle GoldenGate Command Interpreter for Oracle: PROCESS ABENDING.
2012-04-26 21:51:43 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (oracle): start mgr.
2012-04-26 21:52:13 ERROR OGG-01454 Oracle GoldenGate Manager for Oracle, mgr.prm: Unable to lock file "/share_disk/ggs/dirpcs/MGR.pcm" (error 37, No locks available).
2012-04-26 21:52:13 ERROR OGG-01668 Oracle GoldenGate Manager for Oracle, mgr.prm: PROCESS ABENDING.
以上红色部分大概意思就是mgr进程无法获得共享存储上的相关锁,直接会导致后续操作都无法进行,方法很简单,就是在节点3上启动nfslock服务,然后再启动mgr进程就好了,待mgr启动起来之后,发现抽取进程abend掉了,errlog里抛出相关extract的错误信息,如下:
2012-04-26 21:54:34 INFO OGG-01026 Oracle GoldenGate Capture for Oracle, dpksr1.prm: Rolling over remote file /home/oracle/ggs/trails/t1000240.
2012-04-26 21:54:34 INFO OGG-01053 Oracle GoldenGate Capture for Oracle, dpksr1.prm: Recovery completed for target file /home/oracle/ggs/trails/t1000240, at RBA 1022.
2012-04-26 21:54:34 INFO OGG-01057 Oracle GoldenGate Capture for Oracle, dpksr1.prm: Recovery completed for all targets.
2012-04-26 21:54:35 ERROR OGG-00446 Oracle GoldenGate Capture for Oracle, extksr1.prm: Could not find archived log for sequence 16857 thread 3 under alternative destinations. SQL <SELECT MAX(sequence#) FROM v$log WHERE thread# = :ora_thread>. Last alternative log tried /arch/rac3/3_16857_744833311.dbf, error retrieving redo file name for sequence 16857, archived = 1, use_alternate = 0Not able to establish initial position for sequence 16857, rba 1529360.
2012-04-26 21:54:35 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, extksr1.prm: PROCESS ABENDING.
造成这种情况的原因很简单,就是节点3在关闭的时候,出现vip漂移至其他节点了,导致原本节点3上的归档归到了其他的节点上,在gg抽取节点3的归档的时候,在相关目录下找不到必须的归档日志,所以就abend掉了,原因清楚之后,解决就简单了,直接到其他节点上把节点3的归档日志拷贝过来,然后再启动抽取进程就ok了:
2012-04-26 21:57:22 INFO OGG-00993 Oracle GoldenGate Capture for Oracle, extksr1.prm: EXTRACT EXTKSR1 started.
2012-04-26 21:57:22 INFO OGG-01055 Oracle GoldenGate Capture for Oracle, extksr1.prm: Recovery initialization completed for target file /share_disk/ggs/trails/s1000239, at RBA 24518902.
2012-04-26 21:57:22 INFO OGG-01478 Oracle GoldenGate Capture for Oracle, extksr1.prm: Output file /share_disk/ggs/trails/s1 is using format RELEASE 10.4/11.1.
2012-04-26 21:57:23 INFO OGG-01517 Oracle GoldenGate Capture for Oracle, extksr1.prm: Position of first record processed for Thread 1, Sequence 29645, RBA 18568720, SCN 18.122009990, Apr 26, 2012 9:01:24 PM.
2012-04-26 21:57:23 INFO OGG-01517 Oracle GoldenGate Capture for Oracle, extksr1.prm: Position of first record processed for Thread 2, Sequence 28161, RBA 12794496, SCN 18.122010368, Apr 26, 2012 9:01:32 PM.
2012-04-26 21:57:24 INFO OGG-01026 Oracle GoldenGate Capture for Oracle, extksr1.prm: Rolling over remote file /share_disk/ggs/trails/s1000239.
2012-04-26 21:57:24 INFO OGG-01053 Oracle GoldenGate Capture for Oracle, extksr1.prm: Recovery completed for target file /share_disk/ggs/trails/s1000240, at RBA 1019.
2012-04-26 21:57:24 INFO OGG-01057 Oracle GoldenGate Capture for Oracle, extksr1.prm: Recovery completed for all targets.
gg主库:
GGSCI (rac3) 20> info all
Program Status Group Lag Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING DPKSR1 00:00:00 00:00:00
EXTRACT RUNNING EXTKSR1 00:00:00 00:00:04
gg备库:
GGSCI (rptdb) 7> info all
Program Status Group Lag Time Since Chkpt
MANAGER RUNNING
REPLICAT RUNNING REPKSR1 00:00:00 00:00:00
最后观察了一段时间,发现主站和gg都没什么问题了,整过程持续了大概一个小时,接下来一周时间继续观察监控。
记录一下~~
上述就是小编为大家分享的goldengate故障的处理方法了,如果刚好有类似的疑惑,不妨参照上述分析进行理解。如果想知道更多相关知识,欢迎关注亿速云行业资讯频道。
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。