oracle rac遇到了问题:报错:
	
		CRS-4535: Cannot communicate with Cluster Ready Services
		
			CRS-4534: Cannot communicate with Event Manager‘
		
问题分析:由于网站上云,oracle有一套rac从idc机房撤回到了公司本地,,按着步骤关闭了数据库,领导关闭的,只是su - oracle  然后shu  immediate,关闭了oracle实例,asm实例则没有关闭,然后搬到公司按着原来的位置插好了网线并尝试启动,我只尝试着把ora010的实例起来了,然后就不管了,后来要用这套库的时候,我才看ora102的状态,才意识到数据库实例和asm实例都没有启动,于是尝试启动,但是报错如下:
	
 
 
	首先先说下oracle  rac
服务器需要重启的时候,oracle相关资源关闭的的流程:
 
	方法一:
	1)关闭oracle实例
	[grid@ora102 ~]$ srvctl  stop database  -d ORCL
	2)关闭asm实例
	[grid@ora102 ~]$ srvctl  stop asm  -n ora102
	[grid@ora102 ~]$ srvctl  stop asm  -n ora101
	如果报错就强制关闭,如下
	[root@ora101 bin]# ./srvctl   stop asm
	PRCR-1065 : Failed to stop resource ora.asm
	CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
	CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
	加上强制关闭 即可:
	[grid@ora101 ~]$ srvctl stop asm -f
	[grid@ora101 ~]$ srvctl status asm
	ASM is not running.
	3)最后还需要关闭crs
	[root@ora101 bin]# ./crsctl stop cluster   -all
	方法二:
	1)关闭oracle实例,两个节点都执行
	su - oracle
	sqlplus / as sysdba
	shu immediate
	2)关闭asm实例,两个节点都执行
	su - grid
	sqlplus / as sysasm
	shu immediate
	sqlplu  abort强制关闭
	[grid@ora101 ~]$ sqlplus / as sysasm
	SQL> shu abort
	ASM instance shutdown
	3)最后还需要关闭crs
	[root@ora101 bin]# ./crsctl stop cluster   -all
	检查数据库和asm实例的状态,以及crs的状态
	[grid@ora101 ~]$ srvctl status asm
	ASM is running on ora101,ora102
	[grid@ora101 ~]$ srvctl status database -d ORCL
	Instance orcl1 is not running on node ora101
	Instance orcl2 is not running on node ora102
	好了言归正传,继续说遇到的问题。
	[root@ora102 ~]# su - grid
	[grid@ora102 ~]$ sqlplus / as sysasm
	[grid@ora102 ~]$ sqlplus / as sysasm
	SQL*Plus: Release 11.2.0.4.0 Production on Wed Nov 29 22:28:20 2017
	Copyright (c) 1982, 2013, Oracle.  All rights reserved.
	Connected to:
	Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
	With the Real Application Clusters and Automatic Storage Management options
	SQL> startup
	报错。。。
	在ora102节点上检查集群服务的状态,报错
	[root@ora102 ~]# /u01/app/11.2.0/grid/bin/crs_stat  -t
	CRS-0184: Cannot communicate with the CRS daemon.
	根据上面报错,可以判断出crs是有问题。
	尝试启动也报错:注意需要使用root
	
	[root@ora102 ~]# /u01/app/11.2.0/grid/bin/crsctl start crs
	CRS-4640: Oracle High Availability Services is already active
	CRS-4000: Command Start failed, or completed with errors.
	正常情况是:
	[root@ora102 bin]# /u01/app/11.2.0/grid/bin/crsctl start crs
	CRS-4123: Oracle High Availability Services has been started.
	检查crs服务,发现有问题:
	[grid@ora102 ~]$ crsctl check crs
	CRS-4638: Oracle High Availability Services is online
	CRS-4535: Cannot communicate with Cluster Ready Services
	CRS-4530: Communications failure contacting Cluster Synchronization Services demon
	CRS-4534: Cannot communicate with Event Manager‘
	
	然后节点ora102查看ip情况,发现vip和scan ip都已经不在,vip在节点ora101上了,可以判断出节点ora102已经脱离了集群。
	查看ip配置。。。
	[root@ora102 ~]# cat /etc/hosts
	127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
	::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
	192.168.0.44     ora101
	192.168.0.45     ora102
	192.168.0.46     ora101-vip
	192.168.0.47     ora102-vip
	192.168.0.48     ora-cluster-scan
	172.168.56.101   ora101-priv
	172.168.56.102   ora102-priv
	查看节点的ip情况,发现只有物理ip(192.168.0.45 )了。
	[root@ora102 ~]# ip a
	1: lo:  mtu 65536 qdisc noqueue state UNKNOWN qlen 1
	link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
	inet 127.0.0.1/8 scope host lo
	valid_lft forever preferred_lft forever
	inet6 ::1/128 scope host
	valid_lft forever preferred_lft forever
	2: enp11s0f0:  mtu 1500 qdisc mq state UP qlen 1000
	link/ether 5c:f3:fc:e6:63:40 brd ff:ff:ff:ff:ff:ff
	inet 192.168.0.45/24 brd 192.168.0.255 scope global enp11s0f0
	valid_lft forever preferred_lft forever
	inet6 fe80::f451:31ab:4b4a:b224/64 scope link
	valid_lft forever preferred_lft forever
	3: enp11s0f1:  mtu 1500 qdisc mq state UP qlen 1000
	link/ether 5c:f3:fc:e6:63:42 brd ff:ff:ff:ff:ff:ff
	inet 172.168.56.102/24 brd 172.168.56.255 scope global enp11s0f1
	valid_lft forever preferred_lft forever
	inet 169.254.20.215/16 brd 169.254.255.255 scope global enp11s0f1:1
	valid_lft forever preferred_lft forever
	inet6 fe80::7ee2:d8da:d7fa:12d5/64 scope link
	valid_lft forever preferred_lft forever
	4: enp0s29f0u2:  mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
	link/ether 5e:f3:fc:de:63:43 brd ff:ff:ff:ff:ff:ff
	5: virbr0:  mtu 1500 qdisc noqueue state DOWN qlen 1000
	link/ether 52:54:00:f5:11:c7 brd ff:ff:ff:ff:ff:ff
	inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
	valid_lft forever preferred_lft forever
	6: virbr0-nic:  mtu 1500 qdisc pfifo_fast master virbr0 state DOWN qlen 1000
	link/ether 52:54:00:f5:11:c7 brd ff:ff:ff:ff:ff:ff
	解决问题过程。。。。
	首先尝试重启节点2的crs
	关闭crs
	[root@ora102 bin]# ./crsctl stop  crs
	或者
	[root@ora102 bin]# ./crsctl stop cluster
	之后启动cluster集群:
	方法一和方法二的区别:crsctl start/stop crs 只能管理本地节点的clusterware stack,并不允许我们管理远程节点,crsctl strat/stop cluster既可以管理本地 clusterware stack,也可以管理整个集群
	指定–all 启动集群中所有节点的集群件,即启动整个集群。-n 启动指定节点的集群件.
	方法一:
	[root@ora102 bin]# ./crsctl start crs
	或者
	方法二:
	[root@ora102 bin]# ./crsctl start cluster
	CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'ora102'
	CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'ora102' succeeded
	CRS-2679: Attempting to clean 'ora.asm' on 'ora102'
	CRS-2681: Clean of 'ora.asm' on 'ora102' succeeded
	CRS-2672: Attempting to start 'ora.asm' on 'ora102'
	CRS-2676: Start of 'ora.asm' on 'ora102' succeeded
	CRS-2672: Attempting to start 'ora.crsd' on 'ora102'
	CRS-2676: Start of 'ora.crsd' on 'ora102' succeeded
	如果还是有问题那么清理节点2的配置信息,然后重新运行root.sh
	[root@ora102 trace]$ /u01/app/11.2.0/grid/crs/install/rootcrs.pl -verbose -deconfig -force
	[root@ora102 ~]#  /u01/app/11.2.0/grid/crs/install/roothas.pl  -verbose -deconfig -force
	[root@ora102 bin]# /u01/app/11.2.0/grid/root.sh
	然后检查状态是否正常,如果不正常,再次重启crs,就好了。
	检查状态,发现正常。。。。
	[root@ora102 bin]# ./crs_stat -t
	Name           Type           Target    State     Host
	------------------------------------------------------------
	ora.DATA.dg    ora....up.type ONLINE    ONLINE    ora101
	ora.FRA.dg     ora....up.type ONLINE    ONLINE    ora101
	ora....ER.lsnr ora....er.type ONLINE    ONLINE    ora101
	ora....N1.lsnr ora....er.type ONLINE    ONLINE    ora101
	ora.OCR.dg     ora....up.type ONLINE    ONLINE    ora101
	ora.asm        ora.asm.type   ONLINE    ONLINE    ora101
	ora.cvu        ora.cvu.type   ONLINE    ONLINE    ora101
	ora.gsd        ora.gsd.type   OFFLINE   OFFLINE
	ora....network ora....rk.type ONLINE    ONLINE    ora101
	ora.oc4j       ora.oc4j.type  ONLINE    ONLINE    ora101
	ora.ons        ora.ons.type   ONLINE    ONLINE    ora101
	ora....SM1.asm application    ONLINE    ONLINE    ora101
	ora....01.lsnr application    ONLINE    ONLINE    ora101
	ora.ora101.gsd application    OFFLINE   OFFLINE
	ora.ora101.ons application    ONLINE    ONLINE    ora101
	ora.ora101.vip ora....t1.type ONLINE    ONLINE    ora101
	ora....SM2.asm application    ONLINE    ONLINE    ora102
	ora....02.lsnr application    ONLINE    ONLINE    ora102
	ora.ora102.gsd application    OFFLINE   OFFLINE
	ora.ora102.ons application    ONLINE    ONLINE    ora102
	ora.ora102.vip ora....t1.type ONLINE    ONLINE    ora102
	ora.orcl.db    ora....se.type ONLINE    ONLINE    ora101
	ora.scan1.vip  ora....ip.type ONLINE    ONLINE    ora101
	检查ocr状态
	[grid@ora101 ~]$ ocrcheck
	Status of Oracle Cluster Registry is as follows :
	Version                  :          3
	Total space (kbytes)     :     262120
	Used space (kbytes)      :       2948
	Available space (kbytes) :     259172
	ID                       :   87127720
	Device/File Name         :       +OCR
	Device/File integrity check succeeded
	Device/File not configured
	Device/File not configured
	Device/File not configured
	Device/File not configured
	Cluster registry integrity check succeeded
	Logical corruption check bypassed due to non-privileged user
	检查crs状态  状态正常。。。。
	[grid@ora101 ~]$ crsctl check crs
	CRS-4638: Oracle High Availability Services is online
	CRS-4537: Cluster Ready Services is online
	CRS-4529: Cluster Synchronization Services is online
	CRS-4533: Event Manager is online
	
	题外话。。
	一:关闭asm实例报错。。。。
	[root@ora101 bin]# ./srvctl   stop asm
	PRCR-1065 : Failed to stop resource ora.asm
	CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
	CRS-2529: Unable to act on 'ora.asm' because that would require stopping or relocating 'ora.DATA.dg', but the force option was not specified
	加上强制关闭 即可:
	[grid@ora101 ~]$ srvctl stop asm -f
	[grid@ora101 ~]$ srvctl status asm
	ASM is not running.
	或者 sqlplu  abort强制关闭
	[grid@ora101 ~]$ sqlplus / as sysasm
	SQL> shu abort
	ASM instance shutdown
	此时查看crs:
	[grid@ora101 ~]$ crsctl check crs
	CRS-4638: Oracle High Availability Services is online
	CRS-4537: Cluster Ready Services is online
	CRS-4529: Cluster Synchronization Services is online
	CRS-4533: Event Manager is online
	使用crsctl stop crs停止CRS,同时也停止了ASM磁盘
	从停止的过程可以看到VIP的飘移,
	[root@ora101 bin]# ./crsctl stop crs
	CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'ora101'
	CRS-2673: Attempting to stop 'ora.crsd' on 'ora101'
	CRS-2790: Starting shutdown of Cluster Ready Services-managed resources on 'ora101'
	CRS-2673: Attempting to stop 'ora.OCR.dg' on 'ora101'
	CRS-2673: Attempting to stop 'ora.DATA.dg' on 'ora101'
	CRS-2673: Attempting to stop 'ora.FRA.dg' on 'ora101'
	CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'ora101'
	CRS-2677: Stop of 'ora.LISTENER.lsnr' on 'ora101' succeeded
	CRS-2673: Attempting to stop 'ora.ora101.vip' on 'ora101'
	CRS-2677: Stop of 'ora.FRA.dg' on 'ora101' succeeded
	CRS-2677: Stop of 'ora.DATA.dg' on 'ora101' succeeded
	CRS-2677: Stop of 'ora.ora101.vip' on 'ora101' succeeded
	CRS-2672: Attempting to start 'ora.ora101.vip' on 'ora102'
	CRS-2676: Start of 'ora.ora101.vip' on 'ora102' succeeded   -----实现vip飘逸
	CRS-2677: Stop of 'ora.OCR.dg' on 'ora101' succeeded
	CRS-2673: Attempting to stop 'ora.asm' on 'ora101'
	CRS-2677: Stop of 'ora.asm' on 'ora101' succeeded
	CRS-2673: Attempting to stop 'ora.ons' on 'ora101'
	CRS-2677: Stop of 'ora.ons' on 'ora101' succeeded
	CRS-2673: Attempting to stop 'ora.net1.network' on 'ora101'
	CRS-2677: Stop of 'ora.net1.network' on 'ora101' succeeded
	CRS-2792: Shutdown of Cluster Ready Services-managed resources on 'ora101' has completed
	CRS-2677: Stop of 'ora.crsd' on 'ora101' succeeded
	CRS-2673: Attempting to stop 'ora.ctssd' on 'ora101'
	CRS-2673: Attempting to stop 'ora.evmd' on 'ora101'
	CRS-2673: Attempting to stop 'ora.asm' on 'ora101'
	CRS-2673: Attempting to stop 'ora.m
dnsd' on 'ora101'
 
	CRS-2677: Stop of 'ora.evmd' on 'ora101' succeeded
	CRS-2677: Stop of 'ora.mdnsd' on 'ora101' succeeded
	CRS-2677: Stop of 'ora.ctssd' on 'ora101' succeeded
	CRS-2677: Stop of 'ora.asm' on 'ora101' succeeded
	CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'ora101'
	CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'ora101' succeeded
	CRS-2673: Attempting to stop 'ora.cssd' on 'ora101'
	CRS-2677: Stop of 'ora.cssd' on 'ora101' succeeded
	CRS-2673: Attempting to stop 'ora.crf' on 'ora101'
	CRS-2677: Stop of 'ora.crf' on 'ora101' succeeded
	CRS-2673: Attempting to stop 'ora.gipcd' on 'ora101'
	CRS-2677: Stop of 'ora.gipcd' on 'ora101' succeeded
	CRS-2673: Attempting to stop 'ora.gpnpd' on 'ora101'
	CRS-2677: Stop of 'ora.gpnpd' on 'ora101' succeeded
	CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'ora101' has completed
	CRS-4133: Oracle High Availability Services has been stopped.
	启动asm,先启动crs服务
	[root@ora101 bin]# ./crsctl start crs
	[root@ora101 bin]# ./crsctl status  crs
	CRS-4638: Oracle High Availability Services is online
	CRS-4537: Cluster Ready Services is online
	CRS-4529: Cluster Synchronization Services is online
	CRS-4533: Event Manager is online
	启动RAC实例和数据库
	[grid@ora102 ~]$ srvctl start asm 
	PRCC-1014 : asm was already running
	[root@ora101 bin]# ./srvctl start database -d ORCL 
	二:简单概述CRS架构 :
	1)Cluster Synchronization Services (CSS)—管理群集配置,谁是成员、谁来、谁走,通知成员。
	2)Cluster Ready Services (CRS)—管理群集内高可用操作的主要程序,crs管理的全部内容都被看作资源,包括数据库、实例、服务、监听器、vip地址、应用进程等。Crs进程根据OCR中的配置信息管理群集资源,包括启动、停止、监视和容错操作。当某个资源的状态发生改变时,crs进程产生事件。RAC安装完成后,crs进程监视各种资源,发生异常时自动重启该资源,一般来说重启5次,如不成功不再尝试。
	3)Event Management (EVM)—后台进程发布由crs生成的事件。
	4)Oracle Notification Service (ONS)—通信FAN消息的发布和订阅服务。
	5)RACG—扩展集群支持oracle特定的需求和复杂的资源。
	6)Process Monitor Daemon (OPROCD)—锁定在内存中监视集群运行并执行I/O隔离。利用       hangchecker,监测、停止、再监测、再停止,如果醒来时时间不对则重启该节点。
	注意:
	CRS进程栈默认随着操作系统的启动而自启动,有时出于维护目的需要关闭这个特性,可以用root用户执行下面命令。
	[root@rac1 bin]# ./crsctl disable crs
	[root@rac1 bin]# ./crsctl enable crs
	这个命令实际是修改了/etc/oracle/scls_scr/raw/root/crsstart这个文件里的内容
	CRS由CRS,CSS,EVM三个服务组成,每个服务又是由一系列module组成,crsctl允许对每个module进行跟踪,并把跟踪内容记录到日志中。
	[root@rac1 bin]# ./crsctl lsmodules css
	[root@rac1 bin]# ./crsctl lsmodules evm
	–跟踪CSSD模块,需要root用户执行:
	[root@rac1 bin]# ./crsctl debug log css "CSSD:1"
	Configuration parameter trace is now set to 1.
	Set CRSD Debug Module: CSSD Level: 1
	–查看跟踪日志
	[root@rac1 cssd]# pwd
	/u01/app/oracle/product/crs/log/rac1/cssd
	[root@rac1 cssd]# more ocssd.log
	四:Oracle Cluster Registry (OCR):
	管理Oracle集群软件和Oracle RAC数据库配置信息;类似于windows的注册表;这也包含Oracle Local Registry (OLR),存在于集群的每个节点上,管理Oracle每个节点的集群配置信息。Oracle Clusterware 把整个集群的配置信息放在共享存储上,这个存储就是OCR Disk.在整个集群中,只有一个节点能对OCR Disk进行读写操作,这个节点叫作Master Node,所有节点都会在内存中保留一份OCR的拷贝,同时有一个OCR Process从这个内存中读取内容。OCR内容发生改变时,由Master Node的OCR Process负责同步到其他节点的OCR Process。
	ocrcheck:
	Ocrcheck命令用于检查OCR内容的一致性,命令执行过程会在$CRS_HOME\log\nodename\client目录下产生ocrcheck_pid.log日志文件。 这个命令不需要参数。
	[root@rac1 bin]#./ocrcheck
	五:最后检查数据库的状态:
	1)检查数据库实例的状态:
	[root@ora102 bin]# ./srvctl status  database -d ORCL
	Instance orcl1 is running on node ora101
	Instance orcl2 is running on node ora102
	2)检查asm实例的状态:
	[root@ora102 bin]# ./srvctl status  asm
	ASM is running on ora101,ora102
	3)检查crs的状态,如下是正常的
	[root@ora102 bin]# ./crsctl check crs
	CRS-4638: Oracle High Availability Services is online
	CRS-4537: Cluster Ready Services is online
	CRS-4529: Cluster Synchronization Services is online
	CRS-4533: Event Manager is online
	–检查单个状态
	[root@rac1 bin]# ./crsctl check cssd
	CSS appears healthy
	[root@rac1 bin]# ./crsctl check crsd
	CRS appears healthy
	[root@rac1 bin]# ./crsctl check evmd
	EVM appears healthy
	总结:oracle rac集群,是一个整体,需要同时启动和关闭,如果你只启动其中一个,那么另一个节点的vip就会飘到这个节点,voting disk投票把这个节点踢出集群,也就是脑裂。解决脑裂问题的基本思路就是:首先重启被踢出集群的节点的crs(crsctl stop crs   ,然后crsctl  start crs ),如果不行,那就清理节点2的配置信息,然后重新运行root.sh,然后执行crsctlstart crs开启crs即可。