Oracle 11g控制文件损坏问题分析


对于Oracle 11g版本以下数据库当控制文件损坏后,我们在mount数据库时,会有很明显的ora-600错误,这样就很容易知道控制文件损坏的错误,但是对于oracle 11g R2就不是很明显了,

当时是一个ORACLE 11g 的RAC系统,出现问题时数据库实例可以nomount打开但是在mount控制文件时就会出现如下告警:

ORA-3113 "end of file on communication channel"

然后整个sqlplus连接终止,需要重新连接,当然我们知道通常mount阶段无法进行,问题就出在控制文件本身的存在损坏的问题,但是对于专业的人员来说,如果仅仅满足这样的心态,显然是不行的,所以需要对其进行进一步分析:

但是在ASM日志中我们可以看到如下信息:

Tue Mar 27 13:35:11 2012
NOTE: client PROD1:PROD registered, osid 6726, mbr 0x1
Tue Mar 27 13:35:24 2012
NOTE: ASM client PROD1:PROD disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_6726.trc
Tue Mar 27 13:40:35 2012
NOTE: client PROD1:PROD registered, osid 7477, mbr 0x1
Tue Mar 27 13:41:45 2012
NOTE: ASM client PROD1:PROD disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_7477.trc
Tue Mar 27 13:41:47 2012
NOTE: client PROD1:PROD registered, osid 7736, mbr 0x1
Tue Mar 27 13:42:01 2012
NOTE: ASM client PROD1:PROD disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_7736.tr

对于生成的trace文件我们仅能够看到如下些信息:

2012-03-27 13:41:08.022438 :802EEFE8:KFNS:kfn.c@702:kfnDispatch(): calling server stub for KFNOP=5
2012-03-27 13:41:13.027006 :802EF0F4:KFNU:kfns.c@1924:kfnsBackground(): kfnsBackground completed in 5 seconds (KFNPM=0)
2012-03-27 13:41:13.027012 :802EF0F5:KFNS:kfn.c@729:kfnDispatch(): completed KFNOP=5
2012-03-27 13:41:13.027122 :802EF0F6:KFNS:kfn.c@702:kfnDispatch(): calling server stub for KFNOP=5

对于此问题显然没什么用处,并且问题应该还是在数据库方面。

所以对数据库实例的alert告警检查,当执行alter database mount状态时的日志如下:

Tue Mar 27 11:42:01 2012
alter database mount
This instance was first to mount
Tue Mar 27 11:42:01 2012
NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
NOTE: Loaded library: System
Tue Mar 27 11:42:01 2012
SUCCESS: diskgroup PRODDATA was mounted
Tue Mar 27 11:42:01 2012
NOTE: dependency between database PROD and diskgroup resource ora.PRODDATA.dg is established
USER (ospid: 26774): terminating the instance
Tue Mar 27 11:42:07 2012
System state dump requested by (instance=1, osid=26774), summary=[abnormal instance termination].
System State dumped to trace file /d01/oracle/11.2.0/admin/PROD1_db01/diag/rdbms/prod/PROD1/trace/PROD1_diag_26656.trc
Dumping diagnostic data in directory=[cdmp_20120327114207], requested by (instance=1, osid=26774), summary=[abnormal instance termination].
Instance terminated by USER, pid = 26774

还是不明显的日志提示,检查告警trace文件:/d01/oracle/11.2.0/admin/PROD1_db01/diag/rdbms/prod/PROD1/trace/PROD1_diag_26656.trc也无明细的信息

后来采用10046事件来跟踪mount这个过程,才看到了比较明细的提示,

alter session set events='10046 trace name context forever,level 12';
Trace file /d01/oracle/11.2.0/admin/PROD1_db01/diag/rdbms/prod/PROD1/trace/PROD1_ora_7764.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /d01/oracle/11.2.0
System name:    Linux
Node name:      db01.clc.com
Release:        2.6.18-238.el5
Version:        #1 SMP Sun Dec 19 14:22:44 EST 2010
Machine:        x86_64
Instance name: PROD1
Redo thread mounted by this instance: 0 <none>
Oracle process number: 31
Unix process pid: 7764, image: oracle@db01.clc.com (TNS V1-V3)


*** 2012-03-27 13:41:55.101
*** SESSION ID:(1751.3) 2012-03-27 13:41:55.101
*** CLIENT ID:() 2012-03-27 13:41:55.101
*** SERVICE NAME:() 2012-03-27 13:41:55.101
*** MODULE NAME:(oraagent.bin@db01.clc.com (TNS V1-V3)) 2012-03-27 13:41:55.101
*** ACTION NAME:() 2012-03-27 13:41:55.101

Error: kccpb_sanity_check_2
Control file sequence number mismatch!
fhcsq: 312916 bhcsq: 313137 cfn 0


*** 2012-03-27 13:41:55.101
Submitting synchronized dump request [268435460]. summary=[Controlfile header dump (kccpbsc)].

*** 2012-03-27 13:41:57.102
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+461<-kjzdssdmp()+267<-kjzduptcctx()+232<-kjzdicrshnfy()+53<-ksuitm()+1325<-kccpb_sanity_check()+341<-kccbmp_get()+309<-kccsed_rbl()+111<-kccocx()+1154<-kccocf()+136<-kcfcmb()+1025<-kcfmdb()+54<-adbdrv()+63122<-opiexe()+18173<-opiosq0()+3993<-kpooprx()+274
<-kpoal8()+800<-opiodr()+910<-ttcpip()+2289<-opitsk()+1670
----- End of Abridged Call Stack Trace -----

*** 2012-03-27 13:41:57.141
USER (ospid: 7764): terminating the instance
ksuitm: waiting up to [5] seconds before killing DIAG(7652)

如上红色字段可以看到,是控制文件中序列号不匹配造成控制文件一致性验证损坏,而无法正常mount数据库。

这样问题就明了了,可以修改或重建控制文件方式来打开数据库。

更多Oracle相关信息见Oracle 专题页面 http://www.bkjia.com/topicnews.aspx?tid=12

相关内容