Friday, December 14, 2018

Datafile Block Corruption: "Completely zero block found during backing up datafile"


One beautiful day we found in alert.log a lot of corruption messages:

--------------------------------------------------------------
Hex dump of (file 951, block 3733504) in trace file /u01/app/oracle/diag/rdbms/spur/spur1/trace/spur1_ora_256147.trc

Corrupt block relative dba: 0xedf8f800 (file 951, block 3733504)
Completely zero block found during backing up datafile

Trying mirror side DATAC1_CD_03_MRCELADM05.
Reread of blocknum=3733504, file=+DATAC1/SPUR/DATAFILE/forts_java_data.984.962454999. found same corrupt data
Reread of blocknum=3733504, file=+DATAC1/SPUR/DATAFILE/forts_java_data.984.962454999. found valid data
--------------------------------------------------------------
Hex dump of (file 951, block 3733505) in trace file /u01/app/oracle/diag/rdbms/spur/spur1/trace/spur1_ora_256147.trc

Corrupt block relative dba: 0xedf8f801 (file 951, block 3733505)
Completely zero block found during backing up datafile

Trying mirror side DATAC1_CD_03_MRCELADM05.
Reread of blocknum=3733505, file=+DATAC1/SPUR/DATAFILE/forts_java_data.984.962454999. found same corrupt data
Reread of blocknum=3733505, file=+DATAC1/SPUR/DATAFILE/forts_java_data.984.962454999. found valid data
--------------------------------------------------------------
Hex dump of (file 951, block 3733506) in trace file /u01/app/oracle/diag/rdbms/spur/spur1/trace/spur1_ora_256147.trc

Corrupt block relative dba: 0xedf8f802 (file 951, block 3733506)
Completely zero block found during backing up datafile

Trying mirror side DATAC1_CD_03_MRCELADM05.
Reread of blocknum=3733506, file=+DATAC1/SPUR/DATAFILE/forts_java_data.984.962454999. found same corrupt data
Reread of blocknum=3733506, file=+DATAC1/SPUR/DATAFILE/forts_java_data.984.962454999. found valid data
--------------------------------------------------------------

Reading messages above bring me many questions:
·         If these messages are about block corruptions, then where is ora-1578? Why it is absent?
·         Is this error from hardware or not? Is it bad disk, flash or offload server crashing? We did the cell check, but all disks and flashes are ok. We rebooted the cell – corruptions still take place.
·         The select * from v$database_block_corruption shows “no rows selected
·         We run RMAN> validate database – and found 0 corruptions.
We tried
RMAN> validate database check logical – 0 errors found by RMAN.
·         We installed 18.1.9 (new Exadata image) – corruptions still take place.


alert.log analysis shows 16 affected files :

 $ cat alert_spur1.log|grep "Corrupt block relative dba"|awk '{print $6,$7}'|sort|uniq
(file 303,
(file 360,
(file 374,
(file 375,
(file 377,
(file 379,
(file 799,
(file 800,
(file 826,
(file 863,
(file 937,
(file 938,
(file 951,
(file 952,
(file 953,
(file 962,

We opened the SR, but Oracle recommended to check HW: Physical Corrupted Blocks consisting of all Zeroes indicate a problem with OS, HW or Storage (Doc ID 1545366.1)

For the clarity:
This is the virtual Exadata, which has 4 DB nodes, 7 X7-2 cells and 9 VMs. Each VM is spread over all cells and all 84 disks, each disk is shared by all 9 VMs.
And only one VM is affected by the corruption issue. Should we suspect the software issue?

The reason:
The reason of zeroed blocks is corruption in ASM metadata. As I understand this issue when database asquire the block from the datafile the ASM points to the empty place of disk.


Solution:
Drop affected datafiles and copy these datafiles from the other dataguard instance.

After datafiles have been dropped and recreated the corruption disappeared.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

How to disable/setup autostart parameters for specified instance ?

Q: We have a 4-node RAC. I need to disable autostart of the DB on one node only.    How to do it and how to see autostart parameters, confir...