Oracle Redolog corruption on Azure

After an update of one of our Oracle systems in Azure to Oracle Linux 7.9 we noticed the installation of a new UEK Kernel. The new kernel is UEKR6 (5.4.17-2102.202.5.el7uek.x86_64). This is nothing new as 7.6 already brought us the UEKR5 kernel in the past.

Since then we are receiving redolog corruption on our databases whenever we do heavy load :
ORA-16038: log 2 sequence# 125 cannot be archived
ORA-00354: corrupt redo log block header
ORA-00312: online log 2 thread 1: '+DATA/TST/ONLINELOG/group_2.324.1079171683'
ORA-00312: online log 2 thread 1: '+DATA/TST/ONLINELOG/group_2.323.1079171685'

After some debugging and rolling back changes we were able to pinpoint the problem to the kernel+ASM combination. On every other kernel version we tried, we couldn’t reproduce the corruption, it only occured on this particular version.

The specs of our system are :

  • Standard DS13 v2 (8 vcpus, 56 GiB memory)
  • Linux (oracle 7.9) – kernel 5.4.17-2102.202.5
  • ASM 19c ( not using asmlib or afd )
  • RDBMS 19c

We have tried a lot of things to see if they are related to the problem, but all these were having the same corruption :

  • Changed size of disks in azure/ASM
  • Removed any sort of caching in Azure portal on the ASM disks
  • set caching in azure portal to read-only
  • Upgraded ASM and Database version
  • Created a new database
  • Created 4096 sector size diskgroups instead of 512 as the disks are 4096 physical size and 512 logical size.

The only thing that solves the problem is to :

  • Downgrade UEK kernel
  • Use Redhat Compatible kernel ( RHCK )
  • Place the redologs on filesystem instead of ASM
  • Update 09/08/2021 – see below – create 4K redo log files

So it’s a combination of ASM, kernel and azure that is causing the corruption.
This is currently still being investigated by oracle and azure.

ps. This issue looks alot like a past issue with the redhat kernel which was solved in 2019. https://access.redhat.com/solutions/3114361

Update 09/08/2021

It seems the kernel + ASM combo has problems with the emulated mode of the Azure disks.

Azure uses advanced format for its disks which means they all have a physical sector size of 4K but for backwards compatibility emulated mode is used, which means the physical sector is split in 8 logical sectors of 512.

every sector that then gets read or written has to do it for the whole physical block and not just the logical block. For most database files thats not an issue as the blocksize is a multiple of 4K anyway. But for the redologs this can cause a lof of extra IO.

The solution for this is to create the logfiles in 4K sector size, this is possible since 11.2 but seldom used.

  • Allow 4K sector size even if the underlying OS emulates 512k Logical sector size
    SQL> ALTER SYSTEM SET "_DISK_SECTOR_SIZE_OVERRIDE"=TRUE;
  • Create 2 new logfiles in 4K sector size ( drop the others after switching over to it )
    SQL> ALTER DATABASE ADD LOGFILE GROUP 3 ('+DATA','+DATA') SIZE 200M BLOCKSIZE 4096;
    SQL> ALTER DATABASE ADD LOGFILE GROUP 4 ('+DATA','+DATA') SIZE 200M BLOCKSIZE 4096;

Once we have done this, corruption no longer occurs. So I’ll add it to the list of fixes for the issue.
It still is an Oracle issue though as older kernels and Redhat kernels have no problem with it.

Update 27/08/2021

Problem Seems to be Solved in the latest UEKR6 kernel. ( 5.4.17-2102.204.4.4 )