Root Cause for High Disk Utilization on Exadata

Exadata X2 Q – rack
Image : 12.1.2.3.3
GI + DB : 11.2.0.4.161018

1. FS was growing heavily , and following process & files were responsible for continuous growth.

-bash-3.2$ cat lsof |grep /var/log/cellos/qd.log

tgtd        2069        0    4w      REG              252,0   34524433    1441871 /var/log/cellos/qd.log (deleted)
tgtd        2071        0    4w      REG              252,0   34524433    1441871 /var/log/cellos/qd.log (deleted)
iscsiuio    3513        0    4w      REG              252,0   34524433    1441871 /var/log/cellos/qd.log (deleted)
iscsid      3525        0    4w      REG              252,0   34524433    1441871 /var/log/cellos/qd.log (deleted)
iscsid      3526        0    4w      REG              252,0   34524433    1441871 /var/log/cellos/qd.log (deleted)
multipath   4118        0    4w      REG              252,0   34524433    1441871 /var/log/cellos/qd.log (deleted)
sleep      96506        0    4w      REG              252,0   11322945    1443121 /var/log/cellos/qd.log (deleted)
sleep      97538        0    4w      REG              252,0   11322945    1443121 /var/log/cellos/qd.log (deleted)
sleep     101866        0    4w      REG              252,0   11322945    1443121 /var/log/cellos/qd.log (deleted)
sh        135003        0    4w      REG              252,0   11322945    1443121 /var/log/cellos/qd.log (deleted)
sh        332657        0    4w      REG              252,0   11322945    1443121 /var/log/cellos/qd.log (deleted)
sh        333478        0    4w      REG              252,0   11322945    1443121 /var/log/cellos/qd.log (deleted)

Process description:

tgtadm – tgtadm is used to monitor and modify everything about Linux SCSI target software: targets, volumes, etc.
iscsiuio is the UserSpace I/O driver
iscsid – establish iSCSI connections
multipath – provide multipath fuctionality.

There services are ON.
[root@myexadb01 sysconfig]# chkconfig –list multipathd
multipathd     0:off 1:off 2:on 3:on 4:on 5:on 6:off
[root@myexadb01 sysconfig]#
[root@myexadb01 sysconfig]# chkconfig –list iscsid
iscsid         0:off 1:off 2:on 3:on 4:on 5:on 6:off
[root@myexadb01 sysconfig]#
[root@ersdrdb01 sysconfig]# chkconfig –list tgtd
tgtd           0:off 1:off 2:on 3:on 4:on 5:on 6:off
[root@ersdrdb01 sysconfig]#
File descriptions:

/var/log/cellos/qd.log
/var/log/cellos/qd.trc

Here  “qd” :  Quorum Disk Manager Daemon/Process

+ Explanation on this logging , Quorum Disk Manager utility is introduced in Oracle Exadata release 12.1.2.3.0.The utility enables you to create an iSCSI quorum disk on two of the database nodes and store a voting file on those two quorum disks. Allow you to configure High Redudancy diskgroup in ASM with 3 cell node as well.

+ And above mentioned service/process are required to support this feature.

+ During webx, I have noticed all above mentioned services are running on problem node.

[root@ersdrdb01 cellos]# service multipathd status
multipathd (pid  4118) is running…
[root@ersdrdb01 cellos]# service iscsid status
iscsid (pid  3526) is running…
[root@ersdrdb01 cellos]# service tgtd status
tgtd (pid 2071 2069) is running…

+ Also verified you have Quorum Disk configured in this node.

[root@myexadb01 cellos]# more /opt/oracle.cellos/quorumdisk.env

export EXD_SCSI_TARGET_PREFIX=”iqn.2015-05.com.oracle:”
export EXD_VOLUME_GROUP_PATH=”/dev/VGExaDb/”
export EXD_VOLUME_PREFIX=”LVDbVd”
export EXD_IFACE_PREFIX=”exadata_”
export EXD_UDEV_RULE_DIR=/etc/udev/rules.d/
export EXD_ISCSI_NAME_RULE_PATH=$EXD_UDEV_RULE_DIR/98-exadata-openiscsi.rules
export EXD_DEVICE_OWNER_RULE_PATH=$EXD_UDEV_RULE_DIR/99-exadata-asmdevices.rules
export EXD_DEVICE_DIR=”/dev/exadata_quorum/”

[root@myexadb01 cellos]#

+ But this configuration is not enabled & not in Use currently.

[root@myexadb01 cellos]# /opt/oracle.SupportTools/quorumdiskmgr –list –config

[Failure] Failed to list config because configuration doesn’t exist

See no Quorum disk found at OS level

[root@myexadb01 cellos]# ls -l /dev/exadata_quorum
ls: cannot access /dev/exadata_quorum: No such file or directory
[root@myexadb01 cellos]#

See no Quorum disk found at ASM level

SQL> select label, path from v$asm_disk where path like ‘/dev%’;
no rows selected
SQL>

+ Further when the incident happened , some of services were terminated at OS level & respawning

/var/log/messages

Jan 24 14:42:35 myexadb01 init: exadata-multipathmon main process (5835) terminated with status 1
Jan 24 14:42:35 myexadb01 init: exadata-multipathmon main process ended, respawning
Jan 24 14:43:05myexadb01 init: exadata-iscsimon main process (5827) terminated with status 1
Jan 24 14:43:05 myexab01 init: exadata-iscsimon main process ended, respawning

+ This caused Quorum disk daemon to verify the stats of all above service continuously but for some reason, it failed to check before 2017-01-24 14:42:35

In below logs , check all entries with ” [CMD: service ”  -> all reported up & running.

[1485247355][2017-01-24 14:42:35 +0600][TRACE][/opt/oracle.cellos/imageLogger – 990][imageLogger_init][]
Log Path: /var/log/cellos
Log file: qd.log
Trace File: qd.trc
SILENT MODE
[1485247355][2017-01-24 14:42:35 +0600][INFO][0-0][/dev/fd/9 – 31][main][]  BEGIN: Arguments
[1485247355][2017-01-24 14:42:35 +0600][INFO][0-0][/dev/fd/9 – 52][main][]  multipathd monitor started
[1485247355][2017-01-24 14:42:35 +0600][CMD][/dev/fd/9 – 55][main][]  [CMD: which multipath || true] [CMD_STATUS: 0]
—– START STDOUT —–
/sbin/multipath
—– END STDOUT —–
[1485247355][2017-01-24 14:42:35 +0600][CMD][/dev/fd/9 – 63][main][]  [CMD: service multipathd status || multipathd_status=1;
true] [CMD_STATUS: 0]
—– START STDOUT —–
multipathd (pid  4118) is running…
—– END STDOUT —–
[1485247385][2017-01-24 14:43:05 +0600][TRACE][/opt/oracle.cellos/imageLogger – 990][imageLogger_init][]
Log Path: /var/log/cellos
Log file: qd.log
Trace File: qd.trc
SILENT MODE
[1485247385][2017-01-24 14:43:05 +0600][INFO][0-0][/dev/fd/9 – 30][main][]  BEGIN: Arguments
[1485247385][2017-01-24 14:43:05 +0600][INFO][0-0][/dev/fd/9 – 51][main][]  iscsid monitor started
[1485247385][2017-01-24 14:43:05 +0600][CMD][/dev/fd/9 – 54][main][]  [CMD: which iscsiadm || true] [CMD_STATUS: 0]
—– START STDOUT —–
/sbin/iscsiadm
—– END STDOUT —–
[1485247385][2017-01-24 14:43:05 +0600][CMD][/dev/fd/9 – 62][main][]  [CMD: service iscsid status || iscsid_status=1; true] [
CMD_STATUS: 0]
—– START STDOUT —–
iscsid (pid  3526) is running…
—– END STDOUT —–
[1485247355][2017-01-24 14:47:35 +0600][CMD][/dev/fd/9 – 73][main][]  [CMD: sleep 300 || true] [CMD_STATUS: 0]
[1485247355][2017-01-24 14:52:35 +0600][CMD][/dev/fd/9 – 55][main][]  [CMD: which multipath || true] [CMD_STATUS: 0]
—– START STDOUT —–
/sbin/multipath
—– END STDOUT —–
[1485247355][2017-01-24 14:52:35 +0600][CMD][/dev/fd/9 – 63][main][]  [CMD: service multipathd status || multipathd_status=1;
true] [CMD_STATUS: 0]
—– START STDOUT —–
multipathd (pid  4118) is running…
—– END STDOUT —–
[1485247385][2017-01-24 14:48:05 +0600][CMD][/dev/fd/9 – 78][main][]  [CMD: sleep 300 || true] [CMD_STATUS: 0]
[1485247385][2017-01-24 14:53:05 +0600][CMD][/dev/fd/9 – 54][main][]  [CMD: which iscsiadm || true] [CMD_STATUS: 0]
—– START STDOUT —–
/sbin/iscsiadm
—– END STDOUT —–
[1485247385][2017-01-24 14:53:06 +0600][CMD][/dev/fd/9 – 62][main][]  [CMD: service iscsid status || iscsid_status=1; true] [
CMD_STATUS: 0]

[1485291550][2017-01-25 02:59:10 +0600][CMD][/dev/fd/9 – 60][main][]  [CMD: service tgtd status || tgtd_status=1; true] [CMD_
STATUS: 0]
—– START STDOUT —–
tgtd (pid 2071 2069) is running…
—– END STDOUT —–
[1485291550][2017-01-25 02:59:10 +0600][CMD][/dev/fd/9 – 67][main][]  [CMD: sleep 30 || true] [CMD_STATUS: 0]
[1485291550][2017-01-25 02:59:40 +0600][CMD][/dev/fd/9 – 52][main][]  [CMD: which tgtadm || true] [CMD_STATUS: 0]
—– START STDOUT —–
/usr/sbin/tgtadm
—– END STDOUT —–
[1485291550][2017-01-25 02:59:40 +0600][CMD][/dev/fd/9 – 60][main][]  [CMD: service tgtd status || tgtd_status=1; true] [CMD_
STATUS: 0]
—– START STDOUT —–

+ Conclusion : all these services were terminated before 2017-01-24 14:42:35, keep failing some more time, & this log were growing which leads to high disk consumption.

+ Action Plan :

Since all services are up & running currently without any issue, this logs are not growing much now and hence no FS growth.

If you are not using Quorum disk configuration, please delete it so this daemons will not check & keep verifying the status of service continuously when any OS issue happens. . Refer following document for removing Quoram disk configuration.

https://docs.oracle.com/cd/E50790_01/doc/doc.121/e51951/db_server.htm#CCHDIIGC

And stop these services

service multipathd stop

service iscsid stop

service tgtd stop

 

 

 

Advertisements

Author: jee

Oracle Engineered Guy work for Oracle Corp. Techno-addict for Exadata, SuperCluster, ODA, RAC, ASM, HA.Fusion Music Enthusiast,Son,Husband,Father, Information-news- Freak, Optimist, Humanist. Interests : Technology, Innovation, Sharing interesting content. Views expressed on this account are my own and don't necessarily reflect the views of Oracle & its affiliates.