Exadata Storage Snapshots

My Oracle Notes

This post describes how to implement Oracle Database Snapshot Technology on Exadata Machine.

Because Exadata Storage Cell Smart Features, Storage Indexes, IORM and Network Resource Manager work at level of ASM Volume Manager only, (and they don’t work on top of ACFS Cluster File System), the implementation of the snapshot technology is different compared to any other non-Exadata environment.

At this purpuse Oracle has developed a new type of ASM Disk Group called SPARSE Disk Group. It uses ASM SPARSE Grid Disk based on Thin Provisioning to save the database snapshot copies and the associated metadata, and it supports non-CDB and PDB snapshot copy.

The implementation requires the following minimal software versions :

  • Exadata Storage Software version 12.1.2.1.0.
  • Oracle Database version 12.1.0.2 with bundle patch 5.
One major restriction applies to Exadata Storage Sanpshot compared to ACFS;
the source database must be a shared copy open on read only…

View original post 674 more words

Advertisements

Megacli Command fails on X5 Extreme Flash

# /opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -a0
User specified controller is not present.
Failed to get CpController object.

Exit Code: 0x01

Megacli commands not working,

X5 Exadata Extreme Flash uses NVME CARDS instead of Disks, hence not disk controller.

You can even verify using storcli64 command.

# /opt/MegaRAID/storcli/storcli64 /c0 /eall /sall show
Controller = 0
Status = Failure
Description = Controller 0 not found

Root Cause for High Disk Utilization on Exadata

Exadata X2 Q – rack
Image : 12.1.2.3.3
GI + DB : 11.2.0.4.161018

1. FS was growing heavily , and following process & files were responsible for continuous growth.

-bash-3.2$ cat lsof |grep /var/log/cellos/qd.log

tgtd        2069        0    4w      REG              252,0   34524433    1441871 /var/log/cellos/qd.log (deleted)
tgtd        2071        0    4w      REG              252,0   34524433    1441871 /var/log/cellos/qd.log (deleted)
iscsiuio    3513        0    4w      REG              252,0   34524433    1441871 /var/log/cellos/qd.log (deleted)
iscsid      3525        0    4w      REG              252,0   34524433    1441871 /var/log/cellos/qd.log (deleted)
iscsid      3526        0    4w      REG              252,0   34524433    1441871 /var/log/cellos/qd.log (deleted)
multipath   4118        0    4w      REG              252,0   34524433    1441871 /var/log/cellos/qd.log (deleted)
sleep      96506        0    4w      REG              252,0   11322945    1443121 /var/log/cellos/qd.log (deleted)
sleep      97538        0    4w      REG              252,0   11322945    1443121 /var/log/cellos/qd.log (deleted)
sleep     101866        0    4w      REG              252,0   11322945    1443121 /var/log/cellos/qd.log (deleted)
sh        135003        0    4w      REG              252,0   11322945    1443121 /var/log/cellos/qd.log (deleted)
sh        332657        0    4w      REG              252,0   11322945    1443121 /var/log/cellos/qd.log (deleted)
sh        333478        0    4w      REG              252,0   11322945    1443121 /var/log/cellos/qd.log (deleted)

Process description:

tgtadm – tgtadm is used to monitor and modify everything about Linux SCSI target software: targets, volumes, etc.
iscsiuio is the UserSpace I/O driver
iscsid – establish iSCSI connections
multipath – provide multipath fuctionality.

There services are ON.
[root@myexadb01 sysconfig]# chkconfig –list multipathd
multipathd     0:off 1:off 2:on 3:on 4:on 5:on 6:off
[root@myexadb01 sysconfig]#
[root@myexadb01 sysconfig]# chkconfig –list iscsid
iscsid         0:off 1:off 2:on 3:on 4:on 5:on 6:off
[root@myexadb01 sysconfig]#
[root@ersdrdb01 sysconfig]# chkconfig –list tgtd
tgtd           0:off 1:off 2:on 3:on 4:on 5:on 6:off
[root@ersdrdb01 sysconfig]#
File descriptions:

/var/log/cellos/qd.log
/var/log/cellos/qd.trc

Here  “qd” :  Quorum Disk Manager Daemon/Process

+ Explanation on this logging , Quorum Disk Manager utility is introduced in Oracle Exadata release 12.1.2.3.0.The utility enables you to create an iSCSI quorum disk on two of the database nodes and store a voting file on those two quorum disks. Allow you to configure High Redudancy diskgroup in ASM with 3 cell node as well.

+ And above mentioned service/process are required to support this feature.

+ During webx, I have noticed all above mentioned services are running on problem node.

[root@ersdrdb01 cellos]# service multipathd status
multipathd (pid  4118) is running…
[root@ersdrdb01 cellos]# service iscsid status
iscsid (pid  3526) is running…
[root@ersdrdb01 cellos]# service tgtd status
tgtd (pid 2071 2069) is running…

+ Also verified you have Quorum Disk configured in this node.

[root@myexadb01 cellos]# more /opt/oracle.cellos/quorumdisk.env

export EXD_SCSI_TARGET_PREFIX=”iqn.2015-05.com.oracle:”
export EXD_VOLUME_GROUP_PATH=”/dev/VGExaDb/”
export EXD_VOLUME_PREFIX=”LVDbVd”
export EXD_IFACE_PREFIX=”exadata_”
export EXD_UDEV_RULE_DIR=/etc/udev/rules.d/
export EXD_ISCSI_NAME_RULE_PATH=$EXD_UDEV_RULE_DIR/98-exadata-openiscsi.rules
export EXD_DEVICE_OWNER_RULE_PATH=$EXD_UDEV_RULE_DIR/99-exadata-asmdevices.rules
export EXD_DEVICE_DIR=”/dev/exadata_quorum/”

[root@myexadb01 cellos]#

+ But this configuration is not enabled & not in Use currently.

[root@myexadb01 cellos]# /opt/oracle.SupportTools/quorumdiskmgr –list –config

[Failure] Failed to list config because configuration doesn’t exist

See no Quorum disk found at OS level

[root@myexadb01 cellos]# ls -l /dev/exadata_quorum
ls: cannot access /dev/exadata_quorum: No such file or directory
[root@myexadb01 cellos]#

See no Quorum disk found at ASM level

SQL> select label, path from v$asm_disk where path like ‘/dev%’;
no rows selected
SQL>

+ Further when the incident happened , some of services were terminated at OS level & respawning

/var/log/messages

Jan 24 14:42:35 myexadb01 init: exadata-multipathmon main process (5835) terminated with status 1
Jan 24 14:42:35 myexadb01 init: exadata-multipathmon main process ended, respawning
Jan 24 14:43:05myexadb01 init: exadata-iscsimon main process (5827) terminated with status 1
Jan 24 14:43:05 myexab01 init: exadata-iscsimon main process ended, respawning

+ This caused Quorum disk daemon to verify the stats of all above service continuously but for some reason, it failed to check before 2017-01-24 14:42:35

In below logs , check all entries with ” [CMD: service ”  -> all reported up & running.

[1485247355][2017-01-24 14:42:35 +0600][TRACE][/opt/oracle.cellos/imageLogger – 990][imageLogger_init][]
Log Path: /var/log/cellos
Log file: qd.log
Trace File: qd.trc
SILENT MODE
[1485247355][2017-01-24 14:42:35 +0600][INFO][0-0][/dev/fd/9 – 31][main][]  BEGIN: Arguments
[1485247355][2017-01-24 14:42:35 +0600][INFO][0-0][/dev/fd/9 – 52][main][]  multipathd monitor started
[1485247355][2017-01-24 14:42:35 +0600][CMD][/dev/fd/9 – 55][main][]  [CMD: which multipath || true] [CMD_STATUS: 0]
—– START STDOUT —–
/sbin/multipath
—– END STDOUT —–
[1485247355][2017-01-24 14:42:35 +0600][CMD][/dev/fd/9 – 63][main][]  [CMD: service multipathd status || multipathd_status=1;
true] [CMD_STATUS: 0]
—– START STDOUT —–
multipathd (pid  4118) is running…
—– END STDOUT —–
[1485247385][2017-01-24 14:43:05 +0600][TRACE][/opt/oracle.cellos/imageLogger – 990][imageLogger_init][]
Log Path: /var/log/cellos
Log file: qd.log
Trace File: qd.trc
SILENT MODE
[1485247385][2017-01-24 14:43:05 +0600][INFO][0-0][/dev/fd/9 – 30][main][]  BEGIN: Arguments
[1485247385][2017-01-24 14:43:05 +0600][INFO][0-0][/dev/fd/9 – 51][main][]  iscsid monitor started
[1485247385][2017-01-24 14:43:05 +0600][CMD][/dev/fd/9 – 54][main][]  [CMD: which iscsiadm || true] [CMD_STATUS: 0]
—– START STDOUT —–
/sbin/iscsiadm
—– END STDOUT —–
[1485247385][2017-01-24 14:43:05 +0600][CMD][/dev/fd/9 – 62][main][]  [CMD: service iscsid status || iscsid_status=1; true] [
CMD_STATUS: 0]
—– START STDOUT —–
iscsid (pid  3526) is running…
—– END STDOUT —–
[1485247355][2017-01-24 14:47:35 +0600][CMD][/dev/fd/9 – 73][main][]  [CMD: sleep 300 || true] [CMD_STATUS: 0]
[1485247355][2017-01-24 14:52:35 +0600][CMD][/dev/fd/9 – 55][main][]  [CMD: which multipath || true] [CMD_STATUS: 0]
—– START STDOUT —–
/sbin/multipath
—– END STDOUT —–
[1485247355][2017-01-24 14:52:35 +0600][CMD][/dev/fd/9 – 63][main][]  [CMD: service multipathd status || multipathd_status=1;
true] [CMD_STATUS: 0]
—– START STDOUT —–
multipathd (pid  4118) is running…
—– END STDOUT —–
[1485247385][2017-01-24 14:48:05 +0600][CMD][/dev/fd/9 – 78][main][]  [CMD: sleep 300 || true] [CMD_STATUS: 0]
[1485247385][2017-01-24 14:53:05 +0600][CMD][/dev/fd/9 – 54][main][]  [CMD: which iscsiadm || true] [CMD_STATUS: 0]
—– START STDOUT —–
/sbin/iscsiadm
—– END STDOUT —–
[1485247385][2017-01-24 14:53:06 +0600][CMD][/dev/fd/9 – 62][main][]  [CMD: service iscsid status || iscsid_status=1; true] [
CMD_STATUS: 0]

[1485291550][2017-01-25 02:59:10 +0600][CMD][/dev/fd/9 – 60][main][]  [CMD: service tgtd status || tgtd_status=1; true] [CMD_
STATUS: 0]
—– START STDOUT —–
tgtd (pid 2071 2069) is running…
—– END STDOUT —–
[1485291550][2017-01-25 02:59:10 +0600][CMD][/dev/fd/9 – 67][main][]  [CMD: sleep 30 || true] [CMD_STATUS: 0]
[1485291550][2017-01-25 02:59:40 +0600][CMD][/dev/fd/9 – 52][main][]  [CMD: which tgtadm || true] [CMD_STATUS: 0]
—– START STDOUT —–
/usr/sbin/tgtadm
—– END STDOUT —–
[1485291550][2017-01-25 02:59:40 +0600][CMD][/dev/fd/9 – 60][main][]  [CMD: service tgtd status || tgtd_status=1; true] [CMD_
STATUS: 0]
—– START STDOUT —–

+ Conclusion : all these services were terminated before 2017-01-24 14:42:35, keep failing some more time, & this log were growing which leads to high disk consumption.

+ Action Plan :

Since all services are up & running currently without any issue, this logs are not growing much now and hence no FS growth.

If you are not using Quorum disk configuration, please delete it so this daemons will not check & keep verifying the status of service continuously when any OS issue happens. . Refer following document for removing Quoram disk configuration.

https://docs.oracle.com/cd/E50790_01/doc/doc.121/e51951/db_server.htm#CCHDIIGC

And stop these services

service multipathd stop

service iscsid stop

service tgtd stop

 

 

 

Useful Queries on 12c CDB PDB Datapatch

I found some useful queries to check on 12c CDB PDB datapatch.  ( you might be aware of this may be )

In CDB

SQL> ALTER SESSION SET container = cdb$root;

In PDB

SQL> alter session set container=pdb3;

SQL> select owner, directory_name, directory_path from dba_directories where directory_name like ‘OPATCH%’ order by 2;
SQL> set serverout on
SQL> exec dbms_qopatch.get_sqlpatch_status;

inventory information

SQL> set pagesize 0

SQL> set long 1000000

SQL> select xmltransform(dbms_qopatch.get_opatch_install_info, dbms_qopatch.get_opatch_xslt) “Home and Inventory” from dual;

Check Patch Applied or not.

Lets check for the latest PSU.

SQL> select patch_id, patch_uid, version, status, description from dba_registry_sqlpatch where bundle_series = ‘PSU’;

SQL> select xmltransform(dbms_qopatch.is_patch_installed(‘21359755’), dbms_qopatch.get_opatch_xslt) “Patch installed?” from dual;

The equivalent of opatch lsinventory -detail …

SQL> select xmltransform(dbms_qopatch.get_opatch_lsinventory, dbms_qopatch.get_opatch_xslt) from dual;

set heading off long 50000 pages 9999 lines 180 trims on tab off
select xmltransform(dbms_qopatch.get_opatch_count, dbms_qopatch.get_opatch_xslt) from dual;
select xmltransform(dbms_qopatch.get_opatch_list, dbms_qopatch.get_opatch_xslt) from dual;
select xmltransform(dbms_qopatch.get_pending_activity, dbms_qopatch.get_opatch_xslt) from dual;

set serverout on
exec dbms_qopatch.get_sqlpatch_status;

Hope this helps

 

Exadata Complete Outage Due to RDS Misconfiguration.

Exadata Full Rack. (8 DB nodes + 14 cells)

All DB,Cell were up but CRS failed to start, throwing following error.

GPNP:1239021056: clsgpnp_Term: [at clsgpnp0.c:1347] GPnP cli=clsuGpnpg
SKGFD:1239021056: ERROR: -8(OS Error -1 (open,sskgxplp,Invalid protocol requested (2) or protocol not loaded.,Error 0))
SKGFD:1239021056: ERROR: -10(OSS Operation oss_initialize failed with error 4 [Network initialization failed]

ocssd call stak

CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/grid/diag/crs/myexadbadm01/crs/incident/incdir_43/

[OCSSD(32372)]CRS-8503: Oracle Clusterware OCSSD process with operating system process ID 32372 experienced fatal signal or exception code 6

CSSD:1755981568: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization

CLSFA initialization – point to OS miconfiguration during initialization of crs satck.
[01]: dbgeExecuteForError [diag_dde]
[02]: dbgePostErrorDirect [diag_dde]
[03]: clsdAdrPostError []<– Signaling
[04]: clsbSigErrCB []
[05]: skgesig_sigactionHandler []

Later verify the complete health check & noticed rds-ping does not work on any of the nodes, it forced me to check rds driver, & found this

# modprobe rds_rdma

FATAL: Error inserting rds_rdma (/lib/modules/2.6.39-400.281.1.el6uek.x86_64/kernel/net/rds/rds_rdma.ko): Unknown symbol in module, or unknown parameter (see dmesg)

Basically the real problem in OS side is rds_rdma kernel module cannot be loaded for above error.

Verified the same from OS logs

kernel: rds_rdma: Unknown symbol rds_send_get_message (err 0)
kernel: rds_rdma: Unknown symbol rds_for_each_conn_info (err 0)
kernel: rds_rdma: Unknown symbol rds_message_add_rdma_dest_extension (err 0)
kernel: rds_rdma: Unknown symbol rds_wq (err 0)
kernel: rds_rdma: Unknown symbol rds_atomic_send_complete (err 0)
kernel: rds_rdma: Unknown symbol rds_conn_connect_if_down (err 0)
kernel: rds_rdma: Unknown symbol rds_conn_destroy (err 0)
kernel: rds_rdma: Unknown symbol rds_rdma_send_complete (err 0)
kernel: rds_rdma: Unknown symbol rds_send_drop_acked (err 0)
kernel: rds_rdma: Unknown symbol rds_send_xmit (err 0)
kernel: rds_rdma: Unknown symbol rds_stats_info_copy (err 0)
kernel: rds_rdma: Unknown symbol rds_inc_put (err 0)
kernel: rds_rdma: Unknown symbol rds_message_add_extension (err 0)
kernel: rds_rdma: Unknown symbol rds_info_register_func (err 0)
kernel: rds_rdma: Unknown symbol rds_page_remainder_alloc (err 0)
kernel: rds_rdma: Unknown symbol rds_inc_init (err 0)
kernel: rds_rdma: Unknown symbol rds_recv_incoming (err 0)

Ideally Each Exadata has following configuration file

/etc/modprobe.d/network.conf , contents are

install vfat /bin/true
options ipv6 disable=1
install rds /bin/true <<<<<<<<<<<<<<<<<<<

rds_rdma depends upon rds and highlighted line, this entry was preventing explicitly the rds module from loading being included in kernel module operations, is similar to a blacklist of that module, but much higher precedence.

(The line actually means, “when you want to install rds, invoke this program instead of the default insmod”.)

This isn’t a bug, this is something the customized environments.

Solution :

Commenting following line, it helps restores rds communication, and crs stack came up without any issue.

# install rds /bin/true

 

 

 

ODA Patching from 12.1.2.6.0 to 12.1.2.7.0 failed.

+ ODA VM version upgrade 12.1.2.6.0 to 12.1.2.7.0

+ node0 patch successfully , but node1 patch failed

+ following did not help
# oakcli update –patch 12.1.2.7.0 –clean
# oakcli update -patch 12.1.2.7.0 –server

# oakcli update –patch 12.1.2.7.0 –clean
# oakcli update -patch 12.1.2.7.0 –server –local

+ Later checked the logs & found

INFO: 2016-09-22 14:11:22: It may take upto 15 mins. Please wait…
ERROR: 2016-09-22 14:17:20: Unable to apply gi patch on the following Homes : /u01/app/12.1.0.2/grid
SUCCESS: 2016-09-22 14:17:20: Successfully started the Database consoles
SUCCESS: 2016-09-22 14:17:20: Successfully started the EM Agents
ERROR: 2016-09-22 14:17:21: Unable to apply the GRID patch
ERROR: 2016-09-22 14:17:21: Failed to patch server (grid) component
INFO: local patching code END
ERROR: 2016-09-22 14:17:21:  Unable to apply the patch on /u01/app/12.1.0.2/grid

File does not exist: /tmp/PatchSummary_1_20160922140336.xml at /opt/oracle/oak/pkgrepos/System/12.1.2.7.0/bin/pkg_install.pl line 2063
ERROR: Unable to apply the patch <2>

+ verified all patch level, patches, rolling state looks good.

+ noticed eventhough CRS was stopped, 2 active orarootagent.bin processes were running

+ killed those processes helped to proceed with ODA patching.

+ One possible cause could be following, but this is not the case here

ODA: Grid Infrastructure (GI) Patching Fails On The Second Node (Doc ID 2118723.1

Recreate cell disk xml files, in case of duplicate,incorrect entries reported by MS proces

Exadata Griddisk is giving issue after exadata upgrade 12.1.2.2.0.150917 on cell 06.

CellCLI> list griddisk

CELL-04633: PD-CD mapping has changed. Cell disk CD_09_x2ksc30cel06_duplicate_name (b76e17c2-385e-4fdd-8fb2-e88881ed5538)
used to be on physical disk null and LUN null, but now is on physical disk 20:9 (EXNDHX) and LUN 0_9.

CellCLI> list celldisk

CELL-04633: PD-CD mapping has changed. Cell disk CD_09_x2ksc30cel06_duplicate_name (b76e17c2-385e-4fdd-8fb2-e88881ed5538)
used to be on physical disk null and LUN null, but now is on physical disk 20:9 (EXNDHX) and LUN 0_9.

The problem is a logical mismatch that is pushed into the file cell_disk_config.xml.

Solution
=========

Recreate cell_disk_config.xml file.

1.1 Stop services on the cell. Validate they services can be stopped
cellcli -e list griddisk atributes asmdeactivationoutcome
cellcli -e alter cell shutdown services all
1.2 take a backup of the files moving them to a different location.
# cd $OSSCONF
# mkdir origfiles
# mv cell_disk_config.xml* origfiles
# ls -l origfiles/*
There is up to 3 files, one master and two copies.
Example:
ls -l $OSSCONF/cell_disk_config.xml*
-rw-r–r– 1 celladmin root 160120 Feb 13 14:01 /opt/oracle/cell11.2.3.3.0_LINUX.X64_131014.1/cellsrv/deploy/config/cell_disk_config.xml
-rw-r—– 1 celladmin root 160120 Feb 13 14:01 /opt/oracle/cell11.2.3.3.0_LINUX.X64_131014.1/cellsrv/deploy/config/cell_disk_config.xml_
-rw-r–r– 1 root celladmin 160120 Feb 13 17:26 /opt/oracle/cell11.2.3.3.0_LINUX.X64_131014.1/cellsrv/deploy/config/cell_disk_config.xml__
2. Restart the services on the cell
# cellcli -e alter cell startup services all

Issue fixed after re-creating xml files as above.