Exadata Complete Outage Due to RDS Misconfiguration.

Exadata Full Rack. (8 DB nodes + 14 cells)

All DB,Cell were up but CRS failed to start, throwing following error.

GPNP:1239021056: clsgpnp_Term: [at clsgpnp0.c:1347] GPnP cli=clsuGpnpg
SKGFD:1239021056: ERROR: -8(OS Error -1 (open,sskgxplp,Invalid protocol requested (2) or protocol not loaded.,Error 0))
SKGFD:1239021056: ERROR: -10(OSS Operation oss_initialize failed with error 4 [Network initialization failed]

ocssd call stak

CRS-8503 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/grid/diag/crs/myexadbadm01/crs/incident/incdir_43/

[OCSSD(32372)]CRS-8503: Oracle Clusterware OCSSD process with operating system process ID 32372 experienced fatal signal or exception code 6

CSSD:1755981568: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization

CLSFA initialization – point to OS miconfiguration during initialization of crs satck.
[01]: dbgeExecuteForError [diag_dde]
[02]: dbgePostErrorDirect [diag_dde]
[03]: clsdAdrPostError []<– Signaling
[04]: clsbSigErrCB []
[05]: skgesig_sigactionHandler []

Later verify the complete health check & noticed rds-ping does not work on any of the nodes, it forced me to check rds driver, & found this

# modprobe rds_rdma

FATAL: Error inserting rds_rdma (/lib/modules/2.6.39-400.281.1.el6uek.x86_64/kernel/net/rds/rds_rdma.ko): Unknown symbol in module, or unknown parameter (see dmesg)

Basically the real problem in OS side is rds_rdma kernel module cannot be loaded for above error.

Verified the same from OS logs

kernel: rds_rdma: Unknown symbol rds_send_get_message (err 0)
kernel: rds_rdma: Unknown symbol rds_for_each_conn_info (err 0)
kernel: rds_rdma: Unknown symbol rds_message_add_rdma_dest_extension (err 0)
kernel: rds_rdma: Unknown symbol rds_wq (err 0)
kernel: rds_rdma: Unknown symbol rds_atomic_send_complete (err 0)
kernel: rds_rdma: Unknown symbol rds_conn_connect_if_down (err 0)
kernel: rds_rdma: Unknown symbol rds_conn_destroy (err 0)
kernel: rds_rdma: Unknown symbol rds_rdma_send_complete (err 0)
kernel: rds_rdma: Unknown symbol rds_send_drop_acked (err 0)
kernel: rds_rdma: Unknown symbol rds_send_xmit (err 0)
kernel: rds_rdma: Unknown symbol rds_stats_info_copy (err 0)
kernel: rds_rdma: Unknown symbol rds_inc_put (err 0)
kernel: rds_rdma: Unknown symbol rds_message_add_extension (err 0)
kernel: rds_rdma: Unknown symbol rds_info_register_func (err 0)
kernel: rds_rdma: Unknown symbol rds_page_remainder_alloc (err 0)
kernel: rds_rdma: Unknown symbol rds_inc_init (err 0)
kernel: rds_rdma: Unknown symbol rds_recv_incoming (err 0)

Ideally Each Exadata has following configuration file

/etc/modprobe.d/network.conf , contents are

install vfat /bin/true
options ipv6 disable=1
install rds /bin/true <<<<<<<<<<<<<<<<<<<

rds_rdma depends upon rds and highlighted line, this entry was preventing explicitly the rds module from loading being included in kernel module operations, is similar to a blacklist of that module, but much higher precedence.

(The line actually means, “when you want to install rds, invoke this program instead of the default insmod”.)

This isn’t a bug, this is something the customized environments.

Solution :

Commenting following line, it helps restores rds communication, and crs stack came up without any issue.

# install rds /bin/true

 

 

 

Advertisements

Author: jee

Oracle Engineered Guy work for Oracle Corp. Techno-addict for Exadata, SuperCluster, ODA, RAC, ASM, HA.Fusion Music Enthusiast,Son,Husband,Father, Information-news- Freak, Optimist, Humanist. Interests : Technology, Innovation, Sharing interesting content. Views expressed on this account are my own and don't necessarily reflect the views of Oracle & its affiliates.