...
Type of Failure | Recovery |
---|---|
Disk Failure | Storage volumes are on RAID 6 redundant aggregates. They can tolerate up to 2 disk failures. |
Netapp Head Failure | Secondary head takes over as primary and the transition happens in seconds transparent to the application |
VM Host failure | VM ware moves the VM to a different host. This is cold failover but happens in seconds. |
VM Host saturation | DRS moves the VM to a different VM host, transparent to the applications |
VM network card failure | VM network connectivity is redundant, it can deal with single failures |
Storage network card failure | Storage connection to the storage subnet is completely redundant. It can deal with single failure transparently. |
Storage subnet | Storage subnet is completely redundant switched network. |
Database instance failure | Recovery is manual, operators notice on the monitoring tool and then call the on call DBA |
Database failure | Recovery is manual, operators notice on the monitoring tool and then call the on call DBA |
Options to make Database HA automatic:
a) RAC
RAC means we run a >3 node cluster as active-active oracle instances mounting the same database. RAC runs a heartbeat across the cluster and links a virutal IP with each node in the cluster. If any host or database fails in the cluster the VIP of the corresponding node fails over to a surviving node in the cluster. As soon as all the clients connected to the failed node retry the connection they get a immediate rejection since the VIP has failed over and they can get a new connection from a surviving node immediately. This failover happens within seconds. RAC can be implemented in current infrastructure on VM's over NFS with Netapp.
b) Dataguard with FSFO (Fast Start Fail Over)
Dataguard is a active-passive setup where a primary node ships the transaction logs to the standby which is in constant recovery. The secondary node can be opened as read only with Active Data Guard license. A third node that is called observer, keeps a heart beat with both primary and secondary. If the observer looses connection with primary but can still talk to secondary, it makes the secondary primary but the IP's are not switched so there will be a TCP time out involved when the failed client retries a connection. After the failover, if the primary can still be contacted it turns into a standby database to the new primary DB.