Infrastructure setup:
Database is hosted on a virtual machine vm-iamprddb-01.its.yale.internal. The database storage is on NFS hosted on Netapp storage wcnas03d. The database runs on RedHat Linux 6.4 and is running a single instance of Oracle 11.2.0.3.
The VM infrastructure is completely redundant with no single point of failure. The storage system and the network infrastructure is also completely redundant without any single point of failure. Oracle instance is not redundant and is a single point of failure.
Storage HA:
The storage volumes are stored in RAID 6 redundant aggregates i.e. they can deal with 2 disk failures and still function with reduced performance. These volumes are served to the clients by a host called a head. There are two heads primary and secondary. The disks are served to the client by the primary but in case of primary head failure, the secondary head takes over the role of the primary for the disk and starts serving the data. This failover is done in few seconds and is more or less transparent to the client. The storage heads are connected to the redundant storage subnet via redundant bonded ethernet cards i.e. the network connection between the storage heads and the storage subnet does not have a single point of failure.
There is daily snapshot scheduled on the production volumes. In case of any logical failures we can restore from the last snapshot and restore using archive logs to any point in time from the snapshot time to the current time. We maintain 8 days worth of archive logs at all time on the primary volume.
Storage subnet:
Storage subnet is a completely redundant switched network i.e. there is no single point of failure in the storage subnet.
Vmware HA
HA: Vmware runs a heart beat between all its hosts, when any host misses a heart beat for a specified amount of time or specified number of times vmware moves all the VM's running on the host to different hosts on the cluster. This is a cold fail over since the host is rebooted. But the host keeps it identity on the network and all databases on the host will come up transparently. This happens withing minutes.
DRS: This mechanism tries to keep all hosts in the cluster equally busy. If a host crosses a threshold on utilization vmware automatically moves some VM's to different hosts. This move it completely online and transparent to the application.
VMOTION: The db host can also be manually vmotioned to a different host completely online. This is done if the VM team needs to do some maintenance on host running the db host. ase
Database HA:
There is no HA mechanism built in the infrastructure for database HA. If a database instance or database fails, the monitoring tool will inform the operators about the failure, operators would then call the on call dba who will gauge the failure and take appropriate action. It's difficult to come with a time range for this action since it could vary based on type of failure. The options range from just restarting the instance to doing a database recovery from snapshot to getting storage and systems team involved.