Improving CAS Availability
Background
CAS is an open source project developed by a working group of the Apereo consortium of universities. The developers periodically release updates that fix bugs and new releases that add new features. We are currently running CAS 4 and had to add Duo Multi-Factor Authentication to it ourselves, but CAS 5 has been released and Duo support is part of the standard feature set.
CAS is written as a core component with all the base required function, and then optional libraries that add additional functions. Yale does not make any changes to the standard CAS distribution, so we begin with some standard Apereo release. We then add our own customization:
- We add our own version of some Web pages so CAS has a Yale look and feel and users see the same CAS login page from release to release.
- We add some Yale specific additional libraries. In particular we add the Yale "expired password change" module to make users change their password each year.
- We add files that configure the Yale environment. In CAS 4 configuration used a mixture of XML elements to select options and property files to configure the options. For example, XML tells CAS to use an LDAP query to validate the password, and then a set of properties points to the LDAP access to Active Directory using our network F5 front end as the LDAP service endpoint. In CAS 5, many of the XML configuration options can be removed. If a CAS 5 properties file contains configuration for LDAP, then LDAP support is turned on. If there are no properties, then the XML that would have been configured remains dormant.
Generally speaking, each upgrade from one release to another allows us to retire some Yale custom code. A migration from CAS 4 to CAS 5 might allow us to remove all (or at least most) of our Duo and Password Expiration coding. The closer we are to standard, the easier it will be to use Unicon (a company that supports CAS under contract) as a backup for in-house expertise.
Configuration Options
The simplest version of CAS availability is what we do now. We run two identically configured CAS VMs. While it is functioning, all requests are sent to the first CAS VM. Every time someone logs in, his Single Signon session is replicated to the backup VM using the "ehcache" free software component and one of the optional CAS configuration components described above. If the first VM goes down, then requests are sent to the backup VM.
If both go down, then we have a more interesting problem. At this point our objectives change:
- It is no longer necessary to maintain the Single SignOn behavior. If users are forced to reenter their userid and password on the next login, they will hardly notice the change. There are a lot of services that force CAS to re-authenticate anyway. So "ehcache" replication can be dropped and then CAS has no memory and a backup version can come up cold.
- Password Expiration (have you changed your password in the last year) need not be checked in recovery mode. It adds a lot of extra dependencies that might not be available in data center failures, and you cannot actually change your password until that component comes back up and "change your password" is not Tier 0 availability. We will resume checking when we are back to normal.
- Duo multi-factor authentication is a security measure, dependent on offsite network access, and so it should still be attempted. Turning off Duo is an Information Security call and should be manual rather than automatic during a failure.
In this situation, the only dependency for CAS is the mechanism to validate passwords. CAS does not keep passwords itself, and while it can validate a password using many different protocols, at Yale your Netid password is stored in Active Directory and then is replicated out to the Azure AD for Office 365 purposes. So CAS could use the on premise AD, or the Domain Controller currently hosted in the AWS cloud, or it could (although this might require some customization pending a future CAS release function) check the password in Azure AD using some native Microsoft programming interface. That is the only thing that the CAS login function requires outside CAS itself.
It makes no particular sense to have gradations of CAS fail-over. There just aren't enough important CAS options to justify a "gradual" approach. If the two virtual machines here at Yale fail, then I propose we switch unconditionally to the CAS in the virtual bomb shelter wherever it is. It will handle the basic logins for critical services. There can be more than one such backup (which if you excuse me I would refer to as the "CAS-tastrophy" configuration), but we need one that kicks in automatically or with very little manual intervention to minimize disruption, so any additional backup resource would be manually enabled and therefore come up less quickly.
This represents such a small requirement that it can be handled many different ways. CAS does not care where it is (on premise, in the Cloud), or what it runs on (RHEL, Ubuntu, Windows Server), or whether it is on a VM or a Container. Given the most recent CAS failure (due to monthly maintenance) I suggest that the only hard rule is that the one option that is not acceptable is to duplicate any feature of the primary on premise CAS out to the Recovery configuration, even if that complicates the staffing charts. Thus because the production CAS runs on an RHEL VM, the Recovery CAS has to run on a Container, or Windows Server VM, but nothing that could ever have the same failure trigger.
The network part that routes request to CAS is actually a bigger problem than CAS itself. You can put CAS directly on the network or behind the F5 or an F5 replacement. You can direct traffic by changing the DNS server mapping of the name or the Cisco routing of the IP address. Any program that uses CAS expects that https://secure.its.yale.edu/cas gets to it and does not care how it works (although whatever is at the other end does need the secure.its.yale.edu Certificate and Key to do SSL. There are a few applications that have been "misconfigured" (possibly due to bad advice given by a programmer no longer with us) to use "https://auth.yale.edu/cas" instead and we either should reconfigure them or we have to maintain both names on major failures.
Recovery CAS starts as all CAS instances with the core Apereo release code, but now there are reasons to consider a separate set of Yale configuration files rather than trying to reuse the standard configuration for normal operation. The Recovery configurations selects and configures a subset of the options chosen to either have fewer dependencies or else a different version of the same dependency with a different failure mechanism (use the AWS Domain Controller directly rather than using the on premise Domain Controller accessed through the F5).
Decisions
We have to make a bunch of choices, but I believe there are neither Technical nor Business criteria for the choice. More importantly, it should take at most a day and probably just a few hours to move the Recovery CAS from one set of choices to another environment. Since getting Information Security approval is usually a sticking point, I suggest putting together a quick set of options and asking for their approval of a subset. Then we can focus on whatever makes it through that bottleneck rather than making our own choice and then beating our heads against a brick wall to get it approved.