CAS is a Single SignOn solution. Internally the function of CAS is to create, update, and delete a set of objects it calls "Tickets" (a word borrowed from Kerberos). A Logon Ticket (TGT) object is created to hold the Netid when a user logs on to CAS. A partially random string is generated to be the login ticket-id and is sent back to the browser as a Cookie and is also used as a "key" to locate the logon ticket object in a table. Similarly, CAS creates Service Tickets (ST) to identity a logged on user to an application that uses CAS authentication.
Four years ago Yale implemented a "High Availability" CAS cluster using JBoss Cache to replicate tickets. After that, the only CAS crashes were caused by failures of the ticket replication mechanismJBoss Cache. Red Hat failed to diagnose or fix the problem. We considered replacing JBoss Cache with Ehcache, but there is a more fundamental problem here. It should not be possible for any failure of the data replication mechanism to crash all of the CAS servers at once. Another choice of cache might be more reliable, but it would suffer from the same fundamental structural problem.
Of course, finding the ticket is not helpful unless you use a feature that has always been part of CAS configuration but previously was previouslty not particularly useful. Each server can put a specific identifier on the end of every ticketid it creates. This is the "suffix" of the ticket in the configuration parameters, but typically it has been left as the default string "-CAS". If the suffix is configured seriouslymeaningfully, and if it is set to a value the Front End can use to identify the node, then combined with the previous three steps the Front End can be configured to route ticket requests preferentially to the node that created the ticket and therefore holds it in memory without depending first on cluster replication.
Of course, tickets still have to be replicated for recovery purposes, but that means that tickets can be replicated in seconds instead of milliseconds, and they can be queued and replicated periodically instead of synchronously (while the request waits). This makes the clustering mechanism much easier and more reliable.
Of course, the Front End is owned by the Networking staff, and they are not always responsive to the needs of the CAS administrator. Although it is obviously more efficient to program the Front End, the CushyFrontEndFilter can be added to the Servlet configuration of the CAS server to do in Java the same thing the Front End should be doing, at least until your network administrators adopt a more enlightened point of view.
"Cushy" stands for "Clustering Using Serialization to disk and Https transmission of files between servers, written by Yale". This summarizes what it is and how it works.
There are two big ideas.
- As soon as a CAS request can be processed, look at the URL and Headers, find the ticketid, and based on the suffix route or forward this request to the server that created the ticket and is guaranteed to have it in memory.
- The number of CAS tickets in a server is small enough that it is possible to think in terms of replicating the entire registry instead of just individual tickets. As long as you do this occasionally (every few minutes) the overhead is reasonable. Java's writeObject statement can just as easily process an entire Collection as it can a single ticket object.
Given these two ideas, the initial Cushy design was obvious and the code could be written in a couple of days. For objects to be replicated from one node to another, programs use the Java writeObject statement to "Serialize" the object to a stream of bytes that can be transmitted over the network and then restored in the receiving JVM. Ehcache and the other ticket replication systems operate on individual tickets. However, writeObject can operate just as well on the entire contents of the TicketRegistry. This is very simple to code, it is guaranteed to work, but it might not be efficient enough to use. Still, once you have the idea the code starts to write itself.
Start with the DefaultTicketRegistry source that CAS uses to hold tickets in memory on a single CAS standalone server. Then add the writeObject statement (surrounded by the code to open and close the file) to create a checkpoint copy of all the tickets, and a corresponding readObject and surrounding code to restore the tickets to memory. The first thought was to do the writeObject to a network socket, because that was what all the other TicketRegistry implementations were doing. Then it became clear that it was simpler, and more generally useful, and a safer design, if the data was first written to a local disk file. The disk file could then optionally be transmitted over the network in a completely independent operation. Going first to disk created code that was useful for both standalone and clustered CAS servers, and it guaranteed that the network operations were completely separated from the Ticket objects and therefore the main basic CAS function.
This was so simple and modular that it had to work. It was not clear that it would be good enough to be useful, but it was worth a few days of coding to get some preliminary results. The The first benchmarks turned out to be even better than had been expected, and that justified further work to complete on the optionsystem.
CushyTicketRegistry and the Standalone Server
For a single CAS server, the standard choice is the DefaultTicketRegistry class which keeps the tickets in an in-memory Java table keyed by the ticket id string. Suppose you change the name of the Java class in the Spring ticketRegistry.xml file from DefaultTicketRegistry to CushyTicketRegistry (and add a few required parameters described later). Cushy was based on the DefaultTicketRegistry source code, so everything works the same as it did before, until you have to restart CAS for any reason. Since the DefaultTicketRegistry only has an in memory table, all the ticket objects are lost when CAS restarts and users all have to login again. Cushy detects the shutdown and using a single Java writeObject statement it saves all the ticket objects in the Registry to a file on disk (called the "checkpoint" file). When CAS restarts, Cushy reloads all the tickets from that file into memory and restores all the CAS state from before the shutdown (although a few tickets may have expired). No user even notices that CAS restarted unless they tried to access CAS during the restart.
The number of tickets CAS holds grows during the day and shrinks over night. At Yale there are fewer than 20,000 ticket objects in CAS memory, and Cushy can write all those tickets to disk in less than a second generating a file around 3 megabytes in size. Other numbers of tickets scale proportionately (you can run a JUnit test and generate your own numbers). This is such a small amount of overhead that Cushy can be proactive.
So to take the next logical step, start with the previous ticketRegistry.xml configuration and duplicate the XML elements that currently call a function in the RegistryCleaner every few minutes. In the new copy of the XML elements, call the "timerDriven" function in the (Cushy)ticketRegistry bean every few minutes. Now Cushy will not wait for shutdown but will back up the ticket objects regularly just in case the CAS machine crashes without shutting down normally. When CAS restarts after a crash, it can load a fairly current copy of the ticket objects which will satisfy the 99.9% of the users who did not login in the last minutes before the crash.
What about disaster recovery? The checkpoint and incremental files are ordinary sequential binary files on disk. When Cushy writes a new file it creates a temporary name, fills the file with new data, closes it, and then swaps it the new for the old file, so other programs authorized to access the directory can safely open or copy the files while CAS is running. Feel free to write a shell script or Pearl or Python program to use SFTP or any other program or protocol to back up the data offsite or to the cloud.
Before you configure a cluster, remember that today a server is typically a virtual machine that is not bound to any particular physical hardware. Ten years ago moving a service to a backup machine involved manual work that took time. Today there is VM infrastructure and automated monitoring and control tools. A failed server can be migrated and restarted automatically or with a few commands. If you can get the CAS server restarted fast enough that almost nobody notices, then you have solved the problem that clustering was originally designed to solve without adding a second running node.
You may still want a cluster.
If you use the JPATicketRegistry, then you configure CAS to know about the database in which tickets are stored. None of the nodes knows about the cluster as a whole. The "cluster" is simply one or more CAS servers all configured with to backup tickets into the same database.
If you use Ehcache or one of the other object replication "cache" technologies, then there is typically some an option to use an automatic node discovery mechanism based on multicast messages. That would be a good solution if you have only the one production CAS cluster, but it becomes harder to configure if you have separate Test and Development clusters (unless they are completely isolated by being run as non-routed virtual machines on a single host).The alternative is to configure each node in each cluster separately, but that is hard to maintain. You want to test and validate a single CAS WAR artifact. separate Test and Development clusters that have to have their own multicast configuration.
It seems to be more reliable to configure each node to know the name and URL of all the other machines in the same cluster. However, a node specific configuration file on each machine is difficult to maintain and install. You do not want to change the CAS WAR file when you distribute it to each machine, and Production Services wants to churn out identical server VMs with minimal differences. Where do you keep and introduce the node specific configuration? How do you make sure that changes are correctly implemented on every machine.
CushyClusterConfiguration (CCC) provides an alternative approach to cluster configuration, and while it was originally designed for CushyTicketRegistry it also works for Ehcache. Instead of defining the point of view of each individual machine, the administrator defines all of the CAS servers in all of the clusters in the organization. Production, Functional Test, Load Test, Integration Test, down to the developers desktop or laptop "Sandbox" machines.
CushyClusterConfiguration CCC is a Spring Bean that is specified in the CAS Spring XML. It only has a function during initialization. It reads in the complete set of clusters, uses DNS (or the hosts file) to obtain information about each CAS machine referenced in the configuration, it uses Java to determine the IP addresses assigned to the current machine, and then it tries to match one of the configured machines to the current computer. When it finds a match, then that configuration defines this CAS, and the other machines in the same cluster definition can be used to manually configure Ehcache or CushyTicketRegistry.
CushyClusterConfiguration CCC exports the information it has gathered and the decisions it has made by defining a number of properties that can be referenced using the "Spring EL" language in the configuration of properties and constructor arguments for other Beans. This obviously includes the TicketRegistry, but the ticketSuffix property can also be used to define a node specific value at the end of the unique ticketids generated by beans configured by the uniqueIdGenerators.xml file.
There is a separate page to explain the design and syntax of CushyClusterConfigurationCCC.
Front End or CushyFrontEndFilter