Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Four years ago Yale implemented a "High Availability" CAS cluster using JBoss Cache to replicate tickets. After that, the only CAS crashes were caused by failures of the ticket replication mechanism. A technology designed to improve availability should not be the primary source of outages.

We considered switching to another technology, but the problem was not entirely in the choice of caching library. CAS is doing some things wrong, and there are certain poorly understood consequences of CAS behavior and ticket replication mechanisms that has been tolerated up to this point because it produced acceptable results but which might not meet the needs of CAS 4.

It is not possible to solve all the problems in the TicketRegistry component of CAS, but if we stop treating the TicketRegistry as a magic black box that we imagine must do everything correctly because it is Off the Shelf software, then perhaps we will understand the remaining problems and finally fix them.

CushyTicketRegistry is a much extended version of the traditional DefaultTicketRegistry module used by standalone CAS servers. Suppose that all you do is to switch the class names in the ticketRegistry.xml Spring configuration file (and add a required parameter). Then Cushy works the same as Default, until you shut down and restart the server. Default restarts with an empty table, so everyone has to login again. Cushy saves all the tickets from memory to a disk file using a single Java writeObject statement. When CAS starts up again, Cushy reloads the tickets from the disk file with a single readObject statement and CAS resumes where it left off without affecting any users who did not try to use CAS while it was restartingCAS is a Single SignOn solution. Internally the function of CAS is to create, update, and delete a set of objects it calls "Tickets". A Logon Ticket is created whenever a user logs in to CAS. It is used to remember the userid, and a generated string used to identity and locate the ticket is written back as a cookie to the logged in browser. When the browser uses this login to access an application, CAS issues a temporary Service Ticket that ties the application URL to the Login Ticket. These ticket objects are stored in a plugin component called a TicketRegistry. A standalone server stores tickets in memory, but a cluster of CAS servers has to share the tickets by replicating copies of them from the server that created the ticket to other servers in the cluster.

Four years ago Yale implemented a "High Availability" CAS cluster using JBoss Cache to replicate tickets. After that, the only CAS crashes were caused by failures of the ticket replication mechanism. We were disappointed that a mechanism nominally designed to improve availability should be a source of failure. We considered switching from JBoss Cache to an alternate library performing essentially the same service, but it was not clear that any other option would solve this problem.

General object replication systems are necessary for shopping cart applications that handle thousands of concurrent users spread across a number of machines. That is not really the CAS problem. CAS has a relatively light load that could probably be handled by a single server, but it needs to be available all the time, even during disaster recovery when there may be unexpected network communication problems. It also turns out that CAS tickets violate some of the restrictions that general object replication systems place on applications.

CushyTicketRegistry is a much simpler solution for clustering than any of the other available systems, but it also adds useful availability features to single standalone CAS servers. It solves the specific requirements of CAS. It is as simple as possible because it doesn't have to be a general purpose tool, and because CAS is a high value system with low resource utilization that can afford simplicity over optimization.

CAS is based on the Spring Framework, which means that internal components are selected and connected to each other using XML text files. The ticketRegistry.xml file has to configure some object that implements the TicketRegistry interface. An administrator chooses one of at least 5 alternatives. The simplest, which keeps tickets in memory on a single standalone server, is called DefaultTicketRegistry. CushyTicketRegistry is a much enhanced version of that code.

Suppose that you start with the standard single server configuration and simply change the class name in the XML file from DefaultTicketRegistry to CushyTicketRegistry (and add a few required parameters described later). Everything works the same as before, until you shut down the CAS server. The old TicketRegistry loses all the tickets, and therefore everyone has to login again. Cushy detects the shutdown and saves all the ticket objects to a file on disk, using a single Java writeObject statement.  Unless that file is deleted, then when CAS restarts Cushy loads all the tickets from that file into memory (although if CAS was down for a long time some may have timed out) and then CAS picks up where it left off. Users do not have to login again, and no user notices that CAS rebooted unless they tried to access CAS while it was down.

With this single modification in place it was possible to measure the cost of the operation. It turns out that all the largest number of tickets used normally encountered at Yale could be written to disk in less than a second, and the file was around 3MB which is pretty small by modern standardsonly 3MB. That is such a small cost that you don't have to wait just until shutdown. So the next step was to duplicate the existing configuration of the RegistryCleaner in the standard CAS Spring XML file to create a second timer driven operation. Every so often (2-5 minutes for example) Cushy does the same writeObject statement that is does at normal shutdown. Now, if the CAS machine crashes without a shutdown, it can come back up with almost the same state it had at the crash (except for changes between the last timer driven backup and the crash).

...