In 2010 Yale upgraded to CAS 3.4.2 and implemented Four years ago Yale implemented a "High Availability" CAS clustering cluster using the JBoss Cache option (because Yale Production Services had standardized on JBoss for "clustering"). Unfortunately, the mechanism designed to improve CAS reliability ended up as the cause of most CAS failures. If you insist that Service Tickets be replicated through the cluster, so that any CAS node can validate any Service Ticket, then replication has to complete before the ST can be passed back to the user. But if CAS has to wait for cache activity, then network problems or some sickness on one of the CAS nodes propagates back to all the nodes and CAS stops working. We considered changing to another option, but none of the alternatives has a spotless reputation for reliability.
There is much to be said for "off the shelf COTS software". After all, if something is widely used and written to handle much more complicated problems, then it should handle CAS. Unfortunately, all these packages are designed to support application software, and at Yale CAS is a Tier 0 system component (in Disaster Recovery planning) and it has to be back up first with as few dependencies as possible. Application software comes up much later when the databases are back up and the network is stable again.
So CushyTicketRegistry was written to solve the CAS Ticket problem and pretty much nothing else. It does not require a database, or any additional complex network configuration with multicast addresses and timeouts. It depends on the observed behavior that CAS is actually a fairly small component with limited hardware demands so that a slightly less "efficient" but rock solid and dead simple approach can be used to solve the problem.
Rather than trying to move tickets in memory to network message queues and multicast protocols, Cushy periodically uses the standard Java writeObject statement to write a copy of the entire ticket cache to disk. At Yale we have fewer than 20,000 tickets at any time, and this operation takes less than a second of elapsed time (on one core) and uses about 3.2MB of disk. So you can certainly do it once every few minutes. In between full checkpoints, every few seconds we write a file of changes since the last full backup.
Now you have to replicate the files to the other CAS nodes, but CAS runs on a Web server and they are pretty good about transferring files over the network. The data has to be secure, but we use HTTPS for the rest of the datastream and it will work just as well with the ticket backup file. The CAS servers have to authenticate themselves, so Cushy uses the same combination of Service Tickets and server certificates (as in the Proxy Callback) that CAS already uses for server authentication. In short, CAS already had all the problems solved and just needed to reapply existing solutions to its own housekeeping problem.
The "off the shelf" object replication technologies are all enormous black boxes with lots of complex code that no CAS programmer has ever read. As a result, CAS is not written correctly to use most of these packages. They have restrictions and CAS doesn't conform to those restrictions. The chance of failure is small, but if you run 24x7 it will eventually trigger a problem.
Cushy is a small amount of source and any Java programmer should be able to understand it. This means it can be customized to handle any special CAS requirements, which is something you cannot do with generic object cache libraries. It can also be customized to address local network configurations, failure patterns, disaster recovery, or availability profilesto replicate tickets. After that, the only CAS crashes were caused by failures of the ticket replication mechanism. A technology designed to improve availability should not be the primary source of outages.
We considered switching to another technology, but the problem was not entirely in the choice of caching library. CAS is doing some things wrong, and there are certain poorly understood consequences of CAS behavior and ticket replication mechanisms that has been tolerated up to this point because it produced acceptable answers but which might not meet the needs of CAS 4.
It is not possible to solve all the problems in the TicketRegistry component of CAS, but if we stop treating the TicketRegistry as a magic black box that we imagine must do everything correctly because it is Off the Shelf software, then perhaps we will understand the remaining problems and finally fix them.
CushyTicketRegistry is an extension of the traditional DefaultTicketRegistry module that CAS has used for standalone servers. It is a big extension, but if all you do is to replace DefaultTicketRegistry with CushyTicketRegistry in the XML (and you don't add the optional timer driven call) then when you shut down CAS it generates a file with a checkpoint copy of all its current tickets and when you restart CAS it reloads the tickets into memory and picks up where it left off preserving all the existing user logons. This is a minor useful improvement over the default behavior.
The next step up is to turn on the timer driven activity and to configure a cluster of CAS servers. Then CushyTicketRegistry periodically generates a disk checkpoint of its current set of tickets, and more frequently a list of changes since the last full checkpoint, and then it uses HTTP GET requests to exchange these files between servers. The files sit on disk until a CAS server crashes, and when it does the other servers have a fairly recent copy of the data on the failed node to handle requests for that node until it comes back up.
Cushy does not use a database and it requires no network configuration beyond the Web server CAS is already running on. It is "inefficient" in the sense that it periodically copies all the tickets instead of just the changed tickets, but that ends up using less than 1% of one core on your server and in modern hardware that is a trivial cost.
Cushy depends on a modern programmable network front end distributing requests to the various CAS servers in the cluster. In exchange, all the code for this type of clustering is contained in one medium sized Java source file that is fairly easy to understand. There is no giant magic black box, but that still leaves the two CAS mistakes exposed:
1) Any system that seeks to replicate tickets has a concurrency problem if there are multiple threads (like the request threads maintained by any Web Server) that can change the content of an object while another thread has triggered replication of the object. CAS has some collections in its TicketGrantingTicket object that can be changed while the object is being replicated, and that can cause the replication to throw a ConcurrentModificationException.
2) Object replication systems work best on standalone objects. CAS, however, chains Proxy and Service Tickets to the Ticket Granting Ticket. Under the covers this has always resulted in other CAS nodes receiving duplicate copies of the Ticket Granting Ticket object, but in the past that didn't matter because the copies were all identical. CAS 4 allows meaningful changes to be made to the Ticket Granting Ticket after logon, and then the copies are no longer identical. This may or may not be acceptable depending on how you use multi-factor authentication and Proxy tickets.
Cushy cannot solve problems that require changes to the Ticket classes, but since it is a small amount of source you can quickly find the writeObject and readObject statements and develop a strategy to fix any problems by coordinated changes to the Tickets and the TicketRegistry. For now it is enough that there is no type of network problem or possible failure of the cluster or replication process that can crash one CAS node let alone crash all the CAS servers simultaneously as happens with the previous alternatives.
Executive Summary
This is a quick introduction for those in a hurry.
...