- Recover tickets after reboot without JPA, or a separate server, or a cluster (works on a standalone server)
- Recover tickets after a crash, except for the last few seconds of activity that did not get to disk.
- No dependency on any large external librarylibraries. Pure Java using only the standard Java SE runtime and some Apache commons stuff.
- All source in one class. A Java programmer can read it and understand it.
- Can also be used to cluster CAS servers
- Cannot crash CAS ever, no matter what is wrong with the network or other servers.
- A completely different and simpler approach to the TicketRegistry. Easier to work with and extend.
- Probably uses more CPU and network I/O than other TicketRegistry solutions, but it has a constant predictable overhead you can verify is trivial.
...
Four years ago Yale implemented a "High Availability" CAS cluster using JBoss Cache to replicate tickets. After that, the only CAS crashes were caused by failures of JBoss Cache. Red Hat failed to diagnose or fix the problem. As we tried to diagnose the problem ourselves we discovered both bugs and design problems in the structure of Ticket objects and the use of the TicketRegistry solutions that contributed to the failure. We considered replacing JBoss Cache with Ehcache, but there is a more fundamental problem here. It should not be possible for any failure of the data replication mechanism to crash all of the CAS servers at once. Another choice of cache might be more reliable, but it would suffer from the same fundamental structural problemwhile that might improve reliability somewhat it would not solve the fundamental structural problems.
Having been burned by software so complicated that the configuration files were almost impossible to understand, Cushy was developed to accomplish the same thing in a way so simple it could not possibly fail.
The existing CAS TicketRegistry solutions must be configured to replicate tickets to the other nodes and to wait for this activity to complete, so that any node can validate a Service Ticket that was just generated a few milliseconds ago. Waiting for the replication to complete is what makes CAS vulnerable , but it seemed like an obvious solution because Ehcache and JBoss Cache promised that it would work. Once it didn't work, it was obvious that someone would at least ask the question whether there was another way to do this. If there was, then the entire TicketRegistry strategy could be reconsideredto a crash if the replication begins but never completes. Synchronous ticket replication is a standard service provided by JBoss Cache and Ehcache, but is it the right way to solve the Service Ticket validation problem? A few minutes spent crunching the math suggested there was a better way.
It is easier and more efficient to send the request to the node that already has the ticket and can process it rather than struggling to get the ticket to every other node in advance of the next request.
It turns out that most of the modern network Front End devices that distribute requests to members of a cluster are smart enough today to program a relatively simple amount of CAS protocol and locate the ticketid for a request. If you activate the CAS feature that puts a node identifier on the end of the ticketid, then requests can be routed to the node that created the ticket. If you cannot get your network administrator to program your Front End device, then CushyFrontEndFilter does the same thing not as efficiently as it could be done in the network box, but more efficiently than the processing of requests using "naked" Ehcache without the Filter.
With programming in the Front End or the Filter, CAS nodes can "own" the tickets they create. Those tickets have to be replicated to other nodes so there is recovery when a node crashes, but that replication can be lazy and happen over seconds instead of milliseconds, and no request has to wait for the replication to complete.
That is a better way to run Ehcache or any of the traditional TicketRegistry solutions, but it opens up the possibility of doing something entirely different. Of periodically replicating the TicketRegistry instead of just the individual tickets.
In the current TicketRegistry implementations, any request in a cluster to create a Service Ticket must replicate the service ticket to at least one other computer (the database server in JPA, one or more nodes using Ehcache or any other ticket replication mechanism) before the Service Ticket ID is returned to the browser. This ensures that the Service Ticket can be validated by any node to which the application's validation request is directed. After validation, there is a second network transaction to delete the ticket. So every ST involves two backend synchronous operations.
However, it has always been part of CAS that every ticketid has a suffix that, at least on paper, can contain the node name of the CAS server that created the ticket. Using this feature in practice requires some node configuration methodology. Once this is done, then any validate request (for example, any call to /cas/serviceValidate contains in the query string part of the URL a ticket= parameter, and the end of the value of that parameter designates the node that created the ticket. Today you can program most modern network front end devices to extract this information from the request and route the validate request to the node that created the ticket and is guaranteed to have it in memory. If you cannot program your front end device, or if you cannot convince your network administrators to do the work for you, then CushyFrontEndFilter accomplishes the same thing by scanning requests as they arrive at a CAS server and forwarding requests like validation to the server that created the ticket. If you have two servers and requests are randomly assigned to them, then 50% of the time the request goes to the right server and there is no network transaction, and 50% of the time the request has to be forwarded by the Filter to the other server, which then validates the ST and deletes it returning the response message. So with the Filter you expect, on average, one network transaction half the time instead of, with current JPA or Cache technology, two network transactions every time. When the number of nodes in the cluster is more than 2, the Filter works even better.
CushyFrontEndFilter works with Ehcache or CushyTicketRegistry. When added to Ehcache you can change the cache configuration so that the Service Ticket cache does not use synchronous replication, or even better you can turn off replication entirely for the Service Ticket cache because every 10 seconds a Service Ticket is either used and discarded or else times out, so it makes no sense to replicate them at all if the front end or filter routes requests properly.
However, once you come up with the idea of using front end routing to avoid the synchronous ticket replication (which was the source of crashes in JBoss Cache at Yale), then some new more radical changes to TicketRegistry become possible. In addition to the various validate request, you can route the /proxy request to the node that owns the Proxy Granting Ticket, and you can route new Service Ticket requests to the node that issued the Ticket Granting Ticket (based on the suffix of the CASTGC cookie). Now a basic principle of all the existing ticket registry designs is no longer necessary. CAS Ticket objects do not have to be stored in what appears to be a common shared pool. Tickets can be segregated into separate collections based on the identity of the node that created and "owns" the ticket.
"Cushy" stands for "Clustering Using Serialization to disk and Https transmission of files between servers, written by Yale". This summarizes what it is and how it works.
For objects to be replicated from one node to another, programs libraries use the Java writeObject statement to "Serialize" the object to a stream of bytes that can be transmitted over the network and then restored in the receiving JVM. Ehcache and JBoss Cache use writeObject on individual tickets (although it turns out they also end up serializing copies of all the other ticket replication systems operate on individual ticketsobjects the ticket points to, including the TGT when attempting to replicate a ST). However, writeObject can operate just as well on the entire contents of the TicketRegistry. This TicketRegistry. Making a "checkpoint" copy of the entire collection of tickets to disk (at shutdown for example) and then restoring this collection (after a restart) is very simple to code. Since Java does all the work, it is guaranteed to work, but it might not be efficient enough to use. Still, once you have the idea the code starts to write itselfbehave correctly. It is a useful additional function. However, you can be more aggressive in the use of this approach, and that suggests the design of an entirely different type of TicketRegistry.
Start with the DefaultTicketRegistry source that CAS uses to hold tickets in memory on a single CAS standalone server. Then add the writeObject statement (surrounded by the code to open and close the file) to create a checkpoint copy of all the tickets, and a corresponding readObject and surrounding code to restore the tickets to memory. The first thought was to do the writeObject to a network socket, because that was what all the other TicketRegistry implementations were doing. Then it became clear that it was simpler, and more generally useful, and a safer design, if the data was first written to a local disk file. The disk file could then optionally be transmitted over the network in a completely independent operation. Going first to disk created code that was useful for both standalone and clustered CAS servers, and it guaranteed that the network operations were completely separated from the Ticket objects and therefore the basic CAS function.
...
CAS is a very important application, but on modern hardware it is awfully small and cheap to run. Since it was first developed there have been at least 5 cycles generations of new chip technology applied to that now run what was never a big application to begin with.
...