- Recover tickets after reboot without JPA, or a separate server, or a cluster (works on a standalone server)
- Recover tickets after a crash, except for the last few seconds of activity that did not get to disk.
- No dependency on any external library. Pure Java using only the standard Java SE runtime.
- All source in one class. A Java programmer can read it and understand it.
- Can also be used to cluster CAS servers
- Cannot crash CAS ever, no matter what is wrong with the network or other servers.
- A completely different and simpler approach to the TicketRegistry. Easier to work with and extend.
- Probably less efficient than the uses more CPU and network I/O than other TicketRegistry solutions, but it has a constant predictable overhead you can measure and "price out"verify is trivial.
CAS is a Single SignOn solution. Internally the function of CAS is to create, update, and delete a set of objects it calls "Tickets" (a word borrowed from Kerberos). A Logon Ticket (TGT) object is created to hold the Netid when a user logs on to CAS. A partially random string is generated to be the login ticket-id and is sent back to the browser as a Cookie and is also used as a "key" to locate the logon ticket object in a table. Similarly, CAS creates Service Tickets (ST) to identity a user to an application that uses CAS authentication.
...
Four years ago Yale implemented a "High Availability" CAS cluster using JBoss Cache to replicate tickets. After that, the only CAS crashes were caused by failures of JBoss Cache. Red Hat failed to diagnose or fix the problem. We considered replacing JBoss Cache with Ehcache, but there is a more fundamental problem here. It should not be possible for any failure of the data replication mechanism to crash all of the CAS servers at once. Another choice of cache might be more reliable, but it would suffer from the same fundamental structural problem.
All of the previous CAS cluster solutions create a common pool of tickets shared by all of the cluster members. They are designed and configured so that the Front End can distribute requests in a round-robin approach and any server can handle any request. However, once the Service Ticket is returned by one server, the request to validate the ST comes back in milliseconds. So JPA must write the ST to the database, and Ehcache must synchronously replicate the ST to all the other servers, before the ST ID is passed back to the browser. Synchronous replication was the option that exposed CAS to crashing if the replication system had problems, and it imposed a sever performance constraint that requires all the CAS servers to be connected by very high speed networking.
Disaster recovery and very high availability suggests that at least one CAS server should be kept at a distance independent of the machine room, its power supply and support systems. So there is tension between performance considerations to keep servers close and recovery considerations to keep things distant.
CAS was designed with the ability to add a node identifier at the end of every generated ticketid. This capability is not widely used, because it has no particular purpose and is rather difficult to configure. CushyClusterConfiguration makes it easy to configure, and modern Front End programming or the CushyFrontEndFilter use this capability to improve CAS performance and increase reliability.
Ten years ago, when CAS was being designed, the Front End that distributed requests to members of the cluster was typically an ordinary computer running simple software. Today networks have become vastly more sophisticated, and Front End devices are specialized machines with powerful software. They are designed to detect and fend off Denial of Service Attacks. They improve application performance by offloading SSL/TLS processing. They can do "deep packet inspection" to understand the traffic passing through and route requests to the most appropriate server (called "Layer 5-7 Routing" because requests are routed based on higher level protocols rather than just IP address or TCP session). Although this new hardware is widely deployed, CAS clustering has not changed and has no explicit option to take advantage of it.
Front End devices know many protocols and a few common server conventions. For everything else they expose a simple programming language. While CAS performs a Single Sign On function, the logic is actually designed to create, read, update, and delete tickets. The ticketid is the center of each CAS operation. In different requests there are only three places to find the ticketid that defines this operation:
- In the ticket= parameter at the end of the URL for validation requests.
- In the pgt= parameter for a proxy request.
- In the CASTGC Cookie for browser requests.
Programming the Front End to know that "/validate", "/serviceValidate", and two other strings in the URL path means that this is case 1, and "/proxy" means it is case 2, and everything else is case 3 is pretty simple. If you cannot program your Front End, then CushyFrontEndFilter does the coding in Java, although this will occasionally add an extra network hop.
"Cushy" stands for "CThe existing CAS TicketRegistry solutions must be configured to replicate tickets to the other nodes and to wait for this activity to complete, so that any node can validate a Service Ticket that was just generated a few milliseconds ago. Waiting for the replication to complete is what makes CAS vulnerable, but it seemed like an obvious solution because Ehcache and JBoss Cache promised that it would work. Once it didn't work, it was obvious that someone would at least ask the question whether there was another way to do this. If there was, then the entire TicketRegistry strategy could be reconsidered.
It is easier and more efficient to send the request to the node that already has the ticket and can process it rather than struggling to get the ticket to every other node in advance of the next request.
It turns out that most of the modern network Front End devices that distribute requests to members of a cluster are smart enough today to program a relatively simple amount of CAS protocol and locate the ticketid for a request. If you activate the CAS feature that puts a node identifier on the end of the ticketid, then requests can be routed to the node that created the ticket. If you cannot get your network administrator to program your Front End device, then CushyFrontEndFilter does the same thing not as efficiently as it could be done in the network box, but more efficiently than the processing of requests using "naked" Ehcache without the Filter.
With programming in the Front End or the Filter, CAS nodes can "own" the tickets they create. Those tickets have to be replicated to other nodes so there is recovery when a node crashes, but that replication can be lazy and happen over seconds instead of milliseconds, and no request has to wait for the replication to complete.
That is a better way to run Ehcache or any of the traditional TicketRegistry solutions, but it opens up the possibility of doing something entirely different. Of periodically replicating the TicketRegistry instead of just the individual tickets.
"Cushy" stands for "Clustering Using Serialization to disk and Https transmission of files between servers, written by Yale". This summarizes what it is and how it works.
For objects to be replicated from one node to another, programs use the Java writeObject statement to "Serialize" the object to a stream of bytes that can be transmitted over the network and then restored in the receiving JVM. Ehcache and the other ticket replication systems operate on individual tickets. However, writeObject can operate just as well on the entire contents of the TicketRegistry. This is very simple to code, it is guaranteed to work, but it might not be efficient enough to use. Still, once you have the idea the code starts to write itself.
...
Front End or CushyFrontEndFilter
If the Front End can be programmed to understand CAS protocol, to locate the ticketid, to extract the node identifying suffix from the ticketid, and to route requests to the CAS server that generated the ticket, then CAS does not have to wait for each Service Ticket ID to be replicated around the cluster. This is much simpler and more efficient, and the Cushy design started by assuming that everyone would see that this is an obviously better idea.
Unfortunately, it became clear that people in authority frequently had a narrow view of what the Front End should do, and that was frequently limited to the set of things the vendor pre-programmed into the device. Furthermore, there was some reluctance to depend on the correct functioning of something new no matter how simple it might be.
So with another couple of day's more programming (much spent understanding the multithreaded SSL session pooling support in the latest Apache HttpClient code), CushyFrontEndFilter was created. The idea here was to code in Java the exact same function that was better performed by an iRule in the BIG_IP F5 device, so that someone would be able to run all the Cushy programs even if he was not allowed to change his own F5Front End devices know many protocols and a few common server conventions. For everything else they expose a simple programming language. The Filter contains the same logic written in Java.
We begin by assuming that the CAS cluster has been configured by CushyClusterConfiguration or its equivalent, and that one part of configuring the cluster was to create a unique ticket suffix for every node and feed that value to the beans configured in the uniqueIdGenerators.xml file.
After login, the other CAS requests all operate on tickets. They generate Service Tickets and Proxy Granting Tickets, validate tickets, and so on. The first step is to find the ticket that is important to this request. There are only three places to find the ticketid that defines an operation:
- In the ticket= parameter at the end of the URL for validation requests.
- In the pgt= parameter for a proxy request.
- In the CASTGC Cookie for browser requests.
A validate request is identified by having a particular "servletPath" value ("/validate", "/serviceValidate, "proxyValidate", "/samlValidate"). The Proxy request has a different path ("/proxy"). Service Ticket create requests come from a browser that has a CASTGC cookie. If none of the servletPath values match and there is no cookie, then this request is not related to a particular ticket and can be handled by any CAS server.
If you program this into the Front End, then the request goes directly to the right server without any additional overhead. With only the Filter, a request goes to some randomly chosen CAS Server which may have to forward the request to another server, forward back the response, and handle failure if the preferred server goes down.
CushyTicketRegistry and a CAS Cluster
...
Cushy models its design on two 40 year old concepts. A common strategy for backing disks up to tape was to do a full backup of all the files once a week, and then during the week to do an incremental backup of the files changed since the last backup. The term "checkpoint" derives from a disk file into which an application saved all its important data periodically so it could restore that data an pick up where it left off after a system crash. These strategies work because they are too simple to fail. More sophisticated algorithms may accomplish the same result with less processing and I/O, but the more complex the logic the more vulnerable you become if the software, or hardware, or network failure occurs in a way that the complex sophisticated software did not anticipate.Ehcache is a large library of complex code designed to merge changes to shared data across multiple hosts. Cushy is a single source file of pure Java written to be easily understoodor hardware, or network failure occurs in a way that the complex sophisticated software did not anticipate.
Ehcache is a large library of complex code designed to merge changes to shared data across multiple hosts. Cushy is a single source file of pure Java written to be easily understood.
Replicating the entire TicketRegistry instead of just replicating individual tickets is less efficient. The amount of overhead is predictable and you can verify that the extra overhead is trivial. However, remember this is simply the original Cushy 1.0 design which was written to prove a point and is aggressively "in your face" pushing the idea of "simplicity over efficiency". After we nail down all the loose ends, it is possible to add a bit of extra optimization to get arbitrarily close to Ehcache in terms of efficiency. You can do that when the code base is this small.
Basic Principles
- CAS is very important, but it is also small and cheap to run.
- Emphasize simplicity over efficiency as long as the cost remains trivial.
- The Front End gets the request first and it can be told what to do to keep the rest of the work simple. Let it do its job.
- Hardware failure doesn't have to be completely transparent. We can allow one or two users to get a bad message if everything works for the other 99.9% of the users. Trying to do better than this is the source of most 100% system failures.
...