...
All of the previous CAS cluster solutions create a common pool of tickets shared by all of the cluster members. They are designed and configured so that the Front End can distribute requests in a round-robin approach and any server can handle any request. However, once the Service Ticket is returned by one server, the request to validate the ST comes back in milliseconds. So JPA must write the ST to the database, and Ehcache must synchronously replicate the ST to all the other servers, before the ST ID is passed back to the browser. This imposes Synchronous replication was the option that exposed CAS to crashing if the replication system had problems, and it imposed a sever performance constraint that requires all the CAS servers to be connected by very high speed networking.However, reliability is best served by keeping different members of the cluster at physical distances so they will not all go down if there is a single data center or power failure. So the requirements of recovery are at odds with the standard configuration to handle ST validation
Disaster recovery and very high availability suggests that at least one CAS server should be kept at a distance independent of the machine room, its power supply and support systems. So there is tension between performance considerations to keep servers close and recovery considerations to keep things distant.
Ten years ago, when CAS was being designed, the Front End that distributed requests to members of the cluster was typically an ordinary computer running simple software. Today networks have become vastly more sophisticated, and Front End devices are specialized machines with powerful software. They are designed to detect and fend off Denial of Service Attacks. They improve application performance by offloading SSL/TLS processing. They can do "deep packet inspection" to understand the traffic passing through and route requests to the most appropriate server (called "Layer 5-7 Routing" because requests are routed based on higher level protocols rather than just IP address or TCP session). Although this new hardware is widely deployed, CAS clustering has not changed and has no explicit option to take advantage of it.
...
"Cushy" stands for "Clustering Using Serialization to disk and Https transmission of files between servers, written by Yale". This summarizes what it is and how it works.
There are two big ideas.
- Before the request has been turned over to Spring or any of the main CAS code, route or forward requests for operations on tickets generated by another server As soon as a CAS request can be processed, look at the URL and Headers, find the ticketid, and based on the suffix route or forward this request to the server that created the ticket and is guaranteed to have it in memory.
- While it is more efficient to replicate an individual ticket, it is more efficient to send a block of tickets together than to send one at a time, and every few minutes it is not unreasonably expensive to replicate the entire collection of tickets. Java does this with a single writeObject statement (if you tell it to write out a Collection object)The number of CAS tickets in a server is small enough that it is possible to think in terms of replicating the entire registry instead of just individual tickets. As long as you do this occasionally (every few minutes) the overhead is reasonable. Java's writeObject statement can just as easily process an entire Collection as it can a single ticket object.
Given these two ideas, the basic initial Cushy design was obvious and the code could be written in a couple of days. Start with the DefaultTicketRegistry code source that CAS uses to hold tickets in memory on a single CAS standalone server. Then add the writeObject statement (surrounded by the code to open and close the file) to create a checkpoint copy of all the tickets, and a corresponding readObject and surrounding code to restore the tickets to memory. The first thought was to do the writeObject to a network socket, because that was what all the other TicketRegistry implementations were doing, but then . Then it became clear that it was simpler, and more generally useful, and a safer design, if the data was first written to a local disk file. Then the The file could optionally be transmitted over the network in a completely independent operation. This provided Going first to disk created code that was useful for both standalone and clustered CAS servers, and it guaranteed that the network operations were completely separated from the main CAS function.
...
Unfortunately, it became clear that people in authority frequently had a narrow view of what the Front End should do, and that was frequently limited to the set of things the vendor preprogrammed pre-programmed into the device. Furthermore, there was some reluctance to depend on the correct functioning of something new no matter how simple it might be.
So with another couple of day's worth of more programming (much spent understanding the multithreaded SSL session pooling support in the latest Apache HttpClient code), CushyFrontEndFilter was created. The idea here was to code in Java the exact same function that was better performed by an iRule in the BIG_IP F5 device, so that someone would be able to run all the Cushy programs even if he was not allowed to change his own F5.
CushyTicketRegistry and a CAS Cluster
The Picking back up where we left off from the Standalone Server discussion, the names of each checkpoint and incremental files are created from the unique node names each server in the cluster, so they can all coexist in the same disk directory. The simplest Cushy communication option is "SharedDisk". When this is chosen, Cushy expects that the other nodes are writing their full backup and incremental files to the same disk directory it is using. If Cushy receives a request that the Front End should have sent to another node, then Cushy assumes some node or network failure has occurred, loads the other node's tickets into memory from its last checkpoint and incremental file in the shared directory, and then processes the request on behalf of the other node.
...
Thanks to https: this request will not transmit the parameters unless the server first proves its identity with its SSL Certificate. Then the request is sent encrypted so the dummyServiceTicketId is protected. Although this is a GET, there is no response. It is essentially a "restful" Web Service request that sends data as parameters.
Notify does three things:
...
Since these node to node communication calls are modeled on existing CAS Service Ticket validation and Proxy Callback requests, they are configured into CAS in the same place (in the Spring MVC configuration, details provided below).
Note: Yes, this sort of thing can be done with GSSAPI, but after looking into configuring Certificates or adding Kerberos, it made sense to keep it simple and stick with the solutions that CAS was already using to solve the same sort of problems in other contexts.
...