CAS is a Single SignOn solution. Internally the function of CAS is to create, update, and delete a set of objects it calls "Tickets" (a word borrowed from Kerberos). A Logon Ticket object is created to hold the Netid when a user logs on to CAS. A partially random string is generated to be the login ticket-id and is sent back to the browser as a Cookie and is also used as a "key" to locate the logon ticket object in a table. Similarly, CAS creates Service Tickets to identity a logged on user to an application that uses CAS authentication.
CAS data or "state" only consists of its configuration parameters, its saved tickets, and a temporary "flow" state maintained between the time it writes the login form to the screen and the time the user submits a userid and password. Ticket objects are the only important data to save, restore, or share.A standalone CAS server stores ticket objects in memory, but when you add a second CAS server to the network for reliability, then you have to share the ticket objects either by sharing tables in a database or by configuring one of several packages that replicate Java objects over a networkstores its tickets in a puggable component called a TicketRegistry. CAS has one version of TicketRegistry for standalone servers, and at least four alternatives that can be used to share tickets among several CAS servers operating in a network cluster. This document describes a new approach that is simple, provides added function for the standalone server, and yet also operates in clustered configurations.
Four years ago Yale implemented a "High Availability" CAS cluster using JBoss Cache to replicate tickets. After that, the only CAS crashes were caused by failures of the ticket replication mechanism. We considered replacing JBoss, but there is a more fundamental problem here. It should not be possible for any possible failure of the replication mechanism to crash CAS. However, given the design of all the existing replication technologies, CAS cannot function properly if they fail. So replacing one magic black box of code with another, hoping the second is more reliable, is less desirable than fixing the original design problem.
CAS depends on replication because it makes no assumptions about assumes that the network Front End device that distributes Web will randomly distribute requests among the CAS servers. That made sense Therefore, any server must be able to handle any request. That was necessary given the network technology of 10 years ago, but today these devices are much smarter. At Yale, and I suspect at many institutions, the Front End is a BIG-IP F5 device. It can be programmed with iRules, and it is fairly simple to create iRules that understand the basic CAS protocol. If the CAS servers are properly configured and the F5 is programmed, then requests from the browser for a new Service Ticket or requests from an application to validate the Service Ticket ID can be routed to the CAS server that created the Service Ticket and not just to some random server in the cluster. After than, ticket replication is a much simpler and can be a much more reliable process.
General object replication systems are necessary for shopping cart applications that handle thousands of concurrent users spread across a number of machines. E-Commerce applications don't have to worry about running when the network is generally sick or there is no database in which to record the transactions, but CAS is a critical infrastructure component that has to be up at all times.
The CAS component that like everything else Front End devices have gotten smarter and more flexible. Today they routinely handle complicated problems like routing requests to a specific Java Web server based on the JSESSIONID parameter generated by that server.
CAS protocol isn't already built into any Front End, but it is fairly easy to define. First, you have to start with a cluster configuration process that uses the feature that has always been in CAS to append to every Ticket ID string a value identifying the node that issued the ticket. Then the network path part of the URL (the "/cas/serviceValidate"), one of the parameters in the query string (ticket= or pgt=), and a Cookie Header ("CASTGC") provide input to find the ticket id that is important for this particular request. The string at the end of the ticket id selects the node to which this request should be routed, unless that node is down. At that point the only need for ticket replication is to recover from node failure.
The CAS component that holds and replicates ticket objects is called the TicketRegistry. CushyTicketRegistry ("Cushy") is a new option you can configure to in a CAS server. Cushy does useful things for a single standalone server, but it can also be configured to support a cluster of servers behind a modern programmable Front End device. It is explicitly not a general purpose object replication system. It handles CAS tickets. Because it is a easy to understand single Java source file with no external dependencies, it can be made incrementally smarter about ticket objects and how to optimally manage them when parts of the network fail and when full service is restored.CushyTicketRegistry cannot ever crash CAS. It completely separates ticket back with CAS function. If there is a problem it will periodically retry replication until the problem is fixed, but that is completely separate from the rest of CAS function
Unlike other TicketRegistry options where calls to the object replication library are made directly from CAS Single Sign On function, Cushy completely separates the standard CAS function from the replication function and, therefore, any network activity. Cushy can never crash a single CAS node, let alone the entire CAS system, as other technologies have been known to do.
"Cushy" stands for "Clustering Using Serialization to disk and Https transmission of files between servers, written by Yale". This summarizes what it is and how it works.
...
Cushy writes tickets to a disk file. Some organizations install a CAS option that stores passwords in the ticket. If the ticket is written to disk, then you have to protect the disk file and directory. If this is unsatisfactory, use another TicketRegistry option.
If you are happy with your current TicketRegistry, then don't change. There are a lot of other people who seem to be unhappy and want something different. Cushy was created to offer an alternative, not to "close a sale" with any existing satisfied customer.
The Standalone Server
For a simple single standalone CAS server, the standard choice is the DefaultTicketRegistry class which keeps the tickets in an in memory Java table keyed by the ticket id string. Suppose you simply change the class name name of the Java class in the Spring XML file from DefaultTicketRegistry to CushyTicketRegistry (and add a few required parameters described later). Cushy was based on the DefaultTicketRegistry code, so everything works the same as before until you have to restart CAS for any reason. Since the DefaultTicketRegistry only has an in memory table, all the ticket objects are lost when the application restarts and users have to login again. Cushy detects the shutdown and saves all the ticket objects to a file on disk, using a single Java writeObject statement on the entire collection. Unless that file is deleted while CAS is down, then when CAS restarts Cushy reloads all the tickets from that file into memory and restores all the CAS state from before the shutdown. No user even notices that CAS restarted unless they tried to access CAS during the restart.
...
So to take the next logical step, start with the previous ticketRegistry.xml configuration and duplicate the statements that currently run the RegistryCleaner every few minutes. The new statements will call the "timerDriven" method of the ticketRegistry object (Cushy) every few minutes instead of RegistryCleaner. Now Cushy will not wait for shutdown but will back up the ticket objects regularly just in case the CAS machine crashes without shutting down normally. When CAS restarts, it can load a fairly current copy of the ticket objects which will satisfy the 99.9% of the users who did not login in the last minutes before the crash.
At this point, the The next step should be obvious. Can we turn "last few minutes" into "last few seconds". Creating a complete backup of the entire set of tickets is not terribly expensive, but it is not something you want to do continuously. So Cushy can be configured to create "incremental" files between every full checkpoint backup. The incremental file contains all the changes accumulated since the last full checkpoint, so you do not have a bunch of files to process in order. Just apply the You could create a full checkpoint of all the tickets every few seconds, but now the overhead becomes significant. So Cushy starting with the previous configuration file, set the XML to call timerDriven every 10 seconds but configure CushyTicketRegistry to only create checkpoint files every 5 minutes. Now every 10 seconds when it is not making a full checkpoint, Cushy will instead create an "incremental" file containing all the changes since the last checkpoint. If the system crashes at any point, Cushy reloads the last full checkpoint and then applies the incremental file on top of it.
The full checkpoint takes a few seconds to build, the incremental takes a few milliseconds. So you run the full backup every (say) 5 minutes and you run an incremental every (say) 10 seconds.
changes from the last incremental, restoring the tickets to the last few seconds before the crash. Now that 99.99% of the users are happy it is time to quit.
What about disaster recovery? The checkpoint and incremental files are ordinary sequential binary files on disk. When Cushy writes a new file it creates a temporary name and then swaps it for the old file, so other programs authorized to access the directory can freely safely open or copy the files while CAS is running. This is useful because occasionally a computer that crashes cannot just reboot. Since the two files on disk represent all the data that needs to be saved and restored, if you want to prepare for disaster recovery you may want to periodically copy the files to a location far away, in another data center or in the cloud. Cushy doesn't do this itself, but you can easily write a shell script or Pearl or Python program to do it. Since they are normal files, you can copy them with SFTP or any other file utility. Feel free to write a shell script or Pearl or Python program to use SFTP or any other program or protocol to back up the data offsite or to the cloud, though this may be more DR planning than is really required.
Rethink the Cluster Design
Before you configure a cluster, remember that today a server is typically a virtual machine that is not bound to any particular physical hardware. Ten years ago moving a service to a backup machine involved manual work that took time. Today there is VM infrastructure and automated monitoring and control tools. A failed server can be migrated and restarted automatically or with a few commands. If you can get the CAS server restarted fast enough that almost nobody notices, then you have solved the problem that clustering was originally designed to solve. All you need is Cushy's ability to save and restore the tickets.
However, VMs are also cheap and you may prefer to run more than one CAS server. In this case, Cushy offers and entirely different approach to CAS clustering. This new approach is driven by new technology that has been added to machine rooms since the original CAS cluster design was developed.
The cluster will still run in a modern VM infrastructure. This means that individual CAS node outages should be measured in minutes instead of hours.
In any clustered application, all requests go to a single network address ("https://secure.its.yale.edu/cas") that points to a Front End machine. Ten years ago that Front End was dumb and simply distributed the requests round-robin across the set of back end servers. Today, Front End machines, such as the BIG-IP F5, are much smarter and they can be programmed with enough understanding of the CAS protocol so that they only round robin the initial login of new users. After that, if a request arrives at the CAS virtual IP address, then the login ticketid is in the CASTGC Cookie HTTP header, the Service Ticket ID is in the ticket= parameter in the query string of a validate request, or the Proxy ticket ID is in the pgt= parameter of the query string in a /cas/proxy request. CAS has always had the ability to identify the node that created the ticket by a suffix added to all ticket ID strings. Cushy adds a formal methodology to enforce thisthat clustering was originally designed to solve without adding a second running node. You can do this with Cushy, although today some use the JPA TicketRegistry for the same purpose.
Cushy can be configured node by node, but Yale Production Services did not want to configure machines individually. So Cushy adds a configuration class to which you configure the cluster. Actually, you configure every CAS cluster you have in your enterprise (desktop sandbox, development, test, stress test, production, ...). When CAS starts up the configuration class figures out which cluster this machine is a member of, and it configures that cluster and this machine. If also feeds a "ticket ID suffix" string to the CAS components that generate ticket IDs so that the Front End will route tickets properly.
How does Cushy handle clustering? At startup, it creates a "secondary" TicketRegistry that will contain a shadow copy of ticket tickets for each of the other nodes in the cluster. However, as long as the network and nodes are healthy, Cushy only needs access to or a copy of the full checkpoint and incremental file for each node in the network. It does not open the files to restore tickets until there is a failure.
...
The other CAS clustering techniques (JBoss Cache, Ehcache, Memcached) all use complex mechanisms to detect failure, to manage the outage, and to merge results when communication is reestablished. How exactly do they work? What will they do in every possible failure scenario? These systems are so complex and powerful that you have to assume they will eventually do everything right because you cannot plausibly understand how they work. If the problem really was that big, there would be no other choice.However, CAS tickets aren't really that complex. The requirements can be met by two simple steps: convert the objects to a file on disk, then transmit the file from node to node using HTTPS GET. There is no magic black box here that claims to solve all your problems if you don't look under the covers. This is a solution you can understand and own and plan. Yes it is a little less efficient than the more sophisticated packages, but the problem is so small that efficiency is not required and simplicity is more valuable. This document still has to fill in a little more detail, and a moderately skilled Java programmer can read the source
Cushy was created based on two principles:
Today the CAS ticket problem isn't that complicated and CAS can get along with the predictable behavior that results from simple code.
Tomorrow (after 4.0) the CAS Tickets will become more complicated, and then it is helpful to have CAS specific logic built into the replication mechanism that can merge data and repair tickets.
CAS Ticket Objects Need to be Fixed
...