CAS is a Single SignOn solution. Internally the function of CAS is to create, update, and delete a set of objects it calls "Tickets" (a word borrowed from Kerberos). A Logon Ticket is created whenever when a user logs in presents his netid and password to CAS. It is used to remember the useridThe ticket in CAS memory holds the netid, and a generated string used to identity and locate the ticket is written "ticket id string" is sent back as a cookie Cookie to associate the logged in browser . When the browser uses this login to access an with the identity. For each application, CAS issues generates a temporary Service Ticket that ties the application URL to the Login Ticket. These ticket objects are stored in a plugin component called a TicketRegistry. A standalone server stores tickets and passes its id string through the browser to the application, which then validates the ticket id to CAS and receives the validated netid.
A standalone CAS server stores ticket objects in memory, but when you add a cluster of CAS servers has second CAS server to the network for reliability, then you have to share the tickets by replicating copies of them from the server that created the ticket to other servers in the clusterticket objects either by sharing tables in a database or by configuring one of several packages that replicate Java objects over a network.
Four years ago Yale implemented a "High Availability" CAS cluster using JBoss Cache to replicate tickets. After that, the only CAS crashes were caused by failures of the ticket replication mechanism. We were disappointed that a mechanism nominally designed to improve availability should be a source of failure. We considered switching from JBoss Cache to an alternate library performing essentially the same service, but it was not clear that any other option packages would solve all the problems.
General object replication systems are necessary for shopping cart applications that handle thousands of concurrent users spread across a number of machines. That is not really the CAS problemmore than CAS really needs. CAS has a relatively light load that could probably be handled by a single server, but it needs to be available all the time, even during disaster recovery when there may be unexpected network communication problems. It also turns out that CAS tickets violate some of the restrictions that general object replication systems place on application objects.
We developed CushyTicketRegistry is to be a new alternative you can plug into the TicketRegistry component of CASoption specifically designed to support CAS in the modern network environment. It adds useful availability new features to a make the single standalone CAS server more available without configuring a cluster, but if you decide to add additional servers it also provides an entirely different approach to clustering two or more CAS servers for reliability. It is simple because it is specifically designed to the requirements of CAS and no other application.CAS is based on the Spring Framework, which means that internal components are selected and connected to each other using XML text provides a much simpler, but also more reliable approach to ticket replication.
The TicketRegistry is the component of CAS that stores the ticket objects. There are at least 5 different versions of TicketRegistry that you can choose (Default, JPA, JBoss, Ehcache, Memcached) and Cushy simply adds one additional choice. While traditional applications are assembled at build time, CAS uses the Spring Framework to essentially create logical sockets for components that are plugged in at application startup time. This is driven by XML configuration files. The ticketRegistry.xml file has to configure some object that implements the TicketRegistry interface. The simplest class, which keeps tickets in memory on a single standalone server, is called DefaultTicketRegistry.Suppose that you start with the standard single server CAS configuration. but change the class name in the XML file configures whichever registry option you choose. For a simple single standalone CAS server, the standard choice is the DefaultTicketRegistry class which keeps the tickets in an in memory Java table keyed by the ticket id string.
The Standalone Server
Suppose you simply change the class name from DefaultTicketRegistry to CushyTicketRegistry (and add a few required parameters described later). Everything works the same as before, until you shut down the CAS server. The old TicketRegistry loses all the tickets, and therefore everyone has to login againCushy was based on the DefaultTicketRegistry code, so while CAS runs everything works exactly the same. The change occurs if you shut CAS down, particularly because you have to restart it. Since the DefaultTicketRegistry only has an in memory table, all the tickets are lost when the application restarts. Cushy detects the shutdown and saves all the ticket objects to a file on disk, using a single Java writeObject statement on the entire collection. Unless that file is deleted while CAS is down, then when CAS restarts Cushy loads reloads all the tickets from that file into memory and then CAS picks up where it left offrestores all the CAS state from before the shutdown. Users do not have to login again, and no user even notices that CAS rebooted restarted unless they tried to access CAS while it was down.
It turns out that the largest number of tickets normally encountered at Yale could be written The number of tickets CAS holds grows during the day and shrinks over night. The largest number occurs late in the day. Cushy can write all those tickets to disk in less than a second, and the file was only 3MB. That resulting disk file is smaller than 3 megabytes. This is such a small cost that you don't have to wait just until shutdown. If you examine the standard default ticketRegistry.xml configuration file you will see a few extra XML elements that configure a timer driven component called the RegistryCleaner that runs periodically to delete expired tickets. If you copy these XML statements to call the "timerDriven" method of CushyTicketRegistry, then in the regular intervals you just selected Cushy will write the same complete copy of the tickets to disk. Now if the CAS server amount of overhead that Cushy can be proactive. Duplicate and configure a few additional XML elements in the ticketRegistry.xml file and on some regular basis (every few minutes) Cushy will generate a disk copy of all the tickets "just in case". That way if CAS crashes instead of shutting down normally, Cushy it can restore the most recently written set of tickets, missing only the ones created between the last backup and crash.Backing up a fairly recent copy of all the tickets when it comes back up.
However, writing a copy of all the tickets to disk doesn't use a lot of processing time, but it is probably not something you would do every 5 seconds. So Cushy provides a much quicker alternative that can provide complete coverage. Cushy can track the new and deleted tickets. It can write a second file called the "incremental" that contains only the new or deleted tickets since the last full checkpoint file with all the tickets was written. Typically the incremental takes only a few milliseconds to write, so you can write it as frequently as you need. Then if CAS crashes it uses one readObject to read the last full checkpoint, and then another operation to read the last incremental and add back in the tickets from the last is still expensive enough that you would probably not do it every few seconds. So between the configured period of time between full backups of all the tickets, Cushy can be configured to write an "incremental" file containing only the changes since the last full backup. The incremental file takes only milliseconds to generate. After a crash, Cushy loads the last full ticket backup and then applies the last incremental file to restore the state a few seconds before the crash.
Occasionally the source of the crash was a problem that prevents bringing CAS up on the same host computer. The checkpoint and incremental files are plain old ordinary disk files. They can be stored on local disk on the CAS machine, or they can be stored on a file server, or on a SAN, NAS, or some other highly available disk technology. The farther away they are the safer they are in terms of disaster recovery, but then the elapsed time to write the 3 megabytes may be a few milliseconds longer. So rather than slowing CAS down, you should can let it CAS write the file to local disk, then a add your own shell script or other program can copy the file a second later to a remote safer location. If the CAS machine is unbootable, you can bring up a copy of CAS from the remote backup on other hardware.
The idea of using a cluster for availability made sense ten years ago when servers were physical machines and recovery involved manual intervention. Today servers run on VMs in a highly managed environment and backup VMs can be spun off automatically. It may be possible to design a system so that the backup comes up automatically and so quickly that you don't need a cluster at all. Cushy supports this profile and strategy.
If you insist on creating a cluster of CAS servers, then you should consider some differences between modern technology and the machine room of ten years ago when conventional CAS cluster support was developed.
Multiple CAS nodes will still be run in the VM infrastructure. With the CAS server divorced from physical hardware, nodes should not remain down as long.
Original CAS clustering assumed that the network Front End was fairly dumb. Typically it would take requests for the common CAS URL and distribute them on a round robin basis to the available servers. So CAS clustering had to replicate ticket status almost immediately to all the nodes before the next request came in and was randomly assigned to an unpredictable node. Modern Front End machineseven a real program written in any language to wake up a second or so after CAS writes the file and copy it from local disk over the network to a more secure remote location.
Rethink the Cluster Design
Before you configure a cluster, remember that today a server is typically a virtual machine that is not bound to any particular physical hardware. Ten years ago moving a service to a backup machine involved manual work that took time. Today there is VM infrastructure and automated monitoring and control tools. A failed server can be migrated and restarted automatically or with a few commands. If you can get the CAS server restarted fast enough that almost nobody notices, then you have solved the problem that clustering was originally designed to solve. All you need is Cushy's ability to save and restore the tickets.
However, VMs are also cheap and you may prefer to run more than one CAS server. In this case, Cushy offers and entirely different approach to CAS clustering. This new approach is driven by new technology that has been added to machine rooms since the original CAS cluster design was developed.
The cluster will still run in a modern VM infrastructure. This means that CAS server outages should be measured in minutes instead of hours.
In any clustered application, all requests go to a single network address ("https://secure.its.yale.edu/cas") that points to a Front End machine. Ten years ago that Front End was dumb and simply distributed the requests round-robin across the set of back end servers. Today, Front End machines, such as the BIG-IP F5, are much smarter and they can be programmed with enough understanding of the CAS protocol so that they only round robin the initial login of new users. After that, they should be able to route requests for your tickets, whether from the browser or from applications validating a ticket, to the node you logged into which is also the node that created the ticket. Assuming a modern Front End, tickets Tickets only have to be replicated to protect against system failure, and that which allows replication to be measured occur in seconds instead of milliseconds (hence the term Lazy Replication)millseconds.
The CushyClusterConfiguration class makes it simple to configure more than one CAS server in a cluster. It makes sure that every server has a unique name, that all members of the cluster know the names and network locations of the other members, and that some version of these names is appended to every ticketid so the Front End can route requests properly. It then feeds this cluster information to the CushyTicketRegistry object.With cluster data, the TicketRegistry comes up as before, but it now creates a secondary registry first draft of Cushy configured each server individually. At Yale, production services did not like the idea of maintaining unique configuration files for each machine. So on request the CushyClusterConfiguration class was written to simplify things. With this class you configure all the CAS clusters that you have anywhere in the network: sandbox clusters on a developers laptop, formal development, functional test, load test, and production clusters in the machine room. When CAS comes up the CushyClusterConfiguration class examines the machine it is running on and matches it to one of the machines in one of the cluster definitions. It then assigns values to properties that configure the CushyTicketRegistry and other parts of the CAS Spring configuration, especially the unique ticket ID generation objects.
When CushyTicketRegistry comes up with a cluster definition, it creates a secondary object for every other node in the same cluster. With the simplest option (SharedDisk) these secondary objects simply sit until one of the other CAS servers in the cluster fails and the Front End starts routing requests belonging to to the failed server to other members of the cluster. When the registry receives a request for a ticket that belongs to another server, it restores the tickets belonging to that server from the disk to the secondary object associated with the failed cluster member. It then This secondary object manages communication with the node and, should the node fail, loads and holds a copy of the tickets created by that node immediately before the failure.
The simplest communication option is SharedDisk. When this is chosen, Cushy expects that the other nodes are writing their full backup and incremental files to the same disk directory it is using. If Cushy receives a request that the Front End should have sent to another node, then Cushy assumes some failure has occurred, loads the other node's tickets into memory, and processes the request on behalf of the failed server. The details will be explained belowother node.
If you don't want to use shared disk, there are two alternatives. Cushy provides an HTTPS solution. After all, CAS runs on Web ServersOf course you are free to implement SharedDisk with an actual file server or NAS, but technically Cushy doesn't know or care how the files got to the hard drive. So if you don't like real shared disk technology, you can write a shell script somewhere to wake up periodially and copy the files between machines using SFTP or whatever file transfer mechanism you like to use. You could, for example, put the 3 megabytes on the Enterprise Service Bus if you prefer architecture to simplicity.
However, Cushy provides a built-in data transfer solution based on simple HTTPS GET requests. CAS runs on a Web server, so the one thing CAS can reasonably assume is available is HTTP. Web Servers are very good about sending the current copy of small files over the network to clients. The checkpoint file is small, and the incremental file is smaller. Everyone understands how an HTTP GET works. So unless you configure Shared Disk, Cushy running in cluster mode uses HTTP GET to retrieve a copy of the most recent full checkpoint or incremental file from every other node in the cluster and put the copy on the local hard disk of the machine.
You may now have realized that you do not actually need to use either real Shared Disk or Cushy HTTPS. Every 10 seconds or so Cushy writes one of two files to a directory on local disk. You can write your own program in any language you prefer to wake up every 10 seconds, and if you look at the time stamp on the files you will get an exact time to synchronize your program to the Cushy activity, and after the file has been changed your program can write it somewhere the other nodes can find it using anything from FTP on the simple end to an Enterprise Service Bus on the more exotic end. These are just files and figuring out how to distribute them around the network is fairly routine.
When a CAS node crashes, the other nodes use the most recent file they received to load up tickets and handle requests for the failed node. They do not care how the files got to them.
So what happens if a router breaks the connection between the front end and one of the CAS servers? Suppose a fiber optic connection between data centers goes down for a hour before the traffic can be rerouted, separating one CAS server from another? With magic black box technology that is just supposed to take care of all the problems, you don't really know exactly what is going to happen. Cushy is explained completely using HTTP, or a shared disk technology of your choice, or a file transfer program you decide to writeEverything that can go wrong will go wrong. It is easy to plan for a server crashing. However, suppose you maintain multiple redundant data centers and the fiber connection is broken between centers, or a main router breaks somewhere in the network. Everything is up, but some machines cannot talk to each other. The Front End may believe a CAS server is down while other CAS servers can get to it, or the Front End may be able to talk to all servers but they may not be able to talk to each other. What about disaster recovery?
The other CAS clustering techniques (JBoss Cache, Ehcache, Memcached) all use complex mechanisms to detect failure, to manage the outage, and to merge results when communication is reestablished. How exactly do they work? What will they do in every possible failure scenario? These systems are so complex and powerful that you have to assume they will eventually do everything right because you cannot plausibly understand how they work. If the problem really was that big, there would be no other choice.
However, CAS tickets aren't really that complex. The requirements can be met by two simple steps: convert the objects to a file on disk, then transmit the file from node to node using HTTPS GET. There is no magic black box here that claims to solve all your problems if you don't look under the covers. This is a solution you can understand and own and plan. Yes it is a little less efficient than the more sophisticated packages, but the problem is so small that efficiency is not required and simplicity is more valuable. This document still has to fill in a little more detail, and a moderately skilled Java programmer can read the source. With Cushy you will know exactly how it works and therefore exactly what it will do in any failure situation.
CAS Ticket Objects Need to be Fixed
Now the bad news. Current CAS has some bugs. It was not written "properly" to work with the various ticket replication mechanisms. It has worked well enough in the past, but CAS 4 introduces new features and in the future it may not behave as expected. It is not possible to fix everything in the TicketRegistry. A few changes may need to be made in the CAS Ticket classes. So Cushy does not fix the bugs itself, but it does eliminate the false reliance of "the magic black box of off the shelf software" that people imagined was going to do more than it could reasonably be expected to do.
...