View Source

In 2010 Yale upgraded to CAS 3.4.2 and implemented "High Availability" CAS clustering using the JBoss Cache option. Unfortunately, the mechanism designed to improve CAS relilability ended up as the cause of most CAS failures. If JBoss Cache (or any other clustering option) fails due to some unspecified network layer problem, requests back up in memory and eventually CAS stops running on all members of the cluster. None of the other available CAS clustering options have been reported to work flawlessly.

There is much to be said for "off the shelf" software solutions, even when they are designed to handle much more complicated problems. However, there is also something to be said for much, much simpler solutions that just solve the one problem you need to solve.

So CushyTicketRegistry was written to hold CAS tickets in memory and to replicate them to other CAS servers so they can take over if one server fails. It turns out that it is trivial (both in code and overhead) to snapshot the entire collection of tickets to a disk file using the Java writeObject operation. The resulting fairly small file can then be transferred between CAS servers using an HTTPS GET, because CAS runs on a Web Server so you might just as well use it. This approach may not be as efficient as the more sophisticated technology, but it is so dead flat simple that you can understand it, customize it, and arrange that it can never cause problems. More importantly, if it uses less than 5% of one core on a modern multicore commodity server, do you really need to be more efficient?

Every cluster of any type requires a network front end to route requests, detect failure, and maybe load balance. Cushy assumes that this front end is programmable, as most modern front ends are, and depends on routing rules that are entirely reasonable with today's devices.

Executive Summary

This is a quick introduction for those in a hurry.

CAS is a Single SignOn solution. Internally, it creates a set of objects called Tickets. There is a ticket for every logged on user, and short term Service Tickets that exist while a user is being authenticated to an application. The Business Layer of CAS creates tickets by, for example, validating your userid and password in a back end system like Active Directory. The tickets are stored in a plug in component called a Ticket Registry.

For a single CAS server, the Ticket Registry is just a in memory table of tickets (a Java "Map" object) keyed by the ticket ID string. When more than one CAS server is combined to form a cluster, then an administrator chooses one of several optional Ticket Registry solutions that allow the CAS servers to share the tickets.

One clustering option is to use JPA, the standard Java service to map objects to tables in a relational database. All the CAS servers share a database, which means that any CAS node can fail but the database has to stay up all the time or CAS stops working. Other solutions use generic object "caching" solutions (Ehcache, JBoss Cache, Memcached) where CAS puts the tickets into what appears to be a common container of Java objects and, under the covers, the cache technology ensures that the tickets are copied to all the other nodes.

JPA makes CAS dependent on a database. It doesn't really use the database for any real SQL stuff, so you could you almost any database system. However, the database is a single point of failure, so you need it to be reliable. If you already have a 24x7x365 database managed by professionals who can guarantee availability, this is a good solution. If not, then this is an insurmountable prerequisite for bringing up an application like CAS that doesn't really need database.

The various cache solutions should solve the problem. Unfortunately, they too have massively complex configuration parameters with multicast network addresses and timeouts, and while they are designed to work across complete node failure, experience suggests that they are not designed to work when a CAS machine is "sick". That is, if the machine is down and does not respond to any network requests the technology recovers, but if the node is up and receives messages but just doesn't process them correctly then queues start to clog up, they back up into CAS itself and then CAS stops working simultaneously on all nodes. There is also a problem with the "one big bag of objects" model if a router fails that connects two machine rooms, two CAS nodes are separated, and now there are separate versions of what the system is designed to believe is a single cohesive collection.

If you understand the problem CAS is solving and the way the tickets fit together, then each type of failure presents specific problems. Cushy is designed to avoid the big problems and provide transparent service to 99.9% of the CAS users. If one or two people experience an error message due to a CAS crash, and CAS crashes only once a year, then that is good enough especially when the alternative technologies can cause the entire system to stop working for everyone.

Cushy is a cute word that roughly stands for "Clustering Using Serialization to disk and Https transmission of files between servers, written by Yale".

The name explains what it does.Java has a built in operation called writeObject that writes a binary version of Java objects to disk. If you use it on a complex object, like a list of all the tickets in the Registry, then it creates a disk file with all the tickets in the list. Later on you use readObject to turn the disk file back into a copy of the original list. Java calls this mechanism "Serialization". Using just one statement and letting Java do all the work and handle all the complexity makes this easy.

The other mechanisms (JPA or the cache technologies) operate on single tickets. They write individual tickets to the database or replicate them across the network. Obviously this is vastly more efficient than periodically copying all the tickets to disk. Except that at Yale, the entire Registry of tickets can be written to a disk file in 1 second and it produces a file about 3 megabytes in size. Those numbers are so trivial that writing a copy of the entire Registry to disk once every 5 minutes, or even once a minute, is trivial on a modern server. Given the price of hardware, being more efficient than that is unnecessary.

Once you have a file on disk it should not take very long to figure out how to get a copy of that file from one Web Server to another. An HTTP GET is the obvious solution, though if you had shared disk there are other solutions.

Going to an intermediate disk file was not the solution that first comes to mind. If the tickets are in memory on one machine and they have to be copied to memory on another machine, some sort of direct network transfer is going to be the first thing you think about. However, the intermediate disk file is useful to restore tickets to memory if you have to restart your CAS server for some reason. Mostly, it means that the network transmission is COMPLETELY separate from the process of creating, validating, and deleting tickets. If the network breaks down you cannot transfer the files, but CAS continues to operate normally and it can even generate new files with newer copies of all the tickets. When the network comes back the file transfer resumes independent of the main CAS services. So replication problems can never interfere with CAS operation.

Cushy is based on four basic design principles:

CAS is very important, but it is small and cheap to run.
Emphasize simplicity over efficiency as long as the cost to run remains trivial.
Assume the network front end is programmable.
Trying for perfection is the source of most total system failures. Allow one or two users to get a temporary error message when a CAS server fails.

How it works

Cushy is simple enough it can be explained to anyone, but if you are in a rush you can stop here.

Back in the 1960's a "checkpoint" was a copy of the important information from a program written on disk so if the computer crashed the program could start back at almost the point it left off. If a CAS server saves its tickets to a disk file, reboots, and then reads the tickets from the file back into memory it is back to the same state it had before rebooting. If you transfer the file to another computer and bring CAS up on that machine, it have moved the CAS server from one machine to another. Java writeObject and readObject guarantee the state and data are completely saved and restored.

JPA and the cache technologies try to maintain the image of a single big common bucket of shared tickets. This is a very simple view, but it is very hard to maintain and rather fragile. Cushy maintains a separate TicketRegistry for each CAS server, but replicates a copy of each TicketRegistry to all the other servers in the cluster.

Given the small cost of making a complete checkpoint, you could configure Cushy to generate one every 10 seconds and run the cluster on full checkpoints. It is probably inefficient, but using 1 second of one core and transmitting 3 megabytes of data to each node every 10 seconds is not a big deal on modern equipment. This was the first Cushy code milestone and it lasted for about a day before it was extended with a little extra code.

The next milestone (a day later) was to add an "incremental" file that contains all the tickets added or ticket ids of tickets deleted since the last full checkpoint. Creating multiple increments and transmitting only the changes the other node has not yet seen was considered, but it would require more code and complexity. If you generate checkpoints every few minutes, then the incremental file grows as more changes are made but it never gets really large. It is well know that the overhead of creating and opening a file or establishing a network connection is so great that the difference between reading or writing 5K or 100K is trivial.

In Cushy you configure a timer in XML. If you set the timer to 10 seconds, then Cushy writes a new incremental file every 10 seconds. Separately you configure the time between full checkpoints. When the timer goes off, if enough time has passed since the last checkpoint then instead of writing an incremental file, this time it writes a new full Checkpoint.

Only a small number of tickets are added, but lots of Service Tickets have been created and deleted and there is no good way to keep the list of expired Service Tickets from making the incremental file larger. So if you tried to separate full checkpoints by an unreasonable amount of time you would find the incremental file had grown to be larger than the checkpoint file and you have made things worse rather than better. So the expectation is you do a full checkpoint somewhere between every 1-10 minutes and you do an incremental somewhere between every 5 -15 seconds, but test it and make your own decisions.

A Service Ticket is created and then is immediately validated and deleted. Trying to replicate Service Tickets to the other nodes before the validation request comes in is an enormous problem that screws up the configuration and timing parameters for all the other Ticket Registry solutions. Cushy doesn't try to do replication at this speed. Instead, it has CAS configuration elements that ensure that each Ticket ID contains an identifier of the node that created it, and it depends on a front end smart enough to route any of the ticket validation requests to the node that created the ticket and already has it in memory. Then replication only is needed for crash recover.

Note: If the front end is not fully programmable it is a small programming exercise to be considered in Cushy 2.0 to forward the validation request from any CAS node to the node that owns the ticket and then pass back the results of the validation to the app.

Ticket Names

As with everything else, CAS has a Spring bean configuration file (uniqueIdGenerators.xml) to configure how ticket ids are generated. If you accept the defaults, then tickets have the following format:

type - num - random - nodename

where type is "TGT" or "ST", num is a ticket sequence number, random is a large random string like "dmKAsulC6kggRBLyKgVnLcGfyDhNc5DdGKT", and the suffix at the end of the ticket is identified as a nodename.

In vanilla CAS the nodename typically comes from the cas.properties file, but Cushy requires every node in the cluster to have a unique name and even when you are using real clustering many CAS locations leave the "nodename" suffix on the ticket id to its default value of "-CAS". Cushy adds a smarter configuration bean described below and enforces the rule that the end of the ticket really identifies the node that created it and therefore owns it.

How it Fails (Nicely)

The Primary + Warm Spare Cluster

One common cluster model is to have a single master CAS server that normally handles all the requests, and a normally idle backup server (a "warm spare") that does nothing until the master goes down. Then the backup server handles requests while the master is down.

During normal processing the master server is generating tickets, creating checkpoints and increments, and sends them to the backup server. The backup server is generating empty checkpoints with no tickets because it has not yet received a request.

Then the master is shut down or crashes. The backup server has a copy in memory of all the tickets generated by the master, except for the last few seconds before the crash. It can handle new logins and it can issue Service Tickets against logins previously processed by the master, using its copy of the master's registry.

Now the master comes back up and, for this example, let us assume that it resumes its role as master (there are configurations where the backup becomes the new master and so when the old master comes back it becomes the new backup. This is actually easier for Cushy).

The master restores from disk a copy of its old registry and over the network it fetches a copy of the registry from the backup. It now has access to all the login or proxy tickets created by the backup while it was down, and it can issue Service Tickets based on those logins.

However, the failure has left some minor issues that are not important enough to be problems. Because each server is the owner of its own tickets and registry, each has Read-Only access to the tickets of the other server. (Strictly speaking that is not true. You can temporarily change tickets in your copy of the other node's registry, but when the other node comes back up and generates its first checkpoint, whatever changes you made will be replaced by a copy of the old unmodified ticket). So the master is unaware of CAS logouts that occurred while it was down and although it can process a logout for a user that logged into the backup while it was down, it really has no way to actually delete the login ticket. Since no browser has the TGT ID in a cookie any more, nobody will actually be able to use the zombie TGT, but the ticket is going to sit around in memory until it times out.

There are a few more consequences to Single SignOut that will be explained in the next section.

A Smart Front End

A programmable front end is configured to send Validate requests to the CAS server that generated the Service Ticket, /proxy requests to the CAS server that generated the PGT, other requests of logged on users to the CAS server they logged into, and login requests based on standard load balancing or similar configurations. Each ticket has a suffix that indicates which CAS server node generated it.

If the URL "path" is a validate request (/cas/validate, /cas/serviceValidate, etc.) then route to the node indicated by the suffix on the value of the ticket= parameter.
If the URL is a /proxy request, route to the node indicated by the suffix of the pgt= parameter.
If the request has a CASTGC cookie, then route to the node indicated by the suffix of the TGT that is the cookie's value.
Otherwise, or if the node selected by 1-3 is down, choose a CAS node using whatever round robin or master-backup algorithm previously configured.

So normally all requests go to the machine that created and therefore owns the ticket, no matter what type of ticket it is. When a CAS server fails, requests for its tickets are assigned to one of the other servers. Most of the time the CAS server recognizes this as a ticket from another node and looks in the current shadow copy of that node's ticket registry.

As in the previous example, a node may not have a copy of tickets issued in the last few seconds, so one or two users may see an error.

If someone logged into the failed node needs a Service Ticket, the request is routed to any backup node which creates a Service Ticket (in its own Ticket Registry with its own node suffix which it will own) chained to the copy of the original Login Ticket in the appropriate shadow Ticket Registry. When that ticket is validated, the front end routes the request based on the suffix to this node which returns the Netid from the Login Ticket in the shadow registry.

Again, the rule that each node owns its own registry and all the tickets it created and the other nodes can't successfully change those tickets has certain consequences.

If you use Single SignOff, then the Login Ticket maintains a table of Services to which you have logged in so that when you logout or when your Login Ticket times out in the middle of the night then each Service gets a call from CAS on a published URL with the Service Ticket ID you used to login so the application can log you off if it has not already done so. In failover mode a backup server can issue Service Tickets for a failed nodes TGT, but it cannot successfully update the Service table in the TGT, because when the failed node comes back up it will restore the old Service table along with the old TGT.
If the user logs out and the Services are notified by the backup CAS server, and then the node that owned the TGT is restored along with the now undead copy of the obsolete TGT, then in the middle of the night that restored TGT will timeout and the Services will all be notified of the logoff a second time. It seems unlikely that anyone would ever write a service logout so badly that a second logoff would be a problem. Mostly it will be ignored.

You have probably guessed by now that Yale does not use Single SignOut, and if we ever enabled it we would only indicate that it is supported on a "best effort" basis.

CAS Cluster

In this document a CAS "cluster" is just a bunch of CAS server instances that are configured to know about each other. The term "cluster" does not imply that the Web servers are clustered in the sense that they share Session information. Nor does it depend on any other type of communication between machines. In fact, a CAS cluster could be created from a CAS running under Tomcat and one running under JBoss.

To the outside world, the cluster typically shares a common virtual URL simulated by the Front End device. At Yale, CAS is "https://secure.its.yale.edu/cas" to all the users and applications. The "secure.its.yale.edu" DNS name is associated with an IP address managed by the BIG-IP F5 device. It terminates the SSL, then examines requests and based on programming called iRules it forwards requests to any of the configured CAS virtual machines.

Each virtual machine has a native DNS name and URL. It is these "native" URLs that define the cluster because each CAS VM has to use the native URL to talk to another CAS VM. At Yale those URLs follow a pattern of "https://vm-foodevapp-01.web.yale.internal:8080/cas".

Internally, Cushy configuration takes a list of URLs and generates a cluster definition with three pieces of data for each cluster member: a nodename like "vmfoodevapp01" (the first element of the DNS name with dashes removed), the URL, and the ticket suffix that identifies that node (at Yale the F5 likes the ticket suffix to be an MD5 hash of the DNS name).

Sticky Browser Sessions

An F5 can be configured to have "sticky" connections between a client and a server. The first time the browser connects to a service name it is assigned any available backend server. For the next few minutes, however, subsequently requests to the same service go back to whichever server the F5 assigned to handle the first request.

Intelligent routing is based on tickets that exist only after you have logged in. CAS was designed (for better or worse) to use Spring Webflow which keeps information in the Session object during the login process. For Webflow to work, one of two things must happen:

The browser has to POST the Userid/Password form back to the CAS server that sent it the form (which means the front end has to use sticky sessions based on IP address or JSESSIONID value).
You have to use real Web Server clustering so the Web Servers all exchange Session objects based on JSESSIONID.

Option 2 is a fairly complex process of container configuration, unless you have already solved this problem and routinely generate JBoss cluster VMs using some canned script. Sticky sessions in the front end are somewhat easier to configure and obviously they are less complicated than routing request by parsing the ticket ID string.

Yale made a minor change to the CAS Webflow to store extra data in hidden fields of the login form, and an additonal check so if the Form POSTs back to another server the other server can handle the rest of the login without requiring Session data.

What is a Ticket Registry

CAS provides a Single SignOn function, but internally it is an application that creates, validates, and deletes objects called Tickets. One type of ticket (the Ticket Granting Ticket or TGT) is created when the user logs on to CAS by presenting a userid and password or other credential. Then when CAS is asked to authenticate a user to a Web application, it creates a temporary Service Ticket (ST) that is transmitted to the application and which the application validates by talking to CAS directly.

In terms of traditional "layered" application design, the user interface of CAS is provided by Spring MVC and Spring Web Flow. This layer obtains the userid and password from the user or accepts and processes the ticket validation request from the application. In the middle, the Business Logic of CAS verifies the userid and password against some back end system, frequently Active Directory, and it generates ticket IDs and creates ticket objects.

The Ticket Registry is the back end component. It fits into the layered architecture where normal applications have a database layer, but although CAS can optionally store tickets in a database it more commonly keeps them in memory. The TicketRegistry interface is a pluggable component into which you can insert (using Spring XML configuration) any of several JASIG modules. The CushyTicketRegistry is one implementation of the interface.

The JPA Ticket Registry depends on Java ticket objects and object they point to being annotated with references to tables and a column name and type for every object field. JPA then generates and "weaves" into your application a lot of invisible support code to track these objects and to notice when they have been modified.

The "cache" versions of JASIG TicketRegistry modules use what is informally called Plain Old Java Objects. Like Cushy, they require these objects to be serializable to a bunch of portable bytes. With that restriction, they work on any type of object, but they work best with objects that are not directly connected (by references) to other objects and especially not to Collections of objects.

Unfortunately, CAS tickets were not designed to serialize nicely. The problem requires a short discussion about ticket relationships.

When a CAS users logs in, CAS creates a Login TGT. The TGT has two interesting chains of objects. It has an "Authentication" that is connected to a "Principal" that points to the Netid string. To support Single SignOut, it also contains a table of used Service Ticket IDs and references to the Service object that contains the URL CAS contacts to tell the Service that the user has logged out.

When the user authenticates to other applications, the short lived Service Ticket points to the Login TGT. When the ST is validated, CAS follows the pointer from the ST to the TGT and then to the Authentication to the Principal to the Netid string, which it returns as part of the validation message.

A special kind of Service is allowed to Proxy. Such a service gets its own Proxy Granting Ticket which acts like a TGT in the sense that it generates Service Tickets and the ST points back to it, but a PGT does not have the Netid itself. Rather it points back to the real TGT that contains the Authentication data. So when validating a proxy ST, CAS follows the pointer in the ST to the PGT, then follows the pointer from the PGT to the Login TGT and finds the Netid there.

Unfortunately, neither this chain of tickets nor the Single SignOut Service table is the sort of thing you want to have when you use a standard cache mechanism to replicate Plain Old Java Objects. It will do it, but with consequences that may or may not be important depending on how you use them.

When Cushy does a full checkpoint of all the tickets, it doesn't matter how the tickets are chained together. Under the covers of the writeObject statement, Java does all the work of following the chains and understanding the structure, then it write out a blob of bytes that will recreate the exact same structure.

The potential for problems arises when you try to serialize a single ticket, as Cushy does during incrementals and as the standard JASIG cache solutions do all the time for all the tickets.

Serialization makes a copy not just of the object you are trying to serialize, but also of all the objects that it points to directly or indirectly.

If you serialize a TGT, then you also make a copy of the Authentication, the Principal, the Netid string. But then you also generate a copy of the Single SignOut service table and, therefore, all the Services.

If you serialize a ST you get all its data, but because it points to a TGT you also get a copy of the TGT and all its stuff.

Service Tickets last for such a short period of time that it really doesn't matter how they serialize. Proxy Granting Tickets, however, also point to a TGT and therefore they serialize and generate a copy of the TGT they point to, and they live for a very long time.

Now it is possible to describe two problems:

When you serialize a collection, Java must internally obtain an iterator and step one by one through the objects in the collection. Unless the collection is inherently thread safe, serialization can break if another thread adds or deletes elements to the collection while serialization is trying to iterate through it. For example, suppose someone is logged into CAS and in a browser presses the "Open All In Tabs" button to create several tabs simultaneously. This can launch two CAS logon requests at the same time for the same user (and therefore the same TGT). One request is handled first, and it creates a Service Ticket, adds an entry in the Service table of the TGT. Now if you use one of the JASIG "cache" solutions all of them are going to try and Serialize the ST to copy it to the other nodes, but to do that they also have to make a copy of the TGT. Meanwhile, there was another tab that generated another request under another thread that is trying to create a second Service Ticket. Following the same process it will at some point try to add a new Service to the Service Table in the TGT, and every so often this will collide with Java Serialization trying to iterate through the objects in the TGT Service Table to turn them into a blob of bytes. The second thread adds a entry in the Service table that can, on occasion, invalidate the iterator and cause the serialization operation to throw an exception.

If you do not user Single SignOn you can, like Yale, simply disable use of the Service table in the TGT. Otherwise you might solve the problem by editing TicketGrantingTicketImpl.java and changing the type of "services" from HashMap which isn't thread safe to Hashtable which is.

The second problem is a "feature" that you should understand. A Ticket Registry stores ticket objects and retrieves them using their ticket Id string as a key. So if you go to the Ticket Registry with the ID of a TGT you get a TGT object. If you create a new Service Ticket using that TGT, then the ST points to that TGT object and both end up stored in the Registry on the node that created them.

Now if you are using one of the JASIG cache mechanisms or during the period of time while Cushy is generating incremental files instead of a full checkpoint, any attempt to serialize the Service Ticket as a single object produces a blob of bytes that when it is turned back into objects on any other node produces an ST pointing to a copy of the original TGT on the other node.

This is not a problem because the CAS business logic only cares about the data in the TGT, it doesn't care whether it has the official TGT object or an exact copy of it. Besides, Service Tickets are validated or time out and they are gone a few seconds after they are created, so this is never going to be a long term problem.

Except for Proxy Granting Tickets that also point to a TGT. They continue to exist for hours, and if the PGT is serialized and replicated to another node it will, on that node, have its own private copy of the TGT that is a point in time snapshot of what was in the TGT at the time the proxy was created.

Cushy solves this problem as soon as the next full checkpoint is written and restored. By serializing the entire collection of tickets on the node that owns them, Cushy create an exact duplicate object structure on all the other nodes. As for other cache solutions, the information in a TGT does change in any way that could affect the processing of a Proxy ticket. Some time down the line when we start to use mulitfactor authentication and we begin to add additional authentications dynamically to an existing logged on user it may be important for a PGT to reference the current TGT with all its current information rather than a copy of the TGT with only the information it had when the user first connected to the middleware application.

Usage Pattern

Users start logging into CAS at the start of the business day. The number of TGTs begins to grow.

Users seldom log out of CAS, so TGTs typically time out instead of being explicitly deleted.

Users abandon a TGT when they close the browser. They then get a new TGT and cookie when they open a new browser window.

Therefore, the number of TGTs can be much larger than the number of real CAS users. It is a count of browser windows and not of people or machines.

At Yale around 3 PM a typical set of statistics is:

Unexpired-TGTs: 13821
Unexpired-STs: 12
Expired TGTs: 30
Expired STs: 11

So you see that a Ticket Registry is overwhelmingly a place to keep TGTs.

After work, and then again after the students go to sleep, the TGTs from earlier in the day start to time out and the Registry Cleaner deletes them.

So generally the pattern is a slow growth of TGTs while people are using the network application, followed by a slow reduction of tickets while they are asleep, with a minimum probably reached each morning before 8 AM.

If you display CAS statistics periodically during the day you will see a regular pattern and a typical maximum number of tickets in use "late in the day".

Translated to Cushy, the cost of the full checkpoint and the size of the checkpoint file grow over time along with the number of active tickets, and then the file shrinks over night. During any period of intense login activity the incremental file may be unusually large. The worst possible configuration of Cushy would be to generate a checkpoint at 8 AM when the number of tickets is at a minimum, and then to have a period of hours between checkpoints when you might get a lot of activity from people waking up and arriving at work that loads up the incremental file with more and more stuff.

This suggests that the Ticket Registry mechanism should be designed to expect that after any configured period of use, most of the tickets that were there at the beginning of the period are probably still there, that the ones that aren't there are much more likely to have expired (which you can determine from the timestamp) than to have been manually deleted (by explicit CAS logoff), and that there is a block of incremental new tickets accumulated during the period.

This is a fairly predictable pattern that could, over time, produce increasingly efficient and highly optimized replication strategies. The current CushyTicketRegistry was written in about 3 days of coding and is good enough to be usable as is. As the following section will show, the problem is not really all that big.

Some Metrics

At Yale there are typically more than 10,000 and fewer than 20,000 Login tickets. Because Service Tickets expire when validated and after a short timeout, there are only several dozen unexpired Service Tickets at any given time.

Java can serialize a collection of 20,000 Login tickets to disk in less than a second (one core of a Sandy Bridge processor).Cushy has to block normal CAS processing just long enough to get a list of references to all the tickets, and the all the rest of the work occurs under a separate thread unrelated to any CAS operation that does not interfere with CAS processing.

Of course, Cushy also has to deserialize tickets from the other nodes. However, remember that if you are currently using any other Ticket Registry the number of tickets reported out in the statistics page is the total number combined across all nodes, while Cushy serializes only the tickets that the current node owns and it deserializes the tickets for the other nodes. So generally you can apply the 20K tickets = 1 second rule of thumb to estimate the overhead of converting to Cushy and the number does seem to scale. Serializing 200,000 tickets takes 9 seconds.

Incrementals are trivial (.1 to .2 seconds).

CushyTicketRegistry (the code)

CushyTicketRegistry is a medium sized Java class that does all the work. It began with the standard JASIG DefaultTicketRegistry code that stores the tickets in memory (in a ConcurrentHashMap). Then on top of that base, it adds code to serialize tickets to disk and to transfer the disk files between nodes using HTTP.

Unlike the JASIG TicketRegistry implementations, CushyTicketRegistry does not create a single big cache of tickets lumped together from all the nodes. Each node is responsible for the tickets it creates. The TicketRegistry on each node is transferred over the network to the other nodes. Therefore, on each node there is an instance of CushyTicketRegistry for the locally created tickets and other instances of the class for tickets owned by the other nodes.

This is a custom solution designed for the specific CAS requirements. It is not a general object caching mechanism. It is really a strategy for the use of standard Java collections, serialization, and network I/O in a relatively small amount of code. Because the code is so small, it was convenient to put everything in a single class source file.

Configuration

In JASIG CAS, the administrator selects one of the several TicketRegistry optional implementations and configures it using a Spring Bean XML file located in WEB-INF/spring-configuration/ticketRegistry.xml. With CushyTicketRegistry this file creates the first "Primary" object instance that manages the Tickets created and owned by the local nodes. That object examines the configuration and creates additional "Secondary" object instances for every other node configured in the cluster.

The Cluster

Cluster configuration requirements became complex enough that they were moved into their own YaleClusterConfiguration class. This Bean is defined in front of the CushyTicketRegistry in the Spring ticketRegistry.xml file.

Why is this complicated? We prefer a single "cas.war" artifact that works everywhere. It has to work on standalone or clustered environments, in a desktop sandbox with or without virtual machines, but also in official DEV (development), TEST, and PROD (production) servers. Changing the WAR file for each environment is undesirable because we do not want to change the artifact between Test and Production. The original idea was to configure things at the container level (JBoss), but Yale Production Services did not want to be responsible for managing all that configuration stuff.

So YaleClusterConfiguration adds Java logic instead of just a static cluster configuration file. During initialization on the target machine it can determine all the IP addresses assigned to the machine and the machine's primary HOSTNAME. This now allows two strategies.

First, you can configure the configurations of all your clusters (sandbox, dev, test, prod, ...). Then at run time the bean determines what machine it is on, and looks for a configuration that includes that machine. If every computer is in at most one cluster, then it will select the right configuration.

If that does not work, then starting with its own HOSTNAME it can create a simple cluster configuration if the name matches a pattern. At Yale, the DEV, TEST, and PROD machines are all part of a two machine cluster where the HOSTNAME contains a "-01" or "-02" suffix. So by finding the current HOSTNAME it can say that if this machine has "-01" in its name, the other machine in the cluster is "-02" and the reverse.

Sounds easy, but as always the actual code implies some rules you need to know.

First, you can define the YaleClusterConfiguration bean with or without a "clusterDefinition" property. If you provide the property, it is a List of Lists of Strings:

<bean id="clusterConfiguration" class="edu.yale.its.tp.cas.util.YaleClusterConfiguration"
       p:md5Suffix="yes" >
      <property name="clusterDefinition">
       <list>
               
           <list>
               <value>http://foo.yu.yale.edu:8080/cas/</value>
               <value>http://bar.yu.yale.edu:8080/cas/</value>
           </list>
               
           <list>
               <value>https://casdev1.yale.edu:8443/cas/</value>
               <value>https://casdev2.yale.edu:8443/cas/</value>
           </list>
       </list>
      </property>
    </bean>

In spring, the <value> tag generates a String, so this is what Java calls a List<List<String>> (List of Listd of Strings). As noted, the top List has two elements. The first element is a list with two strings for the machines foo and bar. The second element is another List with two strings for casdev1 and casdev2.

There is no good way to determine all the DNS names that point to my server. However, it is relatively easy in Java to find all the IP addresses of all the LAN interfaces on the current machine. This list may be longer than you think. Each LAN adapter can have IPv4 and IPv6 addresses, and then there can be multiple real LANs and a bunch of virtual LAN adapters for VMWare or Virtualbox VMs you host or tunnels to VPN connections. Of course, there is always the loopback address.

This is a caution because what Cushy is going to do is to get all the IP addresses for the current machine and then start to lookup every server DNS name in each cluster defined in the list. In this example, it will first look for the IP address of "foo.yu.yale.edu". It will then compare this address with all the addresses on the current machine.

Cushy cannot use a cluster that does not contain the current machine. So it continues its scan until it finds a cluster definition that the current machine is actually in, and uses the first cluster where the addresses match.

Restrictions:

You cannot create clusters that have the same IP address but different ports. Alternately, two Tomcats on the same machine cannot be members of different clusters. Cluster identity is defined by IP address, not port number. If you need to test on a single host, Virtualbox is free so use VMs.

Be careful of any generic address where the same IP address is used on different machines for different purposes. The Loopback address 127.0.0.1 is on every machine. The private network address of 192.168.1.1 may be used on many dummy networks that connect virtual machines to each other and to their host.

In a desktop sandbox or test environment, you may want to define names in the cluster definition using the local hosts file. If you don't then the computer name has to be found in the real DNS server.

Suppose you omit the clusterDefinition property entirely or the current machine is not associated with any IP address of any URL in any defined cluster. The YaleClusterConfiguration will autoconfigure the cluster. The supplied code is based on simple rules that work in the Yale environment. If you need something different, you have to change the source of YaleClusterConfiguration, but if you know any Java it is not hard. The rules for the supplied code are:

A cluster has at most two machines, because CAS is a very simple application that uses very little resources and can be hosted on any multi-core server today. The second machine is just to recover immediately from a node failure (although you could load balance across the two machines if you want).
The two machines have names that end in "-01" and "-02" (example foo-01.yu.yale.edu and foo-02.yu.yale.edu). Note that this is a three character sequence and requires the dash and the leading "0". It won't match foo1 and foo2 (although as shown in the example above you can configure such names explicitly.
The name the cluster wants to use (the one that ends in "-01") is the primary HOSTNAME configured to the OS and not some additional name added to a machine that has some other primary name. This code can enumerate all the possible IP addresses, but generally there is only one HOSTNAME that the operating system (and Java) returns when you ask for it.

Then YaleClusterConfiguration will use Java to find the full hostname of the current machine, it will find the "-01" or "-02" in the name, and it will autogenerate a cluster with one additional machine with the same name swapping "-01" and "-02".

If none of the above applies:

there is no clusterDefinition or none of the URLs in the clusterDefinition match the current machine and
the HOSTNAME has no "-01" or "-02" in it.

Then YaleClusterConfiguration will generate a standalone CAS server with no other machines in the cluster. CushyTicketRegistry will generate checkpoint and incremental files on local disk, and will use these files to reload tickets after a CAS server reboot, but it will not communicate over the network to any other CAS server.

CushyTicketRegistry uses the cluster configuration created by the YaleClusterConfiguration bean to create one Primary CushyTicketRegistry object for the local server and then one Secondary instance of the CushyTicketRegistry class for every other node in the cluster.

Other Parameters

The nodeName, nodeNameToUrl, and suffixToNodeName parameters link back to properties generated as a result of the logic in the YaleClusterConfiguration bean (id="clusterConfiguration").

The cacheDirectory is a work directory on disk to which it has read/write privileges. The default is "/var/cache/cas" which is Unix syntax but can be created as a directory structure on Windows. In this example we use the Java system property for the JBoss /data subdirectory when running CAS on JBoss.

The checkpointInterval is the time in seconds between successive full checkpoints. Between checkpoints, incremental files will be generated.

YaleClusterConfiguration exposes a md5Suffix="yes" parameter which causes it to generate a ticketSuffix that is the MD5 hash of the computer host instead of using the nodename as a suffix. The F5 likes to refer to computers by their MD5 hash and using that as the ticket suffix simplifies the F5 configuration even though it makes the ticket longer.

How Often?

"Quartz" is the standard Java library for timer driven events. There are various ways to use Quartz, including annotations in modern containers, but JASIG CAS uses a Spring Bean interface to Quartz where parameters are specified in XML. All the standard JASIG TicketRegistry configurations have contained a Spring Bean configuration that drives the RegistryCleaner to run and delete expired tickets every so often. CushyTicketRegistry requires a second Quartz timer configured in the same file to call a method that replicates tickets. The interval configured in the Quartz part of the XML sets a base timer that determines the frequency of the incremental updates (typically every 5-15 seconds). A second parameter to the CushyTicketRegistry class sets a much longer period between full checkpoints of all the tickets in the registry (typically every 5-10 minutes).

A full checkpoint contains all the tickets. If the cache contains 20,000 tickets, it takes about a second to checkpoint, generates a 3.2 megabyte file, and then has to be copied across the network to the other nodes. An incremental file contains only the tickets that were added or deleted since the last full checkpoint. It typically takes a tenth of a second an uses very little disk space or network. However, after a number of incrementals it is a good idea to do a fresh checkpoint just to clean things up. You set the parameters to optimize your CAS environment, although either operation has so little overhead that it should not be a big deal.

Based on the usage pattern, at 8:00 AM the ticket registry is mostly empty and full checkpoints take no time. Late in the afternoon the registry reaches its maximum size and the difference between incrementals and full checkpoints is at its greatest.

Although CAS uses the term "incremental", the actual algorithm is a differential between the current cache and the last full checkpoint. So between full checkpoints, the incremental file size increases as it accumulates all the changes. Since this also includes a list of all the Service Ticket IDs that were deleted (just to be absolutely sure things are correct), if you made the period between full checkpoints unusually long it is possible for the incremental file to become larger than the checkpoint and since it is transferred so frequently this would be much, much worse to performance than setting the period for full checkpoints to be a reasonable number.

Nodes notify each other of a full checkpoint. Incrementals occur so frequently that it would be inefficient to send messages around. A node picks up the other incrementals from the other nodes each time it generates its own incremental.

Methods and Fields

In addition to the ConcurrentHashMap named "cache" that CushyTicketRegistry borrowed from the JASIG DefaultTicketRegistry code to index all the tickets by their ID string, CushyTicketRegistry adds two collections:

addedTickets - a reference to the tickets that were added to the registry since the last full ticket backup to disk.
deletedTickets - a collection of ticketids for the tickets that were deleted.

These two collections are maintained by the implementations of the addTicket and deleteTicket methods of the TicketRegistry interface.

The following methods are added to the CushyTicketRegistry class:

checkpoint() - Called from the periodic quartz thread. Serializes all tickets in the Registry to the nodename file in the work directory on disk. Makes a point in time thread safe copy of references to all the current tickets in "cache" and clearsthe added and deleted ticket collections. Builds an ArrayList of the non-expired tickets. Serializes the ArrayList (and therefore all the non-expired tickets) to /var/cache/cas/CASVM1. Generates a Service Ticket ID that will act as a password until the next checkpoint call. Notifies the other nodes, in this example by calling the /cas/cache/notify service of CASVM2 passing the password ticketid.
restore() - Empty the current cache and de-serialize the /var/cache/cas/nodename file to a list of tickets, then add all the unexpired tickets in the list to rebuild the cache. Typically this only happens once on the primary object at CAS startup where the previous checkpoint of the local cache is reloaded from disk to restore this node to the state it was in at last shutdown. However, secondary caches (of CASVM2 in this example) are loaded all the time in response to a /cas/cache/notify call from CASVM2 that it has taken a new checkpoint.
writeIncremental() - Called by the quartz thread between checkpoints. Serializes point in time thread safe copies of the addedTickets and deletedTickets collections to create the nodename-incremental file in the work directory.
readIncremental() - De-serialize two collections from the nodename-incremental file in the work directory. Apply one collection to add tickets to the current cache collection and then apply the second collection to delete tickets. After the update, the cache contains all the non-expired tickets from the other node at the point the incremental file was created.
getRemoteCache - Generate an https: request to read the nodename or nodename-incremental file from another node and store it in the work directory.
notifyNodes() - calls the /cas/cluster/notify restful service on each other node after a call to checkpoint() generates a full backup. Passes the generated dummy ServiceTicketId to the node which acts as a password in any subsequent getRemoteCache() call.

Unlike conventional JASIG Cache mechanisms, the CushyTicketRegistry does not combine tickets from all the nodes. It maintains shadow copies of the individual ticket caches from other nodes. If a node goes down, then the F5 starts routing requests for that node to the other nodes that are still up. The other nodes can recognize that these requests are "foreign" (for tickets issued by another node and therefore in the shadow copy of that node's tickets) and they can handle such requests temporarily until the other node is brought back up.

Flow

During normal CAS processing, the addTicket() and deleteTicket() methods lock the registry for just long enough to add an item to the end of the appropriate incremental collection. This is a fairly trivial use of locking and it cannot deadlock or be blocked by the other code that synchronizes on the same lock.

Quartz maintains a pool of threads independent of the threads used by JBoss or Tomcat to handle HTTP requests. Periodically a timer event is triggered, Quartz assigns a thread from the pool to handle it, the thread calls the timerDriven() method of the primary CushyTicketRegistry object, and for the purpose of this example, let us assume that it is time for a new full checkpoint.

Java provides a complex built in class called ConcurrentHashMap that handles the coordination of requests to the cache of tickets. Using that build in service, the primary CushyTicketRegistry object generates a separate point-in-time snapshot of references to all the Tickets in the cache. This does not copy the tickets, it only creates a list of pointers to the tickets that existed at that time. Subsequent adds or deletes to the real cache do not affect the snapshot. This separate collection can then be serialized to a disk file in a single Java writeObject() call where all the work is done automatically by Java. After that, we just have to close the file.

However, before returning to Quartz, we call the Notify logic that sends an HTTP GET request to the /cluster/notify URL on each other CAS node. This step occurs under the Quartz thread, and it has to wait until each node returns from the GET request.

On the other node, a standard HTTP request arrives which is processed by the container (JBoss or Tomcat) just like any other request. Spring routes the /cluster/notify suffix to the CacheNotifyController class which is added by the CushyTicketRegistry component. CacheNotifyController calls the primary CushyTicketRegistry object on that node which in turn selects the appropriate secondary object corresponding to the node that sent the Notify request.

Now there are two ways to handle the next step. The simplest logic, but not necessarily the best performance, is for the secondary object to issue the /cluster/getCheckpoint request back to pick up a copy of the just generated checkpoint file, and then restore the data from that file back into the cache memory, before returning from the call. That means, however, that the GET request doesn't end until the data has been completely processed, and remember that the Quartz thread on the node that took the checkpoint is waiting for the response from this node before it can call Notify on the next node. If the network is particularly slow or the file is particularly large or there are a lot of CAS nodes, then sequentially processing each node, and waiting for each node to fetch and load the data before going on to the next node, may generate an unreasonable delay.

If this is an issue, there is an option in the CushyTicketRegistry class. Set "useThread" to true and each secondary CushyTicketRegistry object contains its own thread (the NotifyProcessThread). Then the /cluster/notify GET returns immediately, and the rest of the processing (to fetch the file over the network and load the data into memory) runs under the NotifyProcessThread AFTER the Notify processing appears to have completed from the point of view of the other nodes.

Between full checkpoints, the Quartz timer thread generates the incremental file on disk and then fetches incremental files from the other nodes of the cluster. There is no Notify, and incremental files are so small that the time it takes to write them to disk or read them over the network is not enough to optimize. So everything happens under the Quartz timer thread and the thread can be expected to end well before the next timer tick.

Security

The collection of tickets contains sensitive data. With access to the TGT ID values, a remote user could impersonate anyone currently logged in to CAS. So when checkpoint and incremental files are transferred between nodes of the cluster, we need to be sure the data is encrypted and goes only to the intended CAS servers.

There are sophisticated solutions based on Kerberos or GSSAPI. However, they add considerable new complexity to the code. At the same time, we do not want to introduce anything substantially new because then it has to pass a new security review. So CushyTicketRegistry approaches security by using the existing technology CAS already uses, just applied in a new way.

CAS is based on SSL and uses the X.509 Certificate of the CAS server to verify the identity of machines. If that is good enough to identity a CAS server to the client and to the application that uses CAS, then it should be good enough to identity one CAS server to another.

CAS uses the Service Ticket as a one time randomly generated temporary password. It is large enough that you cannot guess it nor can you brute force match it in the short period of time it remains valid before it times out. The ticket is added onto the end of a URL with the "ticket=..." parameter, and the URL and all the other data in the exchange is encrypted with SSL.

Now apply the same design to CushyTicketRegistry.

Each time a node generates a new full checkpoint file it uses the standard Service Ticket ID generation code to generate a new Service Ticket ID. This ticket id serves in place of a password to fetch files from that node until the next full checkpoint. When a node generates a checkpoint it calls the "https://servername/cas/cluster/notify?ticket=..." URL on the other nodes in the cluster passing this generated dummy Service Ticket ID. SSL validates the X.509 Certificate on the other CAS server before it lets this request pass through, so the ticketid is encrypted and can only go to the real named server at the URL configured to CAS when it starts up.

When a node gets a /cluster/notify request from another node, it responds with an "https://servername/cas/cluster/getChekpoint?ticket=..." request to obtain a copy of the newly generated full checkpoint file. Again, SSL encrypts the data and the other node X.509 certificate validates its identity. If the other node sends the data as requested, then the Service Ticket ID sent in the notify is valid and it is stored in the secondary YaleServiceRegistry object associated with that node. Between checkpoints the same ticketId is used as a password to fetch incremental files, but when the next checkpoint is generated there is a new Notify with a new ticketid and the old ticketid is no longer valid. There is not enough time to brute force the ticketid before it expires and you have to start over.

Behavior

Normal Operation

A CAS node starts up.The Spring configuration loads the primary YaleTicketRepository object, and it creates secondary objects for all the other configured nodes. Each object is configured with a node name, and secondary objects are configured with the external node URL.

If there is a checkpoint file and perhaps an incremental file for any node in the work directory then the primary and secondary objects will use these files to restore at least the unexpired tickets from the previous time the node was up. This is called a "warm start" and it makes sense if CAS has not been down for long and when you are restarting the same version of CAS.

However, there may be times when you want CAS to start with an empty ticket registry, or when you are upgrading from one version of CAS to another and the Ticket objects may not be compatible. When this is true, any files in the work directory should be deleted before restarting CAS. This is a "cold start".

While each node has a copy of its own files, all the other nodes in the cluster have replicated copies of the same files. So if a node fails hard and you lose the disk with the work directory, you can recover the files for the failed node from any other running CAS node in the cluster. Unlike the ehcache or memcached systems where the cache is automatically populated over the network when any node comes up, copying files from one CAS node to another is not an automatic feature. You have to do it manually or else automate it with scripts you write based on your own network configuration.

Remember, every CAS node owns its own Registry and every other CAS node accepts whatever a node says about itself. So if you bring up a node with an empty work directory, then it creates a Registry without tickets, and then it will shortly send an empty checkpoint file to all the other nodes where they will replace any old file with the new empty file and empty their secondary Registry objects. So if you want a warm start, you need to make sure the work directory is populated before you start a CAS node or you will lose all copies of its previous tickets.

If you intend a cold start, it is best to shut down all CAS nodes, empty their work directories, and then bring them back up. You can cold start one CAS node at a time, but it may be confusing if some nodes have no tickets while at the same time other nodes are running with their old ticket population.

During normal processing CAS creates and deletes tickets. It is up to the front end (the F5) to route requests to route browser requests to the node to which the user logged in, and to route validation requests to the node that generated the ticket.

Node Failure

Detecting a node failure is the job of the front end. CAS discovers a failure when a CAS node receives a request that should have been routed to another node. CAS needs no logic to probe the cluster to determine what nodes are up or down. If a node is down then all /cluster/verify and /custer/getIncremental requests will time out, but CAS simply waits the appropriate time and then makes the next request until eventually the node comes back up.

During failure, the most common event is that a browser or Proxy that logged on to another node makes a request to a randomly assigned CAS node to generate a new Service Ticket against the existing login.

Had we been using a JASIG TicketRegistry, then all the tickets from all the nodes would have been stored in a great big virtual bucket. Then any node could find the TGT and issue an ST. So the Business Logic layer does not know or care which node issued a Ticket when creating new tickets or validating existing tickets. Furthermore, when using Ehcache, JBoss Cache, or Memcached the tickets replicated to another node using serialization may be chained to their own private copy of the TGT that was sent from the other node by the Java serialization mechanism. So CAS doesn't really look to carefully at the source of the objects it processes.

The big difference with CushyTicketRegistry is that it keeps separate Registry objects for each node, and it likes to treat the secondary Registry as read-only, at least from the business logic layer of this node. When the other node has failed, then when the business logic layer calls the Registry to find a TGT, then that TGT will be found in one of the secondary Registries. That is the only hint we have that there has been a node failure.

We still preserve the rule that new tickets are created in the primary (local) registry and are indentified with a string that ends with this node's name. That part happens automatically in the business logic when it calls the locally configured Service Ticket unique ID generator, and when it calls addTicket() on the primary object.

In node failure mode, however, the new Service Ticket will have a Granting Ticket field that points to the TGT in the secondary object, that is in the Registry holding a copy of the tickets of the failed node.

If you have been paying attention, you will realize this is no big deal. Any serialized Service Ticket transmitted to another node with one of the JASIG Registry solutions will also be stored on the other node with its own private pointer to a copy of the original TGT, and the fact that the Granting Ticket field in the ST object points to an odd ticket that isn't really "in" the cache has never been a problem. The ST will still validate normally.

Of course, there is a chase condition if after a ST is issued for a TGT in a secondary Registry then the previously failed node starts back up and sends a Notify and the secondary Registry gets refreshed with a new bunch of tickets before the ST is validated. Java will recognize that while the old TGT is no longer in the ConcurrentHashMap of the secondary Registry, the ST still has a valid reference to it. A particularly aggressive cycle of Garbage Collection might delete all the other tickets from the snapshot of the old registry object, but it will leave that one TGT around as long as the ST points to it. When the ST is validated and is deleted, then the old copy of the TGT is released and can be destroyed when Java gets around to it. Again, this arrangement when the ST points to a TGT that is no longer in any Registry HashMap is normal behavior for JASIG replication so it must pose no problem.

The most interesting behavior occurs when a TGT in the secondary Registry of a failed node is used to create a Proxy Granting Ticket. Then the Proxy ticket is issued by and belongs to node that issued it, and the proxying application communicates to that node to get Service Tickets.

The thing that really changes is the handling of CAS Logoff. Fortunately, in normal practice nobody ever logs off of CAS. They just close their browser and let the TGT timeout. However, if someone were to call /cas/logoff during a node failure when they were logged in to another node, then the Business Logic layer will delete the CAS Cookie and report a successful logoff, but we cannot guarantee that CAS will do everything perfectly in the way it would have operated without the node failure.

Sorry but if node failure screws up the complete correct processing of Single SignOut, that is simply a problem you will have to accept. Unless a node stays up to control its TGT and correctly fill in the Service collection with all the Services that the user logged into, then a decentralized recovery system like this cannot globally manage the services.

There is another problem that probably doesn't matter but which should be mentioned. If a node tries to handle Single SignOut on its own during a node failure, and then the failed node comes back, the failed node will restore the TGT that the user just logged out of. Many hours later that TGT will time out, and now the original node will try to notify all the Services that a logout has occurred. So services may get two logout messages from CAS for the same login. It is almost impossible to image a service that will be bothered by this behavior.

Node Recovery

After a failure, the node comes back up and restores its registry from the files in the work directory.

At some point the front end notices the node is back and starts routing requests to it based on the node name in the suffix of CAS Cookies. The node picks up where it left off. It does not know and can not learn about any Service Tickets issued on behalf of its logged in users by other nodes during the failure. It does not know about users who logged out of CAS during the failure.

This does not appear to be a problem.