...
CushyTicketRegistry is an extension of the traditional DefaultTicketRegistry module that CAS has used for standalone servers. It is a big extension, but if all you do is to replace DefaultTicketRegistry with CushyTicketRegistry in the XML (and you don't add the optional timer driven call) then when you shut down CAS it generates a file with a checkpoint copy of all its current tickets and when you restart CAS it reloads the tickets into memory and picks up where it left off preserving all the existing user logons. This turns out to be an extremely useful new feature when you have to reboot a CAS server and you do not want to disrupt the Single Sign On experience of all your users, but it is a minor useful improvement over the default behavior.
The next step up is to turn on the timer driven activity and to configure a cluster of CAS servers. Then CushyTicketRegistry periodically generates a disk checkpoint of its current set of tickets, and more frequently a list of changes since the last full checkpoint, and then it uses HTTP GET requests to exchange these files between servers. The files sit on disk until a CAS server crashes, and when it does the other servers have a fairly recent copy of the data on the failed node to handle requests for that node until it comes back up.
Cushy does not use a database and it requires no network configuration beyond the Web server CAS is already running on. It is "inefficient" in the sense that it periodically copies all the tickets instead of just the changed tickets, but that ends up using less than 1% of one core on your server and in modern hardware that is a trivial cost.
Cushy depends on a modern programmable network front end distributing requests to the various CAS servers in the cluster. In exchange, all the code for this type of clustering is contained in one medium sized Java source file that is fairly easy to understand. There is no giant magic black box, but that still leaves the three CAS mistakes exposed:only the start.
It turns out that writing all the tickets to disk takes only about a second, and the resulting file is, by modern standards, relatively small. With modern servers you could certainly afford to do this more often. So the next obvious step would be to configure the Spring framework under which CAS runs to call CushyTicketRegistry every few minutes to generate this checkpoint file every 5 minutes. That way if the CAS server crashes instead of shutting down normally you can reboot and recover all the user logons except for the ones that occurred since the last checkpoint. Again this is a useful feature for a standalone server.
Users tend to login to CAS and then remain active for hours at a time. If you compare the contents of two consecutive checkpoints, you will find that the difference between them consists almost entirely of the tickets for the users that logged in during the period. So the next step is to keep track of these new logins and to periodically write an "incremental" or "differential" file that contains the changes since the last checkpoint. This is a much smaller file and it typically takes only a few milliseconds to write. If you write this file every 10 seconds or so, and apply these differences to last checkpoint file when you have to restore data, then you lose only the CAS activity that occurred a few seconds before the crash.
The checkpoint and incremental files are just plain ordinary files on disk, so they can be stored in a SAN, or on a file server, with RAID and all the other techniques that protect files. They can be copied to another machine, and since they are plain ordinary files you can copy them by hand or with any timer driven scripting technology at hand, and you don't have to do it from CAS or even from the CAS machine. So if the CAS server crashes hard in a way that it cannot be restarted immediately on the old hardware, you can bring up a backup server elsewhere in the network and if you give it a copy of the two files most recently written by the dead server, then the new server comes up with the saved state from just before the crash. With modern Virtual Machine infrastructure it may be possible to ensure that this type of automatic failover to a new machine can happen very quickly, and that may provide enough availability on its own.
So there may be no need for a cluster. Certainly the CAS function is small enough and simple enough that it doesn't need to be load balanced. It is certainly possible to day to configure a server with enough processing power to handle the largest CAS load on one machine.
However, if you insist on building a CAS cluster, then consider the specific problems that any CAS cluster must solve:
- Because CAS as currently written uses Spring Web Flow to store data between the time that the browser's initial GET returns the logon form and the time that the userid and password are submitted by the user from the form, either you must arrange that the form data is posted back to the same CAS server that wrote the form or you need your Application Servers clustered. This, however, is something like JBoss clustering and it has nothing to do with CAS itself.
- Once a user logs on, then either the browser has to come back to the CAS server that processed the logon or the logon ticket has to be replicated to every server.
- Once the user visits a Portal or other middleware that uses the CAS Proxy feature, requests from that middleware have to go to the same CAS server or the Proxy ticket has to be replicated to every server.
- Once the user gets a Service Ticket for an application, then the subsequent request from that application to validate the ticketid and obtain the userid must go to the CAS server that issued it, or else the ticket has to be replicated to every server.
Since CAS 3 first came out, the assumption has been that tickets have to be replicated to every server. That may have been necessary with the network Front End options available at the time, but today the Front End devices (such as the BIG-IP F5) are programmable, and it is fairly easy to solve the clustering problem using the "request must go to the CAS server that issued the ticket" solution instead of depending on the much more complicated "tickets must be replicated to every server" solution. This is particularly important for case 4 above, since only milliseconds separate the generation and validation of Service Tickets. Even if you use conventional CAS 3 clustering technology, it makes sense as a backup to program the Front End to make things simpler, if only as a backup.
This reduces the value of traditional CAS "ticket replication" clustering with an attack from two sides. Programming a modern network Front End to send requests to the correct server means that you can distribute the CAS workload across several servers and you don't need to replicate tickets as long as the network and servers operate normally. The CushyTicketRegistry checkpoint and incremental files provide new ways to recover from the failure of one or more servers. A little more function can close the remaining gap.
So instead of running CushyTicketRegistry in standalone server mode, you can configure it in clustered mode. When each server boots up it creates a shadow TicketRegistry for every other server in the cluster. If you decide to put the checkpoint and incremental files in a directory in some highly available shared disk technology, then all you have to do is to point all the CAS servers to the same shared directory. If any server crashes, all the other servers can use its last checkpoint and incremental file to load their shadow registries with the tickets that the failed server last saved, and then while it is down they can process requests on behalf of that server until it comes back up or is replaced by a backup machine.
However, if you don't have highly available shared disk, or you don't trust it, there is a very simple low tech alternative. CAS runs on Web Servers. Web Servers are very good about sending the current copy of small files over the network to clients. The checkpoint file is small, and the incremental file is smaller. Everyone understand how an HTTP GET works. So CushyTicketRegistry normally operates by expecting each CAS Server to issue an HTTP GET for the latest copy of the checkpoint and incremental file of every other server in the cluster. Since one CAS server can easily handle the entire node, there is never any need for more than two servers in a cluster, but you can have more if you want.
Cushy is inefficient. Traditional CAS cluster solutions write individual changed tickets from node to node, or from a node to a database server. On the other hand, the cost of this inefficiency is capped (one second elapsed time to write a full checkpoint of all tickets every 5 minutes or so) and with modern hardware this cost is trivial. In exchange for a trivial amount of extra processing, you get a CAS that has no external dependencies on external libraries, services, databases, or complex network configuration. You have CAS, some files, and either a shared disk or some HTTP GET requests. If you want you can read one Java source file and know exactly what it is going to do at any time instead of depending on a complex black box that is difficult to configure.
Now the bad news. Current CAS has some bugs. It was not written "properly" to work with the various ticket replication mechanisms. It has worked well enough in the past, but CAS 4 introduces new features that may not behave as expected. It is not possible to fix everything in the TicketRegistry. A few changes may need to be made in the CAS Ticket classes. So Cushy does not fix the bugs itself, but it does eliminate the false reliance of "the magic black box of off the shelf software" that people imagined was going to do more than it could reasonably be expected to do.
1) Any system that seeks to replicate tickets has a concurrency problem if there are multiple threads (like the request threads maintained by any Web Server) that can change the content of an object while another thread has triggered replication of the object. CAS has some collections in its TicketGrantingTicket object that can be changed by one Web request while the object is being replicated, and that can cause the replication to throw a ConcurrentModificationExceptionanother request is trying to serialize the ticket for replication to another system. This results very, very infrequently in a ConcurrentModificationException thrown somewhere in the middle of the black box ticket replication library.
2) Object replication systems work best on standalone objects. CAS, however, chains Proxy and Service Tickets to the Ticket Granting Ticket. Under the covers this has always resulted in other CAS nodes receiving duplicate copies of the Ticket Granting Ticket object, but in the past that didn't matter because the copies were all identical. CAS 4 allows meaningful changes to be made to the Ticket Granting Ticket after logon, and then the copies are no longer identical. This may or may not be acceptable depending on how you use multi-factor authentication and Proxy tickets.If CAS was designed for ticket replication, then the Service and Proxy Tickets would have the ID string of the Logon Ticket. Instead, they have direct references to the TicketGrantingTicket object. This, however, causes the ticket replication mechanisms that use object serialization to generate a copy of the TGT whenever they intend to replicate the Service Ticket, and then on the other end the receiving CAS node gets a copy of the Service Ticket connected to its own private snapshot copy of the TGT. This has always been going on and it doesn't bother CAS logic today, but with CAS 4 support for additional factors of authentication it may produce different behavior on the CAS that created the ticket from the CAS that only has a copy of the original ticket.
3) If you try to fix problem 2 in the TicketRegistry where you receive what you know the problem is occurring, you cannot do anything because to be a copy of a ticket, then you run into the restriction that the Ticket classes are all locked down and do not expose a method to correct an invalid or inappropriate TGT pointer.Cushy cannot solve problems that require changes to the Ticket classes, but since it is a small amount of source you can quickly find the writeObject and readObject statements and develop a strategy to fix any problems by coordinated changes to the Tickets and the TicketRegistry. For now it is enough that there is no type of network problem or possible failure of the cluster or replication process that can crash one CAS node let alone crash all the CAS servers simultaneously as happens with the previous alternativesyou cannot fix the broken references created by serialization and deserialization from the outside.
If you are willing to fix the Ticket classes to address these problems, the Cushy exposes 100% of the replication code and so there is a place to apply the fixes. With any of the black box libraries everything happens under the covers and there is no place to add the fix (although you could rewrite the TicketRegistries that use them to double check every getTicket call, detect any broken links, and fix them before passing the ticket back to the caller).
Executive Summary
This is a quick introduction for those in a hurry.
...