Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

CAS is a Single SignOn solution. Internally the function of CAS is to create, update, and delete a set of objects it calls "Tickets" (a word borrowed from Kerberos). A Logon Ticket object is created to hold the Netid when a user logs on to CAS. A partially random string is generated to be the login ticket-id and is sent back to the browser as a cookie while Cookie and is also being used as a "key" to save locate the logon ticket object in a table. Similarly, CAS creates Service Tickets to identity a logged on user to an application that uses CAS authentication.

...

Four years ago Yale implemented a "High Availability" CAS cluster using JBoss Cache to replicate tickets. After that, the only CAS crashes were caused by failures of the ticket replication mechanism. We considered replacing JBoss, but there is a more fundamental problem here. It should be a structural feature of the system that problems in the ticket replication mechanism cannot crash CAS. Replacing not be possible for any possible failure of the replication mechanism to crash CAS. However, given the design of all the existing replication technologies, CAS cannot function properly if they fail. So replacing one magic black box of code with another, hoping the second one works better misses the point.General object replication systems are necessary for shopping cart applications that handle thousands of concurrent users spread across a number of machines. At the same time, E-is more reliable, is less desirable than fixing the original design problem.

CAS depends on replication because it makes no assumptions about the network Front End device that distributes Web requests among servers. That made sense 10 years ago, but today these devices are much smarter. At Yale, and I suspect at many institutions, the Front End is a BIG-IP F5 device. It can be programmed with iRules, and it is fairly simple to create iRules that understand the basic CAS protocol. If the CAS servers are properly configured and the F5 is programmed, then requests from the browser for a new Service Ticket or requests from an application to validate the Service Ticket ID can be routed to the CAS server that created the Service Ticket and not just to some random server in the cluster. After than, ticket replication is a much simpler and can be a much more reliable process.

General object replication systems are necessary for shopping cart applications that handle thousands of concurrent users spread across a number of machines. E-Commerce applications don't have to worry about running when the network is generally sick or there is no database in which to record the transactions, but CAS is a critical infrastructure component that has to be up if it is at all possible for it to be running.

CushyTicketRegistry cannot ever crash CAS. No matter what goes wrong with the network, it will keep periodically retrying until it can connect, but in the meanwhile CAS runs normally. This becomes possible because it is designed for the network infrastructure of today and not the one commonly deployed a decade ago. It solves just the CAS problem, so it is a single medium size Java source file that someone can read and understand instead of a complex black box of code designed to solve a much larger general problem.

Cushy is a cute name that roughly times.

The CAS component that holds and replicates ticket objects is called the TicketRegistry. CushyTicketRegistry ("Cushy") is a new option you can configure to a CAS server. Cushy does useful things for a single standalone server, but it can also be configured to support a cluster of servers behind a modern programmable Front End device. It is explicitly not a general purpose object replication system. It handles CAS tickets. Because it is a easy to understand single Java source file with no external dependencies, it can be made incrementally smarter about ticket objects and how to optimally manage them when parts of the network fail and when full service is restored.

CushyTicketRegistry cannot ever crash CAS. It completely separates ticket back with CAS function. If there is a problem it will periodically retry replication until the problem is fixed, but that is completely separate from the rest of CAS function.

"Cushy" stands for "Clustering Using Serialization to disk and Https transmission of files between servers, written by Yale". This summarizes what it is and how it works.The TicketRegistry is the component of that stores the objects CAS uses to remember what it has done. There are at least 5 different versions of TicketRegistry that you can choose (Default, JPA, JBoss, Ehcache, Memcached) and Cushy simply adds one additional choice. CAS uses the Spring Framework to essentially create logical sockets for components that are plugged in at application startup time driven by XML configuration files. The ticketRegistry.xml file configures whichever registry option you choose. of files between servers, written by Yale". This summarizes what it is and how it works.

The Standalone Server

For a simple single standalone CAS server, the standard choice is the DefaultTicketRegistry class which keeps the tickets in an in memory Java table keyed by the ticket id string.

The Standalone Server

Suppose you simply change the class name from DefaultTicketRegistry to CushyTicketRegistry (and add a few required parameters described later). Cushy was based on the DefaultTicketRegistry code, so while CAS runs everything works exactly the same . The change occurs if you shut CAS down, particularly because you as before until you have to restart itCAS for any reason. Since the DefaultTicketRegistry only has an in memory table, all the tickets ticket objects are lost when the application restarts and users have to login again. Cushy detects the shutdown and saves all the ticket objects to a file on disk, using a single Java writeObject statement on the entire collection.  Unless that file is deleted while CAS is down, then when CAS restarts Cushy reloads all the tickets from that file into memory and restores all the CAS state from before the shutdown. Users do not have to login again, and no No user even notices that CAS restarted unless they tried to access CAS while it was downduring the restart.

The number of tickets CAS holds grows during the day and shrinks over night. The largest number occurs late in the day. Cushy At Yale there are fewer than 20,000 ticket objects in CAS memory, andCushy can write all those tickets to disk in less than a second , and the resulting disk file is generating a file smaller than 3 megabytes. This is such a small amount of overhead that Cushy can be proactive. Duplicate and configure a few additional XML elements in the

So to take the next logical step, start with the previous ticketRegistry.xml file and on some regular basis (every few minutes) Cushy will generate a disk copy of all the tickets "just in case". That way if CAS crashes instead of shutting down normally, it can restore a fairly recent copy of all the tickets when it comes back up.

However, writing a copy of all the tickets to disk is still expensive enough that you would probably not do it every few seconds. So between the configured period of time between full backups of all the tickets, Cushy can be configured to write an "incremental" file containing only the changes since the last full backup. The incremental file takes only milliseconds to generate. After a crash, Cushy loads the last full ticket backup and then applies the last incremental file to restore the state a few seconds before the crash.

Occasionally the source of the crash was a problem that prevents bringing CAS up on the same host computer. The checkpoint and incremental files are plain old ordinary disk files. They can be stored on local disk on the CAS machine, or they can be stored on a file server, or on a SAN, NAS,  or some other highly available disk technology. The farther away they are the safer they are in terms of disaster recovery, but then the elapsed time to write the 3 megabytes may be a few milliseconds longer. So rather than slowing CAS down, you can let CAS write the file to local disk, then add your own shell script or even a real program written in any language to wake up a second or so after CAS writes the file and copy it from local disk over the network to a more secure remote location.configuration and duplicate the statements that currently run the RegistryCleaner every few minutes. The new statements will call the "timerDriven" method of the ticketRegistry object (Cushy) every few minutes. Now Cushy will not wait for shutdown but will back up the ticket objects regularly just in case the CAS machine crashes without shutting down normally. When CAS restarts, it can load a fairly current copy of the ticket objects which will satisfy the 99.9% of the users who did not login in the last minutes before the crash.

At this point, the next step should be obvious. Can we turn "last few minutes" into "last few seconds". Creating a complete backup of the entire set of tickets is not terribly expensive, but it is not something you want to do continuously. So Cushy can be configured to create "incremental" files between every full checkpoint backup. The incremental file contains all the changes accumulated since the last full checkpoint, so you do not have a bunch of files to process in order. Just apply the last full checkpoint and then the incremental file on top of it.

The full checkpoint takes a few seconds to build, the incremental takes a few milliseconds. So you run the full backup every (say) 5 minutes and you run an incremental every (say) 10 seconds.

The checkpoint and incremental files are ordinary sequential binary files on disk. Cushy writes a new file and then swaps it for the old file, so other programs authorized to access the directory can freely open or copy the files while CAS is running. This is useful because occasionally a computer that crashes cannot just reboot. Since the two files on disk represent all the data that needs to be saved and restored, if you want to prepare for disaster recovery you may want to periodically copy the files to a location far away, in another data center or in the cloud. Cushy doesn't do this itself, but you can easily write a shell script or Pearl or Python program to do it. Since they are normal files, you can copy them with SFTP or any other file utility. 

Rethink the Cluster Design

...