...
With this single modification in place it was possible to measure the cost of the operation. It turns out that all the tickets used at Yale could be written to disk in less than a second, and the file was around 3MB which is pretty small by modern standards. That is such a small cost that you don't have to wait just until shutdown. So the next step was to adapt duplicate the existing configuration of the RegistryCleaner in the standard CAS Spring XML file to create a second periodic timer driven operation. Every so ofter often (2-5 minutes for example) Cushy does the same writeObject statement that is does at normal shutdown. Now, if the CAS machine crashes without a shutdown, it can come back up with almost the same state it had at the crash (except for changes between the last few minutestimer driven backup and the crash).
This turns out to be well over 99% of what you want. Users log in to CAS all the time, so in any three minute period there are 10,000 unchanged logons from earlier in the day and perhaps a hundred new logons. So the next step is to pick up the remaining 1% without any measurable increase in cost. More frequently (say every 10 seconds) Cushy can write an "incremental" or "differential" file that contains the changes since the last checkpoint (mostly the new logins, but there is a chance of a logoff). This is a much smaller file and it typically takes only a few milliseconds to write. Then if CAS crashes it uses one readObject to read the last full checkpoint, and then another operation to read the last incremental and add back in the tickets from the last seconds before the crash.
The checkpoint and incremental files are plain old ordinary disk files. They can be stored on local disk on the CAS machine, or they can be stored on a file server, or on a SAN, NAS, or some other highly available disk technology. The farther away they are the safer they are in terms of disaster recovery, but then the elapsed time to write the 3 megabytes may be a few milliseconds longer. Once on disk you can back them up, copy them to another machine, deposit them in the cloud, or anything else you might want to do with a shell script, or something written in Python, Perl, or PowerShell. These two files are the state of the CAS server, and wherever you put them you can bring up a backup version of the server if the datacenter is knocked offline. All you need is a machine ready to start up CAS and read in a copy of the filesSo rather than slowing CAS down, you should let it write the file to local disk, then a shell script or other program can copy the file to a remote safer location.
So there may be no need for a cluster. Certainly the CAS function is small enough and simple enough that it doesn't need to be load balanced. It is certainly possible to configure a server with enough processing power to handle the largest CAS load on a single machine.
...
So now Cushy with its checkpoint and incremental files makes it possible to recover from a CAS crash without a cluster, and the modern Front End makes it possible to create a cluster without ticket replication. At this point one can configure the servers so that traditional ticket replication is unnecessary (although a little bit of extra code may still provide useful options)Although a cluster is unnecessary, consider building one anyway.
The CushyClusterConfiguration class makes it simple to configure more than one CAS server in a cluster. It makes sure that every server has a unique name, that all members of the cluster know the names and network locations of the other members, and that some version of these names is appended to every ticketid . In cluster mode every CAS server boots so the Front End can route requests properly. It then feeds this cluster information to the CushyTicketRegistry object.
With cluster data, the TicketRegistry comes up as before, but when it boots up it creates the objects that will allow it to read in and use the tickets belonging to another member of the cluster should that server fail. For this to work without any additional code, the various CAS servers have to write their checkpoint and incremental files to a common directory on some type of shared disk. During normal processing the Front End routes requests to the server that created the ticket, but if that server crashes it send the request to one of the other servers. Any server getting a request for a ticket created by another server simply loads the most recent checkpoint and incremental file left behind by the failed server and processes the request. The trick here is that any new tickets (Proxy or Service) created become owned by the backup server that processed the request and not the failed server that handled the original logon. Subsequent requests for those tickets come back to this server, even after the original logon server comes back up.Cushy now adds one additional block of code so that you don't have to depend on shared disk. It is simple, low tech, and easy to understand. now creates a secondary registry object for every other node in the cluster. With the simplest option (SharedDisk) these secondary objects simply sit until one of the other CAS servers in the cluster fails and the Front End starts routing requests belonging to to the failed server to other members of the cluster. When the registry receives a request for a ticket that belongs to another server, it restores the tickets belonging to that server from the disk to the secondary object associated with the failed cluster member. It then processes the request on behalf of the failed server. The details will be explained below.
If you don't want to use shared disk, there are two alternatives. Cushy provides an HTTPS solution. After all, CAS runs on Web Servers. Web Servers are very good about sending the current copy of small files over the network to clients. The checkpoint file is small, and the incremental file is smaller. Everyone understands how an HTTP GET works. So unless you configure shared diskShared Disk, Cushy running on a cluster node will periodically fetch in cluster mode uses HTTP GET to retrieve a copy of the most recent full checkpoint or incremental file from every other node in the cluster and put it the copy on the local hard disk of the machine. No complex multicast addresses, or timeouts, or port number, or datagrams. Cushy simply reuses the HTTPS that every CAS server already has in place.Of course, you do not have to use either shared disk or HTTPS if you don't want to. CAS is running on a computer and once every (say) 10 seconds it is writing
You may now have realized that you do not actually need to use either real Shared Disk or Cushy HTTPS. Every 10 seconds or so Cushy writes one of two files to a specific directory on local disk directory. If you want to synchronize with its schedule exactly, the timestamp on the file should give you a good target. You can turn off Cushy's own replication and then write your own code in any given language to wake up on the same schedule and copy the new file wherever you want it to go using something as simple as FTP or as complicated as an Enterprise Service Bus. Cushy doesn't really care how the file gets where it needs to go as long as it gets there before the node crashes.. You can write your own program in any language you prefer to wake up every 10 seconds, and if you look at the time stamp on the files you will get an exact time to synchronize your program to the Cushy activity, and after the file has been changed your program can write it somewhere the other nodes can find it using anything from FTP on the simple end to an Enterprise Service Bus on the more exotic end. These are just files and figuring out how to distribute them around the network is fairly routine.
When a CAS node crashes, the other nodes use the most recent file they received to load up tickets and handle requests for the failed node. They do not care how the files got to them.
So what happens if a router breaks the connection between the front end and one of the CAS servers? Suppose a fiber optic connection between data centers goes down for a hour before the traffic can be rerouted, separating one CAS server from another? With magic black box technology that is just supposed to take care of all the problems, you don't really know exactly what is going to happen. Cushy is explained completely using HTTP, or a shared disk technology of your choice, or a file transfer program you decide to write. This document still has to fill in a little more detail, and a moderately skilled Java programmer can read the source. However, at this point if you sit down an think it over you can figure out what will really happen and how Cushy will react and recover from any type of failure. With Cushy you will know exactly how it works and therefore exactly what it will do in any failure situation.
Now the bad news. Current CAS has some bugs. It was not written "properly" to work with the various ticket replication mechanisms. It has worked well enough in the past, but CAS 4 introduces new features and in the future it may not behave as expected. It is not possible to fix everything in the TicketRegistry. A few changes may need to be made in the CAS Ticket classes. So Cushy does not fix the bugs itself, but it does eliminate the false reliance of "the magic black box of off the shelf software" that people imagined was going to do more than it could reasonably be expected to do.
...