...
JPA and the cache technologies try to maintain the image of a single big common bucket of shared tickets. This seems like it is a very fairly simple view, but it is very hard to maintain and rather fragilecreates synchronization problems between servers while serving no useful CAS purpose. Cushy maintains a separate TicketRegistry for each CAS server, but replicates a copy of the TicketRegistry of each TicketRegistry server to all the other servers in the cluster.
Given the small You could configure a Cushy cluster to only make full checkpoints of all the tickets. The cost of making a complete checkpoint is small, you could configure Cushy to generate one every 10 seconds and run the cluster on full checkpoints. It is probably inefficient, but using 1 second of one core and transmitting 3 megabytes of data to each node every 10 seconds is not a big deal on modern multicore servers. This was the first Cushy code milestone and it lasted for about a day.
The next milestone (a day later) was to add an "incremental" file that contains all the tickets added or ticket ids of tickets deleted since the last full checkpoint. Incrementals are designed so that they grow between full checkpoints, they are cumulative, and you can always apply the last incremental you got without worrying about any previous incrementals. Again, slightly inefficient, but trivially so, and emphasize simplicity.
CAS already ran the RegistryCleaner off a timer configured in Spring XML to call it every so often. Cushy adds a second timer to the same configuration file to signal the TicketRegistry frequently. For this example, say it makes the call every 10 seconds. Then every 10 seconds Cushy generates an incremental file, and then it checks all the other nodes to get their most recent incremental file. Separately, Cushy is configured with the time between checkpoints (say every 5 minutes), so when it has been long enough that a new full checkpoint is due, it creates a full checkpoint instead of an incremental.
Each incremental has a small number of new Login (TGT) tickets and maybe a few unclaimed service tickets. However, because we do not know whether any previous incremental was or was not processed, it is necessary to transmit the list of every ticket that was deleted since the last full checkpoint, and that but it is large enough that you might be reluctant to schedule them frequently enough to provide the best protection. So between full checkpoints, Cushy creates and transmits a sequence of "incremental" change files that each have all the changes since the last full checkpoint. In the Spring XML configuration file you set the time between incrementals and the time between checkpoints. The choice is up to you, but a reasonable suggestion is to exchange incrementals every 5-15 seconds and checkpoints every 3-15 minutes. That is not even a recommendation, but we will use it as a rough estimate in the rest of this discussion.
Each incremental has a small number of new Login (TGT) tickets and maybe a few unclaimed service tickets. However, because we do not know whether any previous incremental was or was not processed, it is necessary to transmit the list of every ticket that was deleted since the last full checkpoint, and that will contain the ID of lots of Service Tickets that were created, validated, and deleted within a few milliseconds. That list is going to grow, and its size is limited by the fact that we can start over again after each full checkpoint.
...
A CAS node starts up.The Spring configuration loads the primary YaleTicketRepository object, and it creates secondary objects for all the other configured nodes. Each object is configured with a node name, and secondary objects are configured with the external node URL.
If there is a checkpoint file and perhaps an incremental file for any node in the work directory then the primary and secondary objects will use these files to restore at least the unexpired tickets from the previous time the node was up. This is called a "warm start" and it makes sense if CAS has not been down for long and when you are restarting the same version of CAS.
However, there may be times when you want CAS to start with an empty ticket registry, or when you CAS will have taken a final checkpoint if it shutdown normally. If it crashed, there should be a last checkpoint and may be a last incremental file. The tickets in these files are restored to memory so CAS is restored to the state it was last in before the crash or shutdown. This is a "warm start".
However, if you are upgrading from one version of CAS to another and the Ticket objects may not be compatible. When this is true, any files in the work directory should be deleted before restarting CASwith incompatible Ticket classes, or you want to start a clean slate after some serious outage, then you can manually delete the checkpoint file and CAS will come up with an empty Ticket Registry. This is a "cold start". When the CushyTicketRegistry discovers that it has no prior checkpoint file it enter the "Cold Start Quiet Period". For 10 minutes (you can change this in the source) a node will not communicate with any other node in the cluster. It will not send or process notifications and it will not read or return checkpoint or incremental files. This gives machine room operators time to shut down all the CAS servers, delete the files, replace the CAS WAR, and start a new version of CAS with a clean slate. If operations cannot complete this process within the Quite Period then CAS will continue to function, but it may log I/O error messages from the readObject statement if a node tries to restore a checkpoint or incremental file that contains incompatible versions of Ticket objects created by a different version of the CAS code. As soon as all the nodes have been migrated to the new code the error messages go away and Cushy will not have been affected by the errors.
While each node has a copy of its own files, all the other nodes in the cluster have replicated copies of the same files. So if a node fails hard and you lose the disk with the work directory, you can recover the files for the failed node from any other running CAS node in the cluster. Unlike the ehcache or memcached systems where the cache is automatically populated over the network when any node comes up, copying files from one CAS node to another is not an automatic feature. You have to do it manually or else automate it with scripts you write based on your own network configuration.
Remember, every CAS node owns its own Registry and every other CAS node accepts whatever a node says about itself. So if you bring up a node with an empty work directory, then it creates a Registry without tickets, and then it will shortly send an empty checkpoint file to all the other nodes where they will replace any old file with the new empty file and empty their secondary Registry objects. So if you want a warm start, you need to make sure the work directory is populated before you start a CAS node or you will lose all copies of its previous tickets.
If you intend a cold start, it is best to shut down all CAS nodes, empty their work directories, and then bring them back up. You can cold start one CAS node at a time, but it may be confusing if some nodes have no tickets while at the same time other nodes are running with their old ticket population.
During normal processing CAS creates and deletes tickets. It is up to the front end (the F5) to route requests to route browser requests to the node to which the user logged in, and to route validation requests to the node that generated the ticket.
Node Failure
Detecting a node failure is the job of the front end. CAS discovers a failure when a CAS node receives a request that should have been routed to another node. CAS needs no logic to probe the cluster to determine what nodes are up or down. If a node is down then all /cluster/verify and /custer/getIncremental requests will time out, but CAS simply waits the appropriate time and then makes the next request until eventually the node comes back up.
During failure, the most common event is that a browser or Proxy that logged on to another node makes a request to a randomly assigned CAS node to generate a new Service Ticket against the existing login.
Had we been using a JASIG TicketRegistry, then all the tickets from all the nodes would have been stored in a great big virtual bucket. Then any node could find the TGT and issue an ST. So the Business Logic layer does not know or care which node issued a Ticket when creating new tickets or validating existing tickets. Furthermore, when using Ehcache, JBoss Cache, or Memcached the tickets replicated to another node using serialization may be chained to their own private copy of the TGT that was sent from the other node by the Java serialization mechanism. So CAS doesn't really look to carefully at the source of the objects it processes.
The big difference with CushyTicketRegistry is that it keeps separate Registry objects for each node, and it likes to treat the secondary Registry as read-only, at least from the business logic layer of this node. When the other node has failed, then when the business logic layer calls the Registry to find a TGT, then that TGT will be found in one of the secondary Registries. That is the only hint we have that there has been a node failure.
We still preserve the rule that new tickets are created in the primary (local) registry and are indentified with a string that ends with this node's name. That part happens automatically in the business logic when it calls the locally configured Service Ticket unique ID generator, and when it calls addTicket() on the primary object.
In node failure mode, however, the new Service Ticket will have a Granting Ticket field that points to the TGT in the secondary object, that is in the Registry holding a copy of the tickets of the failed node.
If you have been paying attention, you will realize this is no big deal. Any serialized Service Ticket transmitted to another node with one of the JASIG Registry solutions will also be stored on the other node with its own private pointer to a copy of the original TGT, and the fact that the Granting Ticket field in the ST object points to an odd ticket that isn't really "in" the cache has never been a problem. The ST will still validate normally.
Of course, there is a chase condition if after a ST is issued for a TGT in a secondary Registry then the previously failed node starts back up and sends a Notify and the secondary Registry gets refreshed with a new bunch of tickets before the ST is validated. Java will recognize that while the old TGT is no longer in the ConcurrentHashMap of the secondary Registry, the ST still has a valid reference to it. A particularly aggressive cycle of Garbage Collection might delete all the other tickets from the snapshot of the old registry object, but it will leave that one TGT around as long as the ST points to it. When the ST is validated and is deleted, then the old copy of the TGT is released and can be destroyed when Java gets around to it. Again, this arrangement when the ST points to a TGT that is no longer in any Registry HashMap is normal behavior for JASIG replication so it must pose no problem.
The most interesting behavior occurs when a TGT in the secondary Registry of a failed node is used to create a Proxy Granting Ticket. Then the Proxy ticket is issued by and belongs to node that issued it, and the proxying application communicates to that node to get Service Tickets.
The thing that really changes is the handling of CAS Logoff. Fortunately, in normal practice nobody ever logs off of CAS. They just close their browser and let the TGT timeout. However, if someone were to call /cas/logoff during a node failure when they were logged in to another node, then the Business Logic layer will delete the CAS Cookie and report a successful logoff, but we cannot guarantee that CAS will do everything perfectly in the way it would have operated without the node failure.
Sorry but if node failure screws up the complete correct processing of Single SignOut, that is simply a problem you will have to accept. Unless a node stays up to control its TGT and correctly fill in the Service collection with all the Services that the user logged into, then a decentralized recovery system like this cannot globally manage the services.
There is another problem that probably doesn't matter but which should be mentioned. If a node tries to handle Single SignOut on its own during a node failure, and then the failed node comes back, the failed node will restore the TGT that the user just logged out of. Many hours later that TGT will time out, and now the original node will try to notify all the Services that a logout has occurred. So services may get two logout messages from CAS for the same login. It is almost impossible to image a service that will be bothered by this behaviorIt makes no sense to cold start a single node, so typically if you do this you intend to cold start all the CAS nodes. Since each CAS node "owns" its registry, you could cold start one at a time and as each node comes up it will checkpoint its empty registry and replicate it to the other nodes. However, in most cases you will want to reboot all the CAS nodes nearly simultaneously. To let this occur with the least confusion, after a cold start CAS enters a "Quiet Period" where it neither sends nor receives files to or from other nodes. The default is 10 minutes, and that should be enough time to reboot all the servers.
During normal processing all the CAS servers are generating checkpoint and incremental files and they are exchanging these files over the network. The file exchange is required because you never know when a node is going to fail. However, once the file has been transmitted, the tickets in the file are not actually needed if the front end is routing requests properly and the other nodes are up. So during the 99% of the time when there is no failure, CAS saves a small amount of processing time by waiting until there is an actual request (after a node failure) that requires access to tickets from another node before it deserializes the data in the file. This is an optimization called "Just In Time Deserialization".
However, network failure is judged from the perspective of every node on the network. A CAS node will start to get requests belonging to another node if the Front End thinks the other node is down (mostly because it cannot contact it). However, if the failure is caused by a single switch or router between the Front End and the other node, then other CAS node may be able to talk to the node even though the Front End cannot get to it. If the other node was really down then you would not get Notifications or incrementals, but Cushy does not assume that just because it is getting requests that it should only receive if the other node is down that file updates from that node will actually stop. It manages the two things separately.
Cushy has a "healthy" flag for each node. If it receives a timeout trying to contact a node, or it gets a network I/O error reading the data, then it marks the other node "unhealthy". At that point it stops asking for incrementals and waits for a Notify.
Once Just In Time Deserialization has been triggered by a request for a ticket owned by another node, then any subsequent incrementals received from the other node are immediately applied to keep the in memory collection of tickets up to date. This gets revisited when the other node generates a Notify.
Notify is in part an "I am up and functioning" message as well as an "I have a new checkpoint" message. The first thing a node does after booting up is to send a new Notify to all the other nodes. If there is a temporary network failure between nodes, then other activity may stop but the nodes will all try to send a Notify with each new checkpoint (say every 5 minutes) trying to reestablish contacts.
Getting a Notify from a node and reading its new checkpoint file clears the flag that says that tickets have been "just in time" deserialized and that the node is unhealthy. It provides an opportunity, if nothing else is wrong anywhere, for things to go back to complete normal behavior (at least for that node). If more requests arrive then the Just In Time Deserialization happens again, and if network I/O errors reappear then the node will be marked unhealthy again, but after a Notify we give a node a chance to start a clean slate.
Node Failure
Detecting a node failure is the job of the Front End. CAS discovers a failure when a CAS node receives a request that should have been routed to another node. The tickets for that node are restored into the Secondary Registry for that node.
Anyone who signed in to the failed node in the last few seconds will lose his TGT. Any Service Ticket issued but not validated by the failed node will be lost and validation requests will fail. The Cushy design is to support the 99.99% of traffic that deals with people who logged in longer than 10 seconds ago.
New logins have no node affiliation and therefore nothing to do with node failure.
During node failure, the three interesting activities are:
- Issuing a new Service Ticket on behalf of a TGT owned by another node.
- Issuing a new Proxy Ticket connected to a TGT owned by another node.
- Logging a user off if his TGT is owned by another node.
In the first two cases, the current node creates a new Ticket. The Ticket is owned by this node even if it points to a Granting Ticket that is in the Registry of another node. The Ticket gets the local node suffix and is put in the local (Primary) CushyTicketRegistry. The Front End will route all requests for this ticket to this node. The Business Logic layer of CAS does not know that the TGT belongs to another node because the Business Logic layer is used to all the other TicketRegistries where all the tickets are jumbled up together in a big common collection. So this is business as usual.
There is one consequence that should be understood. Although the TGT is currently in the Secondary Registry, that collection of tickets is logically and perhaps physically replaced when the node comes back up, issues a Notify, and a new checkpoint is received. At that point the ST (and more importantly the PGT because it lives longer) will point to the same sort of "private copy of a TGT that is a point in time snapshot of the login status when the secondary ticket was created" that you get all the time when ST and PGT objects are serialized and transmitted between nodes by any of the "cache" replication technologies. Cushy has been able up to this point to avoid unconnected private copies of TGT's, but it cannot do so across a node failure and restart.
This brings us to Logoff. Not many people logoff from CAS. When they do, the Business Logic layer of CAS will try to handle Single Sign Out by notifying all the applications that registered a logoff URL that the user has logged out. Again, since the Business Layer works fine in existing "cache" based object replication systems, the fact that Cushy is holding the TGT in a Secondary object has absolutely no effect on the processing. The only difference occurs when the Business Logic goes to delete the TGT.
The problem here is that we don't own the TGT. The other node owns it. Furthermore, the other node probably has a copy of it in its last checkpoint file, and as soon as it starts up it will restore that file to memory including this TGT. So while we could delete the object in the Secondary Registry, it is just going to come back again later on.
This probably doesn't matter. The cookie has been deleted in the browser. Any Single Sign Out processing has been done. The TGT may sit around all day unused, and then eventually it times out. At this point we get the only actual difference in behavior. When it times out the Business Logic is going to repeat the Single Sign Out processing. It is almost inconceivable that any application would be written in such a way that it would notice or care if it gets a second logout message for someone who already logged out, but it has to be noted.
Node Recovery
After a failure, the node comes back up and restores its registry from the files in the work directory. It issues a Notify which tells the other nodes it is coming back up.
At some point the front end notices the node is back and starts routing requests to it based on the node name in the suffix of CAS Cookies. The node picks up where it left off. It does not know and can not learn about any Service Tickets issued on behalf of its logged in users by other nodes during the failure. It does not know about users who logged out of CAS during the failure.
Cushy defines its support of Single SignOff to be a "best effort" that does not guarantee perfect behavior across node failure.
Every time the node generates a new checkpoint and issues another Notify, the other nodes clear any flags indicating failover status and attempt to go back to normal processing. This may not happen the first time if the Front End takes a while to react. but if not the first then probably the second Notify will return the entire cluster to normal processing.