Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. After receiving a Service Ticket ID, an application opens its own HTTPS session to CAS, presents the ticket id in a "validate" request. If the id is valid CAS passes back the Netid, and in certain requests can pass back additional attributes. The suffix on the ticket= parameter identifies the CAS server that created the ticket and has it in memory without requiring any high speed replication.
  2. When a middleware server like a Portal has obtained a CAS Proxy Granting Ticket, it requests CAS to issue a Service Ticket by making a /proxy call. Since the middleware is not a browser, it does not have a Cookie to hold the PGT. So it passes it explicitly in the pgt= parameter.
  3. After a user logs in, CAS creates a Login TGT that points to the Netid and attributes and writes the ticket id of the TGT to the browser as a Cookie. The Cookie is scoped to the URL of the CAS application as seen from the browser point of view. At Yale this is "https://secure.its.yale.edu/cas" and so whenever the browser sees a subsequent URL that begins with this string, it appends the CASTGC Cookie with the TGT ID. CAS uses this to find the TGT object and knows that the user has already logged in. This rule sends a browser back to the CAS node the user is logged into.
  4. If the first three tests fail, this request is not associated with an existing logged in user. CAS has a bug/feature that it depends on Spring Web Flow and stores data during login in Web Flow storage which in turn depends on the HTTPSession object maintained by the Web Server (Tomcat, JBoss, ...). You can cluster JBoss or Tomcat servers to share HTTPSession objects over the network, but it is simpler if you program the Front End so that if the user responds in a reasonable amount of time, the login form with the userid and password is send back to the Web Server that wrote the form it to the browser in response to the browser's original HTTP GET. This is called a "sticky session" and the F5 does it automatically if you just check a box. You don't need to write code.
  5. Otherwise, if this is a brand new request to login to CAS or if the CAS Server selected by one of the previous steps has failed and is not responding to the Front End, then send the request to any available CAS server.

How it Fails (Nicely)

The Primary + Warm Spare Cluster

One common cluster model is to have a single master CAS server that normally handles all the requests, and a normally idle backup server (a "warm spare") that does nothing until the master goes down. Then the backup server handles requests while the master is down.

During normal processing the master server is generating tickets, creating checkpoint and increment files, and sharing or sending them to the backup server. The backup server is generating empty checkpoints with no tickets because it has not yet received a request.

Then the master is shut down or crashes. The backup server has a copyof all the tickets generated by the master, except for the last few seconds before the crash. When new users log in, it creates new Login Tickets in its own Ticket Registry. When it gets a request for a new Service Ticket for a user who logged into the master, it creates the ST in its own registry (with its own nodename suffix) but connects the ST to the Login Ticket in its copy of the master's Ticket Registry.

Remember the CAS Business Logic is used to a Ticket Registry maintaining what appears to be a large collection of tickets shared by all the nodes. So the Business Logic is quite happy with a Service Ticket created by one node pointing to a Login Ticket created by another node.

Now the master comes back up and, for this example, let us assume that it resumes its role as master (there are configurations where the backup becomes the new master and so when the old master comes back it becomes the new backup. Cushy works either way).

What happens next depends on how smart the Front End is. If it has been programmed to route requests based on the suffix of the tickets in the login cookie, then users who logged into the backup server during the failure continue to use the backup server, while new and older users all go back to the master. If the Front End is programmed to route all requests to the master as long as the master is up, then logically when the master came up the backup server "failed over to the master".

When the master comes up it reloads its old copy of its ticket registry from before the crash, and it gets a copy of the tickets generated by the backup server while it was down. When it subsequently gets requests from users who logged into the backup server, it resolves those requests using its copy of the backup server's TGT.

This leaves a few residual "issues" that are not really big problems and are deferred until Cushy 2.0. Because each server is the owner of its own tickets, and its Ticket Registry is the authoritative source of status on its own tickets, other nodes cannot make permanent changes to another node's tickets during a failover.

This means that the master is unaware of things the backup server did while it was down that should have modified its tickets. For example, if a user logs out of CAS while the backup server is in control, then the Cookie gets deleted and all the normal CAS logoff processing is done, but the Login Ticket (the TGT) cannot really be deleted. That ticket belongs to the master, and when the master comes back up again it will be in the restored Registry. However, it turns out that CAS doesn't really have to delete the ticket. Since the cookie has been deleted, nobody is going to try and use it. It will simply sit around until it times out and is deleted later on.

A more serious problem occurs for Single Sign Out of people who logged into the backup server while the master is down in systems where the Front End processor is not programmed to route requests intelligently. When the master reboots and starts handling all new requests, they have a TGT is that is "frozen" to the state it was in when the master rebooted. The master can subsequently create new Service Tickets from that TGT, but Single Sign Out will not know to log them off from those services when the user logs off. The current solution is to use Front End programming. Cushy 2.0 may add intelligent TGT migration and merging after a CAS server reboots.

A Load Balanced Cluster

When a CAS server fails, requests for its tickets are assigned to one of the other servers.

When a CAS server receives a request for a ticket owned by another node, it fully activates the other node's shadow Ticket Registry from a copy of the other nodes checkpoint and incremental files. It then looks up the ticket in that registry and returns it to the CAS Business Logic. A node may not have a copy of tickets issued in the last few seconds, so one or two users may see an error.

Cushy can issue a Service Ticket that points to a Login Ticket owned by the failed node. More interestingly, it can issue a Proxy Granting Ticket pointing to the Login Ticket on the failed node. In both cases the new ticket has the suffix and is owned by the node that created it and not by the node that owns the login.

Again, the rule that each node owns its own registry and all the tickets it created and the other nodes can't successfully change those tickets has certain consequences.

  • If you use Single Sign Off, then the Login Ticket maintains a table of Services to which you have logged in so that when you logout or when your Login Ticket times out in the middle of the night then each Service gets a call from CAS on a published URL with the Service Ticket ID you used to login so the application can log you off if it has not already done so. In fail-over mode a backup server can issue Service Tickets for a failed nodes TGT, but it cannot successfully update the Service table in the TGT, because when the failed node comes back up it will restore the old Service table along with the old TGT.
  • If the user logs out and the Services are notified by the backup CAS server, and then the node that owned the TGT is restored along with the now undead copy of the obsolete TGT, then in the middle of the night that restored TGT will timeout and the Services will all be notified of the logoff a second time. It seems unlikely that anyone would ever write a service logout so badly that a second logoff would be a problem. Mostly it will be ignored.

...

What Cushy Does at Failure

It is not necessary to explain how Cushy runs normally. It is based on DefaultTicketRegistry. It stores the tickets in a table in memory. If you have a cluster, each node in the cluster operates as if it was a standalone server and depends on the Front End to route requests to the node that can handle them.

Separately from the CAS function, Cushy periodically writes some files to a directory on disk. They are ordinary files. They are protected with ordinary operating system security.

In a cluster, the files can be written to a shared disk, or they can be copied to a shared location or from node to node by an independent program that has access to the directories. Or, Cushy will replicate the files itself using HTTPS GET requests.

A failure is detected when a request is routed by the Front End to a node other than the node that created the ticket.

Because CAS is a relatively small application that can easily run on a single machine, a "cluster" can be configured in either of two ways:

  • A Primary server gets all the requests until it fails. Then a Backup "warm spare" server gets requests. If the Primary comes back up relatively quickly, then Cushy will work best if Front End resumes routing all request to the Primary as soon as it becomes available again.
  • Users are assigned to CAS Servers on a round-robin or load balanced basis.

Each CAS server in the cluster has a shadow object representing the TicketRegistry of each of the other nodes. In normal operation, that object contains no ticket objects. There is no need to read the files from the other node until a failure occurs and a request for one of those tickets arrives. Then Cushy restores the tickets from the file into memory (Just In Time) and processes requests on behalf of the failed node.

However, every new ticket Cushy creates belongs to the current node that created it. A new Service Ticket gets the suffix of the current node even if the Login TGT has the suffix of the failed node. A new Proxy Granting Ticket can also be created on this node for middleware even though the user logged into the different failed node.

This allows the Front End to do the right thing in the few seconds after the failed node reappears on the network. Requests that depend on the newly created tickets generated by the backup servers go back to the servers that created them. However, as soon as the login node reappears then new requests from the user's browser go back to the login server where new Service Tickets and PGTs are now created where we would prefer they be.

Service Tickets are created and then in a few milliseconds they are deleted when the application validates them or they time out after a few seconds or minutes. They do not exist long enough to raise any issues.

Proxy Granting Tickets, however, can remain around for hours. So the one long term consequence of a failure is that the login TGT can be on one server, but a PGT can be on a different server that created it while the login server was temporarily unavailable. This requires some thought, but you should quickly realize that everything will work correctly today. In future CAS releases there will be an issue if a user adds additional credentials (factors of authentication) to an existing login after a PGT is created. Without the failure, the PGT sees the new credentials immediately. With current Cushy logic, the PGT on the backup server is bound to a point in time snapshot of the original TGT and will not see the additional credentials. Remember, this only occurs after a CAS failure. It only affects the users who got the Proxy ticket during the failure. It can be "corrected" if the end user logs out and then logs back into the middleware server.

Cushy 2.0 will consider addressing this problem automatically.

There is also an issue with Single Sign Out. If a user logs out during a failure of his login server, then a backup server processes the Single Log Out normally. Then when the login server is restored to operation, the Login TGT is restored from the checkpoint file into memory. Of course, no browser now has a Cookie pointing to that ticket, so it sits unused all day and then in the evening it times out and a second Single Sign Out process is triggered and all the applications that perviously were told the user logged out are not contacted a second time with the same logout information. It is almost unimaginable that any application would be written so badly it would care about this, but it should be mentioned.

While the login server is down, new Service Tickets can be issued, but they cannot be meaningfully added to the "services" table in the TGT that drives Single Sign Out. After the login server is restored, if the user logs out to CAS the only applications that will be notified of the logout will be applications that received their Service Tickets from the logon server. Cushy regards Single Sign Out as a "best effort" service and cannot at this time guarantee processing for ST's issued during a node or network failure.

Again, Cushy 2.0 will address this problem.

CAS Cluster

In this document a CAS "cluster" is just a bunch of CAS server instances that are configured to know about each other. The term "cluster" does not imply that the Web servers are clustered in the sense that they share Session information. Nor does it depend on any other type of communication between machines. In fact, a CAS cluster could be created from a CAS running under Tomcat on Windows and one running under JBoss on Linux.

...