Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Then the master is shut down or crashes. The backup server has a copy in memory of all the tickets generated by the master, except for the last few seconds before the crash. It can handle new logins and it can issue Service Tickets against logins previously processed by the master, using When new users log in, it creates new Login Tickets in its own Ticket Registry. When it gets a request for a new Service Ticket for a user who logged into the master, it creates the ST in its own registry (with its own nodename suffix) but connects the ST to the Login Ticket in its copy of the master's registryTicket Registry.

Remember the CAS Business Logic is used to a Ticket Registry maintaining what appears to be a large collection of tickets shared by all the nodes. So the Business Logic is quite happy with a Service Ticket created by one node pointing to a Login Ticket created by another node.

Now the master comes back up and, for this example, let us assume that it resumes its role as master (there are configurations where the backup becomes the new master and so when the old master comes back it becomes the new backup. Cushy works either way).

What happens next depends on how smart the Front End is. If it has been programmed to route requests based on the suffix of the tickets in the login cookie, then users who logged into the backup server during the failure continue to use the backup server, while new users all go back to the master. If the Front End is programmed to route all requests to the master as long as the master is up, then it appears that when the master came up the backup server "failed over to the master".

Lets be careful hereIn a network, "failure" is a matter of perspective. The Front End things thinks a node has failed if it cannot get connect to the node. It then routes all requests to other nodes. However, if the failure is in a The node may be down, or there may have been a problem in a network switch or router between the Front End and the node in question, then the nodes themselves may be able to connect to each other and . The network path between the Front End and this node may be different than the path between CAS servers. So it is possible for the Front End to believe a node has failed while, in fact, not only is the node really up but it continues to exchange checkpoint and incremental files .

In this sense, if the Front End begins to route requests for tickets owned by the backup server to the newly rebooted master, then the master and backup are still able to talk to each other and the master gets the latest checkpoint file from the backup server, but from that point on the master is servicing all the requests on behalf of the backup server as if the backup server failed.

However, the failure has left some minor issues that are not important enough to be with the other CAS nodes. That is the sense in which a master server taking over all subsequent requests behaves (from the Front End point of view) as if the backup server "failed over".

Since the backup server is really up, the master rather quickly gets a copy of the final set of tickets owned by the backup server. From that point on it handles requests for those tickets just as if the backup server actually failed.backup server failed.

This leaves a few residual "issues" that are not really problems.Because each server is the owner of its own tickets and registry, each has Read-Only access to the tickets of the other server. (Strictly speaking that is not true. You can temporarily change tickets in your copy of the other node's registry, but when the other node generates its next checkpoint, whatever changes you made will be replaced by a copy of the old unmodified ticket), and its Ticket Registry is the authoritative source of status on its own tickets, other nodes cannot make permanent changes to another node's tickets during a failover.

This means that the master is unaware of things the backup server did while it was down that should have modified its tickets. For example, if a user logs out of CAS on while the backup server is in control, then the Cookie gets deleted and all the normal CAS logoff processing is done, but the Login Ticket (the TGT) cannot really be deleted. That ticket belongs to the master, and when the master comes back the Logon Ticket that was supposed to be deleted is restored. Of course, no browser has that ticketid as a cookie, so the ticket will sit there unused until it times out. up again it will be in the restored Registry. However, it turns out that CAS doesn't really have to delete the ticket. Since the cookie has been deleted, nobody is going to try and use it. It will simply sit around until it times out and is deleted later on.

The same thing happens in reverse to users who logged into the backup server while the master is down. When the master comes up and handles the logout it cannot really delete the Logon Ticket that belongs to the backup server, so the ticket sits on the backup server unused until it times out. This is an example of the principle that Cushy is designed to handle the specific problem problems of CAS Tickets in a way that is simple but adequate to CAS needs but not a general solution for other applications.

A Smart Front End Cluster

A programmable Front End is adequate to Cushy needs if it can route requests based on four rules:

  1. If the URL "path" is a validate request (/cas/validate, /cas/serviceValidate, etc.) then route to the node indicated by the suffix on the value of the ticket= parameter.
  2. If the URL is a /proxy request, route to the node indicated by the suffix of the pgt= parameter.
  3. If the request has a CASTGC cookie, then route to the node indicated by the suffix of the TGT that is the cookie's value.
  4. Otherwise, or if the node selected by 1-3 is down, choose a any CAS node using whatever round robin or master-backup algorithm previously configured.

So normally all requests go to the machine that created and therefore owns the ticket, no matter what type of ticket it is. When a CAS server fails, requests for its tickets are assigned to one of the other servers. Most of the time the CAS server recognizes this as a ticket from another node and looks in the current shadow copy of that node's ticket registry.As in the previous example, a

When a CAS server receives a request for a ticket owned by another node, it fully activates the other nodes shadow Ticket Registry. It then looks up the ticket in that registry and returns it to the CAS Business Logic. A node may not have a copy of tickets issued in the last few seconds, so one or two users may see an error.

If someone logged into the failed node needs Cushy can issue a Service Ticket , the request is routed to any backup node which creates a Service Ticket (in its own Ticket Registry with its own node suffix which it will own) chained to the copy of the original Login Ticket in the appropriate shadow Ticket Registry. When that ticket is validated, the front end routes the request based on the suffix to this node which returns the Netid from the Login Ticket in the shadow registrythat points to a Login Ticket owned by the failed node. More interestingly, it can issue a Proxy Granting Ticket pointing to the Login Ticket on the failed node. In both cases the new ticket has the suffix and is owned by the node that created it and not by the node that owns the login.

Again, the rule that each node owns its own registry and all the tickets it created and the other nodes can't successfully change those tickets has certain consequences.

...

You have probably guessed by now that Yale does not use Single Sign Out, and if we ever enabled it we would only indicate that it is supported on a "best effort" basis .in the event of a CAS node crash.

CAS Cluster

In this document a CAS "cluster" is just a bunch of CAS server instances that are configured to know about each other. The term "cluster" does not imply that the Web servers are clustered in the sense that they share Session information. Nor does it depend on any other type of communication between machines. In fact, a CAS cluster could be created from a CAS running under Tomcat on Windows and one running under JBoss on Linux.

To the outside world, the cluster typically shares a common virtual URL simulated by the Front End device. At Yale, CAS is "https://secure.its.yale.edu/cas" to all the users and applications. The "secure.its.yale.edu" DNS name is associated with an IP address managed by the BIG-IP F5 device. It terminates the SSL, then examines requests and based on programming called iRules it forwards requests to any of the configured CAS virtual machines.

Each virtual machine has a native DNS name and URL. It is these "native" URLs that define the cluster because each CAS VM has to use the native URL to talk to another CAS VM. At Yale those URLs follow a pattern of "https://vm-foodevapp-01.web.yale.internal:80808443/cas". 

Internally, Cushy configuration takes a list of URLs and generates a cluster definition with three pieces of data for each cluster member: a nodename like "vmfoodevapp01" (the first element of the DNS name with dashes removed), the URL, and the ticket suffix that identifies that node (at Yale the F5 likes the ticket suffix to be an MD5 hash of the DNS name).

...

An F5 can be configured to have "sticky" connections between a client and a server. The first time the browser connects to a service name it is assigned any available back-end server. For the next few minutes, however, subsequently requests from that client to the same service go back are forwarded to whichever server the F5 assigned to handle the first request.

Intelligent routing is based on tickets that exist only after you have logged inWhile a user is logging in to CAS with the form that takes userid and password, or any other credentials, there is no Ticket. No cookie, no ticket=, none of the features that would trigger the first three rules of the programmable intelligent Front End. CAS was designed (for better or worse) to use Spring Webflow which keeps information in the Session object during the login process. For Web Flow to work, one of two things must happen:

...

Option 2 is a fairly complex process of container configuration, unless you have already solved this problem and routinely generate JBoss cluster VMs using some canned script. Sticky sessions in the front end are somewhat easier to configure and obviously they are less complicated than routing request by parsing the ticket ID string.Yale made a minor change to , but any sticky session rule MUST apply only after the first three rules (ticket= suffix, pgt= suffix, or CASTGC suffix) have been tested and found not to apply.

There is another solution, but it involves a CAS modification. Yale made a minor change to the CAS Web Flow to store extra data data that Web Flow saves in the Session object also in hidden fields of the login form , and an additional check so if the Form POSTs back to another server the other server can handle the rest of the login without requiring Session data(because it is not secure information). Then there is a check at the beginning of the Web Flow for a POST arriving at the beginning of the flow, allowing it to jump forward to the step of the Flow that handles the Form submission.

What is a Ticket Registry

...

When the user logs in, CAS creates a ticket that the user can use to create other tickets (a Ticket Granting Ticket or TGT, although a more friendly name for it is the "Login Ticket"). The TGT points to the userid/netid/principal and any attributes of the user. In theory the user can logoff, but typically the TGT simply times out and is then discarded.When the user tried to access a CAS-ified application, CAS generates a Service Ticket (ST). The ST points to the TGT from login and identifies the service (from the service= parameter that contains the application URL). ST's have a very short timeout and they are typically used and discarded less than a second after they are createdYou can usually get away with believing that the TGT contains the login "Netid", but there is really a chain of objects. The TGT points to an Authentication that points to a Principal that points to the Netid. The TGT can also contain a collection of Attributes used to generate SAML responses. This chain of objects doesn't matter unless you are coding CAS Business Logic or are writing a Unit Test.

With CAS 4 things threaten to become more interesting. With multiple factors of authentication and the possibility of adding new factors to an existing logon, the TGT becomes a more interesting and active object.

When the logged in user accesses an application that uses CAS, the application redirects the browser back to CAS to get a Service Ticket. The ST contains the service= URL of the application and it points to the Login TGT.

The Service then connects to CAS using its own HTTP session and Validates the ST. The CAS Business Logic finds the ST in the Registry, then follows the pointer in the ST to the TGT where it finds Authentications, and Principals, and Attributes.

Web applications are traditionally defined in three layers. The User Interface generates the Web pages, displays data, and processes user input. The Business Logic validates requests, verifies inventory, approves the credit card, and so on. The back-end "persistence" layer talks to a database. CAS doesn't sell anything, but it has roughly the same three layers.

...