CAS is a Single SignOn solution. Internally the function of CAS is to create, update, and delete a set of objects it calls "Tickets" (a word borrowed from Kerberos). A Logon Ticket is created to hold the Netid when a user presents his netid and password logs on to CAS. The ticket in CAS memory holds the netid, and a generated "ticket id string" A partially random string is generated to be the login ticket id and is sent back to the browser as a Cookie to associate the browser with the identity. For each application, CAS generates a Service Ticket and passes its id string through the browser to the application, which then validates the ticket id to CAS and receives the validated netidcookie while also being used to save the logon ticket object in a table. Similarly, CAS creates Service Tickets to identity a logged on user to an application that uses CAS authentication.
A standalone CAS server stores ticket objects in memory, but when you add a second CAS server to the network for reliability, then you have to share the ticket objects either by sharing tables in a database or by configuring one of several packages that replicate Java objects over a network.
...
During normal processing the master server is generating tickets, creating checkpoints checkpoint and incrementsincrement files, and sharing or sending them to the backup server. The backup server is generating empty checkpoints with no tickets because it has not yet received a request.
Then the master is shut down or crashes. The backup server has a copy in memory of copyof all the tickets generated by the master, except for the last few seconds before the crash. When new users log in, it creates new Login Tickets in its own Ticket Registry. When it gets a request for a new Service Ticket for a user who logged into the master, it creates the ST in its own registry (with its own nodename suffix) but connects the ST to the Login Ticket in its copy of the master's Ticket Registry.
...
What happens next depends on how smart the Front End is. If it has been programmed to route requests based on the suffix of the tickets in the login cookie, then users who logged into the backup server during the failure continue to use the backup server, while new and older users all go back to the master. If the Front End is programmed to route all requests to the master as long as the master is up, then it appears that logically when the master came up the backup server "failed over to the master".
When the master comes up it reloads its old copy of its ticket registry from before the crash, and it gets a copy of the tickets generated by the backup server while it was down. When it subsequently gets requests from users who logged into the backup server, it resolves those requests using its copy of that the backup server's TGT.
This leaves a few residual "issues" that are not really big problems and are deferred until Cushy 2.0. Because each server is the owner of its own tickets, and its Ticket Registry is the authoritative source of status on its own tickets, other nodes cannot make permanent changes to another node's tickets during a failover.
...
A more serious problem occurs for Single Sign Out of people who logged into the backup server while the master is down in systems where the Front End processor is not programmed to route requests intelligently. When the master reboots and starts handling all new requests, they have a TGT is that is "frozen" to the state it was in when the master rebooted. The master can subsequently create new Service Tickets from that TGT, but Single Sign Out will not know to log them off from those services when the user logs off. The current solution is to use Front End programming. Cushy 2.0 may add intelligent TGT migration and merging after a CAS server reboots.
A Smart Front End Cluster
A programmable Front End is adequate to Cushy needs if it can route requests based on four rules:
- If the URL "path" is a validate request (/cas/validate, /cas/serviceValidate, etc.) then route to the node indicated by the suffix on the value of the ticket= parameter.
- If the URL is a /proxy request, route to the node indicated by the suffix of the pgt= parameter.
- If the request has a CASTGC cookie, then route to the node indicated by the suffix of the TGT that is the cookie's value.
- Otherwise, or if the node selected by 1-3 is down, choose any CAS node
So normally all requests go to the machine that created and therefore owns the ticket, no matter what type of ticket it is. a TGT is that is "frozen" to the state it was in when the master rebooted. The master can subsequently create new Service Tickets from that TGT, but Single Sign Out will not know to log them off from those services when the user logs off. The current solution is to use Front End programming. Cushy 2.0 may add intelligent TGT migration and merging after a CAS server reboots.
A Load Balanced Cluster
When a CAS server fails, requests for its tickets are assigned to one of the other servers.
When a CAS server receives a request for a ticket owned by another node, it fully activates the other nodes node's shadow Ticket Registry from a copy of the other nodes checkpoint and incremental files. It then looks up the ticket in that registry and returns it to the CAS Business Logic. A node may not have a copy of tickets issued in the last few seconds, so one or two users may see an error.
...
When a node gets a /cluster/notify request from another node, it responds with an "https://servername/cas/cluster/getChekpoint?ticket=..." request to obtain a copy of the newly generated full checkpoint file. Again, SSL encrypts the data and the other node X.509 certificate validates its identity. If the other node sends the data as requested, then the Service Ticket ID sent in the notify is valid and it is stored in the secondary YaleServiceRegistry object associated with that node. Between checkpoints the same ticketId is used as a password to fetch incremental files, but when the next checkpoint is generated there is a new Notify with a new ticketid and the old ticketid is no longer valid. There is not enough time to brute force the ticketid before it expires and you have to start over.
...
Normal Operation
A CAS node starts up.The Spring configuration loads the primary YaleTicketRepository object, and it creates secondary objects for all the other configured nodes. Each object is configured with a node name, and secondary objects are configured with the external node URL.
...