In 2010 Yale upgraded to CAS 3.4.2 and implemented "High Availability" CAS clustering using the JBoss Cache option (because Yale Production Services had standardized on JBoss for "clustering"). Unfortunately, the mechanism designed to improve CAS reliability ended up as the cause of most CAS failures. If you insist that Service Tickets be replicated through the cluster, so that any CAS node can validate any Service Ticket, then replication has to complete before the ST can be passed back to the user. But if CAS has to wait for cache activity, then network problems or some sickness on one of the CAS nodes propagates back to all the nodes and CAS stops working. We considered changing to another option, but none of the alternatives has a spotless reputation for reliability.
...
When the user logs in, CAS creates a ticket that the user can use to create other tickets (a Ticket Granting Ticket or TGT, although a more friendly name for it is the "Login Ticket"). Then when someone previously logged in uses CAS to authenticate to another Web application, CAS creates The TGT points to the userid/netid/principal and any attributes of the user. In theory the user can logoff, but typically the TGT simply times out and is then discarded.
When the user tried to access a CAS-ified application, CAS generates a Service Ticket (ST). The ST points to the TGT from login and identifies the service (from the service= parameter that contains the application URL). ST's have a very short timeout and they are typically used and discarded less than a second after they are created.
Web applications are traditionally defined in three layers. The User Interface generates the Web pages, displays data, and processes user input. The Business Logic validates requests, verifies inventory, approves the credit card, and so on. The backend "persistence" layer talks to a database. CAS doesn't sell anything, but it has roughly the same three layers.The
In CAS User Interface uses Spring MVC and Spring Web Flow to log a user on and to process requests from other Web applications. The Business Logic validates the userid and password (typically against an Active Directory)the "User Interface" layer has two jobs. The part that talks to real users handles login requests through the Spring Web Flow services. However, CAS also accepts Web requests from the applications that are trying to validate a Service Ticket and get information about the user. This is also part of the UI layer, and it is handled by the Spring MVC framework. Cushy extends this second part of the UI so that node to node communication within the cluster also flows through MVC.
The Business Logic layer of CAS verifies the userid and password or any other credentials, and it creates and deletes the tickets. CAS tickets, however, typically TGT and ST objects.It also validates Service Tickets and deletes them after use.
In the simple case of a single CAS server, the tickets remain in memory and do not need to be get written to a database or disk file. Nevertheless, the Ticket Registry is positioned logically where the database interface would be in any other application program, and sometimes CAS actually uses a database.CAS was written to use the Spring Java Framework to configure its options. CAS requires disk or sent over the network. So CAS doesn't need any backend services, but it creates the TicketRegistry interface and makes it possible to add database or network function for clustering support.
CAS uses the Spring Framework and Spring XML to configure optional components. You have to provide some object that implements the TicketRegistry function. JASIG CAS provides at least five alternative Ticket Registries. You pick one and then insert its name (and configure its parameters) using a documented Spring XML file which not surprisingly is named "ticketRegistry.xml". Given this modular plug-in design, Cushy is just one more option you can optionally configure with this file.When you have a regular Web application that sells things, the objects in the application (products, inventory, orders) would be stored in a database and the most modern way to do this is with JPA. To support the JASIG JPA Ticket Registry, all the Java source for tickets and things that tickets contain or point to are annotated with references to database tables and the names and data types of the columns in the table that each data field maps tointerface and configure its classname in the ticketRegistry.xml file . The DefaultTicketRegistry class does this using just a table in memory. Other TicketRegistry alternatives use JPA or the various "cache" object replication technologies. Cushy is a new alternative.
JPA is the current technique for creating Java objects from a database query, updating objects, and committing changes back to the database. To support JPA, the TGT and ST Java objects have "annotations" to define the names of tables and columns that correspond to each object and field. If you don't use the JPA Ticket Registry , these annotations are ignored. JPA uses the annotations to generate and then weave into these objects invisible support code to detect when something has changed and track connections from one object to the nextIf you use JPA, then it automatically generates additional Java code that is added to every ticket object to track when it is used and updated.
The "cache" versions (Ehcache, JBoss Cache, Memcached) of JASIG TicketRegistry modules have no annotations and few expectaionsexpectations. They use ordinary objects (sometimes call Plain Old Java Objects or POJOs). They require the objects to be serializable because, like Cushy, they use the Java writeObject statement to turn any object to a stream of bytes that can be held in memory, stored on disk, or sent over the network.
CAS tickets are all serializable, but they are not designed to be very nice about it. This is the "dirty secret" of CAS. It has always expected tickets to be serialized, but it breaks some of the rules and, as a result, can generate failures. They don't happen often, but CAS runs 24x7 and anything that can go wrong will go wrong. With one of the caching solutions, when it goes wrong it is deep inside a huge black box of "off the shelf" code that may or may not recover from the error.
The purpose of this section is to describe in more detail than you find in other CAS documentation just what is going on here, how Cushy avoids problems, and how Cushy would recover even if something went wrong.
In simple terms, the Login ticket (the TGT) "contains" your Netid (username, principal, whatever you call it). In more detail the TGT points to an Authentication object that points to a Principal object that contains the Netid. Currently when a user logs on the TGT, Netid, and any attributes are all determined once and that part of the TGT never changes. In the future, CAS may add higher levels of authentication (secondary "factors") and that might change the important part of the TGT, but that is not a problem now.
However, if you use Single SignOut then CAS also maintains a "services" table in the TGT associates old used ServiceTicket ID strings and a reference to a Service object that contains the URL that CAS should call to notify a service that a user previously authenticated by CAS has logged out. The services table changes through the day as users log in to applications.
CAS also generates Service Tickets. However, the ST is used and discarded in a few milliseconds during normal use, or if it is never claimed it times out after a default period of 10 seconds. When the ST is validated by the application, CAS returns the Netid, but CAS does not store the Netid in the ST. Instead, it points the ST to the TGT and the TGT "contains" the Netid. When the application validates the ST, CAS goes from the ST to the TGT, gets the Netid, deletes the ST, and returns the Netid to the application.
So the ST is around for such a short period of time that you would not think it has an important affect on the structure of the Ticket Registry. There are, however, two impacts:
- First, whenever you ask Java writeObject to serialize an object to bytes, Java not only turns that object into bytes but it also makes a copy of any other object it points to. Cushy, Ehcache, JBoss Cache, and Memcached all serialize objects, but only here will you find anyone explaining what that means. When you think you are serializing an ST what you are really getting is an ST, the TGT it points to, the Authentication and Principal objects the TGT points to, and then the Service objects for all the services that the TGT is remembering for Single SignOut. In reality, the only thing the ST needs is the Netid, but because CAS is designed with many layers of abstraction you get this entire mess whether you like it or not.
- If you do not assume that the Front End is smart enough to route validation requests to the right host, then there is a chase condition between the cache based ticket replication systems copying the ST to the other nodes and the possibility that the front end will route the ST validation request to one of those other nodes. The only way to make sure this will never happen is to configure the cache replication systems to copy the ST to all the other nodes before returning to the CAS Business Layer to confirm the ST is stored. However, if network I/O is synchronous, then if it fails then CAS stops running as a result.
A special kind of Service is allowed to "Proxy", to act on behalf of the user. Such a service gets its own Proxy Granting Ticket (PGT) which acts like a TGT in the sense that it generates Service Tickets and the ST points back to it. However, a PGT does not "contain" the Netid. Rather the PGT points to the TGT which does contain the Netid.
When Cushy does a full checkpoint of all the tickets, it doesn't matter how the tickets are chained together. Under the covers of the writeObject statement, Java does all the work of following the chains and understanding the structure, then it write out a blob of bytes that will recreate the exact same structure when you read it back in.
...
The Serialization Problem of Current CAS
Java Serialization turns an object into a bunch of bytes. Since Java can handle all the ordinary types of data, it can automatically serialize any simple Java class. Ticket objects are declared to be serializable, but there is a problem. The problem has always existed though it has not been well documented.
A Web server handles lots of different HTTP requests from clients at the same time. It assigns a thread to each request. The threads run concurrently, and on modern multicore processors they can run simultaneously.
If an object has a collection (a table or list of objects) that can be updated by these requests, then it has to take some step to make sure that no two requests try to update the collection at the same time. The TGT has a collection of Services to which the user has authenticated (for Single Sign Out) and in CAS 4 it also has a List of Supplemental Authentications. CAS 3 was sloppy about this, but CAS 4 adds "synchronized" methods to protect against concurrent access to these tables by different Web request threads.
Unfortunately, serialization accesses the object and its internal collections without going through any of the synchronized methods. It has to iterate through all the members of the table or the list, and in general it cannot do this in a thread safe manner. Because serialization occurs when some external component (Ehcache, JBoss Cache, ...) decides to do it, and that decision is made deep inside what amounts to a giant black box of code, there is no way to externally guarantee that something won't go wrong.
One solution (that CAS has not implemented yet) is to create a custom serialization function that is synchronized between threads. The code is standard and simple:
private synchronized void writeObject(ObjectOutputStream s) throws IOException {
s.defaultWriteObject();
}
However, you will quickly find many people warning you that this is, in general, a very dangerous thing to just add to any class. The problem it creates is a threat of Deadlock.
Deadlock occurs when I own object A and need to acquire ownership of object B, while you own object B and request ownership of object A. Neither of us can get what we want, and neither of us will give up the thing the other wants. Any synchronized mechanism is exposed to deadlock unless you can enforce rules on your code to make sure it never happens.
The simplest solution is to prohibit any code from obtaining exclusive ownership of more than one object at a time. If that doesn't work, then the objects have to be obtained in a specific order by universal agreement.
CAS only acquires ownership of one object at a time. Serialization would only acquire objects one at at time. Cushy would only acquire ownership of one object at a time. However, who knows what Ehcache, JBoss Cache, Memcached, or other systems do? It is regarded as very bad practice to do disk or network I/O or to use complex services like serialization while holding exclusive ownership of an object. These systems are probably safe, but I lack the resources to prove they are safe.
This leads to a second problem with the "cache" replication mechanisms and CAS 4. In previous versions of CAS the TGT was essentially unmodified after you login. However, support for multiple factors of authentication creates a need for the supplimentalAuthentications array to which additional Authentication objects can be added.
If you serialize the entire registry of tickets, as Cushy does during a full checkpoint, then when you deserialize it you get an exact copy with all the same connections and structure. However, if you serialize an individual ticket, as Cushy does during an incremental and as all the "cache" based object replication systems do for everything, then something happens under the covers.
When you serialize an object, Java also serializes a copy of everything the object points to. This is necessary for simple references to String or Date objects, but it is also done for more complex fields. Every Service Ticket points to a Granting Ticket, and all Proxy Granting Tickets (used by middleware applications like a Portal) point back to the TGT. So when one of these tickets is individually serialized it captures a moment in time copy of the Login TGT. When that ticket is then transmitted to another node in the cluster and is deserialized, it ends up with its own private copy of the TGT on the other node.
Until CAS 4 (and maybe even on CAS 4) this did not matter to the CAS Business Logic layer. When you reference a TGT from a Service Ticket, the only use you make of it is to extract the Netid to send back in the response from an ST validate request. The fact that TGTs don't meaningfully change during the day meant that the copy was as good as the original.
However, once TGT's have modifiable data (such as Supplimental Authentications) then there is a difference in behavior on the node that creates a secondary ticket and any other node that simply receives a copy of that ticket from the replication mechanism. On the original node, the supplimental ticket points to the live TGT and it sees immediately any changes to it. The other nodes see only a copy of the original TGT and they will never see any changes to it no matter how long they wait.
This is probably not a problem for Service Tickets, because they time out and are deleted so quickly that changes to the TGT don't matter. It could become a problem for Proxy Tickets if it is important for the middleware to obtain information in the Supplemental Authentications without re-login.
Cushy automatically solves this problem every time it takes a full checkpoint. The other nodes obtain a fresh exact copy of all the tickets on the other node connected together exactly as they are on the other node with the very latest information.
When Cushy generates an incremental file between full checkpoints, then all the added Tickets in the incremental file are individually serialized, producing the same result as the caching solutions. With Cushy, however, every 5 minutes the full checkpoint comes along and cleans it all up.
...