Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

What is a Ticket Registry

This is a rather detailed description of one CAS component, but it does not assume any prior knowledge.

CAS provides a Single SignOn function. It acts as a system component, but internally it is an application that structured like most other Web applications. Internally it creates, validates, and deletes objects called Tickets. One type of ticket (the The Ticket Registry is the component that holds the tickets while CAS is running.

When the user logs in, CAS creates a ticket that the user can use to create other tickets (a Ticket Granting Ticket or TGT) is created when the user logs on to CAS by presenting a userid and password or other credential. Then when CAS is asked to authenticate a user to a Web application, it creates a temporary Service Ticket (ST) that is transmitted to the application and which the application validates by talking to CAS directly.

In terms of traditional "layered" application design, the user interface of CAS is provided by Spring MVC and Spring Web Flow. This layer obtains the userid and password from the user or accepts and processes the ticket validation request from the application. In the middle, the Business Logic of CAS verifies the userid and password against some back end system, frequently Active Directory, and it generates ticket IDs and creates ticket objects.

The Ticket Registry is the back end component. It fits into the layered architecture where normal applications have a database layer, but although CAS can optionally store tickets in a database it more commonly keeps them in memory. The TicketRegistry interface is a pluggable component into which you can insert (using Spring XML configuration) any of several JASIG modules. The CushyTicketRegistry is one implementation of the interface.

The JPA Ticket Registry depends on Java ticket objects and object they point to being annotated with references to tables and a column name and type for every object field. JPA then generates and "weaves" into your application a lot of invisible support code to track these objects and to notice when they have been modified.

The "cache" versions of JASIG TicketRegistry modules use what is informally called Plain Old Java Objects. Like Cushy, they require these objects to be serializable to a bunch of portable bytes. With that restriction, they work on any type of object, but they work best with objects that are not directly connected (by references) to other objects and especially not to Collections of objects.

Unfortunately, CAS tickets were not designed to serialize nicely. The problem requires a short discussion about ticket relationships.

When a CAS users logs in, CAS creates a Login TGT. The TGT has two interesting chains of objects. It has an "Authentication" that is connected to a "Principal" that points to the Netid string. To support Single SignOut, it also contains a table of used Service Ticket IDs and references to the Service object that contains the URL CAS contacts to tell the Service that the user has logged out.

When the user authenticates to other applications, the short lived Service Ticket points to the Login TGT. When the ST is validated, CAS follows the pointer from the ST to the TGT and then to the Authentication to the Principal to the Netid string, which it returns as part of the validation message.

A special kind of Service is allowed to Proxy. Such a service gets its own Proxy Granting Ticket which acts like a TGT in the sense that it generates Service Tickets and the ST points back to it, but a PGT does not have the Netid itself. Rather it points back to the real TGT that contains the Authentication data. So when validating a proxy ST, CAS follows the pointer in the ST to the PGT, then follows the pointer from the PGT to the Login TGT and finds the Netid there.

Unfortunately, neither this chain of tickets nor the Single SignOut Service table is the sort of thing you want to have when you use a standard cache mechanism to replicate Plain Old Java Objects. It will do it, but with consequences that may or may not be important depending on how you use them.

When Cushy does a full checkpoint of all the tickets, it doesn't matter how the tickets are chained together. Under the covers of the writeObject statement, Java does all the work of following the chains and understanding the structure, then it write out a blob of bytes that will recreate the exact same structure.

The potential for problems arises when you try to serialize a single ticket, as Cushy does during incrementals and as the standard JASIG cache solutions do all the time for all the tickets.

Serialization makes a copy not just of the object you are trying to serialize, but also of all the objects that it points to directly or indirectly.

If you serialize a TGT, then you also make a copy of the Authentication, the Principal, the Netid string. But then you also generate a copy of the Single SignOut service table and, therefore, all the Services.

If you serialize a ST you get all its data, but because it points to a TGT you also get a copy of the TGT and all its stuff.

Service Tickets last for such a short period of time that it really doesn't matter how they serialize. Proxy Granting Tickets, however, also point to a TGT and therefore they serialize and generate a copy of the TGT they point to, and they live for a very long time.

Now it is possible to describe two problems:

When you serialize a collection, Java must internally obtain an iterator and step one by one through the objects in the collection. Unless the collection is inherently thread safe, serialization can break if another thread adds or deletes elements to the collection while serialization is trying to iterate through it. For example, suppose someone is logged into CAS and in a browser presses the "Open All In Tabs" button to create several tabs simultaneously. This can launch two CAS logon requests at the same time for the same user (and therefore the same TGT). One request is handled first, and it creates a Service Ticket, adds an entry in the Service table of the TGT. Now if you use one of the JASIG "cache" solutions all of them are going to try and Serialize the ST to copy it to the other nodes, but to do that they also have to make a copy of the TGT. Meanwhile, there was another tab that generated another request under another thread that is trying to create a second Service Ticket. Following the same process it will at some point try to add a new Service to the Service Table in the TGT, and every so often this will collide with Java Serialization trying to iterate through the objects in the TGT Service Table to turn them into a blob of bytes. The second thread adds a entry in the Service table that can, on occasion, invalidate the iterator and cause the serialization operation to throw an exception.

If you do not user Single SignOn you can, like Yale, simply disable use of the Service table in the TGT. Otherwise you might solve the problem by editing TicketGrantingTicketImpl.java and changing the type of "services" from HashMap which isn't thread safe to Hashtable which is.

The second problem is a "feature" that you should understand. A Ticket Registry stores ticket objects and retrieves them using their ticket Id string as a key. So if you go to the Ticket Registry with the ID of a TGT you get a TGT object. If you create a new Service Ticket using that TGT, then the ST points to that TGT object and both end up stored in the Registry on the node that created them.

Now if you are using one of the JASIG cache mechanisms or during the period of time while Cushy is generating incremental files instead of a full checkpoint, any attempt to serialize the Service Ticket as a single object produces a blob of bytes that when it is turned back into objects on any other node produces an ST pointing to a copy of the original TGT on the other node.

This is not a problem because the CAS business logic only cares about the data in the TGT, it doesn't care whether it has the official TGT object or an exact copy of it. Besides, Service Tickets are validated or time out and they are gone a few seconds after they are created, so this is never going to be a long term problem.

Except for Proxy Granting Tickets that also point to a TGT. They continue to exist for hours, and if the PGT is serialized and replicated to another node it will, on that node, have its own private copy of the TGT that is a point in time snapshot of what was in the TGT at the time the proxy was created.

Cushy solves this problem as soon as the next full checkpoint is written and restored. By serializing the entire collection of tickets on the node that owns them, Cushy create an exact duplicate object structure on all the other nodes. As for other cache solutions, the information in a TGT does change in any way that could affect the processing of a Proxy ticket. Some time down the line when we start to use mulitfactor authentication and we begin to add additional authentications dynamically to an existing logged on user it may be important for a PGT to reference the current TGT with all its current information rather than a copy of the TGT with only the information it had when the user first connected to the middleware application.

Usage Pattern

Users start logging into CAS at the start of the business day. The number of TGTs begins to grow.

Users seldom log out of CAS, so TGTs typically time out instead of being explicitly deleted.

Users abandon a TGT when they close the browser. They then get a new TGT and cookie when they open a new browser window.

Therefore, the number of TGTs can be much larger than the number of real CAS users. It is a count of browser windows and not of people or machines.

At Yale around 3 PM a typical set of statistics is:

Unexpired-TGTs: 13821
Unexpired-STs: 12
Expired TGTs: 30
Expired STs: 11

So you see that a Ticket Registry is overwhelmingly a place to keep TGTs.

After work, and then again after the students go to sleep, the TGTs from earlier in the day start to time out and the Registry Cleaner deletes them.

So generally the pattern is a slow growth of TGTs while people are using the network application, followed by a slow reduction of tickets while they are asleep, with a minimum probably reached each morning before 8 AM.

If you display CAS statistics periodically during the day you will see a regular pattern and a typical maximum number of tickets in use "late in the day".

Translated to Cushy, the cost of the full checkpoint and the size of the checkpoint file grow over time along with the number of active tickets, and then the file shrinks over night. During any period of intense login activity the incremental file may be unusually large. The worst possible configuration of Cushy would be to generate a checkpoint at 8 AM when the number of tickets is at a minimum, and then to have a period of hours between checkpoints when you might get a lot of activity from people waking up and arriving at work that loads up the incremental file with more and more stuff.

This suggests that the Ticket Registry mechanism should be designed to expect that after any configured period of use, most of the tickets that were there at the beginning of the period are probably still there, that the ones that aren't there are much more likely to have expired (which you can determine from the timestamp) than to have been manually deleted (by explicit CAS logoff), and that there is a block of incremental new tickets accumulated during the period.

This is a fairly predictable pattern that could, over time, produce increasingly efficient and highly optimized replication strategies. The current CushyTicketRegistry was written in about 3 days of coding and is good enough to be usable as is. As the following section will show, the problem is not really all that big, although a more friendly name for it is the "Login Ticket"). Then when someone previously logged in uses CAS to authenticate to another Web application, CAS creates a Service Ticket (ST).

Web applications are traditionally defined in three layers. The User Interface generates the Web pages, displays data, and processes user input. The Business Logic validates requests, verifies inventory, approves the credit card, and so on. The backend "persistence" layer talks to a database. CAS doesn't sell anything, but it has roughly the same three layers.

The CAS User Interface uses Spring MVC and Spring Web Flow to log a user on and to process requests from other Web applications. The Business Logic validates the userid and password (typically against an Active Directory), and it creates and deletes the tickets. CAS tickets, however, typically remain in memory and do not need to be written to a database or disk file. Nevertheless, the Ticket Registry is positioned logically where the database interface would be in any other application program, and sometimes CAS actually uses a database.

CAS was written to use the Spring Java Framework to configure its options. CAS requires some object that implements the TicketRegistry function. JASIG CAS provides at least five alternative Ticket Registries. You pick one and then insert its name (and configure its parameters) using a documented Spring XML file which not surprisingly is named "ticketRegistry.xml". Given this modular plug-in design, Cushy is just one more option you can optionally configure with this file.

When you have a regular Web application that sells things, the objects in the application (products, inventory, orders) would be stored in a database and the most modern way to do this is with JPA. To support the JASIG JPA Ticket Registry, all the Java source for tickets and things that tickets contain or point to are annotated with references to database tables and the names and data types of the columns in the table that each data field maps to. If you don't use the JPA Ticket Registry these annotations are ignored. JPA uses the annotations to generate and then weave into these objects invisible support code to detect when something has changed and track connections from one object to the next.

The "cache" versions (Ehcache, JBoss Cache, Memcached) of JASIG TicketRegistry modules have no annotations and few expectaions. They use ordinary objects (sometimes call Plain Old Java Objects or POJOs). They require the objects to be serializable because, like Cushy, they use the Java writeObject statement to turn any object to a stream of bytes that can be held in memory, stored on disk, or sent over the network.

CAS tickets are all serializable, but they are not designed to be very nice about it. This is the "dirty secret" of CAS. It has always expected tickets to be serialized, but it breaks some of the rules and, as a result, can generate failures. They don't happen often, but CAS runs 24x7 and anything that can go wrong will go wrong. With one of the caching solutions, when it goes wrong it is deep inside a huge black box of "off the shelf" code that may or may not recover from the error.

The purpose of this section is to describe in more detail than you find in other CAS documentation just what is going on here, how Cushy avoids problems, and how Cushy would recover even if something went wrong.

In simple terms, the Login ticket (the TGT) "contains" your Netid (username, principal, whatever you call it). In more detail the TGT points to an Authentication object that points to a Principal object that contains the Netid. Currently when a user logs on the TGT, Netid, and any attributes are all determined once and that part of the TGT never changes. In the future, CAS may add higher levels of authentication (secondary "factors") and that might change the important part of the TGT, but that is not a problem now.

However, if you use Single SignOut then CAS also maintains a "services" table in the TGT associates old used ServiceTicket ID strings and a reference to a Service object that contains the URL that CAS should call to notify a service that a user previously authenticated by CAS has logged out. The services table changes through the day as users log in to applications.

CAS also generates Service Tickets. However, the ST is used and discarded in a few milliseconds during normal use, or if it is never claimed it times out after a default period of 10 seconds. When the ST is validated by the application, CAS returns the Netid, but CAS does not store the Netid in the ST. Instead, it points the ST to the TGT and the TGT "contains" the Netid. When the application validates the ST, CAS goes from the ST to the TGT, gets the Netid, deletes the ST, and returns the Netid to the application.

So the ST is around for such a short period of time that you would not think it has an important affect on the structure of the Ticket Registry. There are, however, two impacts:

  1. First, whenever you ask Java writeObject to serialize an object to bytes, Java not only turns that object into bytes but it also makes a copy of any other object it points to. Cushy, Ehcache, JBoss Cache, and Memcached all serialize objects, but only here will you find anyone explaining what that means. When you think you are serializing an ST what you are really getting is an ST, the TGT it points to, the Authentication and Principal objects the TGT points to, and then the Service objects for all the services that the TGT is remembering for Single SignOut. In reality, the only thing the ST needs is the Netid, but because CAS is designed with many layers of abstraction you get this entire mess whether you like it or not.
  2. If you do not assume that the Front End is smart enough to route validation requests to the right host, then there is a chase condition between the cache based ticket replication systems copying the ST to the other nodes and the possibility that the front end will route the ST validation request to one of those other nodes. The only way to make sure this will never happen is to configure the cache replication systems to copy the ST to all the other nodes before returning to the CAS Business Layer to confirm the ST is stored. However, if network I/O is synchronous, then if it fails then CAS stops running as a result.

A special kind of Service is allowed to "Proxy", to act on behalf of the user. Such a service gets its own Proxy Granting Ticket (PGT) which acts like a TGT in the sense that it generates Service Tickets and the ST points back to it. However, a PGT does not "contain" the Netid. Rather the PGT points to the TGT which does contain the Netid.

When Cushy does a full checkpoint of all the tickets, it doesn't matter how the tickets are chained together. Under the covers of the writeObject statement, Java does all the work of following the chains and understanding the structure, then it write out a blob of bytes that will recreate the exact same structure when you read it back in.

The caching solutions never serialize the entire Registry. They write single tickets one at a time, except that as we have seen, a single ST or PGT points to a TGT that points to a lot of junk and all that gets written out every time you think you are serializing a "single ticket".

When Cushy generates an incremental file between full checkpoints, then all the added Tickets in the incremental file are individually serialized, producing the same result as the caching solutions. With Cushy, however, every 5 minutes the full checkpoint comes along and cleans it all up.

The reason why CAS can tolerate this sloppy serialization is that it doesn't affect the Business Logic. Suppose a ST is serialized on one node and is sent to another node where it is validate. Validation follows the chain from the ST to the TGT and then gets the Netid (and maybe the attributes). The result is the same whether you obtain the Netid from the "real" TGT or a copy of the real TGT made a few seconds ago. Once the ST is validated it is deleted, and that also discards all the other objects chained off the ST by the caching mechanism. It it isn't validate, then the ST times out and is deleted anyway.

If you have a PGT that points to a TGT, and if the PGT is serialized and copied to another node, and if after it is copied the TGT is changed (which cannot happen today but might be something CAS does in a future release with multifactor support), then the copy of the PGT points to the old copy of the TGT with the old info while the original PGT points to the original TGT with the new data. This problem would have to be solved before you introduce any new CAS features that meaningfully change the TGT.

Cushy solves this currently non-existent problem every time it does a full checkpoint. Between checkpoints, only for the tickets added since the last checkpoint, Cushy creates copies of TGTs from the individually serialized STs and PGTs just like the caching systems. It creates a lot fewer of them and they last only a few minutes.

Now for the real problem that CAS has not solved.

When you serialize a collection, Java must internally obtain an "iterator" and step one by one through the objects in the collection. An iterator knows how to find the next or previous object in the collection. However, the iterator can break if while it is dealing with one element in the collection another thread is adding a new element to the collection "between" the object that serialization is currently processing and the object that the iterator expects to be next. When this happens, serialization stops and throws an error exception.

So if you are going to use a serialization based replication mechanism (like Ehcache, JBoss Cache, or Memcached) then it is a really, really bad idea to have a non-threadsafe collection in your tickets, such as the services table in the TGT used for Single SignOut. Collisions don't happen all that often, but as it turns out a very common user behavior can make them much more likely.

Someone presses the "Open All In Tabs" button of the browser to create several tabs simultaneously. Two tabs reference CAS aware applications that redirect the browser to CAS. The user is already logged on, so each tab only needs a Service Ticket. The problem is that both Service Tickets point to the same TGT, and both go into the services table for Single SignOut, and the first one to get generated can start to be serialized while the second one is about to add its new entry in the services table.

Yale does not use Single SignOut, so we simply disabled the services table. If you want to solve this problem then at least Cushy gives you access to all the code, so you can come up with a solution if you understand Java threading.

Usage Pattern

Users start logging into CAS at the start of the business day. The number of TGTs begins to grow.

Users seldom log out of CAS, so TGTs typically time out instead of being explicitly deleted.

Users abandon a TGT when they close the browser. They then get a new TGT and cookie when they open a new browser window.

Therefore, the number of TGTs can be much larger than the number of real CAS users. It is a count of browser windows and not of people or machines.

At Yale around 3 PM a typical set of statistics is:

Unexpired-TGTs: 13821
Unexpired-STs: 12
Expired TGTs: 30
Expired STs: 11

So you see that a Ticket Registry is overwhelmingly a place to keep TGTs (in this statistic TGTs and PGTs are combined).

Over night the TGTs from earlier in the day time out and the Registry Cleaner deletes them.

So generally the pattern is a slow growth of TGTs while people are using the network application, followed by a slow reduction of tickets while they are asleep, with a minimum probably reached each morning before 8 AM.

If you display CAS statistics periodically during the day you will see a regular pattern and a typical maximum number of tickets in use "late in the day".

Translated to Cushy, the cost of the full checkpoint and the size of the checkpoint file grow over time along with the number of active tickets, and then the file shrinks over night. During any period of intense login activity the incremental file may be unusually large. If you had a long time between checkpoints, then around the daily minimum (8 AM) you could get an incremental file bigger than the checkpoint.

Some Metrics

At Yale there are typically more than 10,000 and fewer than 20,000 Login tickets. Because Service Tickets expire when validated and after a short timeout, there are only several dozen unexpired Service Tickets at any given time.

...

Of course, Cushy also has to deserialize tickets from the other nodes. However, remember that if you are currently using any other Ticket Registry the number of tickets reported out in the statistics page is the total number combined across all nodes, while Cushy serializes only the tickets that the current node owns and it deserializes the tickets for the other nodes. So generally you can apply the 20K tickets = 1 second rule of thumb to estimate the overhead of converting to Cushy and the number does seem to scale. Serializing 200,000 tickets takes 9 seconds. Serializing 200,000 tickets has been measured to take 9 seconds (so it scales as expected) and if you convert the 20K common pool of tickets to Cushy, then each node will serialize 10K of tickets it owns and deserialize 10K of tickets from the other node (load balanced) or else in a master-backup configuration the master will serialize 20K tickets and deserialize 0, while the backup will serialize 0 and deserialize 20K. You come to the same number no matter how you slice it.

Incrementals are trivial (.1 to .2 seconds).

...