Front End Programming for CAS

Any cluster of Web Servers requires some sort of Front End device to screen and forward network traffic. Ten years ago this was a computer with some fairly simple programming. Common strategies for distributing requests:

Round Robin - A list of the servers is maintained. Each request is assigned to the next computer in the list. When you get to the end of the list, you wrap back to the beginning.
Master/Spare - All requests go to the first computer until it fails. Then requests go to the backup computer until the Master comes back up.

There is a problem when different requests from the same client are randomly distributed to different servers. The servers may be retaining some information about the client (like a "shopping basket" in eCommerce). This information is tied to a Session object, and so some clustering technology is designed to replicate the Session object and all its data to all the other servers that run the same application. JBoss servers do this when you turn on their clustering feature.

However, designers of Front End devices quickly learned that they would have a competitive advantage if they learned a little bit about the common application and Web servers and made more intelligent decisions that optimized the performance of the servers and the customer applications they support. For example, if the Front End knows about the JSESSIONID parameter used by Java Web servers to identify Session data, then the Front End can send subsequent requests back to the same server that handled previous requests for the same transaction. Now the Web server doesn't have to replicate Session information frantically in order to handle the next Web page request, although it may back up data to recover from a server crash.

As chips and memory became less expensive, Front End devices became more powerful and added even more features to protect and optimize applications. Today, these device track repetitive activity from very active IP addresses to detect and suppress Denial of Service attacks. With "deep packet inspection" they can identify attempts to hack a system. They can remove HTTP Headers regarded as inappropriate, and add Headers of their own. Vendors may provide even fairly complex services for very widely used programs, and local network administrators can add their own coding in some high level language for applications that are locally important.

For example, users at Yale know that CAS is "https://secure.its.yale.edu/cas". In reality, DNS resolves secure.its.yale.edu to 130.132.35.49 and that is a Virtual IP address (a VIP) on the BIG-IP F5. The VIP requires configuration, because the F5 has to hold the SSL Certificate for "secure.its.yale.edu" and manage the SSL protocol.

Yale decided to make it appear that other security applications appear to run on the secure.its.yale.edu machine, even though each application has its own pool of VMs. So the F5 has to examine the URL to determine if the server name is followed with "/cas" and therefore goes to the pool of CAS VMs, or if it contains "/idp" and therefore goes to the Shibboleth pool of VMs. Because it is the F5 that is really talking to the browser, it has to create a special Header with the browser IP address in the event that it is important to the Web server.

The amount of logic in the Front End can be substantial. Suppose CAS wants to use X.509 User Certificates installed in the browser as a authetication credential. CAS has an entire X509 support module, but that depends on the browser talking to the CAS server directly. If the Front End is terminating the SSL/TLS session itself, then it has to be configured with all the standard information needed by any Web component that handles user certificates.

There has to be a special list of "Trusted" Certificate Authorities from which User Certificates will be accepted. The Front End has to tell the browser that certificates are required or optional. The signature in the submitted Certificate has to be validated against the Trusted CA list. The Certificate has to be ASN.1 decoded, and then the DN and/or one or more subjectAltNames has to be extracted, and they have to be turned into new HTTP headers that can be forwarded to the application. All this is fairly standard stuff and it is typically part of the built in code for any Front End device.

Modern Front End systems can select specific servers from the pool based on data in the URL or Headers or based on the recent history of requests from that client device. Requests from phones could go to a different pool of servers than requests from PCs. If the Front End is going to all the trouble of decoding the X.509 User Certificate, then it could select servers based on organizational unit or geographic information that it contains. The application can write out a Cookie that the Front End can subsequently use to select specific servers for the next request.

So teaching a Front End about CAS protocol is not that big a deal. Commercial sites do this all the time, but they don't run CAS. Universities probably spend less time optimizing eCommerce applications and so they may not normally think about Front End devices this way.

First, however, we need to understand the format of CAS ticketids because that is where the routing information comes from:

type - num - random - suffix

where type is "TGT" or "ST", num is a ticket sequence number, random is a large random string like "dmKAsulC6kggRBLyKgVnLcGfyDhNc5DdGKT", and the suffix at the end is configured in the uniqueIdGenerators.xml file.

A typical XML configuration for a particular type of ticket (when you use Cushy) looks like this:

<bean id="ticketGrantingTicketUniqueIdGenerator" class="org.jasig.cas.util.DefaultUniqueTicketIdGenerator">
<constructor-arg index="0" type="int" value="50" />
<constructor-arg index="1" value="#{clusterConfiguration.getTicketSuffix()}" />
</bean>

The suffix value, which is the index="1" argument to the Java object constructor, is obtained using a Spring "EL" expression to be the TicketSuffix property of the bean named clusterConfiguration. This is the CushyClusterConfiguration object that scans the configured cluster definitions to determine which cluster the server is running in and what name and IP address it uses. By directly feeding the output of clusterConfiguration into the input of the Ticket ID Generator, this approach makes configuration simple and ensures that all the machines come up configured properly. There is special logic in Cushy for an F5 which, for some reason, likes to identify hosts by the MD5 hash of the character representation of their decimal/dotted IPv4 address.

Every CAS request except the initial login comes with one or more tickets located in different places in the request. There is a sequence of tests and you stop at the first match:

If the Path part of the URL is a validate request (/cas/validate, /cas/serviceValidate, /cas/proxyValidate, or /cas/samlValidate) then look at the ticket= parameter in the query string part of the URL
Otherwise, if the Path part of the URL is a /cas/proxy request, then look at the pgt= parameter in the query string.
Otherwise, if the request has a CASTGC cookie, then look at the cookie value.
Otherwise, the request is probably in the middle of login so use any built in Front End support for JSESSIONID.
Otherwise, or if the node selected by 1-4 is down, choose any CAS node from the pool.

That is the code, now here is the explanation:

After receiving a Service Ticket ID from the browser, an application opens its own HTTPS session to CAS, presents the ticket id in a "validate" request. If the id is valid CAS passes back the Netid, and in certain requests can pass back additional attributes. This request is best handled by the server that issued the Service Ticket.
When a middleware server like a Portal has obtained a CAS Proxy Granting Ticket, it requests CAS to issue a Service Ticket by opening its own HTTPS connection to CAS to make a /proxy call. Since the middleware is not a browser, it does not have a Cookie to hold the PGT. So it passes that ticketid explicitly in the pgt= parameter. This request is best handled by the server that created the Proxy Granting Ticket.
After a user logs in, CAS creates a Login TGT that points to the Netid and attributes and writes the ticket id of the TGT to the browser as a Cookie. The Cookie is sent back from the browser in any request to "https://secure.its.yale.edu/cas". After initial login, all requests with cookies are requests to issue a Service Ticket for a new application using the existing CAS login. This is best handled by the server that created the TGT.
If there is no existing ticket, then the user is logging into CAS. This may be the GET that returns the login form, or the POST that submits the Userid and Password. Vanilla CAS code works only if the POST goes back to the same server than handled the GET. This is the only part of CAS that actually has an HttpSession.
Otherwise, if there is no JSESSIONID then this is the initial GET for the login form. Assign it to any server.

Only cases 1-3 actually involve special CAS protocol logic. Steps 4 and 5 are standard options that will be programmed into any Front End device already, so after test 3 you basically fall through to whatever the Front End did before you added special code.

Cases 1-3 are only meaningful for a GET request. The only CAS POST is the login form.

CAS uses SSL/TLS sessions and they are rather expensive to set up. So after going to all the trouble to create one of these sessions, the client and the server reuse it over and over again. If the secure session was set up with client identification information, like an X.509 Certficate or a SPNEGO (Windows integrated AD login), then the SSL session belongs to just one user. Otherwise the session is anonymous and can carry many unrelated requests from many users.

The Front End will maintain a pool of already established SSL connections to the CAS servers. If it routes a request using the logic of cases 1-3 to a specific server, this has nothing to do with what Web servers normally describe as a "session". Consecutive ST validation requests from the same application will go to different CAS servers based on different suffix values in the different ST ids, and they will reuse entirely different idle SSL sessions from the pool.

If you would like to see how this sort of programming is done, but you don't want to learn about real Front End devices, the same program logic is defined in the CushyFrontEndFilter.

In Java, an application can specify one or more Filters. A Filter is a Java class that can scan the Request as it is coming in to the application (CAS in this case) and it can scan the response generated by the application. The CushyFrontEndFilter scans incoming Requests an applies the logic for cases 1-3 to forward requests best handled by another node in the cluster.

If the programming is added to the Front End device, then it forwards request to the CAS server that issued the ticket and holds it in memory. However, without intelligent Front End programming the request arrives at a randomly chosen CAS node. At this point we have only two possible strategies:

The traditional idea is to configure the TicketRegistry to replicate ticket information to all the nodes or to the database server before returning the ST back to the browser. So 100% of the time you have a network transaction from one server to all the other servers and there is a delay for the network operation to complete added to the end of the Service Ticket generation. However, you should not forget that when the ticket gets validated the ST is deleted, and since you cannot configure Ehcache or any of the other technologies to do addTicket synchronously and deleteTicket asynchronously, then 100% of the time you get a second operation at the end of ST validation.
With CushyFrontEndFilter the Service Ticket generation returns immediately. When the ST validation request comes in then there is a 1/n chance it will be routed to the right server by chance (that is, in a two node cluster it goes the the right place 50% of the time). Only if it goes to the wrong server is there a network operation, and then it will be from the server that received the request to the server that owns the ticket. There will only be the one exchange instead of two operations and two delays.

So with Ehcache or any other synchronous replication strategy there is a 100% probability of two network transactions and delays, while with the Filter there is a 50% chance of no operation or delay, and a 50% chance of one operation and delay. Roughly speaking, the Filter has 25% of the overhead of the traditional mechanism.

But wait, you say, tickets still have to be replicated to other nodes for recovery if a node crashes. Yes, that is right, at least for Login TGTs and PGTs. However, Service Tickets are created, validated, and deleted so fast that if you are replicating ticket asynchronously, whether you use CushyTicketRegistry or Ehcache default replication every 10 seconds, that most of the time the ST doesn't hang around long enough to get replicated.

So while programming the Front End is the best solution, using CushyFrontEndFilter should clearly be more efficient than the current system of synchronous ST ticket replication.