Any cluster of Web Servers requires some sort of Front End device to screen and forward network traffic. Ten years ago this was a computer with some fairly simple programming. Common strategies for distributing requests:
- Round Robin - A list of the servers is maintained. Each request is assigned to the next computer in the list. When you get to the end of the list, you wrap back to the beginning.
- Master/Spare - All requests go to the first computer until it fails. Then requests go to the backup computer until the Master comes back up.
However, if each Web request can go to a different server, then all the data about the current session either has to be saved back to the browser (as a cookie) or it has to be saved to a database that all servers in the cluster can access. The database limited performance. Companies building network equipment realized that they would have a competitive advantage if the Front End learned a little bit more about servers and protocols to optimize request routing and make things simpler and more efficient for the applications. Then they added additional features to enhance security. Today, these device track repetitive activity from very active IP addresses to detect and suppress Denial of Service attacks. With "deep packet inspection" they can identify attempts to hack a system. They can remove HTTP Headers regarded as inappropriate, and add Headers of their own. Vendors may provide even fairly complex services for very widely used programs, and local network administrators can add their own coding in some high level language for applications that are locally important.
For example, users at Yale know that CAS is "https://secure.its.yale.edu/cas". In reality, DNS resolves secure.its.yale.edu to 130.132.35.49 and that is a Virtual IP address (a VIP) on the BIG-IP F5. The VIP requires configuration, because the F5 has to hold the SSL Certificate for "secure.its.yale.edu" and manage the SSL protocol. Yale decided to make it appear that other security applications appear to run on the secure.its.yale.edu machine, even though each application has its own pool of VMs. So the F5 has to examine the URL to determine if the server name is followed with "/cas" and therefore goes to the pool of CAS VMs, or if it contains "/idp" and therefore goes to the Shibboleth pool of VMs. Because it is the F5 that is really talking to the browser, it has to create a special Header with the browser IP address in the event that it is important to the Web server.
The amount of logic in the Front End can be substantial. Suppose CAS wants to use X.509 User Certificates installed in the browser as a authentication credential. CAS has an entire X509 support module, but that depends on the browser talking to the CAS server directly. If the Front End is terminating the SSL/TLS session itself, then it has to be configured with all the standard information needed by any Web component that handles user certificates. There has to be a special list of "Trusted" Certificate Authorities from which User Certificates will be accepted. The Front End has to tell the browser that certificates are required or optional. The signature in the submitted Certificate has to be validated against the Trusted CA list. The Certificate has to be ASN.1 decoded, and then the DN and/or one or more subjectAltNames has to be extracted, and they have to be turned into new HTTP headers that can be forwarded to the application. All this is fairly standard stuff and it is typically part of the built in code for any Front End device.
So teaching a Front End about CAS protocol is not that big a deal. Commercial sites do this all the time, but they don't run CAS. Universities probably spend less time optimizing eCommerce applications and so they may not normally think about Front End devices this way.
First, however, we need to understand the format of CAS ticketids because that is where the routing information comes from:
type - num - random - suffix
where type is "TGT" or "ST", num is a ticket sequence number, random is a large random string like "dmKAsulC6kggRBLyKgVnLcGfyDhNc5DdGKT", and the suffix at the end is configured in the uniqueIdGenerators.xml file.
A typical XML configuration for a particular type of ticket (when you use CushyClusterConfiguration) looks like this:
<bean id="ticketGrantingTicketUniqueIdGenerator" class="org.jasig.cas.util.DefaultUniqueTicketIdGenerator">
<constructor-arg index="0" type="int" value="50" />
<constructor-arg index="1" value="#{clusterConfiguration.getTicketSuffix()}" />
</bean>
The suffix value, which is the index="1" argument to the Java object constructor, is obtained using a Spring "EL" expression to be the TicketSuffix property of the bean named clusterConfiguration. By directly feeding the output of clusterConfiguration into the input of the Ticket ID Generator, this approach makes configuration simple and ensures that all the machines come up configured properly. There is special logic in Cushy for an F5 which, for some reason, likes to identify hosts by the MD5 hash of the character representation of their decimal/dotted IPv4 address.
Every CAS request except the initial login comes with one or more tickets located in different places in the request. There is a sequence of tests and you stop at the first match:
- If the Path part of the URL is a validate request (/cas/validate, /cas/serviceValidate, /cas/proxyValidate, or /cas/samlValidate) then look at the ticket= parameter in the query string part of the URL
- Otherwise, if the Path part of the URL is a /cas/proxy request, then look at the pgt= parameter in the query string.
- Otherwise, if the request has a CASTGC cookie, then look at the cookie value.
- Otherwise, the request is probably in the middle of login so use any built in Front End support for JSESSIONID.
- Otherwise, or if the node selected by 1-4 is down, choose any CAS node from the pool.
That is the code, now here is the explanation:
- After receiving a Service Ticket ID from the browser, an application opens its own HTTPS session to CAS, presents the ticket id in a "validate" request. If the id is valid CAS passes back the Netid, and in certain requests can pass back additional attributes. This request is best handled by the server that issued the Service Ticket.
- When a middleware server like a Portal has obtained a CAS Proxy Granting Ticket, it requests CAS to issue a Service Ticket by opening its own HTTPS connection to CAS to make a /proxy call. Since the middleware is not a browser, it does not have a Cookie to hold the PGT. So it passes that ticketid explicitly in the pgt= parameter. This request is best handled by the server that created the Proxy Granting Ticket.
- After a user logs in, CAS creates a Login TGT that points to the Netid and attributes and writes the ticket id of the TGT to the browser as a Cookie. The Cookie is sent back from the browser in any request to "https://secure.its.yale.edu/cas". After initial login, all requests with cookies are requests to issue a Service Ticket for a new application using the existing CAS login. This is best handled by the server that created the TGT.
- If there is no existing ticket, then the user is logging into CAS. This may be the GET that returns the login form, or the POST that submits the Userid and Password. Vanilla CAS code works only if the POST goes back to the same server than handled the GET. This is the only part of CAS that actually has an HttpSession.
- Otherwise, if there is no JSESSIONID then this is the initial GET for the login form. Assign it to any server.
Only cases 1-3 actually involve special CAS protocol logic. Steps 4 and 5 are standard options that will be programmed into any Front End device already, so after test 3 you basically fall through to whatever the Front End did before you added special code.
Cases 1-3 are only meaningful for a GET request. The only CAS POST is the login form and it is case 4.
CushyFrontEndFilter
It is not always possible to convince the network administrators who own the Front End device to add programming for CAS protocol. It turns out, however, that this sort of programming is a good idea even if you have to wait until the request hits the application servers.
CushyFrontEndFilter is a Java Servlet Filter that you can insert using WEB-INF/web.xml in the CAS WAR file. It is typically configured by CushyClusterConfiguration to know the ticket suffix and URL of the other nodes in a CAS cluster. It builds a small pool of reusable SSL sessions with each member of the cluster and uses them to forward GET requests that the case 1-3 analysis indicates would better be handled by another node that generated the ticket. It sends back to the browser whatever reply the other node generated. In short, it does what we would prefer the Front End do, but it waits until the request hits the processing stack of one of the CAS servers.
This adds some overhead and some network delay due to the request forwarding. However, it turns out that you can save more overhead and reduce more delay than the Filter consumes.
Start with an existing configuration that uses Ehcache for the TicketRegistry. The standard configuration creates a Service Ticket cache that replicates ticket operations "synchronously" (the node that creates the ticket has to wait until the RMI call to all the other nodes finishes copying the ticket to the other caches before it can return the ticketid to the browser). If you have a two node cluster, then every Service Ticket that is created has to wait for one RMI call to the other node, and then because all operations are synchronous, every ST validation that deletes the ticket also has to wait for that RMI call to get to the other node. So the overhead is two network operations and each is synchronous, delaying the return of the response to the client.
Now consider the same configuration with CushyFrontEndFilter inserted in front of the stack, and because of that you reconfigure the ticketRegistry.xml file so that Ehcache uses delayed replication for Service Tickets with the same parameters it uses for Login Tickets.
The browser submits a request to generate a Service Ticket. The Front End randomly chooses one of the two servers, with a 50% chance of selecting the one that generated the Login ticket. This is the case 3 situation (the CASTGC Cookie ticket), and since Ehcache replicates Login TGTs to all the nodes, there is no particular reason to route this request since all the nodes can handle it. If for some reason you decide to route to the login server, you expect to have a single network transaction half the time. If you disable CASTGC routing, then there is never a synchronous request or delay. The addTicket call does queue a ticket replication to occur eventually, but it will be batched up, happen in the background, and it delays nobody.
The application submits a request to validate the ST. Again there is a 50% chance the request will go to the node that issued the ST and require no network communication. In the other half of the cases, there is one synchronous network transaction.
So as Ehcache is configured now with synchronous replication, there are always two operations and two delays. With the Filter and asynchronous replication, there are two cases where you have a 50% chance of a network operation. If you disable CASTGC routing, that drops to a 50% chance of one network operation. Thus the Filter reduces the expected number of synchronous operations by at least 50% and perhaps 75%.
If you use CushyTicketRegistry instead of Ehcache, then you have no choice. Either the Front End has to do the routing or you have to use CushyFrontEndFilter.