Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Although the default configuration of Ehcache uses synchronous replication for Service Tickets, if you program the Front End (or add the CushyFrontEndFilter) to a CAS using Ehcache in the same way described for CushyTicketRegistry, then ST validation requests will go to the CAS server that created the ST, so you can use the same lazy asynchronous replication for both Login and Service TicketsService Tickets that normally Ehcache is configured to use for Logon Tickets (TGTs).

So the main difference between the two is that every 10 seconds or so Ehcache replicates all the tickets that have changed in the last 10 seconds, while Cushy transmits a file with all of the ticket changes since the last full checkpoint. Then every few minutes it generates a full checkpoint. So Ehcache transmits a lot less data. However, the cost of transmitting the extra data is so low that this may not matter if Cushy provides extra function.

Ehcache is a closed system that operates inside the CAS servers and exposes no external features. Cushy generates checkpoint and incremental files that are regular files on disk that can be accessed using any standard commands, scripts, or utilities. This provides new disaster recovery options.

Ehcache is designed to be a "cache". That is, it is designed to be a high speed, in memory or local disk, copy of some data that has a persistent copy off on some server. That is why it has a lot of configuration for "LRU" and object eviction, because it assumes that lost objects are reloaded from persistent storage. You can use it as a replicated in memory table, but you have to understand if you read the documentation that that is not its original design.Ehcache replicates data transparently inside a large black box library. Cushy is a single source file of pure Java written to be easily understood. It is specifically designed to manage Tickets. Furthermore, there is a specific point in the code when files arrive and when they are being processed. These are places in the code where additional CAS specific logic can be added to handle special or future requirements.

Two examples in the form of a fable -

Suppose the Rapture happens and all your users have been good users and they are all transported to Heaven leaving their laptops and tablets behind. Activity ceases on the network, so all the other TicketRegistry systems have nothing to do. Cushy, however, is driven by the number of tickets in the Registry and not as much by the amount of activity. So it continues to generate and exchange checkpoint files until 8 hours after the Rapture when the logins all timeout.

Suppose (and this one we have all seen) someone doesn't really understand how applications are supposed to work with CAS, and they write their code so they get a new Service Ticket for every Web page the user accesses. CAS now sees a stream of requests to create and validate new Service Tickets. The other TicketRegistry systems replicate the Service Ticket and then immediately send a message to all nodes to delete the ticket they just created. Cushy instead just wakes up after 10 seconds and finds that all this create and delete ticket activity has mostly cancelled out. The incremental file will contain an increasing number of deleted Ticket IDs, until the next checkpoint resets it to empty and it starts growing again. If you turn on the option to ignore Service Tickets all together (because you don't really need to replicate them if you have programmed your Front End or added the Filter), Cushy can ignore this activity entirelybe a CAS TicketRegistry. That is the only thing it does, and it is very carefully designed to do that job correctly.

Cushy models its design on two 40 year old concepts. A common strategy for backing disks up to tape was to do a full backup of all the files once a week, and then during the week to do an incremental backup of the files changed since the last backup. The term "checkpoint" derives from a disk file into which an application saved all its important data periodically so it could restore that data an pick up where it left off after a system crash. These strategies work because they are too simple to fail. More sophisticated algorithms may accomplish the same result with less processing and I/O, but the more complex the logic the more vulnerable you become if the software, or hardware, or network failure occurs in a way that the complex sophisticated software did not anticipate.

Ehcache is a large library of complex code designed to merge changes to shared data across multiple hosts. Cushy is a single source file of pure Java written to be easily understood.

Basic Principles

  1. CAS is very important, but it is also small and cheap to run.
  2. Emphasize simplicity over efficiency as long as the cost remains trivial.
  3. The Front End gets the request first and it can be told what to do to keep the rest of the work simple. Let it do its job.
  4. Hardware failure doesn't have to be completely transparent. We can allow one or two users to get a bad message if everything works for the other 99.9% of the users. Trying to do better than this is the source of most 100% system failures.

...

Yale decided to make it appear that other security applications appear to run on the secure.its.yale.edu machine, even though each application has its own pool of VMs. So the F5 has to examine the URL to determine if it begins with "/cas" and therefore goes to the pool of CAS VMs, of or if it references a different application and begins with /idp and therefore goes to the Shibboleth pool. The F5 has to inspect and generate HTTP Headers if the real client IP address is passed on to a Web Server for processing.

...

Routing requests to particular servers based on the content of request line and the headers is part of what generic Front End devices (not just the F5) call "Layer 54-7 routing". The internet routes messages between computers using Layer 4 3 routing (IP) but Front End devices select the last hop to the specific VM based on data and and understanding of the higher level protocols. For example, if a large university divided its CAS servers up by physically separated campuses, then people who normally go to one campus could be given an OU= in the DN of their X.509 User Certificate that would preferentially route CAS requests to the server or pool of servers for the home campus. Servers at other campus locations then provide offsite backup.After the first request is randomly assigned to a Java J2EE server, subsequent requests can be sent back to the same server if the Front End understands JSESSIONID protocol. The Java server places a parameter called JSESSIONID in the first response to the browser, and the browser sends it back as a Cookie or as part of the URL. The F5 has built in programming to handle JSESSIONID, but that requires tables and is a lot more complex than CASIn the previous example the F5 obtained the User X.509 Certificate from the browser and decoded it. The full "DN" name in the certificate provides not only the users identity but also the organizational unit to which he is assigned and the geographical location of his home campus or office. An F5 running on one campus of a large statewide university might use this information to route a CAS login request to a server on the specific home campus where the user has his specific local account and data.

In an eCommerce application, an F5 may be programmed to route the user's first request to a server chosen at random, but then to send all subsequent requests back to the same server. This can be done based on the client IP address or, for Java servers, based on the value of a parameter named "JSESSIONID" commonly used  by Java servers to manage sessions. Of course, Java servers also have complex clustering technologies to exchange session information between requests, but it is better to avoid that problem all together.

First, however, we need to understand the format of CAS ticketids because that is where the routing information comes from:

...

If you cannot convince your network administrators to do the programming in the Front End where it belongs, you can get the same result slightly less efficiently using the CushyFrontEndFilter.  Just add it as a Servlet Filter to the WEB-INF/web.xml file and it will do the same thing that the F5 is supposed to doexamines incoming requests and executes the same logic described above, just coded in Java instead of F5 iRule syntax.

What Cushy Does at Failure

...