Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

It will always be necessary to copy data to a backup location in order to recover from a crash. The previous TicketRegistry alternatives had to copy data to all servers very, very fast because CAS made no assumptions about the intelligence of the network Front End device. Today it is possible to use the capability of modern Front End equipment to simplify and relax the data replication requirements.

Since CAS has to use SSL, any Front End that distributes requests among CAS servers has to hold the SSL Certificate and manage the SSL protocol. This is more complicated than you may think, because if you want to accepts User Certificates installed in the browser (or in plug in devices based on User Certificate protocol), then it is the Front End and not CAS that has to manage the trusted Certificate Authorities, the signature validation, the ASN.1 decode, and the Credentials To Principal Resolution functions that CAS handles when it talks directly to browsers. Compared to that, intelligent routing of requests based on ticket ID suffix is fairly simple.

Although we think of CAS from the end user's point of view as a Single Sign On system with "logged in users", under the covers CAS processes HTTP Web requests to create, read, update, and delete Tickets. If you read the CAS protocol carefully, you realize that every CAS request (except the initial login) is a Ticket operation in which the primary piece of data is a Ticket ID string. It can be the value of the CASTGC cookie, or the ticket= parameter on a /cas/serviceValidate, or the pgt= parameter sent by a proxy. The ticket string is in different places, but it is always there.

CAS generates a new Ticket ID string in a specific class. This is important because CAS is not secure unless the Ticket ID is large enough and random enough that it cannot be guessed or brute forced. This code has the ability to put a "node name" identifier suffix on the end of each Ticket ID, although in many CAS installations nobody changes the default which is to end the ticket with "-CAS". However, if every CAS server generates a unique Ticket ID suffix value that the Front End can use, then programming the Front End to find the ticket in one of the three places it can be, extract the string following the third "-" character in the ID, and using that value to select a particular server is fairly simple.

Do not make the mistake of assuming this is a "CAS Session". It is a substitute for a session for the Logon CASTGC Cookie value, but an application validates Service Tickets as they come in from browsers, and although the application thinks it is talking to a single CAS URL (that actually points to the Front End) each request may go to a different specific server since each Service Ticket has its own specific issuing node.

Even if you don't use Cushy, automating the configuration of the Ticket ID suffix and adding intelligent routing to your Front End can vastly simplify the configuration of all the other TicketRegistry solutions. For other solutions, this is a simplification. For Cushy it is a requirementHowever, if you assume that the ticket validation request from the application will be randomly distributed to any CAS server in a cluster, then the Service Ticket has to be copied to all the other servers really, really fast.

Modern network front end devices have become very smart. They have to be smart, because this is a competitive market and you can't afford to sell a box that is less capable than the competition at accelerating applications, or fending off DOS attacks, or intelligently routing requests. These boxes already know about JSESSIONID and will send a user back to the particular Web Server in the cluster that he previously connected to.

While CAS performs a Single Sign On function, the logic is actually designed to create, read, update, and delete tickets. The ticket is the center of each CAS operation. In different requests there are only three places to find the key ticket that defines this operation:

  1. In the ticket= parameter at the end of the URL for validation requests.
  2. In the pgt= parameter for a proxy request.
  3. In the CASTGC Cookie for browser requests.

Programming the front end to know that "/validate", "/serviceValidate", and two other strings in the URL path means that this is case 1, and "/proxy" means it is case 2, and everything else is case 3 is pretty simple.

Of course, finding the ticket is not helpful unless you use the feature that has always been part of CAS to put the nodename or any other identifier for the CAS server at the end of every ticket generated by that server. If you start to use that option, now an intelligent front end can send every request to the node that created the ticket and therefore is best able to process the request.

Adding some intelligent routing to your network front end reduces the burden on replication for all the ticket registry solutions. It means that you can relax the timing parameters, and make some replication asynchronous instead of synchronous, and still the validation request will never arrive at a server before the Service Ticket being validated arrives. However, while this may be a good thing to do for all the other Ticket Registry solutions, it is required by Cushy. Cushy doesn't have a very, very fast replication mode.

"Cushy" stands for "Clustering Using Serialization to disk and Https transmission of files between servers, written by Yale". This summarizes what it is and how it works.

...

Although HTTP is a "stateless" protocol, an SSL connection is frequently optimized to be a longer term thing that keeps a session alive between requests. The SSL connects the browser or application to the Front End, and there is probably a separate SSL connection from the Front End to the CAS VM. A common option for Front Ends is to notice any long running SSL connection and use it to route requests to the same backend VM node. You must be sure that you do not select this option with Cushy and CAS. For Service Ticket validation requests to work, the routing decision has to be made separately for each request because different tickets have to be routed to different CAS VMs even though they came from the same application.

What Cushy Does at Failure

It is not necessary to explain how Cushy runs From a practical point of view, the biggest problem with Cushy deployment will probably be that Front End programming is typically the responsibility of one group, and CAS is typically the responsibility of another group, and the CAS group may not be able to prioritize its needs to the other group, or even convince them that this is a good idea.

What Cushy Does at Failure

It is not necessary to explain how Cushy runs normally. It is based on DefaultTicketRegistry. It stores the tickets in a table in memory. If you have a cluster, each node in the cluster operates as if it was a standalone server and depends on the Front End to route requests to the node that can handle them.

...

"Quartz" is the standard Java library for timer driven events. There are various ways to use Quartz, including annotations in modern containers, but JASIG CAS uses a Spring Bean interface to Quartz where parameters are specified in XML. All the standard JASIG TicketRegistry configurations have contained a Spring Bean configuration that drives the RegistryCleaner to run and delete expired tickets every so often. CushyTicketRegistry requires a second Quartz timer configured in the same file to call a method that replicates tickets. The interval configured in the Quartz part of the XML sets a base timer that determines the frequency of the incremental updates (typically every 5-15 seconds). A second parameter to the CushyTicketRegistry class sets a much longer period between full checkpoints of all the tickets in the registry (typically every 5-10 minutes).

A full checkpoint contains all the tickets. If the cache contains 20,000 tickets, it takes about a second to checkpoint, generates a 3.2 megabyte file, and then has to be copied across the network to the other nodes. An incremental file contains only the tickets that were added or deleted since the last full checkpoint. It typically takes a tenth of a second an uses very little disk space or network. However, after a number of incrementals it is a good idea to do a fresh checkpoint just to clean things up. You set the parameters to optimize your CAS environment, although either operation has so little overhead that it should not be a big deal.

Based on the usage pattern, at 8:00 AM the ticket registry is mostly empty and full checkpoints take no time. Late in the afternoon the registry reaches its maximum size and the difference between incrementals and full checkpoints is at its greatest.

Although CAS uses the term "incremental", the actual algorithm is a differential between the current cache and the last full checkpoint. So between full checkpoints, the incremental file size increases as it accumulates all the changes. Since this also includes a list of all the Service Ticket IDs that were deleted (just to be absolutely sure things are correct), if you made the period between full checkpoints unusually long it is possible for the incremental file to become larger than the checkpoint and since it is transferred so frequently this would be much, much worse to performance than setting the period for full checkpoints to be a reasonable number.

Nodes notify each other of a full checkpoint. Incrementals occur so frequently that it would be inefficient to send messages around. A node picks up the other incrementals from the other nodes each time it generates its own incremental.

...

:

    <bean id="jobBackupRegistry" class="org.springframework.scheduling.quartz.MethodInvokingJobDetailFactoryBean"

        p:targetObject-ref="ticketRegistry" p:targetMethod="timerDriven" />

    <bean id="triggerBackupRegistry" class="org.springframework.scheduling.quartz.SimpleTriggerBean"
      p:jobDetail-ref="jobBackupRegistry" p:startDelay="60000" p:repeatInterval="15000" />

The first bean tells Spring to call method "timerDriven" in the object configured with Spring bean name "ticketRegistry". The second bean tells Spring that after the first minute (letting things start up), make the call indicated in the first bean every 15 seconds. Since this is standard Spring stuff, the interval is coded in milliseconds.

The time interval configured here is the time between incrementals. The checkpointInterval parameter on the ticketRegistry bean sets the time (in seconds) between full checkpoints:

p:checkpointInterval="300"

So with these parameters, Cushy writes an incremental every 15 seconds and a checkpoint every 5 minutes. Feel free to set these values as you choose. Shorter intervals mean more overhead, but the cost is already so low that longer intervals don't really save much.

See the sample ticketRegistry.xml file for the complete configuration context.

Special Rules

Cushy stores tickets in an in-memory table. It writes tickets to a disk file with a single writeObject Java statement. It transfers files from machine to machine using an HTTPS GET. So far, everything seems to be rather simple. Cushy started that way, but then it became clear that there were a small number of optimizations that really needed to be made even if they added a slight amount of complexity to the code.

...

Cushy avoids this problem because the periodic checkpoint file captures all the tickets with all their relationships. Limited examples of this problem can occur around node failure, but for all the other TicketRegistry solutions (except JPA) this happens all the time to all the tickets during normal processing.

JUnit Testing

It is unusual for Cushy includes a JUnit test that runs all the same cases to get their own documentation. Testing a cluster on a single machine without a Web server is complicated enough that the strategies require some documentation.If you create an instance of CushyTicketRegistry without any parameters, it believes that it is a Primary object. You can then set properties and simulate Spring configuration. There is an alternate constructor with four parameters that is used only from test casesthat the DefaultTicketRegistry JUnit test runs.

It is not possible to configure enough of a Java Servlet Web server to test the HTTP Notify and file transfer. You have to test that on a real server. JUnit tests run in SharedDisk mode, where two objects representing the TicketRegistry objects on two different nodes in the cluster both write and read files from the same disk directory.

The trick here is to create two Primary CusyTicketRegistry instances with two compatible but opposite configurations. Typically one Primary object believes that it is node "casvm01" and that the cluster consists of a second node named "casvm02", while the other Primary object believes that it is node "casvm02" in a cluster with "casvm01".The next thing you need is to make sure that both objects are using the same work directory. That way the first object will create a checkpoint file named object believes that it is node "casvm01" and the other will create a checkpoint file that the cluster consists of a second node named "casvm02".Without a Web server, the files cannot be exchanged over the network. You cannot unit test the HTTP part. For the rest, once both nodes have checkpointed their tickets to the same directory, each node can then be programmed to skip over the HTTPS GET and just restore the file named for the other node from disk to its Secondary object for that node. Neither Primary object knows that the file for the other node was written directly to disk from another object in the same JVM rather than being fetched over the network, while the other Primary object believes that it is node "casvm02" in a cluster with "casvm01".

There are two test classes with entirely different strategies.

...