Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Cushy depends on a modern programmable network front end distributing requests to the various CAS servers in the cluster. In exchange, all the code for this type of clustering is contained in one medium sized Java source file that is fairly easy to understand. There is no giant magic black box, but that still leaves the two three CAS mistakes exposed:

1) Any system that seeks to replicate tickets has a concurrency problem if there are multiple threads (like the request threads maintained by any Web Server) that can change the content of an object while another thread has triggered replication of the object. CAS has some collections in its TicketGrantingTicket object that can be changed while the object is being replicated, and that can cause the replication to throw a ConcurrentModificationException.

2) Object replication systems work best on standalone objects. CAS, however, chains Proxy and Service Tickets to the Ticket Granting Ticket. Under the covers this has always resulted in other CAS nodes receiving duplicate copies of the Ticket Granting Ticket object, but in the past that didn't matter because the copies were all identical. CAS 4 allows meaningful changes to be made to the Ticket Granting Ticket after logon, and then the copies are no longer identical. This may or may not be acceptable depending on how you use multi-factor authentication and Proxy tickets.

3) If you try to fix problem 2 in the TicketRegistry where you know the problem is occurring, you cannot do anything because the Ticket classes are all locked down and do not expose a method to correct an invalid or inappropriate TGT pointer.

Cushy cannot solve problems that require changes to the Ticket classes, but since it is a small amount of source you can quickly find the writeObject and readObject statements and develop a strategy to fix any problems by coordinated changes to the Tickets and the TicketRegistry. For now it is enough that there is no type of network problem or possible failure of the cluster or replication process that can crash one CAS node let alone crash all the CAS servers simultaneously as happens with the previous alternatives.

...

JPA and the various "cache" technologies try to write individual tickets to the database or from node to node. They may also write unintentionally copy additional tickets connected to the ticket you intended to copy. Obviously it is more efficient to write only the changed tickets rather than writing all the tickets. However, at replicate. Cushy also writes changed tickets periodically from server to server, but less fequently (every 5 to 15 minutes would be a suggested schedule) it writes a complete copy of the entire collection of interconnected tickets.

At Yale (a typical medium sized university), the entire Registry of less than 20,000 tickets can be written to a disk file in 1 second (elapsed time) and it produces a file about 3 megabytes in size. That is a trivial use of modern multicore multi-core server hardware, and copying 3 megabytes of data over the network every 5 minutes, or even every minute, is a trivial use of network bandwidth. So Cushy is less efficient, but in a way that is predictable and insignificant, in exchange for code that is simple and easy to understand.

Once the tickets are a file on disk, the Web server provides an obvious way (HTTPS GET) to transfer them from one server to another. Instead of using complex multicast sockets with complex error recovery, you are using a simple technology Web protocol everyone understands to accomplish a trivial function. You can immediately understand the consequences of any network failure and eventual network recovery.

...