Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

JPA makes CAS dependent on a database. It doesn't really use the database for any real SQL stuff, so you could you almost any database system. However, the database is a single point of failure, so you need it to be reliable. If you already have a 24x7x365 database managed by professionals who can guarantee availability, this is a good solution. If not, then this is an insurmountable prerequisite for bringing up an application like CAS that doesn't really need database.

The various cache (in memory object replication) solutions should solve the problemalso work. Unfortunately, they too have massively complex configuration parameters with multicast network addresses and timeouts, and while they are designed to work across complete node failure, experience suggests that they are not designed to work when a CAS machine is "sick". That is, if the machine is down and does not respond to any network requests the technology recovers, but if the node is up and receives messages but just doesn't process them correctly then queues start to clog up, they back up into CAS itself and then CAS stops working simultaneously on all nodes. There is also a problem with the "one big bag of objects" model if a router fails that connects two machine rooms, two CAS nodes are separated, and now there are separate versions of what the system is designed to believe is a single cohesive collection.If you understand the problem CAS is solving and the way the tickets fit together, then each type of failure presents specific problems. Cushy is designed to avoid the big problems and provide transparent service to 99.9% of the CAS users. If one or two people experience an error message due to a CAS crash, and CAS crashes only once a year, then that is good enough especially when the alternative technologies can cause the entire system to stop working for everyone.They also tend to be better at detecting a node that is dead and does not respond than they are at dealing with nodes that are sick and accept a message but then never really get to processing it and responding. They operate entirely in memory, so at least one node has to remain up while the others reboot in order to maintain the content of the cache. While node failure is well defined, the status of objects is ambiguous if the network is divided into two segments by a linkage failure, the two segments operate independently for a while, and then connection is reestablished.

Since Cushy is specifically designed to handle the CAS Ticket problem, you will not understand it without a more detailed discussion of CAS Ticket connections and relationships. There are some specific CAS design problems that cannot be solved at the TicketRegistry layer. Cushy doesn't fix them, but neither do any of the cache solutions. This document will identify them and suggest how to fix them elsewhere in the CAS code.

Cushy is a cute word that roughly stands for "Clustering Using Serialization to disk and Https transmission of files between servers, written by Yale".

...