Shibboleth Memory Utilization

We have had a small number of Out of Memory problems in Shibboleth. In order to detect actual memory problems, someone responsible for Shibboleth administration needs to understand the normal Shibboleth memory use patterns. These show up in the Tomcat logs/gc.log file.

Traditional programming languages like C have explicit memory management service calls. You allocate memory by calling one routine, then free it with another. The advantage here is that memory is immediately freed for reuse, but the problem is that programming bugs can call the memory free routine prematurely while the memory in question is actually being used by another part of the program. So Java and more modern application programming languages use Garbage Collection (GC). A program does not free memory exclusively, but simply discards any reference to an object. When free memory gets low enough, Java GC automatically audits objects to find those that are no longer being referenced and are safe to discard.

Garbage Collection has strategies. For example, because most objects are short lived, GC divides objects into groups by age. It frequently checks recently allocated objects which have a high probability of turning out to be unreferenced and discardable. Less frequently it checks objects that are medium old, and even less frequently it checks objects that have been in use for a very long time.

These strategies turn out to be a poor match for Shibboleth. It has one really, really big memory use. Once a day it needs to read in the InCommon metadata, which contains tens of thousands of configuration objects for various partner sites. Shibboleth parses the entire XML file building a complete set of objects to describe the new configuration, while keeping the old configuration objects in memory. Then when everything has been processed, it switches the active Metadata pointer to the new set of objects making all the old objects unreferenced. Because these objects were in use for a day, they are very old and will not be freed until GC gets around to checking the set of very old objects. It just happens that the design and use pattern of Shibboleth and InCommon is the complete opposite of the standard GC strategy.

In the gc.log file, there are normally a long string of cycle that process new (young) objects:

2018-07-05T16:22:16.838-0400: 1112.377: [GC pause (G1 Evacuation Pause) (young) 2755M->1380M(3072M), 0.0141264 secs]
At 16:22 there was a standard cycle where the heap memory, which is always 3072M given these JVM parameters, reached the point where there were 2755M of allocated space. By deleting "young" objects, this heap use was reduced to 1380M.

Because a standard cycle of recently allocated objects misses older object, the size of the heap grows over time until it hits some threshold that Java chooses based on the 3072M total heap size. Then it does a more comprehensive cycle. The pattern looks like:

2018-07-05T16:28:48.297-0400: 1503.835: [GC pause (G1 Evacuation Pause) (young) 2756M->1384M(3072M), 0.0217201 secs]
2018-07-05T16:29:10.732-0400: 1526.271: [GC pause (G1 Evacuation Pause) (young) 2756M->1385M(3072M), 0.0159584 secs]
2018-07-05T16:29:35.574-0400: 1551.113: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 2756M->1386M(3072M), 0.0195857 secs]
2018-07-05T16:29:35.594-0400: 1551.133: [GC concurrent-root-region-scan-start]
2018-07-05T16:29:35.613-0400: 1551.151: [GC concurrent-root-region-scan-end, 0.0187561 secs]
2018-07-05T16:29:35.613-0400: 1551.151: [GC concurrent-mark-start]
2018-07-05T16:29:37.340-0400: 1552.878: [GC concurrent-mark-end, 1.7271823 secs]
2018-07-05T16:29:37.340-0400: 1552.879: [GC remark, 0.0415657 secs]
2018-07-05T16:29:37.382-0400: 1552.921: [GC cleanup 1393M->1002M(3072M), 0.0087241 secs]
2018-07-05T16:29:37.391-0400: 1552.930: [GC concurrent-cleanup-start]
2018-07-05T16:29:37.392-0400: 1552.931: [GC concurrent-cleanup-end, 0.0007280 secs]
2018-07-05T16:30:04.776-0400: 1580.314: [GC pause (G1 Evacuation Pause) (young) 2364M->998M(3072M), 0.0227124 secs]
2018-07-05T16:30:05.045-0400: 1580.583: [GC pause (G1 Evacuation Pause) (mixed) 1143M->816M(3072M), 0.0179329 secs]
2018-07-05T16:30:38.213-0400: 1613.752: [GC pause (G1 Evacuation Pause) (young) 2652M->818M(3072M), 0.0254017 secs]
2018-07-05T16:31:17.326-0400: 1652.865: [GC pause (G1 Evacuation Pause) (young) 2653M->819M(3072M), 0.0215341 secs]

So the first two normal cycles are shrinking the heap from 2756M to 1384M and 1385M, but apparently 1385M is the trigger to do something more. We now go through several lines of enhanced GC ending at the third to last line where a (mixed) cleanup of slightly older objects brings the used space down to 816M, and then the final two lines represent the new normal where Java starts a GC when the heap gets up to 2652M (instead of the old 2756M before the cleanup) but now a standard simple cleanup of recently created objects drops the used space all the way down to 816M initially, although that size after GC begins to grow slowly again and will continue growing until it gets back to 1385M and another large GC is triggered.

However, once a day there is a special pattern. Things are going along normally with a sequence of (G1 Evaculation Pause) lines and then

2018-07-10T22:03:32.296-0400: 453587.834: [GC pause (G1 Evacuation Pause) (young) 2758M->1371M(3072M), 0.0122018 secs]
2018-07-10T22:03:54.692-0400: 453610.230: [GC pause (G1 Evacuation Pause) (young) 2757M->1371M(3072M), 0.0110118 secs]
2018-07-10T22:04:27.034-0400: 453642.573: [GC pause (G1 Humongous Allocation) (young) (initial-mark) 2730M->1375M(3072M), 0.0225954 secs]
2018-07-10T22:04:27.057-0400: 453642.595: [GC concurrent-root-region-scan-start]
2018-07-10T22:04:27.071-0400: 453642.610: [GC concurrent-root-region-scan-end, 0.0144775 secs]
2018-07-10T22:04:27.071-0400: 453642.610: [GC concurrent-mark-start]
2018-07-10T22:04:29.448-0400: 453644.987: [GC concurrent-mark-end, 2.3771710 secs]
2018-07-10T22:04:29.449-0400: 453644.988: [GC remark, 0.0869320 secs]
2018-07-10T22:04:29.538-0400: 453645.077: [GC cleanup 2599M->2588M(3072M), 0.0133322 secs]
2018-07-10T22:04:29.554-0400: 453645.092: [GC concurrent-cleanup-start]
2018-07-10T22:04:29.554-0400: 453645.092: [GC concurrent-cleanup-end, 0.0000754 secs]
2018-07-10T22:04:30.240-0400: 453645.778: [GC pause (G1 Evacuation Pause) (young)-- 2924M->2554M(3072M), 5.2851767 secs]
2018-07-10T22:04:35.585-0400: 453651.123: [GC pause (G1 Evacuation Pause) (mixed) 2568M->2470M(3072M), 0.6164123 secs]
2018-07-10T22:04:36.537-0400: 453652.076: [GC pause (G1 Evacuation Pause) (mixed) 2617M->2251M(3072M), 0.1800178 secs]
2018-07-10T22:04:36.998-0400: 453652.537: [GC pause (G1 Evacuation Pause) (mixed) 2384M->2240M(3072M), 0.1866582 secs]
2018-07-10T22:04:37.460-0400: 453652.999: [GC pause (G1 Evacuation Pause) (young) 2373M->2296M(3072M), 0.1456074 secs]
2018-07-10T22:04:38.340-0400: 453653.879: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 2429M->2309M(3072M), 0.0795252 secs]
2018-07-10T22:04:38.420-0400: 453653.959: [GC concurrent-root-region-scan-start]
2018-07-10T22:04:38.439-0400: 453653.978: [GC concurrent-root-region-scan-end, 0.0193931 secs]
2018-07-10T22:04:38.439-0400: 453653.978: [GC concurrent-mark-start]
2018-07-10T22:04:38.797-0400: 453654.336: [GC pause (G1 Evacuation Pause) (young) 2450M->2310M(3072M), 0.0376059 secs]
2018-07-10T22:04:39.345-0400: 453654.884: [GC pause (G1 Evacuation Pause) (young) 2462M->2310M(3072M), 0.0146538 secs]