Shibboleth Memory Utilization

Heap and Garbage Collection

We have had a small number of Out of Memory problems in Shibboleth. In order to detect actual memory problems, someone responsible for Shibboleth administration needs to understand the normal Shibboleth memory use patterns. These show up in the Tomcat logs/gc.log file.

Traditional programming languages like C have explicit memory management service calls. You allocate memory by calling one routine, then free it with another. The advantage here is that memory is immediately freed for reuse, but the problem is that programming bugs can call the memory free routine prematurely while the memory in question is actually being used by another part of the program. So Java and more modern application programming languages use Garbage Collection (GC). A program does not free memory exclusively, but simply discards any reference to an object. When free memory gets low enough, Java GC automatically audits objects to find those that are no longer being referenced and are safe to discard.

Garbage Collection has strategies. For example, because most objects are short lived, GC divides objects into groups by age. It frequently checks recently allocated objects which have a high probability of turning out to be unreferenced and discardable. Less frequently it checks objects that are medium old, and even less frequently it checks objects that have been in use for a very long time.

These strategies turn out to be a poor match for Shibboleth. It has one really, really big memory use. Once a day it needs to read in the InCommon metadata, which contains tens of thousands of configuration objects for various partner sites. Shibboleth parses the entire XML file building a complete set of objects to describe the new configuration, while keeping the old configuration objects in memory. Then when everything has been processed, it switches the active Metadata pointer to the new set of objects making all the old objects unreferenced. Because these objects were in use for a day, they are very old and will not be freed until GC gets around to checking the set of very old objects. It just happens that the design and use pattern of Shibboleth and InCommon is the complete opposite of the standard GC strategy.

In the gc.log file, there are normally a long string of cycle that process new (young) objects:

2018-07-05T16:22:16.838-0400: 1112.377: [GC pause (G1 Evacuation Pause) (young) 2755M->1380M(3072M), 0.0141264 secs]
At 16:22 there was a standard cycle where the heap memory, which is always 3072M given these JVM parameters, reached the point where there were 2755M of allocated space. By deleting "young" objects, this heap use was reduced to 1380M.

Because a standard cycle of recently allocated objects misses older object, the size of the heap grows over time until it hits some threshold that Java chooses based on the 3072M total heap size. Then it does a more comprehensive cycle. The pattern looks like:

2018-07-05T16:28:48.297-0400: 1503.835: [GC pause (G1 Evacuation Pause) (young) 2756M->1384M(3072M), 0.0217201 secs]
2018-07-05T16:29:10.732-0400: 1526.271: [GC pause (G1 Evacuation Pause) (young) 2756M->1385M(3072M), 0.0159584 secs]
2018-07-05T16:29:35.574-0400: 1551.113: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 2756M->1386M(3072M), 0.0195857 secs]
2018-07-05T16:29:35.594-0400: 1551.133: [GC concurrent-root-region-scan-start]
2018-07-05T16:29:35.613-0400: 1551.151: [GC concurrent-root-region-scan-end, 0.0187561 secs]
2018-07-05T16:29:35.613-0400: 1551.151: [GC concurrent-mark-start]
2018-07-05T16:29:37.340-0400: 1552.878: [GC concurrent-mark-end, 1.7271823 secs]
2018-07-05T16:29:37.340-0400: 1552.879: [GC remark, 0.0415657 secs]
2018-07-05T16:29:37.382-0400: 1552.921: [GC cleanup 1393M->1002M(3072M), 0.0087241 secs]
2018-07-05T16:29:37.391-0400: 1552.930: [GC concurrent-cleanup-start]
2018-07-05T16:29:37.392-0400: 1552.931: [GC concurrent-cleanup-end, 0.0007280 secs]
2018-07-05T16:30:04.776-0400: 1580.314: [GC pause (G1 Evacuation Pause) (young) 2364M->998M(3072M), 0.0227124 secs]
2018-07-05T16:30:05.045-0400: 1580.583: [GC pause (G1 Evacuation Pause) (mixed) 1143M->816M(3072M), 0.0179329 secs]
2018-07-05T16:30:38.213-0400: 1613.752: [GC pause (G1 Evacuation Pause) (young) 2652M->818M(3072M), 0.0254017 secs]
2018-07-05T16:31:17.326-0400: 1652.865: [GC pause (G1 Evacuation Pause) (young) 2653M->819M(3072M), 0.0215341 secs]

So the first two normal cycles are shrinking the heap from 2756M to 1384M and 1385M, but apparently 1385M is the trigger to do something more. We now go through several lines of enhanced GC ending at the third to last line where a (mixed) cleanup of slightly older objects brings the used space down to 816M, and then the final two lines represent the new normal where Java starts a GC when the heap gets up to 2652M (instead of the old 2756M before the cleanup) but now a standard simple cleanup of recently created objects drops the used space all the way down to 816M initially, although that size after GC begins to grow slowly again and will continue growing until it gets back to 1385M and another large GC is triggered.

However, once a day there is a special pattern. Things are going along normally with a sequence of (G1 Evaculation Pause) lines and then

2018-07-10T22:03:32.296-0400: 453587.834: [GC pause (G1 Evacuation Pause) (young) 2758M->1371M(3072M), 0.0122018 secs]
2018-07-10T22:03:54.692-0400: 453610.230: [GC pause (G1 Evacuation Pause) (young) 2757M->1371M(3072M), 0.0110118 secs]
2018-07-10T22:04:27.034-0400: 453642.573: [GC pause (G1 Humongous Allocation) (young) (initial-mark) 2730M->1375M(3072M), 0.0225954 secs]
2018-07-10T22:04:27.057-0400: 453642.595: [GC concurrent-root-region-scan-start]
2018-07-10T22:04:27.071-0400: 453642.610: [GC concurrent-root-region-scan-end, 0.0144775 secs]
2018-07-10T22:04:27.071-0400: 453642.610: [GC concurrent-mark-start]
2018-07-10T22:04:29.448-0400: 453644.987: [GC concurrent-mark-end, 2.3771710 secs]
2018-07-10T22:04:29.449-0400: 453644.988: [GC remark, 0.0869320 secs]
2018-07-10T22:04:29.538-0400: 453645.077: [GC cleanup 2599M->2588M(3072M), 0.0133322 secs]
2018-07-10T22:04:29.554-0400: 453645.092: [GC concurrent-cleanup-start]
2018-07-10T22:04:29.554-0400: 453645.092: [GC concurrent-cleanup-end, 0.0000754 secs]
2018-07-10T22:04:30.240-0400: 453645.778: [GC pause (G1 Evacuation Pause) (young)-- 2924M->2554M(3072M), 5.2851767 secs]
2018-07-10T22:04:35.585-0400: 453651.123: [GC pause (G1 Evacuation Pause) (mixed) 2568M->2470M(3072M), 0.6164123 secs]
2018-07-10T22:04:36.537-0400: 453652.076: [GC pause (G1 Evacuation Pause) (mixed) 2617M->2251M(3072M), 0.1800178 secs]
2018-07-10T22:04:36.998-0400: 453652.537: [GC pause (G1 Evacuation Pause) (mixed) 2384M->2240M(3072M), 0.1866582 secs]
2018-07-10T22:04:37.460-0400: 453652.999: [GC pause (G1 Evacuation Pause) (young) 2373M->2296M(3072M), 0.1456074 secs]
2018-07-10T22:04:38.340-0400: 453653.879: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 2429M->2309M(3072M), 0.0795252 secs]
2018-07-10T22:04:38.420-0400: 453653.959: [GC concurrent-root-region-scan-start]
2018-07-10T22:04:38.439-0400: 453653.978: [GC concurrent-root-region-scan-end, 0.0193931 secs]
2018-07-10T22:04:38.439-0400: 453653.978: [GC concurrent-mark-start]
2018-07-10T22:04:38.797-0400: 453654.336: [GC pause (G1 Evacuation Pause) (young) 2450M->2310M(3072M), 0.0376059 secs]
2018-07-10T22:04:39.345-0400: 453654.884: [GC pause (G1 Evacuation Pause) (young) 2462M->2310M(3072M), 0.0146538 secs]
2018-07-10T22:04:39.645-0400: 453655.183: [GC pause (G1 Evacuation Pause) (young) 2462M->2309M(3072M), 0.0077577 secs]
2018-07-10T22:04:40.072-0400: 453655.611: [GC pause (G1 Evacuation Pause) (young) 2461M->2311M(3072M), 0.0140397 secs]
2018-07-10T22:04:40.684-0400: 453656.223: [GC pause (G1 Evacuation Pause) (young) 2461M->2314M(3072M), 0.0273577 secs]
2018-07-10T22:04:42.569-0400: 453658.108: [GC concurrent-mark-end, 4.1299612 secs]
2018-07-10T22:04:42.570-0400: 453658.109: [GC remark, 0.0637912 secs]
2018-07-10T22:04:42.635-0400: 453658.173: [GC cleanup 2366M->2363M(3072M), 0.0119712 secs]
2018-07-10T22:04:42.647-0400: 453658.186: [GC concurrent-cleanup-start]
2018-07-10T22:04:42.647-0400: 453658.186: [GC concurrent-cleanup-end, 0.0000295 secs]
2018-07-10T22:04:50.229-0400: 453665.768: [GC pause (G1 Evacuation Pause) (young) 2577M->2262M(3072M), 0.0268132 secs]
2018-07-10T22:04:50.497-0400: 453666.035: [GC pause (G1 Evacuation Pause) (mixed) 2408M->2013M(3072M), 0.1205804 secs]
2018-07-10T22:04:50.869-0400: 453666.407: [GC pause (G1 Evacuation Pause) (mixed) 2160M->1826M(3072M), 0.2268685 secs]
2018-07-10T22:05:01.420-0400: 453676.959: [GC pause (G1 Evacuation Pause) (mixed) 1973M->1654M(3072M), 0.2596414 secs]
2018-07-10T22:05:02.019-0400: 453677.557: [GC pause (G1 Evacuation Pause) (mixed) 1800M->1568M(3072M), 0.2675552 secs]
2018-07-10T22:05:02.593-0400: 453678.132: [GC pause (G1 Evacuation Pause) (young) 1715M->1568M(3072M), 0.0514932 secs]
2018-07-10T22:05:07.913-0400: 453683.452: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 1715M->1568M(3072M), 0.0257594 secs]
2018-07-10T22:05:07.939-0400: 453683.478: [GC concurrent-root-region-scan-start]
2018-07-10T22:05:07.958-0400: 453683.497: [GC concurrent-root-region-scan-end, 0.0193304 secs]
2018-07-10T22:05:07.958-0400: 453683.497: [GC concurrent-mark-start]
2018-07-10T22:05:11.194-0400: 453686.732: [GC concurrent-mark-end, 3.2352939 secs]
2018-07-10T22:05:11.194-0400: 453686.733: [GC remark, 0.2785849 secs]
2018-07-10T22:05:11.476-0400: 453687.015: [GC cleanup 1626M->1626M(3072M), 0.0129068 secs]
2018-07-10T22:05:13.011-0400: 453688.550: [GC pause (G1 Evacuation Pause) (young) 1714M->1568M(3072M), 0.0219448 secs]
2018-07-10T22:05:13.277-0400: 453688.816: [GC pause (G1 Evacuation Pause) (mixed) 1714M->1529M(3072M), 0.0537970 secs]
2018-07-10T22:05:13.588-0400: 453689.127: [GC pause (G1 Evacuation Pause) (young) 1676M->1529M(3072M), 0.0523330 secs]
2018-07-10T22:05:13.897-0400: 453689.436: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 1676M->1529M(3072M), 0.0235064 secs

On line 3 we see a "Humongous Allocation". To determine the cause, check the /apps/shibboleth-idp/logs/idp-process.log at the same time and find:

2018-07-10 22:04:41,094 - INFO [org.opensaml.saml.metadata.resolver.impl.AbstractReloadingMetadataResolver:306] - Metadata Resolver FileBackedHTTPMetadataResolver ICMD: Next refresh cycle for metadata provider 'http://md.incommon.org/InCommon/InCommon-metadata.xml' will occur on '2018-07-11T20:04:30.318Z' ('2018-07-11T16:04:30.318-04:00' local time)

So we are not shocked that at 22:04 Shibboleth started to process the InCommon Metadata and its memory allocation started to struggle.

After that we get a sequence of GC operations that accomplish relatively little (they recover 150 to 200M for each operation rather than the typical 1250M).

From this we hypothesize that the heap size of 3072M is just barely big enough to hold two copies of the InCommon metadata at the same time, but we are so close to the maximum amount of memory that GC is struggling to provide enough work space to process the InCommon Metadata objects and prepare for the swap. This lasts only about a minute, so by line 36 we are back down to 1568M after GC (although that is after a more agressive GC).

The cycles of more agressive GC continue about once a minute for almost a half hour before finally things return to normal:

2018-07-10T22:35:12.111-0400: 455487.649: [GC pause (G1 Evacuation Pause) (mixed) 1605M->1452M(3072M), 0.0130478 secs]
2018-07-10T22:35:44.463-0400: 455520.002: [GC pause (G1 Evacuation Pause) (young) 2759M->1454M(3072M), 0.0158096 secs]
2018-07-10T22:36:06.669-0400: 455542.208: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 2758M->1455M(3072M), 0.0121612 secs]
2018-07-10T22:36:06.681-0400: 455542.220: [GC concurrent-root-region-scan-start]
2018-07-10T22:36:06.685-0400: 455542.223: [GC concurrent-root-region-scan-end, 0.0036414 secs]
2018-07-10T22:36:06.685-0400: 455542.223: [GC concurrent-mark-start]
2018-07-10T22:36:08.384-0400: 455543.922: [GC concurrent-mark-end, 1.6988816 secs]
2018-07-10T22:36:08.384-0400: 455543.923: [GC remark, 0.0463513 secs]
2018-07-10T22:36:08.432-0400: 455543.970: [GC cleanup 1667M->1164M(3072M), 0.0151417 secs]
2018-07-10T22:36:08.447-0400: 455543.986: [GC concurrent-cleanup-start]
2018-07-10T22:36:08.448-0400: 455543.987: [GC concurrent-cleanup-end, 0.0016228 secs]
2018-07-10T22:36:28.910-0400: 455564.449: [GC pause (G1 Evacuation Pause) (young) 2255M->952M(3072M), 0.0115036 secs]
2018-07-10T22:36:39.180-0400: 455574.719: [GC pause (G1 Evacuation Pause) (mixed) 1102M->890M(3072M), 0.0423113 secs]

So at line 1 we are doing GC that end with 1454M and then after what is roughly the 30th agressive GC we finally drop down to 890M, which is what we expect to see in the normal state.

So when Shibboleth aggregates InCommon Metadata its heap working set jumps to just below 2500M without leaving much room for small transient objects to be created to process the stuff. However, in a half minute that drops to a more comfortable 1500M, but Shibboleth GC appears to be left in a "panicked" state for a while.

I believe that there is another memory use that gets bumped up during the 30 minute period that is not showing up in the GC log because it is a type of memory not managed by GC. That would explain why GC is being triggered by much lower starting numbers (1800M) where before it was triggered at 2750M. Interestingly, after 30 minutes we have a jump back from starting value 1605M in line 1 to the more normal 2759M in line 2, and then after then next full cycle the ending number has returned to 890M. That means that something happened to Java memory around 22:35, but whatever it is it doesn't show up as heap or GC.

Linux VM

Shibboleth runs on a 6G VM

The Java+Tomcat+Shibboleth process uses about 3.6G of active data resident in real memory, but the process address space is 6.25G of virtual memory (the extra space being memory that the process has reserved by system calls but is not really using and based on observation will never actually use.

The Linux system seems to keep 0.6G free and 1.4G used as data cache, or a combined 2G of memory not allocated to programs and available for assignment.

Java has a 3G heap. The 3.6G process memory use suggests that in addition to the 3G heap (which is fully used because Java fills it up before starting GC), there is about 0.6G of code, classes, connections, and other objects used by Java, Tomcat, Shib, and their libraries.

In newer versions of Java there is an additional type of Java memory pool used for NIO (new I/O). It is used for "direct" I/O buffers. Each buffer in this area is connected to an object on the heap, and when the object on the heap is freed by GC the corresponding direct memory buffer is also freed. However, while Tomcat and other Java Web Servers use NIO for modern high performance web I/O to browsers, Shibboleth does nothing this exotic. So there is no need to change this parameter.

Increasing the heap delays GC until the extra space is full, so it increases real memory use. Minimally if we want to bump the heap size from its current 3G to 4G and we do not want to take any chances, we would also bump the VM by 1G. However, memory is cheap and it would be useful to see how all the numbers we are seeing change if we also increase system memory.

So the proposal for now is to bump the VM from 6G to 8G and the heap from 3G to 4G and then watch the numbers. We expect that each routine (young) GC will run a bit longer but GC will happen less frequently. The users will not see a difference.

Shibboleth

Shibboleth Memory Utilization

Analytics

Heap and Garbage Collection

Linux VM

Related content