Heap and Garbage Collection
We have had a small number of Out of Memory problems in Shibboleth. In order to detect actual memory problems, someone responsible for Shibboleth administration needs to understand the normal Shibboleth memory use patterns. These show up in the Tomcat logs/gc.log file.
...
Because a standard cycle of recently allocated objects misses older object, the size of the heap grows over time until it hits some threshold that Java chooses based on the 3072M total heap size. Then it does a more comprehensive cycle. The pattern looks like:
Code Block | ||
---|---|---|
| ||
2018-07-05T16:28:48.297-0400: 1503.835: [GC pause (G1 Evacuation Pause) (young) 2756M->1384M(3072M), 0.0217201 secs] 2018-07-05T16:29:10.732-0400: 1526.271: [GC pause (G1 Evacuation Pause) (young) 2756M->1385M(3072M), 0.0159584 secs] 2018-07-05T16:29:35.574-0400: 1551.113: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 2756M->1386M(3072M), 0.0195857 secs] 2018-07-05T16:29:35.594-0400: 1551.133: [GC concurrent-root-region-scan-start] 2018-07-05T16:29:35.613-0400: 1551.151: [GC concurrent-root-region-scan-end, 0.0187561 secs] 2018-07-05T16:29:35.613-0400: 1551.151: [GC concurrent-mark-start] 2018-07-05T16:29:37.340-0400: 1552.878: [GC concurrent-mark-end, 1.7271823 secs] 2018-07-05T16:29:37.340-0400: 1552.879: [GC remark, 0.0415657 secs] 2018-07-05T16:29:37.382-0400: 1552.921: [GC cleanup 1393M->1002M(3072M), 0.0087241 secs] 2018-07-05T16:29:37.391-0400: 1552.930: [GC concurrent-cleanup-start] 2018-07-05T16:29:37.392-0400: 1552.931: [GC concurrent-cleanup-end, 0.0007280 secs] 2018-07-05T16:30:04.776-0400: 1580.314: [GC pause (G1 Evacuation Pause) (young) 2364M->998M(3072M), 0.0227124 secs] 2018-07-05T16:30:05.045-0400: 1580.583: [GC pause (G1 Evacuation Pause) (mixed) 1143M->816M(3072M), 0.0179329 secs] 2018-07-05T16:30:38.213-0400: 1613.752: [GC pause (G1 Evacuation Pause) (young) 2652M->818M(3072M), 0.0254017 secs] 2018-07-05T16:31:17.326-0400: 1652.865: [GC pause (G1 Evacuation Pause) (young) 2653M->819M(3072M), 0.0215341 secs] |
...
However, once a day there is a special pattern. Things are going along normally with a sequence of (G1 Evaculation Pause) lines and then
Code Block | ||
---|---|---|
| ||
2018-07-10T22:03:32.296-0400: 453587.834: [GC pause (G1 Evacuation Pause) (young) 2758M->1371M(3072M), 0.0122018 secs]
2018-07-10T22:03:54.692-0400: 453610.230: [GC pause (G1 Evacuation Pause) (young) 2757M->1371M(3072M), 0.0110118 secs]
2018-07-10T22:04:27.034-0400: 453642.573: [GC pause (G1 Humongous Allocation) (young) (initial-mark) 2730M->1375M(3072M), 0.0225954 secs]
2018-07-10T22:04:27.057-0400: 453642.595: [GC concurrent-root-region-scan-start]
2018-07-10T22:04:27.071-0400: 453642.610: [GC concurrent-root-region-scan-end, 0.0144775 secs]
2018-07-10T22:04:27.071-0400: 453642.610: [GC concurrent-mark-start]
2018-07-10T22:04:29.448-0400: 453644.987: [GC concurrent-mark-end, 2.3771710 secs]
2018-07-10T22:04:29.449-0400: 453644.988: [GC remark, 0.0869320 secs]
2018-07-10T22:04:29.538-0400: 453645.077: [GC cleanup 2599M->2588M(3072M), 0.0133322 secs]
2018-07-10T22:04:29.554-0400: 453645.092: [GC concurrent-cleanup-start]
2018-07-10T22:04:29.554-0400: 453645.092: [GC concurrent-cleanup-end, 0.0000754 secs]
2018-07-10T22:04:30.240-0400: 453645.778: [GC pause (G1 Evacuation Pause) (young)-- 2924M->2554M(3072M), 5.2851767 secs]
2018-07-10T22:04:35.585-0400: 453651.123: [GC pause (G1 Evacuation Pause) (mixed) 2568M->2470M(3072M), 0.6164123 secs]
2018-07-10T22:04:36.537-0400: 453652.076: [GC pause (G1 Evacuation Pause) (mixed) 2617M->2251M(3072M), 0.1800178 secs]
2018-07-10T22:04:36.998-0400: 453652.537: [GC pause (G1 Evacuation Pause) (mixed) 2384M->2240M(3072M), 0.1866582 secs]
2018-07-10T22:04:37.460-0400: 453652.999: [GC pause (G1 Evacuation Pause) (young) 2373M->2296M(3072M), 0.1456074 secs]
2018-07-10T22:04:38.340-0400: 453653.879: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 2429M->2309M(3072M), 0.0795252 secs]
2018-07-10T22:04:38.420-0400: 453653.959: [GC concurrent-root-region-scan-start]
2018-07-10T22:04:38.439-0400: 453653.978: [GC concurrent-root-region-scan-end, 0.0193931 secs]
2018-07-10T22:04:38.439-0400: 453653.978: [GC concurrent-mark-start]
2018-07-10T22:04:38.797-0400: 453654.336: [GC pause (G1 Evacuation Pause) (young) 2450M->2310M(3072M), 0.0376059 secs]
2018-07-10T22:04:39.345-0400: 453654.884: [GC pause (G1 Evacuation Pause) (young) 2462M->2310M(3072M), 0.0146538 secs]
2018-07-10T22:04:39.645-0400: 453655.183: [GC pause (G1 Evacuation Pause) (young) 2462M->2309M(3072M), 0.0077577 secs]
2018-07-10T22:04:40.072-0400: 453655.611: [GC pause (G1 Evacuation Pause) (young) 2461M->2311M(3072M), 0.0140397 secs]
2018-07-10T22:04:40.684-0400: 453656.223: [GC pause (G1 Evacuation Pause) (young) 2461M->2314M(3072M), 0.0273577 secs]
2018-07-10T22:04:42.569-0400: 453658.108: [GC concurrent-mark-end, 4.1299612 secs]
2018-07-10T22:04:42.570-0400: 453658.109: [GC remark, 0.0637912 secs]
2018-07-10T22:04:42.635-0400: 453658.173: [GC cleanup 2366M->2363M(3072M), 0.0119712 secs]
2018-07-10T22:04:42.647-0400: 453658.186: [GC concurrent-cleanup-start]
2018-07-10T22:04:42.647-0400: 453658.186: [GC concurrent-cleanup-end, 0.0000295 secs]
2018-07-10T22:04:50.229-0400: 453665.768: [GC pause (G1 Evacuation Pause) (young) 2577M->2262M(3072M), 0.0268132 secs]
2018-07-10T22:04:50.497-0400: 453666.035: [GC pause (G1 Evacuation Pause) (mixed) 2408M->2013M(3072M), 0.1205804 secs]
2018-07-10T22:04:50.869-0400: 453666.407: [GC pause (G1 Evacuation Pause) (mixed) 2160M->1826M(3072M), 0.2268685 secs]
2018-07-10T22:05:01.420-0400: 453676.959: [GC pause (G1 Evacuation Pause) (mixed) 1973M->1654M(3072M), 0.2596414 secs]
2018-07-10T22:05:02.019-0400: 453677.557: [GC pause (G1 Evacuation Pause) (mixed) 1800M->1568M(3072M), 0.2675552 secs]
2018-07-10T22:05:02.593-0400: 453678.132: [GC pause (G1 Evacuation Pause) (young) 1715M->1568M(3072M), 0.0514932 secs]
2018-07-10T22:05:07.913-0400: 453683.452: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 1715M->1568M(3072M), 0.0257594 secs]
2018-07-10T22:05:07.939-0400: 453683.478: [GC concurrent-root-region-scan-start]
2018-07-10T22:05:07.958-0400: 453683.497: [GC concurrent-root-region-scan-end, 0.0193304 secs]
2018-07-10T22:05:07.958-0400: 453683.497: [GC concurrent-mark-start]
2018-07-10T22:05:11.194-0400: 453686.732: [GC concurrent-mark-end, 3.2352939 secs]
2018-07-10T22:05:11.194-0400: 453686.733: [GC remark, 0.2785849 secs]
2018-07-10T22:05:11.476-0400: 453687.015: [GC cleanup 1626M->1626M(3072M), 0.0129068 secs]
2018-07-10T22:05:13.011-0400: 453688.550: [GC pause (G1 Evacuation Pause) (young) 1714M->1568M(3072M), 0.0219448 secs]
2018-07-10T22:05:13.277-0400: 453688.816: [GC pause (G1 Evacuation Pause) (mixed) 1714M->1529M(3072M), 0.0537970 secs]
2018-07-10T22:05:13.588-0400: 453689.127: [GC pause (G1 Evacuation Pause) (young) 1676M->1529M(3072M), 0.0523330 secs]
2018-07-10T22:05:13.897-0400: 453689.436: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 1676M->1529M(3072M), 0.0235064 secs |
On line 3 we see a "Humongous Allocation". To determine the cause, check the /apps/shibboleth-idp/logs/idp-process.log at the same time and find:
2018-07-10 22:04:41,094 - INFO [org.opensaml.saml.metadata.resolver.impl.AbstractReloadingMetadataResolver:306] - Metadata Resolver FileBackedHTTPMetadataResolver ICMD: Next refresh cycle for metadata provider 'http://md.incommon.org/InCommon/InCommon-metadata.xml' will occur on '2018-07-11T20:04:30.318Z' ('2018-07-11T16:04:30.318-04:00' local time)
So we are not shocked that at 22:04 Shibboleth started to process the InCommon Metadata and its memory allocation started to struggle.
After that we get a sequence of GC operations that accomplish relatively little (they recover 150 to 200M for each operation rather than the typical 1250M).
From this we hypothesize that the heap size of 3072M is just barely big enough to hold two copies of the InCommon metadata at the same time, but we are so close to the maximum amount of memory that GC is struggling to provide enough work space to process the InCommon Metadata objects and prepare for the swap. This lasts only about a minute, so by line 36 we are back down to 1568M after GC (although that is after a more agressive GC).
The cycles of more agressive GC continue about once a minute for almost a half hour before finally things return to normal:
Code Block |
---|
2018-07-10T22:35:12.111-0400: 455487.649: [GC pause (G1 Evacuation Pause) (mixed) 1605M->1452M(3072M), 0.0130478 secs]
2018-07-10T22:35:44.463-0400: 455520.002: [GC pause (G1 Evacuation Pause) (young) 2759M->1454M(3072M), 0.0158096 secs]
2018-07-10T22:36:06.669-0400: 455542.208: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 2758M->1455M(3072M), 0.0121612 secs]
2018-07-10T22:36:06.681-0400: 455542.220: [GC concurrent-root-region-scan-start]
2018-07-10T22:36:06.685-0400: 455542.223: [GC concurrent-root-region-scan-end, 0.0036414 secs]
2018-07-10T22:36:06.685-0400: 455542.223: [GC concurrent-mark-start]
2018-07-10T22:36:08.384-0400: 455543.922: [GC concurrent-mark-end, 1.6988816 secs]
2018-07-10T22:36:08.384-0400: 455543.923: [GC remark, 0.0463513 secs]
2018-07-10T22:36:08.432-0400: 455543.970: [GC cleanup 1667M->1164M(3072M), 0.0151417 secs]
2018-07-10T22:36:08.447-0400: 455543.986: [GC concurrent-cleanup-start]
2018-07-10T22:36:08.448-0400: 455543.987: [GC concurrent-cleanup-end, 0.0016228 secs]
2018-07-10T22:36:28.910-0400: 455564.449: [GC pause (G1 Evacuation Pause) (young) 2255M->952M(3072M), 0.0115036 secs]
2018-07-10T22:36:39.180-0400: 455574.719: [GC pause (G1 Evacuation Pause) (mixed) 1102M->890M(3072M), 0.0423113 secs] |
So at line 1 we are doing GC that end with 1454M and then after what is roughly the 30th agressive GC we finally drop down to 890M, which is what we expect to see in the normal state.
So when Shibboleth aggregates InCommon Metadata its heap working set jumps to just below 2500M without leaving much room for small transient objects to be created to process the stuff. However, in a half minute that drops to a more comfortable 1500M, but Shibboleth GC appears to be left in a "panicked" state for a while.
I believe that there is another memory use that gets bumped up during the 30 minute period that is not showing up in the GC log because it is a type of memory not managed by GC. That would explain why GC is being triggered by much lower starting numbers (1800M) where before it was triggered at 2750M. Interestingly, after 30 minutes we have a jump back from starting value 1605M in line 1 to the more normal 2759M in line 2, and then after then next full cycle the ending number has returned to 890M. That means that something happened to Java memory around 22:35, but whatever it is it doesn't show up as heap or GC.
Linux VM
Shibboleth runs on a 6G VM
The Java+Tomcat+Shibboleth process uses about 3.6G of active data resident in real memory, but the process address space is 6.25G of virtual memory (the extra space being memory that the process has reserved by system calls but is not really using and based on observation will never actually use.
The Linux system seems to keep 0.6G free and 1.4G used as data cache, or a combined 2G of memory not allocated to programs and available for assignment.
Java has a 3G heap. The 3.6G process memory use suggests that in addition to the 3G heap (which is fully used because Java fills it up before starting GC), there is about 0.6G of code, classes, connections, and other objects used by Java, Tomcat, Shib, and their libraries.
In newer versions of Java there is an additional type of Java memory pool used for NIO (new I/O). It is used for "direct" I/O buffers. Each buffer in this area is connected to an object on the heap, and when the object on the heap is freed by GC the corresponding direct memory buffer is also freed. However, while Tomcat and other Java Web Servers use NIO for modern high performance web I/O to browsers, Shibboleth does nothing this exotic. So there is no need to change this parameter.
Increasing the heap delays GC until the extra space is full, so it increases real memory use. Minimally if we want to bump the heap size from its current 3G to 4G and we do not want to take any chances, we would also bump the VM by 1G. However, memory is cheap and it would be useful to see how all the numbers we are seeing change if we also increase system memory.
So the proposal for now is to bump the VM from 6G to 8G and the heap from 3G to 4G and then watch the numbers. We expect that each routine (young) GC will run a bit longer but GC will happen less frequently. The users will not see a difference.