Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Because a standard cycle of recently allocated objects misses older object, the size of the heap grows over time until it hits some threshold that Java chooses based on the 3072M total heap size. Then it does a more comprehensive cycle. The pattern looks like:

Code Block
linenumberstrue
2018-07-05T16:28:48.297-0400: 1503.835: [GC pause (G1 Evacuation Pause) (young) 2756M->1384M(3072M), 0.0217201 secs]
2018-07-05T16:29:10.732-0400: 1526.271: [GC pause (G1 Evacuation Pause) (young) 2756M->1385M(3072M), 0.0159584 secs]
2018-07-05T16:29:35.574-0400: 1551.113: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 2756M->1386M(3072M), 0.0195857 secs]
2018-07-05T16:29:35.594-0400: 1551.133: [GC concurrent-root-region-scan-start]
2018-07-05T16:29:35.613-0400: 1551.151: [GC concurrent-root-region-scan-end, 0.0187561 secs]
2018-07-05T16:29:35.613-0400: 1551.151: [GC concurrent-mark-start]
2018-07-05T16:29:37.340-0400: 1552.878: [GC concurrent-mark-end, 1.7271823 secs]
2018-07-05T16:29:37.340-0400: 1552.879: [GC remark, 0.0415657 secs]
2018-07-05T16:29:37.382-0400: 1552.921: [GC cleanup 1393M->1002M(3072M), 0.0087241 secs]
2018-07-05T16:29:37.391-0400: 1552.930: [GC concurrent-cleanup-start]
2018-07-05T16:29:37.392-0400: 1552.931: [GC concurrent-cleanup-end, 0.0007280 secs]
2018-07-05T16:30:04.776-0400: 1580.314: [GC pause (G1 Evacuation Pause) (young) 2364M->998M(3072M), 0.0227124 secs]
2018-07-05T16:30:05.045-0400: 1580.583: [GC pause (G1 Evacuation Pause) (mixed) 1143M->816M(3072M), 0.0179329 secs]
2018-07-05T16:30:38.213-0400: 1613.752: [GC pause (G1 Evacuation Pause) (young) 2652M->818M(3072M), 0.0254017 secs]
2018-07-05T16:31:17.326-0400: 1652.865: [GC pause (G1 Evacuation Pause) (young) 2653M->819M(3072M), 0.0215341 secs]

...

However, once a day there is a special pattern. Things are going along normally with a sequence of (G1 Evaculation Pause) lines and then

Code Block
linenumberstrue
2018-07-10T22:03:32.296-0400: 453587.834: [GC pause (G1 Evacuation Pause) (young) 2758M->1371M(3072M), 0.0122018 secs]
2018-07-10T22:03:54.692-0400: 453610.230: [GC pause (G1 Evacuation Pause) (young) 2757M->1371M(3072M), 0.0110118 secs]
2018-07-10T22:04:27.034-0400: 453642.573: [GC pause (G1 Humongous Allocation) (young) (initial-mark) 2730M->1375M(3072M), 0.0225954 secs]
2018-07-10T22:04:27.057-0400: 453642.595: [GC concurrent-root-region-scan-start]
2018-07-10T22:04:27.071-0400: 453642.610: [GC concurrent-root-region-scan-end, 0.0144775 secs]
2018-07-10T22:04:27.071-0400: 453642.610: [GC concurrent-mark-start]
2018-07-10T22:04:29.448-0400: 453644.987: [GC concurrent-mark-end, 2.3771710 secs]
2018-07-10T22:04:29.449-0400: 453644.988: [GC remark, 0.0869320 secs]
2018-07-10T22:04:29.538-0400: 453645.077: [GC cleanup 2599M->2588M(3072M), 0.0133322 secs]
2018-07-10T22:04:29.554-0400: 453645.092: [GC concurrent-cleanup-start]
2018-07-10T22:04:29.554-0400: 453645.092: [GC concurrent-cleanup-end, 0.0000754 secs]
2018-07-10T22:04:30.240-0400: 453645.778: [GC pause (G1 Evacuation Pause) (young)-- 2924M->2554M(3072M), 5.2851767 secs]
2018-07-10T22:04:35.585-0400: 453651.123: [GC pause (G1 Evacuation Pause) (mixed) 2568M->2470M(3072M), 0.6164123 secs]
2018-07-10T22:04:36.537-0400: 453652.076: [GC pause (G1 Evacuation Pause) (mixed) 2617M->2251M(3072M), 0.1800178 secs]
2018-07-10T22:04:36.998-0400: 453652.537: [GC pause (G1 Evacuation Pause) (mixed) 2384M->2240M(3072M), 0.1866582 secs]
2018-07-10T22:04:37.460-0400: 453652.999: [GC pause (G1 Evacuation Pause) (young) 2373M->2296M(3072M), 0.1456074 secs]
2018-07-10T22:04:38.340-0400: 453653.879: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 2429M->2309M(3072M), 0.0795252 secs]
2018-07-10T22:04:38.420-0400: 453653.959: [GC concurrent-root-region-scan-start]
2018-07-10T22:04:38.439-0400: 453653.978: [GC concurrent-root-region-scan-end, 0.0193931 secs]
2018-07-10T22:04:38.439-0400: 453653.978: [GC concurrent-mark-start]
2018-07-10T22:04:38.797-0400: 453654.336: [GC pause (G1 Evacuation Pause) (young) 2450M->2310M(3072M), 0.0376059 secs]
2018-07-10T22:04:39.345-0400: 453654.884: [GC pause (G1 Evacuation Pause) (young) 2462M->2310M(3072M), 0.0146538 secs]
2018-07-10T22:04:39.645-0400: 453655.183: [GC pause (G1 Evacuation Pause) (young) 2462M->2309M(3072M), 0.0077577 secs]
2018-07-10T22:04:40.072-0400: 453655.611: [GC pause (G1 Evacuation Pause) (young) 2461M->2311M(3072M), 0.0140397 secs]
2018-07-10T22:04:40.684-0400: 453656.223: [GC pause (G1 Evacuation Pause) (young) 2461M->2314M(3072M), 0.0273577 secs]
2018-07-10T22:04:42.569-0400: 453658.108: [GC concurrent-mark-end, 4.1299612 secs]
2018-07-10T22:04:42.570-0400: 453658.109: [GC remark, 0.0637912 secs]
2018-07-10T22:04:42.635-0400: 453658.173: [GC cleanup 2366M->2363M(3072M), 0.0119712 secs]
2018-07-10T22:04:42.647-0400: 453658.186: [GC concurrent-cleanup-start]
2018-07-10T22:04:42.647-0400: 453658.186: [GC concurrent-cleanup-end, 0.0000295 secs]
2018-07-10T22:04:50.229-0400: 453665.768: [GC pause (G1 Evacuation Pause) (young) 2577M->2262M(3072M), 0.0268132 secs]
2018-07-10T22:04:50.497-0400: 453666.035: [GC pause (G1 Evacuation Pause) (mixed) 2408M->2013M(3072M), 0.1205804 secs]
2018-07-10T22:04:50.869-0400: 453666.407: [GC pause (G1 Evacuation Pause) (mixed) 2160M->1826M(3072M), 0.2268685 secs]
2018-07-10T22:05:01.420-0400: 453676.959: [GC pause (G1 Evacuation Pause) (mixed) 1973M->1654M(3072M), 0.2596414 secs]
2018-07-10T22:05:02.019-0400: 453677.557: [GC pause (G1 Evacuation Pause) (mixed) 1800M->1568M(3072M), 0.2675552 secs]
2018-07-10T22:05:02.593-0400: 453678.132: [GC pause (G1 Evacuation Pause) (young) 1715M->1568M(3072M), 0.0514932 secs]
2018-07-10T22:05:07.913-0400: 453683.452: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 1715M->1568M(3072M), 0.0257594 secs]
2018-07-10T22:05:07.939-0400: 453683.478: [GC concurrent-root-region-scan-start]
2018-07-10T22:05:07.958-0400: 453683.497: [GC concurrent-root-region-scan-end, 0.0193304 secs]
2018-07-10T22:05:07.958-0400: 453683.497: [GC concurrent-mark-start]
2018-07-10T22:05:11.194-0400: 453686.732: [GC concurrent-mark-end, 3.2352939 secs]
2018-07-10T22:05:11.194-0400: 453686.733: [GC remark, 0.2785849 secs]
2018-07-10T22:05:11.476-0400: 453687.015: [GC cleanup 1626M->1626M(3072M), 0.0129068 secs]
2018-07-10T22:05:13.011-0400: 453688.550: [GC pause (G1 Evacuation Pause) (young) 1714M->1568M(3072M), 0.0219448 secs]
2018-07-10T22:05:13.277-0400: 453688.816: [GC pause (G1 Evacuation Pause) (mixed) 1714M->1529M(3072M), 0.0537970 secs]
2018-07-10T22:05:13.588-0400: 453689.127: [GC pause (G1 Evacuation Pause) (young) 1676M->1529M(3072M), 0.0523330 secs]
2018-07-10T22:05:13.897-0400: 453689.436: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 1676M->1529M(3072M), 0.0235064 secs

On line 3 we see a "Humongous Allocation". To determine the cause, check the /apps/shibboleth-idp/logs/idp-process.log at the same time and find:

2018-07-10 22:04:41,094 - INFO [org.opensaml.saml.metadata.resolver.impl.AbstractReloadingMetadataResolver:306] - Metadata Resolver FileBackedHTTPMetadataResolver ICMD: Next refresh cycle for metadata provider 'http://md.incommon.org/InCommon/InCommon-metadata.xml' will occur on '2018-07-11T20:04:30.318Z' ('2018-07-11T16:04:30.318-04:00' local time)

So we are not shocked that at 22:04 Shibboleth started to process the InCommon Metadata and its memory allocation started to struggle.

After that we get a sequence of GC operations that accomplish relatively little (they recover 150 to 200M for each operation rather than the typical 1250M).

From this we hypothesize that the heap size of 3072M is just barely big enough to hold two copies of the InCommon metadata at the same time, but we are so close to the maximum amount of memory that GC is struggling to provide enough work space to process the InCommon Metadata objects and prepare for the swap. This lasts only about a minute, so by line 36 we are back down to 1568M after GC (although that is after a more agressive GC).

The cycles of more agressive GC continue about once a minute for almost a half hour before finally things return to normal:

Code Block
2018-07-10T22:35:12.111-0400: 455487.649: [GC pause (G1 Evacuation Pause) (mixed) 1605M->1452M(3072M), 0.0130478 secs]
2018-07-10T22:35:44.463-0400: 455520.002: [GC pause (G1 Evacuation Pause) (young) 2759M->1454M(3072M), 0.0158096 secs]
2018-07-10T22:36:06.669-0400: 455542.208: [GC pause (G1 Evacuation Pause) (young) (initial-mark) 2758M->1455M(3072M), 0.0121612 secs]
2018-07-10T22:36:06.681-0400: 455542.220: [GC concurrent-root-region-scan-start]
2018-07-10T22:36:06.685-0400: 455542.223: [GC concurrent-root-region-scan-end, 0.0036414 secs]
2018-07-10T22:36:06.685-0400: 455542.223: [GC concurrent-mark-start]
2018-07-10T22:36:08.384-0400: 455543.922: [GC concurrent-mark-end, 1.6988816 secs]
2018-07-10T22:36:08.384-0400: 455543.923: [GC remark, 0.0463513 secs]
2018-07-10T22:36:08.432-0400: 455543.970: [GC cleanup 1667M->1164M(3072M), 0.0151417 secs]
2018-07-10T22:36:08.447-0400: 455543.986: [GC concurrent-cleanup-start]
2018-07-10T22:36:08.448-0400: 455543.987: [GC concurrent-cleanup-end, 0.0016228 secs]
2018-07-10T22:36:28.910-0400: 455564.449: [GC pause (G1 Evacuation Pause) (young) 2255M->952M(3072M), 0.0115036 secs]
2018-07-10T22:36:39.180-0400: 455574.719: [GC pause (G1 Evacuation Pause) (mixed) 1102M->890M(3072M), 0.0423113 secs]

So at line 1 we are doing GC that end with 1454M and then after what is roughly the 30th agressive GC we finally drop down to 890M, which is what we expect to see in the normal state.

So when Shibboleth aggregates InCommon Metadata its heap working set jumps to just below 2500M without leaving much room for small transient objects to be created to process the stuff. However, in a half minute that drops to a more comfortable 1500M, but Shibboleth GC appears to be left in a "panicked" state for a while.

I believe that there is another memory use that gets bumped up during the 30 minute period that is not showing up in the GC log because it is a type of memory not managed by GC. That would explain why GC is being triggered by much lower starting numbers (1800M) where before it was triggered at 2750M. Interestingly, after 30 minutes we have a jump back from starting value 1605M in line 1 to the more normal 2759M in line 2, and then after then next full cycle the ending number has returned to 890M. That means that something happened to Java memory around 22:35, but whatever it is it doesn't show up as heap or GC.

I propose that we increase Java memory in general as well as the heap size. It seems clear that we need more than a 3G heap to process InCommon.