In this blog, we are going to learn about memory leaks occurring in SOLR QueryResultCache, how the RCA was carried out, and the solution given to resolve the issue.
In the application under test, SOLR was used as a component to store, search, and retrieve the contents. SOLR 7.5 was used in this application. While conducting the performance testing, it was observed that the SOLR Slave CPU was increasing constantly for every test as given below:
- Test 1 – 40% CPU usage
- Test 2 – 60% CPU usage
- Test 3 – 80% CPU usage
In the tests, only the contents were retrieved, and there was no write to the contents in the SOLR Slave server. The SOLR Slave server was not restarted between the tests, as it will not be in production. The CPU usage drill-down view in Dynatrace did not show any specific evidence of where the actual time was spent. In the GC graph, it was observed that the old generation memory was growing, and the time spent on the young GC was a little higher. The old generation size has grown to 9 GB. Frequent minor GCs were seen, with an average time to GC of around 850 milliseconds, with a few spikes in GC time that went up to 1.3 seconds. Since the old gen was growing, it was suspected that some objects were growing in the memory. When looking at the SOLR Cache in Dynatrace, it was observed that there were only inserts into QueryResultCache, but there was no eviction seen in the cache. The above tests were repeated after restarting the SOLR, and a similar behavior was observed. This time, QueryResultCache was monitored closely via SOLR console and increase in size of the cache was observed as given below:
- Test 1 – 190K elements in cache
- Test 2 – 350K elements in cache
- Test 3 – 520K elements in cache
A heap dump was taken after these tests, and it was observed that around 8GB of the memory was occupied by FastLRUCache and its contents. While looking at the SOLR configuration, it was observed that the below settings were given for the QueryResultCache:
<queryResultCache class=”solr.FastLRUCache” size=”5000″ initialSize=”512″ maxRamMB=”1048″ autowarmCount=”0″/>
In this case, QueryResultCache was not honoring both ‘size’ and ‘maxRamMB’ parameters. It went beyond the values set. Cache size went to 520K, as opposed to the 5K size set, and it crossed 8GB in size in the heap dump, as opposed to the maximum of 1GB limit set. Instead of using both the parameters to limit the cache, it was decided to restrict the cache using the ‘size’ parameter.
The following settings were applied in SOLR:
<queryResultCache class=”solr.FastLRUCache ” size=”150000″ initialSize=”512″ autowarmCount=”0″/>
Three tests were repeated, and it was observed that the CPU usage of SOLR was constant at around 40% in all three tests. Also, the QueryResultCache grew to 150K size, and, after that, evictions were seen in the cache by restricting the size within the defined limit. The old generation memory remained constant, and the time to garbage collect the young generation also came down. Old generation size grew to 3 GB and was consistent across the tests. The average time to minor GC was around 650 milliseconds without any spikes. The test result was successful without any CPU issues in SOLR.
Based on the root cause analysis, it looks like there is a leak in the SOLR cache if both the ‘size’ and ‘maxRamMB’ are set in SOLR for QueryResultCache. Setting just the ‘size’ parameter for the QueryResultCache avoids this memory leak.