How AWS helped us optimize memory usage in Tigase HTTP API

Moving to AWS

Recently we have moved our xmpp.cloud (formerly branded as sure.im) installation from one hosting provider where we used dedicated servers, to the Amazon AWS cloud based hosting. Benefits of this move are listed in this article. During migration we chose to use the smallest possible AWS instances which would be good for hosting the xmpp.cloud and t2.medium services. Installation performed without any issues, and test runs showed that the systems were properly operating. Should we would need to scale our installation another cluster node could always be started.

The Crashes

However, after some time we started to experience crashes on the new installation. The JVMs running Tigase XMPP Server were being terminated by Linux kernel due to memory allocation issues. In a typical situation, we would receive some OutOfMemoryError`s from JVM notifying us that something is wrong and and that some adjustment would be needed for JVM memory settings. However this time, this was not the case. Instead JVM was being shutdown with a single entry being written to the tigase-console.log file along with a hs_err_pid file being created. The following entries were written to these files:

 There is insufficient memory for the Java Runtime Environment to continue.
 Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
 Possible reasons:
   The system is out of physical RAM or swap space
   In 32 bit mode, the process size limit was hit
 Possible solutions:
   Reduce memory load on the system
   Increase physical memory or swap space
   Check if swap backing store is full
   Use 64 bit Java on a 64 bit OS
   Decrease Java heap size (-Xmx/-Xms)
   Decrease number of Java threads
   Decrease Java thread stack sizes (-Xss)
   Set larger code cache with -XX:ReservedCodeCacheSize=
 This output file may be truncated or incomplete.

  Out of Memory Error (os_linux.cpp:2640), pid=3633, tid=0x00007fe49c4c4700

 JRE version: Java(TM) SE Runtime Environment (8.0_162-b12) (build 1.8.0_162-b12)
 Java VM: Java HotSpot(TM) 64-Bit Server VM (25.162-b12 mixed mode linux-amd64 compressed oops)
 Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

Native memory allocation suggested that we were having an issue not with java HEAP size but rather an insufficient amount of free memory on our AWS instances. After verifying JVM memory settings, we found out that there was still plenty of free memory on the instance, so this issue should not occur. However, this issue was happening about once a day in frequency and it needed to be fixed. Since we are using Java version 8 at this point, the JVM memory is divided into:

  • HEAP
  • MetaSpace
  • DirectMemory

We only had limits set for HEAP so the issue must have been with MetaSpace or DirectMemory growing without any limits, aside from the amount of free RAM memory at our AWS instance. Additionally, Tigase XMPP Servers at xmpp.cloud installation were processing a lot of SPAM messages when crashes were happening. Due to that we suspected a leak in server-to-server (S2S) connection buffers as a lot of those connections was being created and many of them were saturated due to the high amount of incoming messages, most of which were SPAM. Knowing that, we decided to limit amount of memory allowed for MetaSpace and DirectMemory memory to 128MB.

Isolating cause of the issue

The following day a 2nd crash occurred, as they happened usually between 24 and 32 hours from when the server starts up. This time before the crash happened we received quite a few OutOfMemoryError errors related to MetaSpace. Errors were not pointing to native memory being depleted. All of the MetaSpace and OutOfMemoryError errors where thrown from the code responsible for handling HTTP requests. However, they did not point to any particular location within this code. Since no tests were currently implemented to measure MetaSpace requirements for Tigase XMPP server installed at xmpp.cloud, we proceeded with a simple fix. MetaSpace was configured to use 256MB of space, and Tigase HTTP API was reconfigured to use Jetty API Server. Using Jetty instead of Java Embedded Serverreduces the amount of DirectMemory required for handling HTTPrequests. Unfortunately after 27 hours Tigase XMPP Servers at the xmpp.cloud installation were again down and analysis still pointed to the Tigase HTTP API. This component was still the sole source for OutOfMemoryError errors related to MetaSpace which is allocated out of native memory (non-HEAP).

Analysis of the memory usage

Having the issue isolated we decided to replicate it in a controlled environment, measure memory usage and take a few memory dumps for comparison of memory usage in different periods. Knowing that it is related directly to HTTP API we focused on testing that component. We began with testing REST API requests which we had started to use internally for one of the new features yet to be introduced for users of xmpp.cloud. During those tests HEAP memory was almost empty and MetaSpace usage increased slightly during the first part of the test. This behavior is expected due to Groovy scripts being compiled and loaded into memory. Later on MetaSpace usage was fluctuating but more or less stable. Only CodeCache space was changing due to JVM recompiling code to optimize its execution time. As direct calls to REST API were working fine, we had to focus on accessing the HTTP-API component using a web browser. This meant we needed to test the REST API, Admin UI, and other modules which are accessible from a web browser. Just after executing the first tests using the browser, MetaSpace memory usage increased with each request and MetaSpace grew until the set limit was reached. Then OutOfMemoryError errors began to be thrown. Thanks to the memory dumps which were taken during those tests, we were able to identify which classes were using memory and which were allocated during each request. There we found a lot of classes containing GStringTemplateScript and getTemplate within their names. Each class was named as GStringTemplateScript with a number following, indicating multiple instances. As we are using GStringTemplateScript from Groovy to create HTML output for the web browsers, we surmised that somehow this template engine is leaking memory by generating a new classes for each request, and not unloading the older classes when they are not needed. Our memory leak had been found.

Fixing the Issue

To fix the issue we started with code analysis to find usage patterns of GStringTemplateEngine which lead to a leak. In our case, the leak was caused by an automatic reload mechanism of GString templates. These are stored in files under `scripts/rest/` directory of Tigase XMPP Server installation directory and are loaded when needed. To make development and customization of those templates easier, they were reloading requested templates on each HTTP request. To make it work fast we kept a single GStringTemplateEngine instance (per servlet) which was handling every request. Previously, this mechanism saw a slow amount of MetaSpace memory increase, and we had not before experienced such a rapid acceleration of memory use. But this instance of GStringTemplateEngine had its own ClassLoader and was internally keeping a reference to each class created during parsing of the GString template. This led to an increased usage of MetaSpace, OutOfMemoryError errors and eventually to crashes. Having the real cause of this issue pinpointed, we reviewed usage of GStringTemplateEngine in our code and have changed them, to make sure that:

  • We load all templates at once using single GStringTemplateEngine and cache generate templates. No more automatic reloading of templates.
  • When manual reload of templates is initiated we release old instance of GStringTemplateEngine and parse templates using the new one.

This way we can still use GStringTemplateEngine and our GString templates while maintaining stable MetaSpace usage. As we have our template instances cached, responses for HTTP requests from the web browsers will now be faster. As for manual reloading, it will still generate new classes. However as we are releasing instance of GStringTemplateEngine and its internal ClassLoader we are releasing classes loaded by this ClassLoader to MetaSpace and making sure that this memory can cleaned by garbage collector. After extensive testing, we were able to confirm that memory usage is now stable.

What about Amazon Web Services?

As previously mentioned, we recently moved to AWS from our old hosting provided and enabled new feature for our users. This feature is based on Tigase HTTP API and REST API, and uses both APIs extensively. We needed to expose this HTTP-based API to our other services and to do that we decided to use AWS’s Elastic Load Balancer to be able to transparently forward those HTTP requests to each of Tigase’s cluster nodes. This way it would automatically switch to different nodes if one of them is overloaded or offline. mazon’s Elastic Load Balancer executes a health check requests every few seconds to be able to detect if destination host is up and running fine. In our case it was testing REST API and was generating responses formatted in HTML just as a normal web browser would. This lead GStringTemplateEngine to be used for handling each request and each request creating new classes in the memory forcing JVM to use more and more memory until it ran out.

Thanks to AWS for helping us optimize memory usage in Tigase HTTP API