Auto MID selection causes excessive metrics to accumulate when connection issues occur - Known Error

Auto MID selection causes excessive metrics to accumulate when connection issues occur Description The MID server during auto-mid-selection can experience excessive memory allocation and CPU spikes due to agents connecting with network connectivity issues. As part of auto-mid-selection, agents make REST based calls to the MID server querying for the number of connected agents in order to perform load balancing and honor the maximum number of agents connections allowed. Part of this call will generate a metric instance for tracking which accumulates and can potentially block other thread calls to the MID web server hosting the REST calls. The result is the MID server experiences heavy CPU spikes and excessive memory retention while agents attempt to find the correct MID server to connect with. Coupled with network interruptions, this can result in the MID server running out of memory.

Steps to Reproduce

Several threads were blocking on the REST service call for /api/mid/mon, which is only used by the agent during auto-mid-selection. Analysis of the code indicates that the RateCounter is retrieved via a sync() call on the existing object to ensure data is available when calling toString() on the object. This places a new instance of the data into memory which is accumulated and not released for 24 hours.

To duplicate perform the following:

- Create a default MID server and enable auto-mid-selection - Install and configure an agent using auto-mid-selection on Linux for ease of testing - On the agent, create a cron job to restart the agent every 15 seconds - Take a heap snapshot on the MID server JVM to record baseline usage and to note the number of TimeSlot objects allocated - Stop the MID server - Repeat above steps on fix - Compare the number of instances of the TimeSlot. They should be lowered if not completely eliminated.

Workaround This problem has been fixed. If an upgrade is possible, please refer to the Fixed In section to determine the latest version with a permanent fix that the instance can be upgraded to.

A workaround would be to Disable auto-mid-selection in the acc.yml files of all Agent installs.

A better alternative to using Auto MID Selection (AMS) is to implement a load balancer within the customer's network, to stand between the Agents installs, and the set of ACC Endpoint enabled MID Servers. The Agent acc.yml is configured with the URL of the load balancer, and the Load balancers are configured with the URLs of the MID Servers. This works because connections are always Agent to MID Server, using REST, and it doesn't matter which MID Server an Agent ends up connecting to. The MID Servers and instance keep track of which MID Server the agent is currently connecting to, and ensures any check ecc_queue outputs are sent to the correct MID Server.

Where there are multiple network environments and several sets of MID Servers to cover all Agents, a Load balancer can be implemented for each environment.