MID Server Distributed Cluster doesn't do Load Balancing, or work with more than one MID Server - Support and Troubleshooting

MID Server Distributed Cluster doesn't do Load Balancing, or work with more than one MID Server Issue This article aims to avoid confusion with "Distributed Metric Cache Clusters" and the support cases this leads to.

Table of Contents It isn't a cluster of MID Servers Error seen if a 2nd MID Server is added to a Distributed Cluster Solution It doesn't do Load Balancing for the connected ACC Agents Solution Related info KB0830864 MID Server Cluster for ITOM Event Management Connector Instances It isn't a cluster of MID Servers A Distributed Metric Cache Cluster is not really a MID Server Cluster at all. It re-uses the existing tables used for Load Balancing and Failover Clusters, of multiple MID Servers, that have been in the platform for decades, but is actually a Metric Cache for an individual MID Server . Each Cluster record should only have 1 MID Server , and 1 cluster node, linked to it.

The MID Server and MID Server distributed cluster for Metric Intelligence documentation does say this, in a roundabout way: "Using Metric Intelligence requires at least one MID Server distributed cluster which contains a single MID Server that is configured for Metric Intelligence." "To support the specified throughput, create a distributed cluster with a single MID Server that meets the Metric Intelligence MID Server requirements. "

Lots of customers have assumed this worked like a Load Balancing Cluster, and then had issues with overloaded MID Servers while other MID Servers were practically idle, so the fix of PRB1637583 in Vancouver aimed to reduce this confusion by renaming it from 'Distributed Cluster' to 'Distributed Cache Cluster' to highlight its function, but it still has the word Cluster in it, and so still gives the impression Load Balancing might be involved.

Error seen if a 2nd MID Server is added to a Distributed Cluster If you did add more than 1 MID Server to a Distributed Metric Cache Cluster, then the Metric Intelligence extension on the additional MID Server won't start.

The ecc_agent_ext_context_metric record will show the error: Failed to start pipeline, class org.apache.ignite.IgniteCheckedException: Failed to serialize object [typeName=com.service_now.metric.cache.MetricRegistrationGenericCache$1]

MID Sever agent log has the full error:

2024-06-04 09:30:14 ERROR (ECCQueueMonitor.1) [AnalyticsLog:178] (125) com.service_now.metric.PipelineManager - Failed to start pipeline, Exception Message: CacheException: class org.apache.ignite.IgniteCheckedException: Failed to serialize object [typeName=com.service_now.metric.cache.MetricRegistrationGenericCache$1], Stack Trace: javax.cache.CacheException: class org.apache.ignite.IgniteCheckedException: Failed to serialize object [typeName=com.service_now.metric.cache.MetricRegistrationGenericCache$1] at org.apache.ignite.internal.processors.cache.GridCacheUtils.convertToCacheException(GridCacheUtils.java:1272) at org.apache.ignite.internal.processors.cache.query.GridCacheQueryFutureAdapter.next(GridCacheQueryFutureAdapter.java:167) at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager$4.onHasNext(GridCacheDistributedQueryManager.java:628) at org.apache.ignite.internal.util.GridCloseableIteratorAdapter.hasNextX(GridCloseableIteratorAdapter.java:56) at org.apache.ignite.internal.util.lang.GridIteratorAdapter.hasNext(GridIteratorAdapter.java:45) at org.apache.ignite.internal.processors.cache.QueryCursorImpl.getAll(QueryCursorImpl.java:126) at com.service_now.metric.cache.MetricRegistrationGenericCache.contains(MetricRegistrationGenericCache.java:109) at com.service_now.metric.cache.handler.MetricRegistrationCacheHandler.clearAndInitiateTableSync(MetricRegistrationCacheHandler.java:215) at com.service_now.metric.cache.handler.MetricRegistrationCacheHandler.initialize(MetricRegistrationCacheHandler.java:87) at com.service_now.metric.PipelineManager.startPipeline(PipelineManager.java:323) at com.service_now.metric.PipelineManager.start(PipelineManager.java:171) at com.service_now.mid.extension.container.ExtensionContainer.start(ExtensionContainer.java:392) at com.service_now.mid.extension.container.ExtensionMessageRouter.routeMessage(ExtensionMessageRouter.java:54) at com.service_now.mid.message_executors.ExtensionExecutor.execute(ExtensionExecutor.java:41) at com.service_now.monitor.ECCQueueMonitor.executeMessage(ECCQueueMonitor.java:338) at com.service_now.monitor.ECCQueueMonitor.processMessage(ECCQueueMonitor.java:256) at com.service_now.monitor.ECCQueueMonitor.processMessages(ECCQueueMonitor.java:282) at com.service_now.monitor.ECCQueueMonitor.run(ECCQueueMonitor.java:216) at com.snc.midserver.monitor.internal.MonitorRunner$MonitorTask.execute(MonitorRunner.java:235) at com.snc.midserver.monitor.internal.AMonitorTask.run(AMonitorTask.java:29) at java.base/java.util.TimerThread.mainLoop(Timer.java:566) at java.base/java.util.TimerThread.run(Timer.java:516) Caused by: class org.apache.ignite.IgniteCheckedException: Caused by: class org.apache.ignite.IgniteCheckedException: Failed to serialize object [typeName=com.service_now.metric.cache.MetricRegistrationGenericCache$1] Solution The solution is to create a new Distributed Metric Cache Clusters record, and link the extra MID Server to that instead.

Restart the MID Server and the Metric Intelligence extension should start.

It doesn't do Load Balancing for the connected ACC Agents Because it's not really a Cluster, you cannot expect it to evenly distribute the ACC Agents across the several MID Servers.

The only feature related to load balancing Agents collecting Metrics, and the MID Servers they work though, is the ACC Auto MID Select feature. See: Optimize distribution of agents to MID Servers documentation Enable automatic MID Server selection documentation

On Agent startup, if the instance system property sn_agent.enable_auto_mid_selection is true, the Agent will connect to one of the MID Server URLs in its acc.yml config file. It requests a list of available MID Servers that have the Web Server and ACC Endpoint running, which includes the number of Agents currently connected to each, plus runs network latency tests against the MID Servers. It will then switch over to the MID Server likely to offer best performance based on that test. It will not do that again, and switch to a different MID Server unless that MID Server goes Down, or 'Try redistributing connected agents' is clicked on that MID Server's form in the instance. From that you can figure out that if a MID Server is restarted, then all Agents will go elsewhere and stay with those other MID Servers, and this restarted one will remain with very few Agents connected to it for a while. If Agents are mostly servers, then possible a very long time. If Agents are mostly computers, then this will naturally even out again after a couple of days.

Solution The KB1122613 ITOM Agent Client Collector documentation material has a PDF attached: Recommendations on deploying and maintaining ServiceNow Agent Client Collector for Customers - 8.pdf

It includes notes on setting up a Load balancer between the Agents and the MID Servers to provide true Load Balancing. This idea also work for Push Event Management Connectors.

In summary:

A Load Balancer is installed by the customer in front of the MID Servers. This is configured with a pool of all the relevant MID Servers, and configured for load balancing. The ACC Agent install, or event/metric push connector source, is configured with the load balancer URL, instead of a MID Server URL. The default architecture is the left 3rd of this diagram. The ideal load balancing solution is the right 3rd.

Related info KB0830864 MID Server Cluster for ITOM Event Management Connector Instances KB summary: Don't use Clusters for Event Management Connectors either. Add multiple MID Servers to the Connector instead.