Discovery MID Servers at 100% CPU - Good or Bad, Why and then what?


Description

You are probably reading this because your event monitoring flagged up a MID Server computer for running at very high or 100% CPU for an extended period of time. This article aims to discuss this at a high level, and deliberately not get into specifics about individual probes/patterns running against specific Makes/Models, as those are usually dealt with individually in other knowledge articles and may have a problem ticket linked.

100% CPU is Good

Horizontal Discovery aims to scan your whole IP range and record all it can in the CMDB, so your operations people can do their jobs effectively using that up-to-date information. When a MID Server uses CPU time running Discovery, it is executing probes that fetch information from target hardware and pre-processes it before sending it back to the instance, and the more CPU time it can use the more it can do. The MID Server application and the Discovery probes it runs are multi-threaded, and if the CPU cores and time slices are given to it by the Java JVM, computer Operating system and virtual Hypervisor, then it will use them. More IPs can be scanned, and more often, giving you more value from the Discovery feature. 

100% CPU doesn't break the computer, but simply means the processing resources available to the Operating System are being used to the full while running Discovery. Operating systems are designed to operate at 100% CPU, making sure processes that need the CPU at a higher priority get it, and so is the hardware under it, which will throttle the CPU to keep within thermal limits. When you run out of storage or memory you have a problem, but running at 100% CPU just determines the rate at which jobs are processed. It is cost effective to fully utilize your assets.

100% CPU is to be expected, and desired for the above reasons. This is a deliberate design of Horizontal Discovery. In Quebec release testing it was demonstrated that a MID Server would use 100% CPU consistently at the start of 10k device Discovery Schedules, which then fluctuates a little through the schedule, until the end of the schedule when it drops back to idle. To have most threads blocked waiting for responses from targets when they could be doing something would slow down discovery, so we have more threads so most of them will be actually processing something at any point in time. This testing also showed doubling the CPUs available from 4 to 8 made very little different to the average CPU used through the schedule, but the schedule was able to complete in a shorter time window.

A reported 100% CPU by the OS may even be a lie, if it is a VM. Physical CPU usage is the only true indicator, and that may have to come from the hypervisor monitoring tools rather than at an OS level.

100% CPU is Bad

Alerts from Monitoring tools can be an issue. It is unusual for an application server to have idle periods and then CPU spikes to the extent that you see with Discovery running. To have the minimum impact on the devices being discovered, network traffic, and on the ServiceNow instance itself, the recommendation has tended to be to run Discovery schedules out-of-hours. That often ends up with an 8-10 hour window on week nights for all the important devices to be scanned, and longer windows at the weekends to catch up on the less important devices. The resulting square wave graph, with hardly any CPU when Discovery isn't running, to almost 100% CPU when it is, gets monitoring tools worried.  

On a dedicated host, or a VM with dedicated resources, then 100% CPU will not have any effect on any other computer, and that has always been a recommendation, and so these are false alerts. Tuning the thresholds, alert rules, and time windows for CPU alerts of these specific hosts in the monitoring tools is the correct solution in this case.

However, the changes in virtualization and cloud technology now means dedicated hardware is unlikely to be possible for you, and the MID Server hosts may effect other hosts running on the same cloud or ESX hardware, sharing the actual hardware resources and CPU time with other VMs and applications.

Most VMware ESX Servers are over-committed, with dynamic allocation of CPU and Memory resources as and when it is needed by each VM. This is based on the idea that minor fluctuations in CPU usage will even out across the several VMs, meaning it is possible to get away with fewer physical resources, and is a key selling point of virtual and cloud technology. The pattern of MID Server resource usage, of being mostly idle, and then a big CPU demand, and then back to idle for the rest of the day will mean other VMs that were using those resources suddenly find them being taken away from them. Multiple MID Server hosts in a Discovery load-balancing cluster may make that effect larger if they happen to be on the same hardware at the time.

It is the main function of the Operating system to allocate resources to applications that want them while at the same time managing the OS as a whole, and it is the job of the virtual Hypervisor to manage the demand on the physical resources. Tuning the hypervisor VM settings for CPU is usually the correct solution in this case.

Why would Discovery use so much CPU, and why is it trending upwards?

The main aim of Horizontal Discovery is to discover as much as possible in the shortest time, and use all the CPU available to do that, and development of the product continues to make changes in order to do more and do it faster with every release. We have not revised the minimum hardware recommendations for a MID Server host in general for many years, although Discovery does now use more. The MID Server, Discovery, Cloud Management and Service Mapping Release Notes with each new major version summarise these changes, which often will result in higher CPU usage.

Changes in Windows Discovery: Powershell/WinRM is used instead of WMI from mid-Madrid/New York, automatically selecting the most recent management technologies available on the target Windows Server since Paris. Each Windows probe or pattern will now shell out to Powershell, so for each windows related MID Server thread there will also be a powershell.exe session running. These do contribute a lot to CPU (and memory) usage, especially when compared to a probe running native WMI requests. This change was necessary to continue supporting Discovery of the latest Windows versions due to the changes in Microsoft's management technologies over the years. 

Multi-threaded Shazzam: The port scanner probe is still a single execution of the probe per MID Server at any one time, but is now multi-threaded within the probe. This defaults to 5 threads since Orlando, compared to the previous 1, and within each thread 100 scanners run. Other optimizations were added to increase the rate at which IPs can be scanned. This is how a single Discovery Schedule is able to scan millions of IPs in only hours.

Probes to Pattern Migration, and a general offloading of processing to the MID Server: Patterns were introduced in London, and available for older customers to migrate to since New York. These include potentially hundreds of steps to explore a target and process the data to extract the attributes for the CMDB, the related CIs, connections and relationships. Patterns execute almost all the logic and data processing in the MID Server, which reduces the delays caused by going backwards and forwards to the instance for every decision of what to do next. The equivalent Probes' Sensors ran only a handful of commands on the target, and used to do all the processing in the instance. For Cloud discovery, and large network devices, this is a huge amount of data and processing. Parsing those large payloads in order to use data within them can use a lot of CPU, for many minutes each. For most patterns, including the large ones, all that's left to do in the instance is for the CMDB Identification and Reconciliation processing to insert or update CIs. Some Probes have also moved processing to a post-processing script, which runs in the MID server.

Regular new Patterns: Monthly out-of-band releases, delivered by the "Discovery and Service Mapping Patterns" store app between major upgrades, add new functionality in order to discover new cloud environments, new hardware and new applications. Often this will result in more of the IPs already being scanned, being explored, and more deeply, or more applications on existing discovered servers being explored, including the cloud services and virtualization resources they run on. 

Java's memory Garbage Collection can become more significant as more and larger patterns run. After a Quebec upgrade which upgraded Java 8 to Java 11, and before the fix of PRB1462926 was implemented, the MID Server would spend a lot of CPU clearing up the memory heap after threads had finished with it. This was due to the default Garbage-First Garbage Collector (G1GC) in Java 11 not being as efficient as the  Parallel Collector on smaller JVMs like the MID Server. Not many customer are likely to see this due to upgrading straight to the fixed version that reverts this to parallelGC, but this does highlight that Java does it's own memory management, that is included in the CPU usage of the java process, and is not hidden in the operating system's CPU usage.

Instructions

What could be done about this? Should anything be done about this? The following will help you decide.

Prevent the Alerting

The correct and recommended thing to do is to configure your monitoring tools to expect the behaviour of a MID Server when running Discovery at the times you have set it up to run, and being idle at others, and not create alerts for high CPU events during the Discovery Schedule time windows. 

Unfortunately the patterns of the CPU graphs of MID Server hosts do look fundamentally 'bad' to the sort of automated monitoring tools now used. A >90% CPU usage for an hour or 2, when the rest of the day is idle is a classic thing to look out for by these tools, which for an application server may indicate a problem, but this is exactly what Discovery is designed to do.

In the Paris release the Event Management Self-Health Monitoring feature creates alerts based on MID Server issues records, and if enabled, those issues records are created by default if a MID Server stays at an average of >95% CPU over a period of 30 minutes (PRB1458352, MID Server resource threshold alerts). That's fine for a general use MID Server, doing the odd integration here and there, often small jobs randomly triggered by user updates, but not appropriate for Horizontal Discovery Schedules of thousands of IPs.

Avoid Overcommit of Physical CPUs (vCPU > pCPU)

From searching and reading through posts on the VMware community and blogs, the best practice recommendation is generally to be conservative and not overcommit the physical CPU cores.

If CPU cores are dedicated to MID Server hosts, then their CPU usage cannot affect other VMs running on the ESX server.

Change the CPU resources of the host

Increase the MID Server CPU resources

A MID Server running at 100% means it would use more if you gave it to it. In general, you would be less likely to have peaks, and you would stay at those peaks for a shorter time. As stated above for the Quebec tests, doubling the CPUs from the minimum recommended 4 cores to 8 made very little different to the average CPU used through the schedule, with the CPU graph shape basically identical, and the only effect was to compress it into a shorter time.

If you want to discover more in a shorter time, this, together with MID Server load-balanced clusters, is the way to go.

However in overcommitted virtual environments, this means you will take even more resources away from other VMs than you did before, having a relatively greater an impact on the CPU performance of those.

Reduce the MID Server CPU resources

Yes this is counter-intuitive, but bear with me. The number of CPUs could be reduced, or a cap put on the percentage of those CPUs that the VM can be given. This is also a way of throttling the MID Server, and flattening the CPU usage graph. It will remain at 100% CPU from the Operating System's point of view, because it is fooled by the Hypervisor into thinking it is at 100% CPU, but actually it is only using a fraction of the physical CPU resources. Most Discovery MID Servers tend to be installed on Windows VMs on ESX Servers, and so the allocated CPUs and the dynamic allocation rules can be changed without reinstalling the OS and MID Server.

This isn't usually necessary if you are able to run all the jobs and Discovery schedules that you need to, and within the time windows you need. 

We document a 4 core 2GHz CPU per MID Server install as a minimum recommendation. With that specification, a MID Server will run at 100% CPU for a good proportion of a Discovery Schedule, right from the start. If that causes too much impact on other VMs sharing the hardware, then give it less.

Flatten the curve

A flatter CPU usage graph across the week with more consistent CPU usage with fewer peaks and troughs, and just enough VMs and hardware purchased as you need, is the ideal from the cost and provisioning point of view. If that can be done while still meeting the requirements of the IT Operations people using the ServiceNow products then you are good. There are basically 3 ways to do this.

Below are several ways of doing that.

Discover less

A down-side of Horizontal Discovery is the temptation to Discovery everything possible in minute detail. It can do that, but do you actually need that? Who is using the data, and what data do they need? How up to date does that data need to be?

For example, does anyone use the L2 Mapping data of network connections between network adapters of servers and switch ports. Is the L3 data for TCP level connections between servers at the applications level enough for your needs, and do you even need that? A lot of data is involved in this, which needs extracting from servers, routers and switches using discovery probes running in the MID Servers, and can be turned off with properties.

Service-focused Discovery using Service Mapping

A particular service may be made up of only 10-50 main CIs representing the load balancers, applications, servers, databases and storage involved. If you have an incident from an alert or outage reported, then that's all you need to know to sort it out. Because so much less is discovered, it can be re-discovered every few hours, rather than nightly or weekly.  Everything else not directly related to your important services could be discovered a lot less often, putting a lot less load on the MID Servers.

Reschedule Discovery throughout the day/week

Try to move away from the 'one big nightly schedule' way of doing things. If you have broken up your schedules, you are probably still running them directly after each other, either using the 'Run After' feature, or set it up that way deliberately. Most ServiceNow customers are global, or even if their market is local have employees and hardware all over the world, so there is no real down-time any more.

Break down the schedules into datacenters or regions, and stagger those schedules' start times throughout the week. However you can still expect 100% CPU when those smaller schedules run, but the graph will now be more of a saw tooth, rather than one big square.

Consider a host per MID Server

Installing multiple MID Servers on the same host, perhaps one for production and one for sub-production instances, will mean the MID Servers can effect each other. If a MID Server is reporting High CPU in the MID Server dashboard, the figures are for the whole host, and it may not even be that MID Server using the CPU, or even a MID Server at all if the host is being used for other applications too. A host dedicated to MID Servers is better than a shared one, and 1 host for 1 MID Server is better still.

Throttling the MID Server application

It is possible to deliberately throttle the MID Server and Discovery throughput, to reduce the average CPU demand by reducing the throughput and rate of probes it can process, however this will not eliminate all 100% usage. It deliberately introduces bottlenecks and blockages, while jobs wait for a thread or connection pool. 

We don' t recommend that you do any of these though, because the MID Server is designed to run as much as it can in the quickest time in order to avoid potentially problematic backlogs in the ECC queue for the features that need a quick response or high rate of jobs. A few long running jobs could block all other queued jobs from executing for a long time, including non-Discovery integrations, increasing the chance of this happening the more limited the MID Server becomes.  

Doing this will prolong Discovery Schedule times, such that you may not be able to scan all the devices your operations people need in the CMDB, or update them as regularly as they need.

Worker Threads

The number of threads available in a MID Server to run probes can be set at the MID Server level with these MID Server Parameters. A probe might be a Shazzam port probe for a block of IPs, or a Windows Classify probe or a Network Switch Pattern, each for a specific target IP.

Thread GroupValue (default)ParameterNotes
Standard 25 threads.maxThis can be reduced to as low as 5
Expedited 20threads.expedited.maxThis can be reduced to as low as 5
Interactive 10 threads.interactive.maxDon' t touch this.

Don't touch the interactive property. This is usually only used by the MID Server's own system commands, such as the Heartbeat probe that shows the MID Server is still Up, and so that should not be restricted in any way.

For more on MID Server Worker Thread Pools, see: KB0743566 MID Server Max Threads, Worker Groups , Priority and Queues

MID Server Resource Reservation

This feature was added in Paris can be used to do various things, including:

  • limit the number of ”large” probes being executed by a MID Server at the same time
  • throttle Discovery (to minimize impact on the instance)

See MID Server Resource Reservation in the docs for more details.

Throttling Discovery Probes

Shazzam

The MID Server will only run 1 Shazzam port scanning probe at once, however since the Orlando version this has been enhanced to run multiple threads in parallel, and each of those running multiple scanners concurrently, which will naturally increase CPU usage, and would use many times more than 100% CPU if it could.

The number of threads and scanners per thread can be lowered. See mid.shazzam.threads and mid.shazzam.max_scanners_per_thread parameters in MID Server properties docs.

Session Pools

SSH and Powershell probes connect to targets via session pools. 

For Windows Powershell, see mid.powershell_api.session_pool.max_size and mid.powershell_api.session_pool.target.max_size in the MID Server parameters for PowerShell or Windows Discovery parameters docs.

For SSH, see mid.ssh.pool_thread_ratio in the MID Server Parameters, and mid.ssh_connections_per_host  in the SSH Discovery parameters.

This could cause probes to time out, and block threads while waiting for the session pool. Using MID Server Resource Reservation instead avoid that queue blocking.