Discovery should never run at Interactive priority (especially for Retry Discovery from Discovery Home) - Known Error

Discovery should never run at Interactive priority (especially for Retry Discovery from Discovery Home) Description Discovery should never run at Interactive priority, because these probes can block all threads for a long time, and prevent cancel_discovery system commands, and cause the MID Server's own HeartbeatProbe to be stuck in the queue and cause the MID Server to be marked as Down. Debugging the MID Server in this situation is also made impossible.

Currently a normal Discovery schedule runs at Standard Priority, and Service Mapping at Expedited priority in the ECC Queue because this CI data tends to need to be more up to date. Pattern designer runs at Interactive priority, but that is only going to be 1 probe at a time, so only 1 thread in use, and the user is watching the screen for it to refresh so this does need to jump the queue.

However Discovery Now, Quick Discovery and the Retry Discovery of batches of IPs from the Discovery Home, run at the highest Interactive priority too, and should use at most Expedited instead.

When "Retry Discovery" is triggered from the Discovery Home page, a huge number of interactive priority outputs are put in the ecc_queue, effectively blocking it. This triggers a separate schedule, at priority=0, for every failed discovery in the home page report, all at once, so for 20k errors, you would have 20k interactive priority discovery schedules being created.

By default there are only 10 threads available in the interactive thread pool in the MID Server platform. These need to be kept free for important system commands such as Cancel Discovery, Grab logs, Credential reload, Restart or Upgrade, that _must_ run straight away. In this situation we need to be able to cancel discovery so no more of these interactive thread jobs run for that Discovery status. If the cancel command is behind the action discovery jobs in the queue, when it runs it will be too late.

Discovery often runs multiple probes and patterns in parallel as part of a single schedule, even when re- discovering a single CI or IP address, and these can be long running - for example large switches or vcenters. All threads can be blocked for minutes at a time even when everything is working fine. If particular threads get stuck for a very long time, such as for target connection issues, this makes it a lot worse.

The main symptom of this that customer would report is:

The MID Server platform must also have an interactive thread available every 5 minutes to run the HeartbeatProbe, or it will be marked as Down, and if clusters are involved cause all pending Discovery jobs to be reassigned to a different mid server.

In this situation, the MID Server will continue to send back 10 minute interval "queue.stats" topic inputs, listing what threads are currently in use, and that will confirm if all Worker-Interactive threads are running discovery probes/patterns.

More on Priorities and Thread pools can be see in https://support.servicenow.com/kb?id=kb_article_view&sysparm_article=KB0743566

Steps to Reproduce

Run some big discovery schedules, that are bound to have a lot of errors, perhaps because credentials are missing Open Discovery Home. Click through to a set of errors. e.g. Windows authentication failed In ACTION ON ALL section, click Retry Discovery This is going to trigger potentially thousands of discovery schedules with name "Retry Discovery x.x.x.x", all at Interactive priority These outputs in the ecc_queue may prevent HeartbeatProbes running, and cause the MID Server to be set as Down.

Workaround This problem is currently under review and targeted to be fixed in a future release. Subscribe to this Known Error article to receive notifications when more information will be available.

To prevent this, avoid running too many Discovery Now, Quick Discovery or Retry Discovery of large batches of IPs from the Discovery Home, within a short period of time.

If you have triggered a huge number of interactive priority discovery schedules using the Retry Discovery feature of the Discovery Home reports, and the MID Servers are being reported as Down because of this, you can reprioritise these jobs so that they still run, but without impacting the mid server platform:

Set the Discovery Status records for Retry Discovery to standard priority. You could use this background script:

var statusGr = new GlideRecord('discovery_status'); statusGr.addEncodedQuery('priority=0^ORpriority=^state=Active^descriptionSTARTSWITHRetry Discovery'); statusGr.setValue('priority', 1); statusGr.updateMultiple();

Find all ecc_queue outputs, apart from SystemCommand and HeartbeatProbe, that are currently in Ready or Processing state, and interactive priority. Set the priority to Expedited instead. You could use this background script, although you will need to modify the query string for the specific MID Servers involved, an you might wish to exclude other probe topics as well.

var eccGr = new GlideRecord('ecc_queue'); eccGr.addEncodedQuery('queue=output^state=ready^ORstate=processing^priority=0^ORpriority=^agentINmid.server.PROD_DISCO_MID^topic!=HeartbeatProbe^ORtopic=NULL^topic!=SystemCommand^ORtopic=NULL'); eccGr.setValue('priority', 1); eccGr.updateMultiple();

If the MID Server does not come back Up soon, then you might have some very long running or permanently stuck threads in the MID Server. From the host computer, restart the MID Server service. This will clear the stuck threads, and the MID will then attempt to re-run them, but at a lower priority this time, so if they do get stuck again, they won't cause a Down status for the MID Server. Note: you cannot do this from the instance because new restart system commands will also be blocked until the MID is restarted manually. To prevent future incidents, please do the following:

Go to DiscoveryStatus script include Find _getPriorityFromSource function Move "case 'Quick_Discovery':" part down just below\above "case 'ServiceWatch':" so that these types of discoveries will become Expedited priority Related Problem: PRB1624010