On a discovery schedule a value for "Max run time" can be configured. The discovery process will be cancelled if it takes longer to complete than the set max run time.
A discovery scheduled which used to complete successfully in past discoveries may start being cancelled more and more often. Over time, the number of devices discovered by a schedule can increase considerably, and in such cases this may be the root cause of why the discovery schedule starts to go over the configured max run time.
If the number of CIs discovered in a discovery schedule increases, then it is necessary to augment the max run time or the resources allocated to the discovery schedule, such as the number of MID server, MID server threads, and or MID server memory.
However, in some cases, the discovery may start to go over the max run time due to a single probe or few probes taking too long to complete or getting stuck. This article focuses on finding troublesome probes which get stuck on the MID server, thus causing the discovery scheduled to be cancelled. A separate investigation is needed to determine why the probe is hanging as there are many different probes and investigation steps will depend based on the probe.
MID Server configuration
The documentation pages below can provide guidelines for increasing MID server resources and configuring threads and memory usage:
Check the ECC Queue related list for records where Processed is (empty)
Note: Depending on the instance version, the processed field may not be empty.
When a discovery is cancelled, all ecc_queue records which have not been processed have their status changed to processed. However the "processed" field, which is the time when the processing of the probe/sensor was complete, is left empty. Therefore we can determine what probes were still running when a discovery was cancelled by finding the records where processed is (empty).
In the following screenshot the ECC Queue records for the discovery job have been ordered from newest to oldest for the "updated" field, the "updated" field is not seen by default on the ECC Queue related list. Adding the "updated" column is also helpful so that it is known how long it took the probe to complete (Updated - Processed = time in which an output probe spent on the MID server, which includes queue time waiting in MID internal queue until a thread is available for execution and execution time of probe).
Note: the ECC Queue record highlighted was created an hour before the discovery was cancelled and still it had not been processed.
This is a good indication this probe is what caused the discovery above to be cancelled.The above example is a simple case scenario where it is found that only one probe was not processed and therefore it is easy to determine the culprit. However in some cases there may be multiple probes that happen to hang and contribute to the timing out of the discovery. For such cases it is important to analyze the ECC Queue for the discovery cancelled and look for patterns, such as many probes with processed (empty) where the "Topic" or "Probe Name" are the same. In general, to find the ECC Queue records which caused the discovery to be cancelled look for records with a larger gap between the "created" and "updated" field.
When analyzing the ECC Queue for a cancelled discovery filter for "queue = output." Inputs can usually be filtered out because inputs have a default maximum processing time of twenty minutes and therefore will usually not cause a discovery to be cancelled.
Once it is determined what probe/IP is causing the discovery to be cancelled, then a new investigation will be needed to troubleshoot why the probe is not fully processed. The IP address can be added as an exception to the discovery schedule, so that the discovery completes successfully until a solution is found for the probe.
Check the Devices related list for devices where the "Scan status" is not "Completed"
The Devices related list is not updated when the discovery is cancelled. Therefore any devices where probes did not complete will not have a "Scan status" of "Completed". The following Image shows devices which where still being scanned when the discovery was cancelled. From this related list we can gather the Source, which is the IP address of the device, and look for the probes which did not complete, look for them under the "ECC Queue" related list.
Note: both "Source" and "CMDB CI" were cleared in the above screenshot.
Show Discovery Timeline
The "Show Discovery Timeline" button will be helpful in reviewing the timeline for smaller discoveries. There is a default limit of 300 ecc_queue records. This limit can be controlled by glide.discovery.timeline.max_entries. It is best to keep this limit at 300 and only use this timeline for smaller discoveries troubleshooting.
Discovery performance metrics
Starting in Madrid, performance metrics are collected when a device discovery is complete. To view the performance metric navigate to "Discovery > Discovery Performance Metrics > Probe/Sensor (Individual)". Metrics are not collected for devices which did not complete a discovery. Therefore, the metrics may be useful in analyzing the discovery to determine what devices and device types are taking the longest, but not as useful in determining what device caused a discovery to be cancelled (because the device which caused discovery to cross the max run time threshold did not complete and thus performance information is not collected on it).
The metrics can be filtered by discovery status and then by "Probe processing time" or "Sensor processing time" in order to find the longest running ones. As an example:
See Discovery performance metrics for more information.