Discovery Performance | TroubleshootingThis article provides a guideline for your approach to troubleshooting slow performance with ServiceNow Discovery. The article is written to ServiceNow customers, partners, or developers who are trying to get Discovery to go faster. The first section of this article, the troubleshooting tree, is intended to be read as a step-by-step diagnostic process. However, while troubleshooting trees are a nice starting place, troubleshooting is not something that can be completely reduced to a step-by-step process. Hopefully, the principles discussed here can assist you in developing a structured approach to avoid going in circles or going too far down the wrong path. Table of Contents Troubleshooting treeConfigurationResource constraints on the MID ServerDiscovery componentsPhases of DiscoverySetting up a scheduleAdditional Troubleshooting Tools Impact Instance Observer ECC Queue MetricsPerformance benchmarksLogsResource constraintsCommon issuesHistorical issues Troubleshooting tree (Note: The routes in this troubleshooting tree are not necessarily mutually exclusive. For example, you could have both a sensor issue and a probe issue.) Use the ECC queue to identify the slowest input (sensor)/output (probe) records. You can see the overall input vs. output latency by looking at the JRobin Discovery metrics. Open the Homepage titled "ServiceNow Performance". Look at this graph as a starting point to your investigation to get an overall sense of where and when most of the time is being spent during discovery.Using the graphs as an example, note that the Probe Runtime graph shows that most of the latency is taken up on the MID Server side with millions of milliseconds of time being spent on the probe processing. By comparison, the sensor processing time is generally much faster than probe processing – between 20 and 30 thousand milliseconds.So, given the information in the graphs, the most likely place to find performance improvement is on the MID Server side. However, you need to determine whether the probe runtime is time that can be improved. Also, do not discount sensor issues as the cause just because there is more probe time than sensor time.Check the difference between processed and sys_updated_on dates of ECC. This will let you know the processing time of the probe or sensor since the sys_updated_on field will be the last thing that gets touched when all the work is done. Here's an overview of the life cycle of a single output probe and its corresponding input sensor.1 - sys_created_on (output): When the ecc_queue output record is inserted2 - processed (output): The processed timestamp actually denotes when the MID has picked up the ecc_queue record, prior to actually running the probe4 - sys_updated_on (output): After the probe has finished executing on the MID Server, the probe response is returned and the input record is being created 4 - sys_created_on (input): When the probe response is returned and the input ecc_queue record is created (should be close to output.sys_updated_on), at this time the state of the output probe is set to "processed"6 - processed (input): This timestamp is set after the scheduled Async Sensor processing job has begun but just prior to actually running the sensor logic7 - sys_updated_on (input): When the Async Scheduled Job has finished executing the SensorFilter ecc_queue using the agent_correlator field: field = sys_id_of_discovery_status. If you are a customer, it is probably best to export the ecc_queue records to CSV once you query them so that you can do a calculation on your local machine. You need to check the difference in time between the processed and sys_updated_on fields.For ServiceNow employees, if you are querying the database and your schedule spans two ecc_queue shards, you might need to run two different queries.You can use this query to get details about the 50 slowest ecc_queue transactions in a single Discovery Status (for example, a single run of a schedule):SELECT sys_id, queue, sys_created_on, processed, sys_updated_on, agent, topic, name, source, TIMEDIFF(processed, sys_updated_on) AS tdiff FROM ecc_queue0001 WHERE agent_correlator = '45ee5fb34f674ac0f6d82ae6f110c7d8' ORDER BY TIMEDIFF(processed, sys_updated_on) DESC LIMIT 50;Customers will need to browse through the list view of the UI. You should grab a small slot of time where issues are expected to have occurred and limit the result set to an affected Discovery Schedule. You can use the following URL as a starting point for building your query: it will render the list with the filter only, without any results. The ecc_queue table can be quite large and slow to query if no filter is used to reduce the results.<yourInstanceName>/ecc_queue_list.do?sysparm_filter_only=true If the gap was on the "output" queue, then it means you have slow probes. This points to some slow operation on the MID Server side. (Go to "Troubleshooting slow probes".)If the gap was on the "input" queue, then it means you have slow sensors. Sensors are processed in the scheduler worker threads on the ServiceNow instance. (Go to "Troubleshooting slow sensors".)If the issue is not solved by addressing the gap between the processed and sys_updated_on fields, then continue to (C) and check the difference between sys_created_on and processed. Check the difference between sys_created_on and processed dates of ECC. This will let you know the time spent waiting to be processed. Once again, here is the lifecycle of a probe and sensor:1 - sys_created_on (output): When the ecc_queue output record is inserted2 - processed (output): The processed timestamp actually denotes when the MID has picked up the ecc_queue record, prior to actually running the probe4 - sys_updated_on (output): After the probe has finished executing on the MID Server, the probe response is returned and the input record is being created 4 - sys_created_on (input): When the probe response is returned and the input ecc_queue record is created (should be close to output.sys_updated_on), at this time the state of the output probe is set to "processed"6 - processed (input): This timestamp is set after the scheduled Async Sensor processing job has begun but just prior to actually running the sensor logic7 - sys_updated_on (input): When the Async Scheduled Job has finished executing the Sensor If there is lag between the ecc_queue.sys_created_on and ecc_queue.processed fields, this indicates that the mechanism that processes ecc_queue records fell behind - as opposed to the mechanism that processes the code in the probes/sensors. In other words, either the scheduler workers that process the input queue or the SOAP requests that process the output queue were all busy and fell behind in processing records from the ecc_queue. If the gap was on "output" queue records, then it might be due to the following: SOAP semaphores being maxed out (go to Troubleshooting SOAP semaphores)A network issue (go to Troubleshooting a network issue between the instance and MID Server)MID Server going down (go to Troubleshooting a dying MID Server) If the gap was on "input" queue records, then it might be due to one of the following: The Sensors (or other input jobs) might be running slowly, causing a scarcity of available scheduler workers.Some other jobs besides Sensors are running slowly, causing a scarcity of available scheduler workers.(In either case, go to Troubleshooting 'input' queue not getting processed.) If the issue has not been solved by addressing the gap between sys_created_on and processed fields for either the output or input queues, continue to (D). Check for gaps in ECC queue.Sometimes a certain probe takes an excessive time to return and may still be running well after all the other probes have sent back a response. If this is happening, you should figure out and focus on that specific probe (go to Troubleshooting a Specific Probe).Check the IP Ranges to see if subnet masks include redundant IPs.Suffice to say that 10.15.0.0 with a /16 subnet mask (255.255.0.0) includes any host with a 10.15.x.x IP address. To find out more, go to Google and search for the phrase "subnet mask CIDR table". Use the most specific IP Ranges possible to avoid excessive Shazzam probe times. Whenever possible eliminate /16 subnets or lower. Troubleshoot slow probes.If you found probes with a large gap between processed and sys_updated_on in step 1.1 then this indicates slowness on the MID Server side. There are a few reasons that this could happen and ways to troubleshoot: Troubleshoot undersized MID Server.There are two key aspects of MID Server sizing, threads, and memory. If you increase the number of threads from the default, 25, then you will probably need to increase memory as well. As long as you have available resources on the server where the MID Server is running, you can increase the memory (generally up to 2GB but some have gone higher).You will need to test to see what works best for your situation. Sometimes increasing the size of the MID Server will not improve the speed of discovery because there is some other resources constraint.It is helpful to increase the size of the MID Server if the MID is running low on memory or all available threads are in use and therefore creating delay due to processor backlog. If all threads are in use processing a probe, then other probes start to back up and wait for an available thread. If you see the MID Server running out of threads, then you should find out what particular probes are taking the longest to process and determine why. Just because there is a large gap between the processed and updated time in the ECC record does not mean the probe was actually running that long.Troubleshoot Slow Shazzam probes.Slow Shazzam probes in particular are a limiting factor since only one Shazzam will run per MID service at a time. By using a cluster of MID services, you can increase the number of simultaneous Shazzam probes that can be run. So, for example, suppose you had a cluster with 5 MID Servers, and there is a schedule that uses this cluster with enough IPs that it takes 30 Shazzam probes. If each Shazzam probe takes 45 minutes to process (not unusual), then it will take five hours ((45*30)/5) just to get through the Shazzam probes. One possibility to increase the number of simultaneous Shazzam probes is to add more MID services to your cluster. Just one note: While it is physically possible to have more than one MID service running on a single physical server at a time, it is not recommended due to troubleshooting complications.Other things you can do to troubleshoot slow Shazzam probes: Add Behaviors to only scan ports that you care aboutPut MID Servers next to the targetCheck for large ranges: Do you really need to use a /16 subnet mask? Ideally, use more targeted ip scans. Troubleshoot other types of slow probes.Other types of slow probes could be a problem too; however, they are not bound the way Shazzam probes are. After each Shazzam probe completes, it will spark hundreds or thousands of other probe types that can all run simultaneously. So, for example, if 50 UNIX - Classify MultiProbes take 45 minutes, and you have at least two MID Servers in your cluster with 25 probes each, they could all be processed in 45 minutes because each of the 25 threads in the two MID Servers could be simultaneously processing one of the probes.Troubleshoot a specific probe.If you have isolated a specific probe or probe type as being particularly troublesome, one approach is to re-run the associated IP on a sub-production environment. If the issue can be reproduced in a sub-production environment then you can narrow down exactly what is wrong with that probe. Check the wrapper and agent logs while it runs and add debug statements or make code changes to the Postprocessor script if necessary. Alternatively, you can run the command that the probe runs locally from the command line using an account with the same permissions as the Schedule runs and get more details that way. Troubleshoot network issues between MID Server and customer infrastructure.Determine whether there are network issues between the MID Server and the devices being discovered. Remember, all troubleshooting should be done from the MID Server host for a direct comparison to how the MID Server is gathering the information. Run a test to validate network connectivity to confirm your port scanning results.Use PING to see if you can "see" the host on the network (ping <host>)If no ping response, use TRACEROUTE to see where traffic might be stopped. (traceroute <host>)Use TELNET to see if you can connect to any of the TCP ports (telnet <host> <port>)Use an SNMP scanning tool to see if a potential network device is responsive Troubleshoot slow sensors. Slow sensors might mean a broader system resource issue with the ServiceNow instance. It might be caused by Discovery or it might be caused by something else. One of your first steps, when there are sensor issues, should be to review the performance of the ServiceNow instance. Check the following common areas (see KB0547412 - "Tips for troubleshooting general performance issues" and KB0516495 - "Performance Troubleshooting Guide"):Scheduler worker queue length - are the scheduler workers falling behind? JVM heap spaceApplication server memory/CPUDatabase server memory/CPUDatabase server slow SQL queriesTo check the slowest queries related to discovery sensors, check the slow query pattern logs in the application:https://<instanceName>/sys_query_pattern_list.do?sysparm_query=url%3DASYNC:%20Discovery%20-%20Sensors%5EfirstONToday%40javascript:gs.daysAgoStart(0)%40javascript:gs.daysAgoEnd(0)&sysparm_first_row=1&sysparm_view= If the issue seems to be more related to specific types of sensors, there are two key properties that add insight into slow Sensor operations. If you have identified latency in the processing of the input queue – as determined in step 1.1B above – you can use the following properties to get more information about what is causing the slowness. Note that most of the information that can be gleaned from the sensor logs can be inferred from the ecc_queue as described above in "Use the ECC queue to identify the slowest input(sensor)/output(probe) records". glide.discovery.log_sensor_metrics: Collects information about Discovery Sensor processing in the [discovery_metric] table. Turn this property on and then review the [discovery_metric] table to see the following:Queue time: the time between when a sensor was created in the ecc_queue and the time when a scheduler worker marked it as "processed". In other words, the scheduler worker queue latency.SQL count: the number of SQL queries executed during the processing of a single sensorProcessing time: the time that the JVM was processing a single sensor - inclusive of both script and SQL processing.glide.discovery.sensors.debugging: Causes SQL debug information related to Sensor processing to be output to the Localhost logs and the syslog table.Note: Always remember to turn these properties back off when you are done collecting diagnostic data.To identify slow Sensors, review the localhost_logs during the time of the discovery. Run the following grep to determine what sensors are taking the longest [time in ms].grep "Processed sensors in" <logfile> | cut -d " " -f9-13 | sortExample: grep "Processed sensors in" localhost_log.2015-09-25.txt | cut -d " " -f9-13 | sort Troubleshoot "input" queue message not getting processed. "Input" messages are put into the ECC queue by the MID Server to be processed on the instance. 99% of all "input" messages contain the results from a previous Probe. The most likely reason for latency in the "input" queue is that all the scheduler workers are busy. This could be due to a number of different reasons. For information about troubleshooting the scheduler worker queue, see the Community article Troubleshooting Events Processor and Scheduler Worker.Troubleshoot "output" queue message not getting processed. If ECC "output" messages are not being processed as soon as they are created, this could be due to SOAP requests not processing in a timely manner. Troubleshoot Soap Semaphores After the fact, you can check to see if the max queue depth for the SOAP queue was spiking.While the issue is happening, you can check stats.do to see what transactions are happening in the SOAP semaphores (note: customers can only see the stats.do page that they are currently logged in to at the moment; however, there is a workaround: https://community.servicenow.com/message/821191#821191)If the SOAP requests are running quickly, and it is a matter of too much volume, then find out if there is an out-of-control integration that is spamming the instance.If the SOAP requests are getting stuck or running slowly, you need to find out why they are running so slow. Click the link of the slow transaction from the stats.do page to see the stack trace of the thread while it is running. This shows you exactly what code is being run and helps ServiceNow support in diagnosing the issue. Troubleshoot a network issue between the instance and MID. This is essentially the same as troubleshooting any network connection between ServiceNow and a remote connection.Troubleshoot a dying MID Server. Work with the customer's server admin to check the resource constraints on the server (see the Resource Constraints section below).Confirm MID Server is correctly configured (check conf.xml and wrapper.conf).Check the agent and wrapper logs on the MID Server for error messages and warnings. Configuration MID host Virtual Host 4Gb RAM40Gb Disk AllocationMulti CPU shareEnsure that the virtual environment has the capacity to provide for allocation. Operating System Current Windows Server OS (32 or 64 bit)Provisioned to customers local policies around patches and security Network 100MB or greater connectionExternal internet access on port 443 to your service-now instanceAll ports and protocol access to targets within your environment Items to consider around MID Server placement include: Available bandwidthGeographic locationIP access to targets (DMZs)Should be installed on your own hosts MID Server applicationThis should be installed in a unique folder structure as described in the product documentation topic MID Server installation. Tuning the app /agent/conf.xmlthreads.max - Can be bumped up to increase thread parallelism on the MID (defaults to 25)/agent/conf/wrapper.confwrapper.java.maxmemory=512 - the amount of memory MID can allocate to itself from the host (done in conjunction w/ increasing threads as needed) Resource constraints on the MID Server Host CPU & memory utilization Available through the Scalar Metrics table [ecc_agent_scalar_metric]It is recommended to keep your host at about 80% total utilization at full scale.Using a base configured MID Server with 4GB of local memory, you should have plenty of available resources to "give" to the MID Server application. MID Server app CPU & memory utilization Available through the Memory Metrics table [ecc_agent_memory_metric]Max available bytes – the configured memory allocation for the MIDServer as defined in the wrapper configuration fileMax allocated bytes – What the MIDServer application has allocated itself from the configured available bytesMax Used Bytes – Memory resources actively being used by our MID Server Java process. Network capacityHow is network latency (ping/traceroute) during the heaviest Discovery work? Discovery components ClassifiersThe first classifier is the actual launch of a Discovery and Shazzam probe. ProbesInstance-side cache: Probes do not change from discovery to discovery (Probe.Cache Results field). To confirm payloads are cached, check the payload of the queue records. It says, "processed" if the payload matches the cached result. (How does this actually work?)Are there custom probes? Do the probes bring back data that has already been collected? How important is the data?Test probes by providing an IP address and MID Server from a dev environment. SensorsHandle the value returned by a probe and execute the next stage in Discovery Phases of Discovery Scanning: Shazzam scans a range of IPs and returns the answers about what ports responded. Classification: For each device result in the scanning phase, Discovery sends a series of probes with credentials for authentication:Classifier probes return a device class (SSH, SNMP, WMI)Classifier sensors determine identifier probes Identification: Probes find out what type of device this is:"Multiple" CI message indicates the CMDB has duplicates.Identifier Sensors use Identifiers that query the CMDB and look for a matching device (de-duplication). Exploration: Attempts to pull additional info from the devices:When Discovery status has different Started and Completed values, it does not complete.Exploration probes look for further details depending on if Discovery should continue. Setting up a schedule Stagger schedules (to avoid overlapping) - Overlapping a little bit is okay. How long does it take to scan a /16 or /24? This depends on how many devices there are. How long does a device take?MID Cluster - Set up your MID Server as a cluster (load balanced allows parallel processing - failover is just if one fails).Behaviors - Speed up Shazzam by remembering the ports that a certain device responds to.Software Discovery - For performance reasons, have this run only periodically. Limit through a business rule (http://community.service-now.com/forum/7807) what days this property is set to true.Shazzam Chunking - Batch size: 5,000 IPs at a time returns their results and start the Classification phase. The subsequent phases of Discovery are waiting for Cluster support: If set to true and you have a 3 MID Server cluster, we break up the Schedule range into 3 pieces and send 1 batch to each server. Additional Troubleshooting Tools Impact Instance Observer ECC Queue Metrics Instance Observer (IO) is an observability and performance monitoring tool that empowers you to keep track of your instance health and performance in near real-time, while also providing historical insights. IO helps Instance Administrators, Platform Owners, and DevOps teams accelerate value and drive their desired business outcomes by increasing visibility into instance performance, helping customers push innovation with speed and confidence. Instance observer includes a series of helpful reports to see historical queue size and lag times of ECC Queue processing. ECC Queue message count (Input) ECC Queue message count (Output) Avg waiting time (Input) Avg waiting time (Output) Processed message delta (Input) Processed message delta (Output) Performance benchmarks Estimate per Network You can expect a single MID Server to discover:Single System – ~90 secondsClass ‘C’ Network (254 IPs) - ~13 minutesClass ‘B’ Network (16k IP’s) – ~6 hours* These estimates vary in regard to the total number of devices in each subnet. See the bandwidth graph in the product documentation topic Discovery resource utilization. Estimate per probe A MID server can process about one probe per minute per node.There are about 7 probes per device. This varies from device to device (for example, Windows servers take many more probes than routers). Doing an estimate:<number of devices> * <probes per device (7)> / <threads.max> * <number MIDs in cluster> = <minutes to complete discovery> So for example, if you had a Discovery schedule that discovered 9810 devices and were processing the schedule with a 5-node MID cluster, it would look like this: (9810 devices * 7 probes) / (25 threads * 5 MIDs in the cluster) / 60 minutes = Expect a 9.2 hour long discovery schedule (give or take) Logs agent log: includes details about probe launches and operations that take placewrapper log: includes JVM information (memory, etc.) Resource constraints Instance side bottlenecks: DB Server: I/O, CPUDB Process: Memory, query cacheApp Server: CPU, Memory, I/OJava: Heap, DB connections, Tomcat threadsWorker threads: sensor processing limitations. If there are 8 running sensors on each node that will become a bottleneckSOAP threads for insert/read of ecc_queueecc_queue DB reads/writesSensor: cmdb, logs DB read/writes MID Server-side bottlenecks: App: Heap (default 512MB, max 2GB), threads (default 25, max 100)Server: CPU, I/O, memoryNetwork: Bandwidth, latency Common issues If a certain device consistently fails, exclude it from the discovery.If threads hang and discovery doesn't finish, then open an incident - that's a bug.If Software Asset Management plugin is enabled discovered software goes into the Software Installations table and a 1:* related table for software instances - otherwise they go into "Software Installed" related list, software packages table (will have as many records as instances of software)There are a number of large tables related to Discovery that are regularly purged. If these tables grow large they can slow down the overall speed of your sensor processing. Are there any slow DB queries associated with these tables? Debugging Tip: What are the counts of the following tables? discovery_device_historydiscovery_logdiscovery_network_trackecc_queuecmdb_tcpcmdb_running_process Propose Solution: What is the retention period for these tables? - If 30 days, recommend trying dropping them down to 7 days. Is UNIX or Windows ADM taking a long time? Debugging Tip: If so, verify the value of glide.discovery.auto_adm. This System Property (sys_properties table) automatically create process classifiers for application dependency mapping: When discovery finds processes that are communicating over the network, automatically generate "Pending Process" classifiersSolution Proposal: Do the require pending process classifiers; if not disable them.Debugging TIP. Can they test without ADM enabled? Disable probes from the classifier.Solution Proposal: If the issue goes away when ADM is disabled, could be a possible problem. PRB630274 - Application Dependency Mapping (ADM) performance is poor *Fixed Fuji Patch 5PRB646422 - Sensors for ADM taking illogical amounts of time to complete. Is the Installed software sensor taking time? Debugging Tip: Does an index exist on cmdb_sam_sw_discovery_model(display_name, version)?Solution Proposal: Add Index: cmdb_sam_sw_discovery_model(display_name,version) Are SNMP probes [classify, routing, switching] table taking time? If there are thousands of OIDS, it's normal for it to take minutes for the MID Server to completely probe the device, and sensor to process it.Solution Proposal: What type of device is it? If it's an edge router or core switch, it shouldn’t need to be scanned every day. Exclude the device from the primary schedule and place it in a different schedule that can run weekly or bi-weekly. Are there any custom probes? Propose Solution: Recommend to the customer to examine each custom probe and see if there is an opportunity to speed it up.Propose Solution: Can you cache any of these probes if any of this information is generally static? Certain aspects of Discovery configuration are inherently slow. Ask the customer if they really need the following configurations? If they say they do, then challenge them to ask themselves, for what? They may decide it is not worth the overhead. Storage Server DiscoveryApplication Dependency Mapping (ADM) Historical issues Hung SSH probes. Eureka moved from J2SSH to SNCSSH, reducing the likelihood of stuck threads (old installations may need to use the MID config param use.snc.ssh)Fuji was dedicated to performance improvements: Faster sensor processing3PL, offload work to the mid servers