Performance Troubleshooting GuideIssue This article is an introduction to troubleshooting performance issues on the ServiceNow platform. By no means does it capture all the knowledge necessary to track down any problem that may present itself. Hopefully, the reader is able to come away with a basic framework for thinking about performance in a web application and gains tools to help track known performance issues in the ServiceNow application. Table of Contents Introduction Business Metrics: Focusing on the things that matterTroubleshooting Methodology Scoping the ProblemIdentifying Resource Constraints Performance GraphsTransaction LogsFile System Logs and Unix UtilitiesWhat if neither memory nor the database look constrained? CPUNetworkCache FlushingBlockingSemaphores and Concurrent Transactions Determining the Cause JVM Memory Sudden scarcity of JVM memoryJVM Memory LeaksGeneral Scarcity of JVM Memory Database Slowness Sudden Degradation in Database PerformanceDatabase Performance Tuning Remediation Threads Thread DumpsTracing Threads in the Filesystem Log Reproducing the Problem / Single Page Performance Debugging Session Debugging Examining performance problems from the perspective of what changed Introduction It is helpful to have as much understanding as possible of the application stack and system behavior before trying to solve a performance issue. The more a technician knows how the ServiceNow platform operates, the more likely they will be to find the root cause on performance-type issues. Performance issues can manifest anywhere in the product. Instead of knowing everything, there is to know about the ServiceNow platform, applications, how our customers use it, and so on, a technician will have to spend a considerable amount of time digging into new concepts. Performance troubleshooting can be time-consuming and while asking a lot of questions helps, ultimately everyone must direct and focus on their learning path as well as make a conscious effort to continue to learn in order to be successful. This section reviews some abstract concepts that provide a good foundation for thinking about the performance of the ServiceNow platform. Business Metrics: Focusing on the things that matter Business metrics are direct measurements of user experience and usage. The main business metric we are concerned with in regards to performance and outages is the response time. Another example of a business metric is transactions per minute. We focus on business metrics because they provide a direct line of sight to the user experience. This is in contrast to server metrics, such as memory in use and the input/output (I/O) wait, which provide insight into causes of service degradations but are not significant beyond their influence on a business metric. Up-time is another example of a business metric that the customer can review. Since the vast majority of outages are caused by slow performance, reviewing the transaction response time provides a better picture of user experience both in terms of precision and inclusiveness. Troubleshooting Methodology It's not brain surgery, but it needs to be scientific… Gathering information: Have a clear understanding of the issue you are trying to resolve. This should trace back to user observed symptoms or business metrics. Before trying to determine why a problem is occurring, it is important to understand what the problem is; who is experiencing slowness? What were they doing? Where was the problem seen? And when exactly did it occur?Developing a theory: Develop a theory of what may have caused the issue. This is an investigation with a purpose; if you do not know what you are looking for, then you probably will not find it.When troubleshooting a performance issue there are some theories that can be universally applied. They may not always be true, but they are always worth reviewing: Something changed: If performance was good and now it's bad, it is often faster to figure out what changed rather than trying to debug the application from scratch.Known issues: If a customer is having an issue, it is safe to presume someone else has encountered the same issue before. Our methodology for vetting customer symptoms against known issues is to identify what resource constraint caused poor performance, then vet the problem against a list of known issues for the specific type of resource contention. Testing the theory: Confirm your theory through observation and forensic analysis or by reproducing the problem in a controlled manner.Refine theory and repeat: Look to refine your theory until you reach a precise understanding of the issue.Developing a solution: When helping a customer with service degradation, there may be one or more solutions delivered over a period of time. It is important to keep in mind: A customer expects a resolution based on an understanding of the root cause and thus will prevent future occurrences. If what is being provided is a workaround or band-aid solution, then this should be communicated in an up-front manner.Time is often of the essence, so if there is a short-term solution available, we should offer it while we continue to investigate. It is easy to get caught up in investigating the root cause and lose sight of the big picture, a customer in trouble. Scoping the Problem The first thing to do when trying to diagnose a poor performance issue is to answer the following questions: Who? What? and When? The focus should always be to understand the cause of abnormalities in business metrics or actual customer reports of the problem. Performance graphs provide a good overview of transaction response time that can be used to help narrow down the timing of the problem. Alternatively, transaction response time data can be found in the transaction logs and the file system logs. If the transaction timing data captured by the system does not corroborate end-user reports of performance, this brings up a red flag and you should proceed to answer the following questions: Are you looking at the correct time zone?Is this a network-related performance issue? If the client transaction log is turned on, it displays data to help confirm a performance issue caused by network slowness.If the client transaction logs are not turned on, pointed interviewing of the affected end-user can help determine if the problem is network-related. Is the issue isolated to a particular ServiceNow application node? Performance graph data displays on a per-node basis and offers the option to toggle among nodes.If the issue only affects a single node, use this as a clue to identify the cause since it immediately eliminates any shared resources, such as the database. Identifying Resource Constraints Most performance issues are the result of resource contention. Identifying the specific resource constraint enables narrowing the root cause to be one of several known causes of contention or an unknown issue. The most common resource scarcities that cause performance degradation are JVM memory and database throughput (I/O). When evaluating resource utilization, we look at time-series data like the performance graph presented below. A useful analogy for how resource contention causes performance problems is highway traffic. Until a certain point, an additional car on the freeway has little impact on the speed of traffic, but when a threshold is reached, adding more cars causes traffic to slow to a crawl. Similarly, when available resources in the hosting environment are exceeded, the performance of a ServiceNow instance will slow down to a crawl. Performance Graphs Review the performance graphs to investigate if memory is high. If so, this indicates the cause of the issue is probably high memory utilization. Otherwise, the cause is possibly database (DB) slowness. The threads graph on the ServiceNow Performance homepage provides quick insight into what resource is constrained, especially in the case of DB slowness. Transaction Logs In lieu of the performance graphs, the transaction logs can also provide insight on which resource was constrained to cause a performance issue. The reasons why you might use the transaction logs rather than the performance graphs include: The performance graphs are not working.You need to examine a segment of time in greater detail than the transaction logs can provide.The performance graphs alone were inconclusive. To validate poor performance during a specific time and identify possible causes: Review the transaction log.In the Systems Logs application, use the Transactions modules to query between specified times of slowness or moments before the outage.Note: Verify the correct time zone is referenced in the transaction log. Personalize list calculations to display the average response time and average SQL time.Sort data in the Response Time column by (z to a) to show the highest response times at the top of the list.For both averages and outliers, determine the percentage of total transaction time classified as SQL time.If the SQL time makes up: Approximately 90% of the overall transaction time, the issue is probably caused by I/O wait on the server (slowness reading or writing the disk) or some database lock caused by an alteration of some table for example.Approximately 50% of the overall transaction time, the issue is probably caused by a low memory condition. SQL times display an increase in these cases however, everything else will also be slow and this is reflected in the percentage of time SQL takes relative to overall transaction time.Approximately 10% of the overall transaction time, then the incident may have been caused by something other than the above-mentioned. For example, CPU. File System Logs and Unix Utilities If you are able to verify long transaction times, perform the following steps: Look at logs for the time period when slow performance was experienced.Search for the time stamp that matches the time when the incident was reported in the log.Searching or using the grep command for Memory will show memory utilization. If memory is low, then it is probably not a memory problem. Unless proven otherwise, it is safe to assume the issue is caused by slowness at a database level.If there is a preponderance of slow SQL warnings, then this indicates there is an issue with disk I/O.If there is a preponderance of slow evaluations and compiles this points to an issue with memory but can also indicate cache flushing. If the instance was restarted by the triage-instance.sh script, then review the triage_*.txt file. What if neither memory nor the database look constrained? Roughly 80% of performance issues are related to JVM memory contention or I/O contention (database slowness). Other resources where scarcity can result in slowness include CPU, network bandwidth, and physical memory, this will manifest as database slowness as the system goes into swap. In addition, a system can be slow or unavailable as a result of problems other than resource contention. CPU Although it is rare to see performance issues caused by CPU, it is not unheard of. A common cause of high CPU utilization is a loop in the script. At most, a single execution of a looping script can consume a core. Since our servers have at least eight cores, it would require a similar number of threads be executing the looping code simultaneously. The other reason consists of a CPU rise on one of the application servers, which means there is a node on that server where garbage collecting is running aggressively. The ps command provides the total CPU time of each process on a server, and this data is written to the triage log. If you are able to catch this in the act, the top command can show the affected process. Network To an end-user, a network issue is experienced differently than any other type of performance issues. If only the server is slow, it will take a long time for the page to be served, but once it is, it will quickly load in the browser. With a network issue, it takes longer from the time the page begins to render in the browser to when it completes. A network performance issue is also distinct from the fact that server transaction times remain good. If the client transaction log is on, it is generally straightforward to identify a network performance issue. In the context of system performance, network refers to the time spent to send a request from the application server and receive a response back to the browser of an end-user. Network performance can be measured in terms of round-trip delay (ping time) and bandwidth (bytes per second). In the context of the ServiceNow application, not a lot of data is shipped after the first visit. As such, ServiceNow tends to be more concerned with round-trip delay than bandwidth. Round trip delays vary, below are some general expectations of what you can expect to see in the real world: SD → Boston: 80msSD → London: 150msSD → Sydney: 250ms You can test these yourself by pinging servers in our various datacenters around the world. University websites are usually a good target for comparisons since they tend to be self-hosted and do not use content delivery networks (CDN's). Cache Flushing Flushing the cache has an impact on performance, as data that is kept in JVM memory must be retrieved from the database. Cache flushes are logged in a table [sys_cache_flush] in the database. In a clustered system, the system log is written at the time of a cache flush to all nodes except the node that triggered the cache flush. Note: Selects against the sys_cache_flush table do not show cache flushing is occurring. Blocking Aside from resource constraints, the most prevalent cause of performance issues and outages is blocking. Blocking occurs when running synchronization at the application layer or in the database or because of artificial constraints on parallel activity. Occasionally, blocking is rooted in resource contention. However, there are cases where no resource is being monopolized, and a blocking can cause an outage. Semaphores and Concurrent Transactions Semaphores protect shared resources by controlling the amount of user transactions that can be run in parallel. It is common for engineers who are new to troubleshooting performance issues to confuse cause and effect in regard to semaphores and session synchronizers. Semaphores gate the number of concurrent transactions that an application node will process. If the system is slow, the semaphores are consumed by running transaction that takes longer than usual. To understand how blocking issues manifest, you must first review the path a transaction takes through the application: The user-executed request is received by the application server.A HTTP thread picks up the action of processing the request.Note: There are a finite number of these, typically 50 per node.The first action the thread performs is to review the following: Is there an available semaphore? If so, the system artificially gates the number of concurrent transactions. The basic premise is, if there are 100 transactions that need to be processed, it is faster to process the transactions in batches rather than trying to process them all at once. Past a certain threshold (relative to the number of CPUs), the work of moving between jobs is overhead that increases the total amount of time to complete all the transactionsIf there is an available semaphore, then the thread claims it and begins processing.If there are no semaphores available, then the thread waits until there is one. Does the user already have another transaction running? Session synchronizer prevents two or more transactions from being processed in a single session. In this model, notice how an outage can occur without a resource being over-utilized if all the semaphores get caught in an extended wait state. For example, a system that sends a Simple Object Access Protocol (SOAP) request to verify details regarding the affected configuration item to another system every time an incident is submitted. The external system for unknown reasons starts to take one minute to respond to requests. If there are eight semaphores and eight incident submission requests come in ten seconds, then the system will be down for 50 seconds. Determining the Cause So far, the focus has been to gain a basic understanding of an issue that is undergoing the troubleshooting process by determining which resource is constrained. Before jumping into the methodology for tracking down the root cause of a performance issue, a brief discussion on what constitutes (and what does not constitute) the root cause is in order. These issues happen because of a code written by either ServiceNow or the customer that is affecting the system behavior. It is possible the code is just bad. More often, the way in which the customer is using the system is unexpected or over the top. These issues can be exceedingly difficult to track down and require a lot of patience and tenacity. The following section covers how to track down and fix various types of issues from scratch. Before jumping into how this is done, it's worth pointing out notable alternatives to a deep dive. Check if you are dealing with a known issue that has already been found and is known to be present in the version the customer is running. Known performance issues are organized into categories based on resource constraints. Knowing this can dramatically cut down the list of known issues we need to check against a problem.Look at what changed. If a customer's system was running fine for months, and they suddenly start having problems, it may be worth reviewing what has recently changed on the system. An inspection of recent changes can quickly narrow down the number of possible causes. The closer the correlation between the start of the incidents related to a problem and the change the more likely it is to be the cause. JVM Memory The Java Virtual Machine (JVM) memory is used by the ServiceNow application to perform tasks. This resource is specific to a node, but a customer instance may constitute multiple nodes and therefore, multiple pools of JVM memory. A JVM performing a task uses memory. That memory remains in use as long as the objects using it remain in scope (something else in scope references that object). After an object is referenced, it is available to be collected, at which point it becomes available for use elsewhere. Garbage collection is an operation with a significant performance impact. As available memory decreases, the JVM spends more time in garbage collection resulting in further degrading performance. When a laptop is low on memory, it resorts to moving data from the memory out to the disk; this option does not apply with the JVM memory. Instead, the JVM memory is finite and if exhausted, an OutOfMemory (OOM) error is thrown. The OOM errors kill the currently running operation and can result in a full system outage. Sometimes, an instance will not experience an outage after an OOM error. This does not mean it can be ignored, in all cases an instance should be restarted after receiving an OOM error. Failure to do so can result in erratic behavior or even data loss. Performance issues relating to JVM memory generally fall into the following categories: The sudden scarcity of JVM memory A sudden shortage of JVM memory is caused by the system having to move a lot of data around at once. They are also bugs. The platform has many protections against this type of issue, but the nature of an open ServiceNow platform will continue to allow customers to find unexpected ways to run low or out of memory. This is an event that impacts performance and suddenly occurs as the result of a significant decrease in the amount of available memory and is usually caused by a single thread. An increase in memory can happen as quickly as seconds but more often takes minutes, less than an hour. Performance suffers as the number of free memory decreases, as the JVM has to spend more time running the garbage collection process. Finding the trigger thread Determining the cause of this type of memory issue starts with finding the thread that is using up all the memory. Since there is no logging available to show memory utilization by thread, it is necessary to correlate the start of the event, when memory utilization starts to rise, with the start of a transaction or thread. There are several different places to find this information. Since none of the methods are guaranteed to work for all incidents its important to be familiar with them all. OutOfMemory Exception This type of sudden run-up in memory utilization may result in an OutOfMemory (OOM) exception The thread that throws the OOM exception is usually the one that causes it. If multiple threads throw an OOM exception, then the last one to do so is most likely the trigger thread. Using stats.do to find trigger thread If stats.do is run when the memory is low, it will provide a list of all semaphores and scheduled workers, as well as how long they have been running for. Sometimes it can be somewhat obvious which thread is the likely offender. Semaphore Sets Default Available semaphores: 0 In use semaphores: 0: 43BE29DA354C2040AC091D1C01CE130B: /incident.do (http16004-Processor43) (0:12:43.506) 1: 0BBFE15E354C2040AC091D1C01CE13E3: /incident.do (http16004-Processor36) (0:00:12.604) 2: D6BE611E354C2040AC091D1C01CE13A8: /problem.do (http16004-Processor11) (0:00:39.520) 3: 9343D0CA35C82800AC091D1C01CE1317: /xmlhttp.do (http16004-Processor45) (0:00:12.612) 4: DC976FC6354C6C00AC091D1C01CE13BE: /home.do (http16004-Processor28) (0:00:51.521) 5: FBBE69DA354C2040AC091D1C01CE13F3: /incident.do (http16004-Processor15) (0:02:09.041) 6: 07BF255E354C2040AC091D1C01CE13C5: /home.do (http16004-Processor10) (0:00:39.589) 7: D2962BCE350C6C00AC091D1C01CE131B: /xmlhttp.do (http16004-Processor38) (0:22:11.013) 8: 6A3E499235C42040AC091D1C01CE135A: /home.do (http16004-Processor49) (0:01:56.384) 9: 91BE211E354C2040AC091D1C01CE135E: /incident.do (http16004-Processor30) (0:00:44.420) 10: BBE9BC9E35402040AC091D1C01CE1315: /home.do (http16004-Processor9) (0:01:02.787) 11: A6BF6D1E354C2040AC091D1C01CE13D9: /home.do (http16004-Processor23) (0:00:27.099) 12: B583E09E35C8EC00AC091D1C01CE136C: /sc_task.do (http16004-Processor18) (0:00:10.503) 13: 64EBB22135002000AC091D1C01CE1317: /ecc_queue.do (http16004-Processor41) (0:00:00.012) 14: 49BFE55E354C2040AC091D1C01CE1397: /navpage.do (http16004-Processor39) (0:00:02.644) 15: E72B8AC23544A800AC091D1C01CE1313: /home.do (http16004-Processor3) (0:00:37.070) In the image above, semaphore #7 is of great interest since it has been running much longer than the other transactions. Using transaction log to find trigger thread The transaction logs are another place to search for the threads that might be responsible. This only works if the instance is recovered without having to restart. If the thread died when the node was shut down, an entry for the node will not be written to the transaction log. For more information about reading transaction log entries, refer to section 3.2 Transaction Logs. Using file system based log to find trigger thread The filesystem-based log is the most exhaustive source of information about what happens on a system and it practically always contains a clue when searching for a thread that might have triggered a low memory condition. Starting with the Berlin release, a node will dump the equivalent of threads.do to the filesystem log at the time of a shutdown. This includes execution time for running semaphores and scheduled workers. If the long-running thread happens to be a semaphore, the start and end of every transaction is logged to the file system log as seen below: #Note that the number following the pound sign is a sequential UID for that transaction. 2012-02-09 00:04:51 (362) http-16 64EBB22135002000AC091D1C01CE1317 *** Start #238,599, path: /ecc_queue.do, user: TECMIDSRV 2012-02-09 00:04:51 (409) http-16 64EBB22135002000AC091D1C01CE1317 *** End #238,599, path: /ecc_queue.do, user: TECMIDSRV, time: 0:00:00.051, render: 0:00:00.051, network: 0:00:00.000, chars: 257, SQL time: 25 (count: 5), business rule: 0 (count: 0) The following image is an example of a log message that may provide a clue(s) as to which thread has triggered a sudden scarcity of memory: #Compacting large row block: indicates we are retrieving a large chunk of data from the database. 2012-02-09 16:13:50 (829) http-43 43BE29DA354C2040AC091D1C01CE130B [0:00:00.078] Compacting large row block #Long running #Large Result set Root cause and remediation Once the thread that triggered the sudden scarcity of JVM memory is determined, the next step is to understand what the thread was doing at the exact time it ran the system low on memory. If a thread dump was taken, it makes investigation much easier. A thread dump may be found in any of the following places: Attached to the incident or problem by the responding technician.In the triage log created when the system was restarted. This varies depending on Gen1/Gen2.In Berlin and subsequent releases: In the localhost log taken when the system was restarted. Note: Thread dumps are further discussed in topic 5.1.1 Thread Dumps. JVM Memory Leaks A memory leak is a gradual decrease in available memory that left alone, culminates in poor performance and eventually a OutOfMemory error. The signature of a leak is a slow increase in post garbage collection JVM memory utilization over days, weeks, or months. The following graph shows an example of a leak; notice the memory utilization increases in week 47, 48, and 49 then is disrupted by a restart of the system that frees memory. In week 50, notice the leak has been fixed and memory utilization returns to a regular and consistent pattern. A leak in the JVM memory consists of a slow decrease in accessible memory. In this case, a full garbage collection and cache flush does not make the conversion to available memory. Memory leaks are difficult to track down from observation of the system, inspection of the logs etc. Determining the cause of a memory leak requires taking a JVM heap dump and using off-the-shelf debugging tools such as YourKit to analyze it. Before resorting to this method of debugging, it is recommended to first check if there is a known problem that can be vetted out. The list of known memory leak problems has the verification steps where possible that usually consist of killing a thread where the leak is located. If it is not possible to confirm a leak as a known issue, then a dump should be taken and analyzed. General Scarcity of JVM Memory A node is experiencing a general scarcity of JVM memory if: It is experiencing slowness as a result of high memory usage.There is no evidence of a leak and no gradual increase in post-GC memory floor.A forced cache flush and garbage collection are successful when trying to reduce the amount of memory in use back to a normal range, roughly 1/3 of the total might be considered normal. Ultimately, we will only get away with doing so much work (in parallel) with a given amount of memory before things slow down. The maximum amount of memory that can be allocated to a given node is 2 gigs. JVM memory is scaled horizontally by adding additional nodes therefore the solution is to increase the amount of memory allocated to the node if it currently is less than 2 gigs or to add more nodes. There is variability in regard to how many transactions can be processed with a given amount of memory due to customizations. General sizing guideline: One node sized to 2 gigs should be able to handle it.At least 150 transactions per minute (full pages).Almost definitely going to be pinched above 300 transactions.Scheduled job processing can skew these numbers. Database Slowness The actual resource constraint investigated in the context of the database is I/O. I/O refers to reading and writing to the disk. In the context of ServiceNow, I/O consists almost exclusively of executing SQL statements. Unlike JVM memory, I/O constraints are felt server-wide. Since all I/O is SQL, we can look at the slow SQL statements to determine what is consuming this resource. When querying from the database, it is important to understand the concept of caching. MySQL has a database cache, the InnoDB buffer pool, which caches frequently accessed portions of the database (DB). The InnoDB buffer pool accounts for the major part of memory utilization on our servers, especially dedicated DB servers. The database generally performs well as long as it rarely processes queries of data that are not in the buffer pool. Sudden Degradation in Database Performance Database performance can suddenly degrade. When it does, it is usually caused by a locking contention or a highly expensive maintenance operation. Examples include truncates, drops alters, create indexes, and so on. Database-level locks occur because an operation requires exclusive access. For MySQL version 5 run in our data centers, the following locking behaviour exists: Alter table/Create index: Table lock on the table that is being altered or indexed.Drop/Truncate: Global lock on the entire DB server (all catalogs). Database Performance Tuning Troubleshooting slowness at the DB level is often a tuning exercise. Over time, the load customers place on the DB increases as the size of the table increases and the additional functionality is implemented. A code for functionality written by a customer may be ineffective. When tuning database performance, the practicable way to improve performance is to find the most egregiously inefficient query pattern/source and fix it. In other words, it is recommended to start by profiling the SQL transactions before optimizing. Profiling SQL Transactions When profiling database transactions, it is safe to presume the cost of a query will vary proportionally to the amount of time it takes to execute the query. The following is a helpful model used to profile SQL statements: Work = Query_execution_time * Query_execution_frequency The ServiceNow platform logs all queries to the localhost log, which takes longer than 100 ms. By filtering the log entries for Slow SQL using the grep command, it allows you to acquire a picture of which queries are probably having the biggest impact on the database performance. The following examples illustrate how to use the grep command to profile SQL statements: #Example 1 #Print all slow SQL statements for all nodes on a server that happened on a given day egrep " Time: .* for: " /glide/instances/*/logs/localhost_log.2012-02-10.txt #Example 2 #Count all slow SQL statements for all nodes on a server that happened on a given day egrep " Time: .* for: " /glide/instances/*/logs/localhost_log.2012-02-10.txt | wc -l #Example 3 #Count all slow SQL statements for all nodes on a server that happened on a given day egrep " Time: .* for: " /glide/instances/*/logs/localhost_log.2012-02-10.txt | wc -l #Example 4 #Print all slow SQL statements for all nodes on a server that happened on a given day #and took at least 1 minute to execute egrep " Time: .* for: " /glide/instances/*/logs/localhost_log.2012-02-10.txt | grep -v "0:00" #Example 5 #Print all slow SQL statements for all nodes on a server that happened on a given day #and took at least 10 minutes to execute egrep " Time: .* for: " /glide/instances/*/logs/localhost_log.2012-02-10.txt | grep -v "0:0" #Example 6 #Print slow statements against a particular table egrep " Time: .* for: " /glide/instances/*/logs/localhost_log.2012-02-10.txt | egrep "sysapproval_approver.*WHERE" #Example 7 #Print slow statements against a particular table with a specific where clause. #note we're replacing single quotes with single char wild card "." egrep " Time: .* for: " /glide/instances/*/logs/localhost_log.2012-02-10.txt | egrep "sysapproval_approver.*WHERE sysapproval_approver0..state. != .not_required.*" As the first step, it is recommended to divide the dataset by execution time using a series of commands such as: grep -v "0:00" | wc -l, grep -v "0:00". Doing so will help narrow the focus to a subset of SQL statements where optimization will have the greatest impact. The table below illustrates how to divide the data and direct focus: Count of slow SQLThresholdSum10000>10 seconds2:46:401000> 1 minute16:40:00200> 10 Minutes33:20:00 For this example, it is recommendable to focus first on the SQL statements that take longer than ten minutes to execute. It is worth coming back to the > 1-minute slice later. Sub-divide the data set based on the query pattern. When doing this, search for the largest block that can be optimized. The easiest way to accomplish this is to manually inspect the data searching for a pattern of similar queries (on a single table that contains the same or similar Where clause).Filter the log file using the grep command, then count them and note the count. For example, #Example7 shown above piped to wc -l).Invert the matching on the grep (append -v) and again manually search the output to find another common pattern repeating steps above to obtain the count for the pattern and filter out.After you have narrowed down the list to queries without a repeating pattern, have a list of patterns and counts similar to the following: PatternCountsysapproval_approver WHERE approver=70fx_price WHERE id=30sys_user WHERE u_custom= AND u_one_off20... Note: This serves as a roughly prioritized list of queries to optimize. Optimizing SQL Queries There are multiple ways to optimize slow SQL transactions in a ServiceNow environment. Add a covering index to the queryChange the query to use an existing index.Eliminate the query altogether. If a query is redundant, this can be accomplished without having to remove functionality. A simple example of adding an index The following is a simple example of how to optimize a query by adding an index. Adding an index can cause performance issues and usually requires a maintenance window. Given the following query: SELECT count(*) AS recordcount FROM sysapproval_approver sysapproval_approver0 WHERE sysapproval_approver0.`approver` = 'f42c36d40a0a3c19006435056742f298' AND sysapproval_approver0.`state` = 'requested' First, look to see what indexes already exist on sysapproval_approver: mysql> use swissre_16004 Database changed mysql> snow mysql-showindexes from sysapproval_approver; +----------------------+------------+------------------------------+--------------+-----------------+-----------+------------- | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality +----------------------+------------+------------------------------+--------------+-----------------+-----------+------------- | sysapproval_approver | 0 | PRIMARY | 1 | sys_id | A | 1655741 | sysapproval_approver | 1 | sysapproval_approver_ref0 | 1 | sysapproval | A | 551913 | sysapproval_approver | 1 | sysapproval_approver_ref1 | 1 | approver | A | 7998 | sysapproval_approver | 1 | sysapproval_approver_ref4 | 1 | state_binding | A | 14 | sysapproval_approver | 1 | sysapproval_approver_ref5 | 1 | process_step | A | 14 | sysapproval_approver | 1 | sysapproval_approver_ref | 1 | u_requested_for | A | 42454 | sysapproval_approver | 1 | sysapproval_approver_ref2 | 1 | u_location | A | 14 | sysapproval_approver | 1 | sysapproval_approver_ref3 | 1 | u_change | A | 48698 | sysapproval_approver | 1 | sysapproval_approver_nor1 | 1 | document_id | A | 275956 +----------------------+------------+------------------------------+--------------+-----------------+-----------+------------- 9 rows in set (2.27 sec) mysql> Next, create a covering index: mysql> snow mysql-createindex jml_cov1 on sysapproval_approver (state, approver); Query OK, 0 rows affected (0.21 sec) Records: 0 Duplicates: 0 Warnings: 0 Finally, test the index is actually getting used by running the snow mysql-explain command: mysql> snow mysql-explain SELECT count(*) AS recordcount FROM sysapproval_approver sysapproval_approver0 -> WHERE -> sysapproval_approver0.`approver` = 'f42c36d40a0a3c19006435056742f298' -> AND sysapproval_approver0.`state` = 'requested'; +----+-------------+-----------------------+------+-----------------------------------+---------+---------+-------------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-----------------------+------+-----------------------------------+---------+---------+-------------+------+--------------------------+ | 1 | SIMPLE | sysapproval_approver0 | ref | sysapproval_approver_ref1,jmlcov1 | jmlcov1 | 220 | const,const | 27 | Using where; Using index | +----+-------------+-----------------------+------+-----------------------------------+---------+---------+-------------+------+--------------------------+ 1 row in set (0.02 sec) mysql> Example of changing code to use an existing index Given the following query: SELECT fx_price0.`sys_id` FROM fx_price fx_price0 WHERE fx_price0.`id` IN ('eb887c570a0a3c2c00a6ebe792a1ebbf' , 'eb887ee80a0a3c2c010e614649768988' , 'eb8881400a0a3c2c013580448779a010' , 'eb8883d80a0a3c2c0097ab2facf1047f') AND fx_price0.`field` = 'price' mysql> snow mysql-showindexes from fx_price; +----------+------------+-----------------+--------------+--------------------+-----------+------------- | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality +----------+------------+-----------------+--------------+--------------------+-----------+------------- | fx_price | 0 | PRIMARY | 1 | sys_id | A | 1196877 | fx_price | 1 | fx_price_index1 | 1 | table | A | 17 | fx_price | 1 | fx_price_index1 | 2 | id | A | 1196877 | fx_price | 1 | fx_price_index1 | 3 | field | A | 1196877 | fx_price | 1 | fx_price_ref3 | 1 | currency | A | 17 | fx_price | 1 | fx_price_ref5 | 1 | reference_currency | A | 17 | fx_price | 1 | fx_price_ref8 | 1 | parent | A | 17 | fx_price | 1 | jmlcov1 | 1 | id | A | 1196877 | fx_price | 1 | jmlcov1 | 2 | field | A | 1196877 +----------+------------+-----------------+--------------+--------------------+-----------+------------- 9 rows in set (0.94 sec) mysql> The following code is determined to issue this query: *******BEFORE********************************************************************** protected GlideRecord getGlideRecord(int first, int count, String table) { GlideRecordKeySetReader reader = new GlideRecordKeySetReader(fGlideRecord); ArrayList al = reader.getKeys(first, count); GlideRecord gr = new GlideRecord(table); gr.addQuery(ID, al); gr.addQuery(FIELD, fElementNames); gr.setQueryReferences(true); gr.query(); return gr; } ************************************************************************************ Rewriting the query to an additional query on the table allows us to use the index, fx_price_index1. *******AFTER********************************************************************** protected GlideRecord getGlideRecord(int first, int count, String table) { GlideRecordKeySetReader reader = new GlideRecordKeySetReader(fGlideRecord); ArrayList al = reader.getKeys(first, count); GlideRecord gr = new GlideRecord(table); gr.addQuery(TABLE, DBObjectManager.get().getAllExtensions(fGlideRecord.getTableName())); gr.addQuery(ID, al); gr.addQuery(FIELD, fElementNames); gr.setQueryReferences(true); gr.query(); return gr; } ************************************************************************************ Remediation Most performance issues are resolved by tracing an issue back to a known problem in the system. The known issue can be very specific, for example, "Load a large number of data into table x and visit a particular URL. Twenty minutes later the system is unavailable". They can also be very general, for example, "Customers can create their own tables and fields and can run ad-hoc queries but cannot add indexes. As a result, their database queries are poorly optimized." For straightforward defects in the product, we have patches and workarounds that can be applied. For a general architectural challenge, we have standard operating procedures that determine how to deal with such problems. If the root cause of the problem is not a known issue, root cause analysis will require deeper investigation. Threads Threads are streams of execution in the application server. The threads used can be grouped into two categories: background and foreground. Foreground threads respond to HTTP requests to the application server. They process regular user requests to serve pages and serve AJAX requests and SOAP. Background threads work at some interval and not directly in response to a request to the application server. The work done by the application server takes place on a thread, therefor the performance that impacts the code is also executed by a thread. Thread Dumps Thread dumps show exactly which code path is being executed by each thread at a certain point in time. Thread dumps can help take a lot of the guessing away from figuring out which specific code is causing the performance issue. Each line in a thread dump shows the exact line of code called by the previous line of code. main,glide.scheduler.worker.0,4,ASYNC: Monitoring - Process SNC queue (592 ms) org.mozilla.javascript.regexp.NativeRegExp.executeREBytecode(NativeRegExp.java:1988) org.mozilla.javascript.regexp.NativeRegExp.MatchRegExp(NativeRegExp.java:2443) org.mozilla.javascript.regexp.NativeRegExp.executeRegExp(NativeRegExp.java:2515) org.mozilla.javascript.regexp.RegExpImpl.matchOrReplace(RegExpImpl.java:182) org.mozilla.javascript.regexp.RegExpImpl.match(RegExpImpl.java:66) org.mozilla.javascript.NativeString.execMethod(NativeString.java:238) org.mozilla.javascript.IdFunction.call(IdFunction.java:78) org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1196) org.mozilla.javascript.gen.c911.call(sys_script_include.3ceff5790a0a0bae00e49c7f3c7ced38:91) org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1196) org.mozilla.javascript.gen.c912.call(sys_script_include.3ceff5790a0a0bae00e49c7f3c7ced38:87) org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1196) org.mozilla.javascript.gen.c907.call(sys_script_include.3ceff5790a0a0bae00e49c7f3c7ced38:17) org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1196) org.mozilla.javascript.gen.c729.call(sys_script_include.d2426c9ec0a8016501958bf2ac79c775:132) org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1196) org.mozilla.javascript.ScriptableObject.callMethod(ScriptableObject.java:1550) com.glide.script.RhinoObject.callFunction(RhinoObject.java:110) com.glide.script.RhinoObject.callFunction_ScriptableObject(RhinoObject.java:147) com.glide.script.GlideRhinoHelper.jsonToObject(GlideRhinoHelper.java:169) com.glide.glideobject.GlideElementNameValue.setFirstTime(GlideElementNameValue.java:129) com.glide.glideobject.GlideElementNameValue.setInitialValue(GlideElementNameValue.java:39) com.glide.script.GlideElement.newGlideElement(GlideElement.java:93) com.glide.script.GlideRecord.loadRow0(GlideRecord.java:2935) com.glide.script.GlideRecord.loadRow(GlideRecord.java:2817) com.glide.script.GlideRecord.next(GlideRecord.java:5055) com.glide.script.GlideRecord.jsFunction_next(GlideRecord.java:4549) inv5.invoke() org.mozilla.javascript.FunctionObject.doInvoke(FunctionObject.java:565) org.mozilla.javascript.FunctionObject.call(FunctionObject.java:480) org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1196) org.mozilla.javascript.gen.c3033.call(sys_script_include.21ad66a10a0a0b7c00180f6799f51780:27) org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1196) org.mozilla.javascript.gen.c3027.call(sys_script_include.a85367d3c0a801020000e98b08941455:11) org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1196) org.mozilla.javascript.BaseFunction.applyOrCall(BaseFunction.java:431) org.mozilla.javascript.BaseFunction.execMethod(BaseFunction.java:176) org.mozilla.javascript.IdFunction.call(IdFunction.java:78) org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1196) org.mozilla.javascript.gen.c377.call(sys_script_include.d22e7bdbc0a8016500a18e024bfc9aa3:4) org.mozilla.javascript.BaseFunction.construct(BaseFunction.java:217) org.mozilla.javascript.ScriptRuntime.newObject(ScriptRuntime.java:1270) org.mozilla.javascript.gen.c3022.call(sys_script_include.5ff0c9ac0a0a0b0a00a0c2dddbdd36d5:40) org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1196) org.mozilla.javascript.gen.c3025.call(sys_script_include.5ff0c9ac0a0a0b0a00a0c2dddbdd36d5:88) org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1196) org.mozilla.javascript.gen.c3021.call(sys_script_include.5ff0c9ac0a0a0b0a00a0c2dddbdd36d5:17) org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1196) org.mozilla.javascript.gen.c3013.call(sys_script_include.03fafe6bc0a8010300fdf160ba2b9f84:213) org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1196) org.mozilla.javascript.gen.c3372.call(sys_script_include.95b27b39c0a8016500c26d339e44be7c:73) org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1196) org.mozilla.javascript.gen.c3367.call(sys_trigger.33d43c625c482000957ba8a53a015553:12) org.mozilla.javascript.optimizer.OptRuntime.callSimple(OptRuntime.java:275) org.mozilla.javascript.gen.c3368.call(sys_trigger.33d43c625c482000957ba8a53a015553:2) org.mozilla.javascript.gen.c3368.exec(sys_trigger.33d43c625c482000957ba8a53a015553) com.glide.script.ScriptEvaluator.execute(ScriptEvaluator.java:156) com.glide.script.ScriptEvaluator.evaluateString(ScriptEvaluator.java:56) com.glide.script.Evaluator.evaluateString(Evaluator.java:82) com.glide.script.Evaluator.evaluatePossiblePrefixedString(Evaluator.java:152) com.glide.job.RunScriptJob.evaluateScript(RunScriptJob.java:129) com.glide.job.RunScriptJob.runScript(RunScriptJob.java:107) com.glide.job.RunScriptJob.execute(RunScriptJob.java:76) com.glide.schedule.JobExecutor.execute(JobExecutor.java:72) com.glide.schedule.GlideScheduleWorker.executeJob(GlideScheduleWorker.java:158) com.glide.schedule.GlideScheduleWorker.process(GlideScheduleWorker.java:125) com.glide.schedule.GlideScheduleWorker.run(GlideScheduleWorker.java:55) From the previous stack, dump an example of how we can follow the execution path of a thread: ... Tracing Threads in the Filesystem Log When treating an issue that does not have a known root cause, use remediation to find the responsible code. Before this stage in the troubleshooting process, you have likely identified evidence in the log file related to the issue. By understanding the most common types of executing threads and knowing what data is written in the logs, it is possible to gain further insight into the cause. Foreground Transactions Foreground transactions are executed on HTTP threads and occur in response to the get or post commands.Most foreground transactions are gated by semaphores and session synchronizers. Transaction Processors There are several different types of foreground transactions and thus rules on how a request on a particular URL is handled: For any table in the system, [table_name].do resolves a form for that record and [table_name]_list.do resolves a list of records in the table.The UI Pages can be called using [ui_page.name].do.Jelly templates in the file system such as wizard_view.xml can be called by replacing .xml with .do.Use processors to run an HTTP request if either the path matches or the parameters match.Note: The most notable of all processors is xmlhttp, which handles AJAX requests.There are multiple hardcoded URLs such as stats.do, threads.do, and so on. Typical Logging To start a typical transaction, process a get for the following URL: /incident.do?sys_id=c92bc7718f4020001c519cfde0f92304&sysparm_view=&sysparm_record_list=active%3Dtrue%5EORDERBYDESCnumber 2012-02-13 12:59:55 (890) http-1 61A9E5A38F4420001C519CFDE0F923ED Parameters ------------------------- sys_id=c92bc7718f4020001c519cfde0f92304 sysparm_record_list=active=true^ORDERBYDESCnumber sysparm_view= 2012-02-13 12:59:55 (890) http-1 61A9E5A38F4420001C519CFDE0F923ED *** Start #355, path: /incident.do, user: jared.laethem 2012-02-13 11:40:19 (925) is the date and time stamp for the transaction, down to the millisecond. Notice the time zone is Pacific across all our data centers (on-prem can vary).UI transactions are processed by HTTP threads which are numbered sequentially, where 1 - Max. In this case, we have http-15. Every line written to the log by this thread will contain this tag.Each UI transaction log entry will also be tagged with session ID: BE56D9678F4420001C519CFDE0F923C9.The transaction is given a sequential number, in this case #355. This is transaction number 355 which has been processed by this node since it was last restarted.At the start of each transaction we log the URL, incident.do, and parameters. Notice the parameters are the deconstructed URL parameters from the get request. Once the process begins, log additional lines for a particular transaction as follows: 2012-02-13 12:59:56 (500) http-1 61A9E5A38F4420001C519CFDE0F923ED [0:00:00.178] slow evaluate for: var gl = Packages.com.glide.sys.GlideLocale.get(); var g_tz_offset = new Packages.com.glide.glideobject.GlideDateTime().getTZOffset(); var g_max_table_length = Packages.com.glide.db.DBUtil.getMaxTableLength(); 2012-02-13 12:59:57 (096) http-1 61A9E5A38F4420001C519CFDE0F923ED [0:00:00.362] Call of: actions2.xml 2012-02-13 12:59:57 (517) http-1 61A9E5A38F4420001C519CFDE0F923ED [0:00:00.283] Call of: list2_action_banner_buttons.xml Notice each line contains the thread name and session ID. Other information that might be found in the transaction log between the start, and end depends on the transaction specifics, however, there are some general rules: Log things that are slow. This includes, but is not limited to: Slow SQL statementsSlow business rule executionsSlow form renders Exceptions and warnings: These dump the stack, which is gold when trying to debug anything.When troubleshooting performance, exceptions and warnings can be a red herring. If there is not a good reason to think that the exception is causing a performance bottleneck it, probably is not a good reason to presume they are even contributing. 2012-02-13 12:59:57 (956) http-1 61A9E5A38F4420001C519CFDE0F923ED *** End #355, path: /incident.do, user: jared.laethem, time: 0:00:03.246, render: 0:00:03.246, network: 0:00:00.000, chars: 39,020, SQL time: 165 (count: 115), business rule: 0 (count: 5) 2012-02-13 12:59:57 (958) http-1 61A9E5A38F4420001C519CFDE0F923ED /incident.do -- transaction time: 0:00:03.331, source: 0:0:0:0:0:0:0:1%0 At the end of each transaction, log the following: A breakdown of execution time, same numbers that appear in the regular (not client) transaction log.The transaction number again. Background processing Scheduled jobs are defined in sys_trigger, A sys_trigger entry can be assigned to a particular node or otherwise will run on any available node. A node can be set to run any scheduled job or only jobs assigned specifically to it. Other worker threads Progress workers extensions of WorkerThread.java, Localhost MidServer, among others.Note: Eclipse search comes in handy here. Reproducing the Problem / Single Page Performance Debugging This section covers troubleshooting steps given the following scenarios: General performance issue: Up to this point, this troubleshooting article has discussed general performance issues that affect all users and all pages on a system. If you are lucky to capture a good stack dump that shows what the system was doing, this may have enough information to identify root cause of a problem. If not, the next step is to try to reproduce the issue. If you are not able to reproduce the issue, then try to reproduce the problem on a ServiceNow instance that has not been customized. To gain insight into the necessary to reproduce, for example, configuration changes to be made, and so on. It may be necessary to obtain additional information than what is commonly written to the log at the time of execution. In other words, assume you know how (review section on logging of transactions if its unclear how to obtain this assumption) to trigger the problem on a customer's system but we do not know what it is about their system that makes performing action X cause a performance problem or outage. Individual page performance: Another performance troubleshooting scenario, aside from all users are experiencing slow response times, is when a particular page in the instance is slow. Troubleshooting steps are similar, so they are combined in the following section. Note: Before reproducing, move to working if at all possible on a copy of the system that had the original issue. Session Debugging In the system, there are several session-specific debugging options that can be activated. When the debugging options are enabled, they allow detailed information regarding the page processing to be written to the bottom of each page when a full-page transition occurs. Session level debugging options include: Business rule debugging.SQL debugging.Security debugging. It is best to use the Enable all module to avoid missing any details. What you are looking for is to identify what is causing the system to take so long. Each line has a precise timestamp as well as a timer to show how long a particular item took. When reading session debugging data, it is important to know that the debug output for a given page may include more than just debugging information related to the rendering of that page. All processing related to a session goes into the session debugging buffer but is not written until the next full page is loaded. This includes most notably processing of AJAX requests, processing of form submission, or other actions. Take-away from this is that session debugging is useful for more than just debugging page loads. Each separate transaction is separated by a delimiter ">>>" as shown below: If this is the case, debugging the output related to the current pages rendering can be found by scrolling to the last instance of >>> Preceding lines from previous transaction in the debug output and continue to read from there. Session debugging can only be used for debugging foreground transactions. If while debugging a background transaction you need the same level of detail on what is occurring as offered by session debugging, it is usually possible to execute the equivalent of what the background job is doing but on a HTTP thread by executing script in Scripts - Background. What exactly the script is varies but the most common example is running a scheduled job of type "Run script job" in the foreground. Doing this is as simple as pulling the script out of the sys_trigger entry for the job and executing. Examining performance problems from the perspective of what changed One of the troubleshooting theories or processes that can expedite root cause analysis and remediation is to start by looking at what has changed. While it never hurts to ask, it is very rare for a customer to know or admit that something changed to cause their performance problem. The following list describes the places to search to help identify what may have changed on a system ordered by likelihood to matter: Update Sets: Update Sets are defined in two table sys_update_set and sys_update_xml. Look for update sets installed on some date or updates where the update set they came over in was installed on a particular date. The following screenshots provide an example of sorting update set changes by install date. It is also worth looking at recent changes in the default update set if case a change was made directly to the production system. Change Requests: If a ServiceNow employee makes a change to a customer system, it should be recorded in a change request. The absence of a change request should not be interpreted as a guarantee that no change occurred, we are generally good about documenting changes in this record. Transaction volume: A sudden spike in users can explain degradation in performance. Performance graphs and transaction logs are the places to refer to in order to identify trends of transaction volume.