How the ECC Queue works - from output ready, to input processed


Details

This article aims to explain how the ECC Queue table records get processed in general and the code involved in that process. It deliberately isn't going too much into what code or features create the jobs, what the jobs actually are, nor exactly how the results are then used.

1. Job is added to Output Queue

A feature or application wants to do an integration that uses the ECC Queue, so does an insert into the ecc_queue table. A description of the fields from the Discovery point of view can be found in the MID Server ECC Queue documentation. More generally, they mean:

Agent

What will run the job. If the Agent value starts with "mid.server." then it's a job for a MID Server, unless it is "mid.server.NODE_AGENT" which means the instance. e.g. NODE_AGENT value is used by RESTProbe when it's the instance doing the request directly.

If Agent is "mid.server.*", then the "Probe - All MIDs" Business Rule (BR) will insert copies of the record specific to every individual MID Server (regardless of their state). This is normally only used for system commands where all MID Servers need re-syncing with a change in the instance.

QueueOutput is the Job, Input is the result, from the point of view of the instance.
StateReady means waiting/queued. Processed or Error means the job is finished. In between it is Processing, which doesn't mean the job has actually started to run yet because it may still be in MID Server internal queue.
Topic

The 'Probe' that is to run for the job. This equates to a name of a Java object in the source code. (not to be confused with the 'Discovery Probe' name)

Name/SourceSpecific to the Topic, but usually the main probe parameters.
PayloadAdditional probe Parameters for the Probe. XML format. The ECC output parameters are also included in the EEC input, together with the result and errors.
Agent CorrelatorSomething to help reference back to the code/task/schedule that created the job, so that the sensor can pass the results back to the original thing.
Priority2=Standard(default), 1=Expedited, 0=Interactive. Almost everything uses Standard. Only system commands, or things where users are waiting for a response after clicking something use higher priorities.
SequenceThe queue Order. It is a combination of Unix epoch timestamp, and a number in case there are multiple records at the same time, so within millisecond resolution.
Response ToFor Inputs, this references the corresponding Output.
ProcessedThe time the State changed from Ready to Processing. For Outputs, it is when the queue.processing Input was inserted. For Inputs it is set by the sensor, which not all sensors set.

2. MID Server (or instance code) picks up the job

Note: These MID Server specific steps are not going to be involved if the Instance is running the jobs itself.

The MID Server needs to be Up and Validated in order to pick up jobs.

If the MID Server AMB Channel is working, the MID Server knows there is something new in the queue for it instantly. If not, the periodic ECCQueueMonitor thread runs to check.

MID Server queries the instance ecc_queue table using the normal SOAP Table API, via the app node's API_INT semaphores. It looks for any output records, with it as the Agent, and in Ready state. Only a certain number are fetched at a time, proportional to the MID Server thread count. On MID startup, it also queries for Processing state outputs, on the assumption that the MID Server may have died while processing them and needs to run them again.

The oldest records with the highest priority are fetched first.

KB0743566 MID Server Max Threads, Worker Groups , Priority and Queues - Has more on ECC Sender folder, priority of sys_trigger jobs, and how many outputs are fetched at a time.

3. MID Server notifies the Instance that it has picked up the job

MID Server writes a topic=queue.processing input to the ecc_queue to tell the instance what it has picked up. The payload looks like:

<queue.processing output_message_count="1"><sys_id state="processing">2401af2c1b8b1010254542e7cc4bcb5b</sys_id></queue.processing>

This may include multiple sys_ids, because the MID Server queries for new jobs in batches. That input triggers sensor Business Rule "ECC Queue - mark outputs state" to update those output records from Ready to Processing state. 

Related Problems:

4. MID Server executes the job

In the MID Server, the input is held in a queue in memory until there is a free Worker Thread (the thread pool used depends on the priority), and then is passed to the java probe code specifically for the Topic, and runs it in a thread. 

Given that, an ecc_queue output marked as Processing state doesn't mean it's running. It just means it is now in the MID Server's internal queue, and will run it at some point. The agent log entries include the sys_id of the ecc output record for confirmation of when it did run. Because of this the output record created, processed and updated timestamps give an idea of the wait time and processing time, but are not accurate because of that internal queue wait time as well.

While running the job, code may make additional calls to the instance APIs using AMB/REST/SOAP directly, that may not be via the ECC Queue. For example, IntegrationHub outputs simply reference a Flow Designer Context, and the Probe then needs to query that context in the instance to know what actual steps it is to execute.

The thread running the job writes the result of running the probe to an XML file in one of the ECC Sender output folders in agent\work\monitors\ECCSender. There are several folders depending on the priority or if it is sequential. These folders act as a send queue, and there are limits.

The probe will log to the MID Server's agent/logs/agent0.log.0 file as it runs. Adding Probe Parameters and MID Server parameters will cause additional debug to be written to the log. Each feature's probe is likely to have different probe parameters and levels of logging, and require more than just adding the mid.log.level parameter=debug to the mid server.

6. The MID Server returns the result to the instance

The ECCSender thread inserts an ecc queue input for each xml file, then deletes the XML file. Highest priority are sent first, with the oldest of those first, until all the folders are empty. 

The SOAP Table API is used for the insert, just like any other inbound integration inserting into a table via a SOAP message, and will run in an API_INT semaphore. The instance Transaction Log will show it as:

/ecc_queue.do?redirectSupported=true&SOAP&displayvalue=all

The "ECC Queue - mark outputs processed" Business rule runs for each Input record inserted, which sets the Output record state to Processed.

KB0743566 MID Server Max Threads, Worker Groups , Priority and Queues - Has more on ECC Sender folder, priority of sys_trigger jobs, and how many outputs are fetched at a time.

7. Sensors run in the instance

Business rules run for each input record insert. They are usually synchronous, before or after input, as part of the API_INT semaphore SOAP transaction. One (or more) of these business rules will be what we call the "Sensor" for the job.

They run according to the Order field, and are skipped based on conditions, like any other business rule on record insert. Some will run condition checks to see if the input is for them, and if not then they leave the record as is, allowing the next BR in the order to have a go.

The following list groups the ecc_queue business rules by Package, helping see which act as the sensor for any particular job. The condition field can usually be used to easily see which jobs the sensor is for.
https://<instance name>.service-now.com/sys_script_list.do?sysparm_query=collection%3Decc_queue%5Eactive%3Dtrue%5EGROUPBYsys_package

Most sensor business rules pass on the work to a scheduled job by inserting a 'run once' sys_trigger record, that the Background Scheduler Worker threads will then pick up and run to do the actual sensor processing. For Discovery, this inherits the priority from the ECC Queue record. If the business rules spends too much time processing stuff as part of the SOAP insert then it risks blocking the API_INT semaphores. That's bad as there are only a few per app node, and there may be many MID Servers sharing those. It is the job of any specific sensor to deal with scheduler worker priority, and each feature/app may do it differently.

The specific sensor code will usually set the input to Processing status when it starts working on it, and error or processed when it ends. The sensor will usually populate the Error String field on Error. Each feature/app may do it differently in their sensors.

The payload field may end up in an attachment, which is an instance's SOAP table API's doing (see glide.soapprocessor.large_field_patch_max). It is the responsibility of each individual sensor to then deal with that. The attachment needs to be smaller than system property com.glide.attachment.max_get_size. Discovery aims to keep everything below 5MB but other features may not.

Don't always expect one input per output. Some features split large results into multiple input records. JDBCProbe for import set data sources splits the inputs into separate ecc_queue inputs for each 200 rows of data, only continuing with the transform once all inputs are received back to the ecc queue. This probe uses the output_s ECCSender folder, so the inputs arrive back at the instance sequentially. Discovery patterns can use multi-page inputs, and each data sub-set input is usually processed as it comes in, without waiting for the full set of data.

There may not be a sensor for the input at all. Those inputs remain in Ready state. e.g. RESTProbe outputs created by RESTMessageV2, that are executed synchronously, such that result/error handling is done by the same script creating the job, often won't need a sensor BR. Generally there should always be a sensor in order to at the very least do error handling, and set it to Processed, to avoid the false impression of a queue backlog.

The "Discovery - Sensors" BR will usually run for the input, even if it has nothing to do with discovery. This all goes back to when MID Server was part of Discovery and only used by Discovery. So long as the output parameters includes skip_sensor=true, then that's fine, and the discovery sensor know to skip it. Every non-Discovery job should include that, but many don't, including some OOTB features.  If the input doesn't have that parameter, the discovery sensor will try to process it, and may cause an Error. e.g. RESTProbe inputs without skip_sensor will get set as state=Error and Error string="No sensors defined". That parameter is in the payload, which may be huge, or even an attachment, and the discovery sensor has to process that to read the parameter, which in itself can take a while or use a lot of memory. When this sensor runs for non-Discovery jobs, you can tell because the app node logs will have "Get for non-existent record: discovery_status:..., Initializing" because the Agent Correlator field value isn't for one of these records.

The sensor will usually use Agent Correlator values to make sure the results end up back with the specific job/record/workflow/Discovery status that created the job. Orchestration values all start "rba." and Integrations Hub "ihub." for example.

Examples

Example 1 - Heartbeat Probe

sys_trigger job, BR sensor. slightly bad example in that sensor runs everything synchronously - doesn't offload the actually processing to a sys_trigger job.

Example 2 - Synchronous RESTMessagev2 

Unusual in that it executes synchronously in the instance, and waits for the response, literally sleeping and blocking the thread in the process.

In this case the initial thread sleeps, and regularly polls the ECC Queue. The stack trace of the thread looks like:

main,glide.scheduler.worker.4,4,<thread name - often a business rule or scheduled job name> (82899 ms)
java.lang.Thread.sleep(Native Method)
com.glide.ecc.ECCResponsePoller.poll(ECCResponsePoller.java:50)
com.glide.rest.outbound.ecc.ECCRESTResponse.fetchAndProcessEccResponse(ECCRESTResponse.java:232)
com.glide.rest.outbound.ecc.ECCRESTResponse.getBody(ECCRESTResponse.java:124)

For more information on this example, and why is is potentially rather bad, see:
KB0727028 Why is State "Error" and "No sensors defined" in the ECC Queue for outbound REST/SOAP Message via MID Server? - Expands on the skip_sensor parameter.
KB0694711 Outbound REST Web Services RESTMessageV2 and SOAPMessageV2 execute() vs executeAsync() Best Practices 
KB0716391 Best practices for RESTMessageV2 and SOAPMessageV2
PRB1305586 RESTMessageV2/SOAPMessageV2 APIs allow running via MID Servers Synchronously, without a Sensor ECC business rule, causing blocked instance threads while sleeping, and Scheduler overload
KB0749289 Synchronous Outbound Web Service Calls Timing Out After 30 seconds Following Upgrade

Example 3 - Discovery Sensors

For more information on this example, see:
KB0718653 ECC Queue Processing and debugging, with "Discovery - Sensors" used as an example - Has details on how specifically Discovery sensors process to explain the Discovery example in more detail.

Debugging

The above should help you understand at what stage in the process you have a problem. If you know the 'where' you are halfway to solving it. The following KBs will help 

KB0718589 Why are my MID Server-related Jobs stuck and ECC Queue inputs still in Ready State? - Suggests possible causes of Sensor processing delay, and gives troubleshooting tips and solutions.
KB0727132 - How to link an ECC Queue record back to a specific Feature or Job - Lists the Topic/Name/Source/Agent Correlator values used by many different features, and what the values mean.