Best Practices for usage of the Agent Correlator field of the ECC Queue - Support and Troubleshooting

Best Practices for usage of the Agent Correlator field of the ECC Queue .ns-kb-css-body-editor-container { p { font-size: 12pt; font-family: Lato; color: var(--now-color--text-primary, #000000); } span { font-size: 12pt; font-family: Lato; color: var(--now-color--text-primary, #000000); } h2 { font-size: 24pt; font-family: Lato; color: var(--now-color--text-primary, black); } h3 { font-size: 18pt; font-family: Lato; color: var(--now-color--text-primary, black); } h4 { font-size: 14pt; font-family: Lato; color: var(--now-color--text-primary, black); } a { font-size: 12pt; font-family: Lato; color: var(--now-color--link-primary, #00718F); } a:hover { font-size: 12pt; color: var(--now-color--link-primary, #024F69); } a:target { font-size: 12pt; color: var(--now-color--link-primary, #032D42); } a:visited { font-size: 12pt; color: var(--now-color--link-primary, #00718f); } ul { font-size: 12pt; font-family: Lato; } li { font-size: 12pt; font-family: Lato; } img { display: ; max-width: ; width: ; height: ; } } Table of Contents Introduction Why this needs to be done better Why not just use Topic Why prefix the sys_id Would changing the business rule Order help Supportability A good example - Orchestration The worst example - Discovery Everything else Disclaimer: These are the thoughts and views of David Piper, a Support Engineer. This is not 'official' (yet?), but aims to get the ball rolling towards a more formal centralised approach to ECC Queue processing. We need to understand the problem before a solution can be designed.

Introduction The ECC Queue and MID Servers are shared platform resources, used by many different applications, and custom integrations.

A Sensor is a business rule that reacts to the insert of an ECC Queue Input (and sometimes Output), to process the data/result. These are feature and probe specific, often for a specific record or process in the instance.

There is now a big challenge when designing a Sensor business rule conditions to make sure:

It has a condition that means it runs ONLY for its ECC Queue records. And that it DOES NOT RUN for any other feature's records. That's different things, but a well designed system of Agent Correlator values solves 1, and helps prevent 2.

If all features did 1, then 2 wouldn't be a problem. But not all features do.

Why this needs to be done better It should not be necessary to parse the XML in the payload field to answer either of the above points until the payload is actually processed. Either on its own records to confirm it owns them, or on others to confirm is doesn't. That is bad because p arsing many megabytes of XML Payload takes cpu time and a lot of app node memory .

If only a sys_id is put in the Agent Correlator field, and there is no positive confirmation of the feature or table it refers to, then a query against the expected table has to be done to confirm if that sys_id is a record in that table or not, to know if the ECC Queue record should be processed by the sensor or not. That is bad because those queries could be against large or slow tables .

That means code from one feature is impacting the performance of another feature, when working out that an ECC Queue records is not its record.

The other main problem with this lack of standardization and consistency is that the wrong sensors do get run against other feature's records. A wrong sensor can run first, and by updating the state could prevent the proper sensor from running, and causing data loss in the process .

The classic examples are the Discovery sensors running for a non-Discovery input, and setting it to Error state and "No sensors defined". Or for that same Discovery sensor to use a lot of time and appnode memory parsing a massive XML payload to see if it is for Discovery, only for it to be an unrelated REST Message response from an integration.

Most features have had to have Problems opened to add code to protect their records from Discovery's sensors, when ideally all features should know which records are its, immediately, from the simply field values, without having to do any costly parsing or database queries, and then not touch anyone else's records.

Why not just use Topic The Topic field is often used in sensor business rule conditions to link the ECC Queue input record back to a specific feature, however that only works when:

That Probe/Topic is only used by that feature. That Probe/Topic is the only probe used by that feature. When the same probe/topic is used by many features, or a probe initially written for e.g. Discovery is then re-used by another feature, as is the case with P owershell and SSHCommand probes, that's not going to give you positive confirmation of which feature the record is for.

However this can be useful for other features that's don't have any easy positive confirmation that the record is for them, and have to work out if the record is definitely not for them instead. They can exclude all records with a list of Topics known to be specific to other features, however that's obviously a list that keeps growing and needs constant maintenance of the sensor conditions.

The following list are some common features which definitely do have their own Probe, and a main common Sensor business rule specifically for that probe.

MonitoringProbe This is only used by Agent Client Collector. There is a common Business rule "AgentNowResponseProcessor", which then passes the payload on to the script includes specific to the higher level feature, using the Check Type value included in the payload. Note: That general idea could make a good basis for a future Sensor Framework in the platform, where probe types and features are registered in a central table, which allows positively identifying the probe/feature, and defines which code runs as the sensor. IPaaSActionProbe This is only used by Integration Hub Flows, which has a set of common Business rules. These don't use any agent correlator prefix or other clues in the main ecc_queue record fields. They rely on Topic, and so any future feature re-using these probes might cause the existing may cause the existing sensors problems.

Adding a breadcrumb to Agent Correlator as well would future proof these, and also help other features probes ignore agent correlator values that are not simply 32 character sys_ids, or their own prefix. That would also mean a reduction in the change of someone else's feature breaking yours.

See KB0727132 How to link an ECC Queue record back to a specific Feature or Job for other Topic values that you could exclude in your sensor conditions, to make sure your sensor is not running for those other features.

Why prefix the sys_id A sys_id will usually refer to a record. But doesn't tell us anything about which table it is in, or which feature that table is from. The sensor business rule needs to know that, without having to guess or double-check.

There isn't really any other field to use except Agent Correlator. Concatenating multiple values into one field is simple, and equally simple to split up again with a couple of lines of javascript in a sensor business rule condition.

Name/Source fields are usually used for input to the probe, often relating to the endpoint URL or IP address, or source record in the instance. More detail could be put in the Payload XML, but parsing that is costly to performance, as we know from Discovery's sensor searching payloads for the skip_sensor parameter.

For more details on each of the ECC Queue fields, and their purpose, see: KB0855595 How the ECC Queue table records get processed: from output ready to input processed

Would changing the business rule Order help Yes, if you can get your sensor to run first, before others can interfere.

If a good Sensor can get run before a bad sensor, and can set the Status to Processing (or Processed/Error) during the insert, before other sensor business rules would run, then those other business rules would usually skip the record because it is no longer in Ready state.

However the actual input is usually not processed as part of the ecc_queue input insert transaction, as that would block the few API_INT semaphores threads available in the instance, and potentially block or break other integrations, and even block MID Servers from connecting to the instance. So set ting the status on insert is only possible if processing the input is really quick and there are few of them, or the state is set to Processed by the insert business rule, before moving the actual processing of the sensor and data to a scheduler worker thread (like Discovery), or via a system event (like DEX).

The Discovery sensors business rule runs at the default order 100, and perhaps shouldn't. Running at order 99 would allow you to get to the record first.

Supportability Failure to have a unique value to enable customers and support engineers to easily link tasks/records from a feature, to specific ECC Queue records for that task is a problem. If the correct ECC Queue output record for a feature/integration can't be found quickly, or at all in amongst all the other similar records around the same time, the sys_id of the output cannot be used to find the logs of the thread in the MID Server that actually executed the probe, which may be the only place the full errors can be seen. This slows down the debugging process when things are going wrong.

e.g. This problem exists for having the only identifying information hidden in a base64 encoded parameter in the payload field, making it impossible to search the ECC Queue for the records related to a specific Check URL.

PRB1950919 Synthetic Monitoring checks via MID Server locations don't set a value in the Agent Correlator of ecc_queue records Logging in the instance can also help with this. A good example would be the Discovery Logs, where the log records reference the ecc_queue output record directly. The sensor scheduled jobs include the ecc_queue record sys_id in the job name, and is also logged to the appnode localhost logs as the sensor runs. This makes it simple to tie the instance side code/logs/jobs to that specific thing running on the mid server side logs.

A good example - Orchestration Orchestration activities in Legacy Workflows has a good system whereby the agent correlator value is the prefix "rba." for Runbook Automation, and a sys_id of the Workflow Context that the ECC Queue record relates to.

Records that are not for Orchestration will never have a "rba." prefix in the agent correlator, so the 'A utomation - Sensors' business rule will never run for them.

The worst example - Discovery Firstly, the Discovery feature is not completely at fault here. It designed the MID Server for Discovery. It wrote most of the main Probes. Then other apps and features came along a used it too, and re-used a lot of those Probes.

The solution to allow other features to use Discovery's MID Server was the skip_sensor parameter. However that is buried within the payload fields XML, which could be huge, depending on the feature. Loading, parsing, and extracting the skip_sensor parameter from that takes a lot of CPU and memory in the appnode, delaying the feature who's job it actually is.

The sensor's condition script is able to filter some out based on Topic name, or where prefixes in the agent_correlator are known, but usually has to create a scheduled job to run the sensor anyway to take a closer look.

The Discovery - Sensors business rule has a condition script, calls a discovery function in script include 'AutomationEccSensorConditions', which uses

a function '_commonSkipConditions', which lists some Topic values that are definitely not Discovery, however is not a long list. (Zurich has just : HeartbeatProbe, config.file, SystemCommand, ConnectorProbe, IPaaSActionProbe, MonitoringProbe, Syslog, DataInputMarkerUpdate, DataInputConnectorPortCheckProbe, DataInputExamplesProbe) That function also checks if the record is already Processed, is not a MID Server related record, is not a queue.stats/queue.processing input, is not for a MIDExtension excludes anything with an Agent Correlator starting with Orchestration and IntegrationHub's prefix. Decrypts the whole Payload field is necessary. Updates Discovery Status counts Queries the ecc_queue output record that this input is in response_to and double checks it is a ready state input So even before the script of the business rule runs, there has been a lot happening, for records that are probably not even for Discovery.

But by this point it may still be unable to know if a record is for Discovery or not, so executes the business rule script, which schedules a 'ASYNC: Discovery - Sensors' job in sys_trigger anyway.

That scheduled job, set to run once, now, calls script include DiscoverySensorJob, runSensor function, which runs SensorProcessor Java. When SensorProcessor initializes, it queries the discovery_status table to see if the agent_correlator value is a discovery_status record sys_id. If it is then that is confirmation this is a Discovery input. It then gets the Payload, either from the field, or from a large (>500k, up to ~50MB) attachment, decrypts it if necessary, and looks for a skip_sensor=false parameter. If that it found it knows the record is NOT for Discovery. If it gets past all of that, it then starts to run the actual Discovery sensor code From that you can see a lot of processing might have to be done to work out if a job is NOT for Discovery, and it involved executing a scheduled job to do that.

When ACC launches a Discovery Pattern, it uses Discovery's PatternLauncher, like a server-less pattern, but doesn't populate the Agent Correlator, because there is no Discovery Status record. It was triggered from process classifiers after an ACC Enhanced Discovery check, instead. Within Discovery it's not consistent.

And often this still doesn't get it right. That's why I call this the worst example.

The following known problems had to be opened for non-Discovery features, so that the skip_sensor parameter could be added to prevent Discovery sensors running for their records. In a fair world, they should never have needed to do that.

PRB1918478 Employee Document Management EmployeeFileImport/StreamPipeline probes fail to include skip_sensor=true parameter in the output payload, causing No sensors defined error PRB1855384 Continuous Integration/Continuous Delivery (CI/CD) API 'Discovery - Sensors' BR may process ecc_queue record before 'Source Control Response' BR PRB1821920 Source Control Engine ApplyChangesVCSHandler terminates the process following an ecc_queue error PRB1579551 Software Asset Management SAMSoapHandler fails to prevent Discovery sensors running for it's ecc_queue inputs, by missing skip_sensor=true RESTMessagev2/SOAPMessagev2 request parameter (Error: No Sensors Defined) PRB1507976 Security Incident Response integration with McAfee ePO 'No sensors defined' for McAfee EPO PRB1507964 Security Incident Response 'No sensors defined' for Splunk Sighting configuration tile PRB1434384 System Export Sets 'No sensors defined' error for the export set ExportSetResult ECC queue input records PRB1379065 Service Mapping Service Mapping sysauto_script "Update Query Based Services" spawns RESTProbes which all get "No sensors defined" errors PRB1418114 Extrahop We see an entry with state "Error" in ECC Queue for 'GET' call. Upon clicking the record we see Error Message: "No sensors defined". and many more Other problems have had to be opened with Discovery to tighten up these conditions:

PRB1456619 The "No Sensors Defined" error and Status=Error in ECC Queue inputs is misleading, leading to confusion and delay PRB1113671 Discovery - Sensors business rule runs for non-Discovery ecc_queue inputs. ...and there will be more as topics names are added to the exclude list Everything else Most features put a sys_id value in, but just that. No indication is given for which table or feature that refers to.

Where a Probe/Topic is only used by a single feature, sensor conditions usually rely entirely on that.

REST/SOAP sensors tend to rely on the URL in the Name field in the condition first, then use the Agent Correlator to link to specific records.

How most features populate the Agent Correlator field is documented here: KB0727132 How to link an ECC Queue record back to a specific Feature or Job