CMDB Health Correctness Orphan jobs can cause instance outage/syslog size issues due to infinite looping when there is some problem with the TPP structure of the CMDB and records cannot be queried - Known Error

CMDB Health Correctness Orphan jobs can cause instance outage/syslog size issues due to infinite looping when there is some problem with the TPP structure of the CMDB and records cannot be queried Description CMDB Health Orphan jobs can cause instance outage/database size issues due to infinite looping when there is some problem with the TPP structure of the CMDB and records cannot be queried

The "CMDB Health Dashboard - Correctness Score Calculation" launches child scheduled job "ASYNC: Script Job", which executes SNC.MetricProcessorScript.orphanProcessor(0). If for any reason CI records can't be queried, the following symptoms occur:

The " ASYNC: Script Job " job runs forever, and each day another is added. These require killing manually to free up the app node scheduler workers. The logs are spammed with " CMDBHealth : Failed to find orphans by condition for class $s with error $s ". These log tables require truncation to avoid log query performance and database disk space issues. Eventually app node becomes unusable due to scheduler worker exhaustion, or database size, and instance outage occurs. This has been seen to be caused by:

Orphan records in the CMDB where the sys_id is missing from the "cmdb" or "cmdb$par1" TPP partitions of the CMDB. This has been caused by demo data removal. Wrongly created CMDB tables due to PRB1406482, where the storage table value in sys_storage_alias is the class name instead of one of the partition tables. This is usually a table added by CMDB CI Class Models or Discovery and Service Mapping Patterns. Steps to Reproduce

Add an Orphan Rule for a CI class in CI Class Manager Manually break a CI record for that class by deleting the sys_id from cmdb, but leaving it in cmdb$par1 Run the "CMDB Health Dashboard - Correctness Score Calculation" scheduled job Alternatively try and cause PRB1406482, where a table is added badly, however the steps to do that on a clean instance are not known.

The app node logs, and the syslog table, will show a lot of this error:

SEVERE *** ERROR *** CMDBHealth : Failed to find orphans by condition for class $s with error $s Logging will only stop once the job gets to "Maximum per transaction log statements (200000) reached. Suppressing further logging".

When viewing the running "ASYNC: Script Job" processes in "All Active Transactions", the DB query will indicate the table or records causing the problem. In this case ESX Server class, and a sys_id 59b6dbfadbc65f003b929334ca96195f, which when opened as /cmdb_ci_esx_server.do?sys_id=59b6dbfadbc65f003b929334ca96195f gave "record not found" because the sys_id was missing from "cmdb" but present in "cmdb$par1"

SELECT cmdb_par10.`sys_id` FROM ((cmdb$par1 cmdb_par10 LEFT JOIN cmdb cmdb1 ON cmdb_par10.`a_ref_8` = cmdb1.`sys_id` ) LEFT JOIN cmdb$par1 cmdb_par12 ON cmdb1.`sys_id` = cmdb_par12.`sys_id` ) WHERE cmdb_par10.`sys_class_path` LIKE '/!!/!D/!!/!$/!2/!#/!!%' AND cmdb_par10.`sys_class_name` = 'cmdb_ci_esx_server' AND cmdb_par10.`operational_status` != 9 AND cmdb_par10.`operational_status` != 9 AND cmdb_par12.`a_ref_8` IS NULL ORDER BY cmdb_par10.`sys_id` limit 0,100000 /* 023, gs:glide.scheduler.worker.2, tx:a25560b51b58341051de3f05464bcb88 */

SELECT cmdb0.`a_str_21` AS `firewall_status`, cmdb0.`a_int_2` AS `os_address_width`, cmdb_par10.`attested_date`, cmdb0.`opera ... sys_id` = cmdb_par10.`sys_id` ) WHERE cmdb0.`sys_class_path` LIKE '/!!/!D/!!/!$/!2/!#/!!%' AND cmdb0.`sys_id` = '59b6dbfadbc65f003b929334ca96195f' /* 023, gs:glide.scheduler.worker.2, tx:a25560b51b58341051de3f05464bcb88 */ Where PRB1406482 was involved, the queries showed the table name with the problem, and a missing field error (which was a false error):

2021-05-26 01:17:47 (008) worker.1 worker.1 txid=0a28bce21b94 SEVERE *** ERROR *** CMDBHealth : Failed to find orphans by condition for class $s with error $s 2021-05-26 01:17:47 (011) worker.1 worker.1 txid=0a28bce21b94 SEVERE *** ERROR *** FAILED TRYING TO EXECUTE ON CONNECTION glide.2 (connpid=339176): SELECT cmdb0.`sys_id` FROM cmdb cmdb0 WHERE cmdb0.`u_active` = 1 AND cmdb0.`u_orphan` = 1 ORDER BY cmdb0.`sys_id` limit 0,100000 /* ininhosteddev014, gs:glide.scheduler.worker.1, tx:0a28bce21b9434d4ab0898ac0a4bcb7a */ Syntax Error or Access Rule Violation detected by database ((conn=339176) Unknown column 'cmdb0.u_orphan' in 'where clause')

com.glide.db.GlideSQLException: FAILED TRYING TO EXECUTE ON CONNECTION glide.2 (connpid=339176): SELECT cmdb0.`sys_id` FROM cmdb cmdb0 WHERE cmdb0.`u_active` = 1 AND cmdb0.`u_orphan` = 1 ORDER BY cmdb0.`sys_id` limit 0,100000 /* ininhosteddev014, gs:glide.scheduler.worker.1, tx:0a28bce21b9434d4ab0898ac0a4bcb7a */ Syntax Error or Access Rule Violation detected by database ((conn=339176) Unknown column 'cmdb0.u_orphan' in 'where clause')

at com.glide.db.DBIError.wrapAndThrow(DBIError.java:153) at com.glide.db.DBIError.handleException(DBIError.java:86) at com.glide.db.DBI.executeStatement(DBI.java:1107) at com.glide.db.DBQueryExecutor.executeStatement(DBQueryExecutor.java:217) at com.glide.db.DBQueryExecutor.executeStatement(DBQueryExecutor.java:207) at com.glide.db.DBQueryExecutor.execute(DBQueryExecutor.java:54) at com.glide.db.DBQuery.executeAsResultSet0(DBQuery.java:353) at com.glide.db.DBQuery.executeAndReturnTable(DBQuery.java:330) at com.glide.db.DBAction.executeNormal(DBAction.java:256) at com.glide.db.DBAction.executeAndReturnException(DBAction.java:210) at com.glide.db.RDBMSQueryContext.executeQuery(RDBMSQueryContext.java:42) at com.glide.db.DBQuery.execute(DBQuery.java:2702) at com.snc.cmdb.health.OrphanProcessorUtil.getOrphansByCondition(OrphanProcessorUtil.java:153) at com.snc.cmdb.health.OrphanProcessorUtil.getAllOrphan(OrphanProcessorUtil.java:786) at com.snc.cmdb.health.OrphanProcessor.processRecords(OrphanProcessor.java:45) at com.snc.cmdb.health.MetricProcessor.process(MetricProcessor.java:106) at com.snc.cmdb.health.MetricProcessorScript.execute(MetricProcessorScript.java:35) at com.snc.cmdb.health.MetricProcessorScript.jsStaticFunction_orphanProcessor(MetricProcessorScript.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.mozilla.javascript.MemberBox.invoke(MemberBox.java:138) at org.mozilla.javascript.FunctionObject.doInvoke(FunctionObject.java:670) at org.mozilla.javascript.FunctionObject.call(FunctionObject.java:614) at org.mozilla.javascript.ScriptRuntime.doCall(ScriptRuntime.java:2609) at org.mozilla.javascript.optimizer.OptRuntime.call1(OptRuntime.java:32) at org.mozilla.javascript.gen.sys_trigger_135336e61bc0b450ab0898ac0a4bcbff_733._c_script_0(sys_trigger.135336e61bc0b450ab0898ac0a4bcbff:1) ...

In these examples, the fields mentioned in those queries happen to be custom fields, but that fact is not relevant. They were involved because they were part of the condition filter of the CMDB Orphan Rule that had been correctly defined.

Workaround This problem is currently under review. You can contact ServiceNow Technical Support or subscribe to this Known Error article by clicking the Subscribe button at the top right of this article to be notified when more information becomes available.

Note: This problem is not the root cause, except in that the CMDB jobs don't handle corrupted tables/records well . Another known problem or cause of the corrupted tables or data in the CMDB needs identifying and linking with support cases. e.g. PRB1406482

The workaround is to:

Kill the stuck "ASYNC: Script Job" worker threads on each app node of the instance, where it has been confirmed from the stack trace or via the running sys_trigger records that these are for the orphanProcessor funtion. Deactivate the "CMDB Health Dashboard - Correctness Score Calculation" job via the scheduled job tab of the "CMDB View" dashboard. Identify and fix the underlying table/record corruption. This is likely to require a Support Case with ServiceNow Technical Support . From the current database queries being done by the stuck job, identify the table or record sys_id currently being run. You will find the exact same queries running hours later, showing we are stuck on that table/record. Identify if three are any TPP related issues with the table or record, checking sys_db_object and sys_storage_alias are as they should be, and match what is actually in the SQL level tables/columns If the 'storage table' name in sys_storage_alias for any field on the table are not cmdb, cmdb$par1, ..par2, then PRB1406482 is probably the cause: PRB1406482 During CMDB CI Class Models and Discovery and Service Mapping Patterns application installation/update, some tables under TPP hierarchy are created in TPC structure From the SQL level, check that the number of records in cmdb equals the number of records in cmdb$par1. Any sys_ids that only exist in one partition will probably need deleting. Then once the tables are fixed, re-activate the "CMDB Health Dashboard - Correctness Score Calculation" job If no corruption is found, and the job does progress with different queries and sys_ids appearing in those queries, the jobs may be long running due to the size of the CMDB, where particular classes may have >100k records. If the instance is Paris Patch 4 or earlier, please discount he following first: PRB1424208 The orphanProcessor job does not limit the number of records returned from the getOrphansByCondition or getOrphansByRelationRules functions causing nodes to run out of memory