CMDB Health Correctness Orphan jobs can cause instance outage/syslog size issues due to infinite looping when there is some problem with the TPP structure of the CMDB and records cannot be queried


Description

CMDB Health Orphan jobs can cause instance outage/database size issues due to infinite looping when there is some problem with the TPP structure of the CMDB and records cannot be queried

The "CMDB Health Dashboard - Correctness Score Calculation" launches child scheduled job "ASYNC: Script Job", which executes SNC.MetricProcessorScript.orphanProcessor(0). If for any reason CI records can't be queried, the following symptoms occur:

This has been seen to be caused by:

Steps to Reproduce

Alternatively try and cause PRB1406482, where a table is added badly, however the steps to do that on a clean instance are not known.

The app node logs, and the syslog table, will show a lot of this error:

SEVERE *** ERROR *** CMDBHealth : Failed to find orphans by condition for class $s with error $s

Logging will only stop once the job gets to "Maximum per transaction log statements (200000) reached. Suppressing further logging".

When viewing the running "ASYNC: Script Job" processes in "All Active Transactions", the DB query will indicate the table or records causing the problem. In this case ESX Server class, and a sys_id 59b6dbfadbc65f003b929334ca96195f, which when opened as /cmdb_ci_esx_server.do?sys_id=59b6dbfadbc65f003b929334ca96195f gave "record not found" because the sys_id was missing from "cmdb" but present in "cmdb$par1"

SELECT cmdb_par10.`sys_id` FROM ((cmdb$par1 cmdb_par10 LEFT JOIN cmdb cmdb1 ON cmdb_par10.`a_ref_8` = cmdb1.`sys_id` ) LEFT JOIN cmdb$par1 cmdb_par12 ON cmdb1.`sys_id` = cmdb_par12.`sys_id` ) WHERE cmdb_par10.`sys_class_path` LIKE '/!!/!D/!!/!$/!2/!#/!!%' AND cmdb_par10.`sys_class_name` = 'cmdb_ci_esx_server' AND cmdb_par10.`operational_status` != 9 AND cmdb_par10.`operational_status` != 9 AND cmdb_par12.`a_ref_8` IS NULL ORDER BY cmdb_par10.`sys_id` limit 0,100000 /* <node>023, gs:glide.scheduler.worker.2, tx:a25560b51b58341051de3f05464bcb88 */

SELECT cmdb0.`a_str_21` AS `firewall_status`, cmdb0.`a_int_2` AS `os_address_width`, cmdb_par10.`attested_date`, cmdb0.`opera ... sys_id` = cmdb_par10.`sys_id` ) WHERE cmdb0.`sys_class_path` LIKE '/!!/!D/!!/!$/!2/!#/!!%' AND cmdb0.`sys_id` = '59b6dbfadbc65f003b929334ca96195f' /* <node>023, gs:glide.scheduler.worker.2, tx:a25560b51b58341051de3f05464bcb88 */

Where PRB1406482 was involved, the queries showed the table name with the problem, and a missing field error (which was a false error):

2021-05-26 01:17:47 (008) worker.1 worker.1 txid=0a28bce21b94 SEVERE *** ERROR *** CMDBHealth : Failed to find orphans by condition for class $s with error $s
2021-05-26 01:17:47 (011) worker.1 worker.1 txid=0a28bce21b94 SEVERE *** ERROR *** FAILED TRYING TO EXECUTE ON CONNECTION glide.2 (connpid=339176): SELECT cmdb0.`sys_id` FROM cmdb cmdb0 WHERE cmdb0.`u_active` = 1 AND cmdb0.`u_orphan` = 1 ORDER BY cmdb0.`sys_id` limit 0,100000 /* ininhosteddev014, gs:glide.scheduler.worker.1, tx:0a28bce21b9434d4ab0898ac0a4bcb7a */
Syntax Error or Access Rule Violation detected by database ((conn=339176) Unknown column 'cmdb0.u_orphan' in 'where clause')

com.glide.db.GlideSQLException: FAILED TRYING TO EXECUTE ON CONNECTION glide.2 (connpid=339176): SELECT cmdb0.`sys_id` FROM cmdb cmdb0 WHERE cmdb0.`u_active` = 1 AND cmdb0.`u_orphan` = 1 ORDER BY cmdb0.`sys_id` limit 0,100000 /* ininhosteddev014, gs:glide.scheduler.worker.1, tx:0a28bce21b9434d4ab0898ac0a4bcb7a */
Syntax Error or Access Rule Violation detected by database ((conn=339176) Unknown column 'cmdb0.u_orphan' in 'where clause')

at com.glide.db.DBIError.wrapAndThrow(DBIError.java:153)
at com.glide.db.DBIError.handleException(DBIError.java:86)
at com.glide.db.DBI.executeStatement(DBI.java:1107)
at com.glide.db.DBQueryExecutor.executeStatement(DBQueryExecutor.java:217)
at com.glide.db.DBQueryExecutor.executeStatement(DBQueryExecutor.java:207)
at com.glide.db.DBQueryExecutor.execute(DBQueryExecutor.java:54)
at com.glide.db.DBQuery.executeAsResultSet0(DBQuery.java:353)
at com.glide.db.DBQuery.executeAndReturnTable(DBQuery.java:330)
at com.glide.db.DBAction.executeNormal(DBAction.java:256)
at com.glide.db.DBAction.executeAndReturnException(DBAction.java:210)
at com.glide.db.RDBMSQueryContext.executeQuery(RDBMSQueryContext.java:42)
at com.glide.db.DBQuery.execute(DBQuery.java:2702)
at com.snc.cmdb.health.OrphanProcessorUtil.getOrphansByCondition(OrphanProcessorUtil.java:153)
at com.snc.cmdb.health.OrphanProcessorUtil.getAllOrphan(OrphanProcessorUtil.java:786)
at com.snc.cmdb.health.OrphanProcessor.processRecords(OrphanProcessor.java:45)
at com.snc.cmdb.health.MetricProcessor.process(MetricProcessor.java:106)
at com.snc.cmdb.health.MetricProcessorScript.execute(MetricProcessorScript.java:35)
at com.snc.cmdb.health.MetricProcessorScript.jsStaticFunction_orphanProcessor(MetricProcessorScript.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.mozilla.javascript.MemberBox.invoke(MemberBox.java:138)
at org.mozilla.javascript.FunctionObject.doInvoke(FunctionObject.java:670)
at org.mozilla.javascript.FunctionObject.call(FunctionObject.java:614)
at org.mozilla.javascript.ScriptRuntime.doCall(ScriptRuntime.java:2609)
at org.mozilla.javascript.optimizer.OptRuntime.call1(OptRuntime.java:32)
at org.mozilla.javascript.gen.sys_trigger_135336e61bc0b450ab0898ac0a4bcbff_733._c_script_0(sys_trigger.135336e61bc0b450ab0898ac0a4bcbff:1)
...

In these examples, the fields mentioned in those queries happen to be custom fields, but that fact is not relevant. They were involved because they were part of the condition filter of the CMDB Orphan Rule that had been correctly defined.

Workaround

This problem is currently under review. You can contact ServiceNow Technical Support or subscribe to this Known Error article by clicking the Subscribe button at the top right of this article to be notified when more information becomes available.

Note: This problem is not the root cause, except in that the CMDB jobs don't handle corrupted tables/records well. Another known problem or cause of the corrupted tables or data in the CMDB needs identifying and linking with support cases. e.g. PRB1406482

The workaround is to:

If no corruption is found, and the job does progress with different queries and sys_ids appearing in those queries, the jobs may be long running due to the size of the CMDB, where particular classes may have >100k records. If the instance is Paris Patch 4 or earlier, please discount he following first:
PRB1424208 The orphanProcessor job does not limit the number of records returned from the getOrphansByCondition or getOrphansByRelationRules functions causing nodes to run out of memory


Related Problem: PRB1501230