CMDB Baseline creation job causing app node out of memory when a CI has a huge number of records referencing it - Known Error

CMDB Baseline creation job causing app node out of memory when a CI has a huge number of records referencing it Description CMDB Baseline creation job causing app node out of memory when a CI has a huge number of tasks referencing it. The referencing records might be Tasks, or other table records with a reference field to Configuration Item. When the baseline creation gets to the CI with loads of relations, it will cause the app node to go OOM and restart, and usually re-run the same job again, and again every restart.

The symptom is very poor performance for any users logged into that app node, and then the restart may cause any other transactions running at the time. The new baseline being created will not finish. Due to re-running many times there will be duplicate cmdb_baseline_entry records for the same CI sys_ids.

Steps to Reproduce

Create a lot of task records, all referencing the same Configuration Item. e.g. 1 million Create a new CMDB baseline for that CI. You could add just that CI sys_id in the baseline filter before saving. Saving the cmdb_baseline record triggers the "SNC Create Baseline" business rule, which creates a "ASYNC: Script Job" scheduled job using SNC.CMDBUtil.baselineSchedule for the new record. watch memory usage. take a heap dump, and you will see huge amount of memory used for the related task data. One example seen was for 3 million tasks referring to a single application CI. That caused >1.8GB memory to be used by the job. In another, lots of "conflict" table records were referencing the same CI.

The job name is just "ASYNC: Script Job", but can be identified as CMDB Baseline creation from the job context script "new SncBaselineCMDB" and ".create()"

2022-10-07 03:37:08 (997) worker.7 worker.7 txid=70694383db92 *** Start Background transaction - system, user: system 2022-10-07 03:37:09 (004) worker.7 worker.7 txid=70694383db92 Starting: ASYNC: Script Job. , Trigger Type: Once, Priority: 100, Upgrade Safe: false, Repeat: 2022-10-07 03:37:09 (004) worker.7 worker.7 txid=70694383db92 Name: ASYNC: Script Job Job Context: #Fri Sep 30 18:30:00 PDT 2022 Script: var base = new SncBaselineCMDB(" "); base.create();base.notify(" ");

Stack trace a the time of high memory usage will look something like:

glide.scheduler.worker.7 at java.util.Arrays.copyOfRange([BII)[B (Arrays.java:4030) at java.lang.StringCoding.decodeUTF8([BIIZ)Ljava/lang/StringCoding$Result; (StringCoding.java:723) at java.lang.StringCoding.decode(Ljava/nio/charset/Charset;[BII)Ljava/lang/StringCoding$Result; (StringCoding.java:257) at java.lang.String. ([BIILjava/nio/charset/Charset;)V (String.java:507) at org.mariadb.jdbc.internal.com.read.resultset.rowprotocol.TextRowProtocol.getInternalString(Lorg/mariadb/jdbc/internal/com/read/resultset/ColumnInformation;Ljava/util/Calendar;Ljava/util/TimeZone;)Ljava/lang/String; (TextRowProtocol.java:250) at org.mariadb.jdbc.internal.com.read.resultset.SelectResultSet.getString(I)Ljava/lang/String; (SelectResultSet.java:1010) at com.glide.db.meta.StorageUtils.getString(Ljava/sql/ResultSetMetaData;ILjava/sql/ResultSet;)Ljava/lang/String; (StorageUtils.java:147) at com.glide.db.meta.StorageUtils.getObject(Ljava/sql/ResultSetMetaData;ILjava/sql/ResultSet;)Ljava/lang/Object; (StorageUtils.java:58) at com.glide.db.meta.RowStorage. (Lcom/glide/db/PositionMap;Ljava/sql/ResultSet;)V (RowStorage.java:40) at com.glide.db.meta.RowFactory.store(Ljava/sql/ResultSet;Lcom/glide/db/PositionMap;)Lcom/glide/db/meta/IRow; (RowFactory.java:62) at com.glide.data.access.internal.CachedTable.query()V (CachedTable.java:105) at com.snc.cmdb. BaselineCMDB.getRelatedRecords (Lcom/glide/script/GlideRecord;Ljava/lang/String;)Lcom/glide/data/access/ITable; ( BaselineCMDB.java:347 ) at com.snc.cmdb. BaselineCMDB.getRelation (Ljava/util/HashMap;Ljava/lang/String;Lcom/glide/script/GlideRecord;Z)V (BaselineCMDB.java:581) at com.snc.cmdb.BaselineCMDB.getRelations(Lcom/glide/script/GlideRecord;Z)Ljava/util/HashMap; (BaselineCMDB.java:565) at com.snc.cmdb.BaselineCMDB.addRelations(Lcom/glide/script/GlideRecord;)V (BaselineCMDB.java:279) at com.snc.cmdb.BaselineCMDB.setBaseLine(Lcom/glide/script/GlideRecord;)V (BaselineCMDB.java:265) at com.snc.cmdb.BaselineCMDB.processCIs()V (BaselineCMDB.java:190) at com.snc.cmdb.BaselineCMDB.create()V (BaselineCMDB.java:152) ...

At the point of the OOM restart, the app node wrapper log will show something like:

INFO | jvm 24 | 2022/10/07 00:40:06.603 | # INFO | jvm 24 | 2022/10/07 00:40:06.604 | # java.lang.OutOfMemoryError: Java heap space INFO | jvm 24 | 2022/10/07 00:40:06.604 | # -XX:OnOutOfMemoryError="../scripts/kill_jvm_only.sh" INFO | jvm 24 | 2022/10/07 00:40:06.604 | # Executing /bin/sh -c "../scripts/kill_jvm_only.sh"... ERROR | wrapper | 2022/10/07 00:40:51.457 | JVM exited unexpectedly.

Workaround This problem is currently under review. You can contact ServiceNow Technical Support or subscribe to this Known Error article by selecting Subscribe in the options at the top of this article to be notified when more information becomes available.

To avoid this, try to identify the CI that has loads of relationships, and exclude that CI from the filter of the cmdb_baseline record.

From a heap dump, it is possible to identify the table and key field values from the RowStorage objects, within the huge arrays of these objects that take up most of the memory. If you can find an example of one of those records, you can see which CI it links to.

The cmdb_baseline record may have been inserted manually from a form, but it is also common to have a scheduled job which does the cmdb_baseline insert periodically to create new baselines. Search in the script field of the sysauto_script table for "cmdb_baseline" to find it. Deactivating this job, and killing any existing "ASYNC: Script Job" transactions that are running "com.snc.cmdb.BaselineCMDB.getRelatedRecords" in the stack trace will provide relief.