Best Practices for Designing Resilient Custom Scheduled Jobs<!-- /*NS Branding Styles*/ --> .ns-kb-css-body-editor-container { p { font-size: 12pt; font-family: Lato; color: #000000; } span { font-size: 12pt; font-family: Lato; color: #000000; } h2 { font-size: 24pt; font-family: Lato; color: black; } h3 { font-size: 18pt; font-family: Lato; color: black; } h4 { font-size: 14pt; font-family: Lato; color: black; } a { font-size: 12pt; font-family: Lato; color: #00718F; } a:hover { font-size: 12pt; color: #024F69; } a:target { font-size: 12pt; color: #032D42; } a:visited { font-size: 12pt; color: #00718f; } ul { font-size: 12pt; font-family: Lato; } li { font-size: 12pt; font-family: Lato; } img { display: ; max-width: ; width: ; height: ; } } Overview This article describes the architectural patterns and best practices required to build resiliency into custom ServiceNow scheduled jobs. It addresses the platform's behavior during unexpected node restarts and provides strategies for ensuring long-running scripts can resume progress rather than restarting from the beginning. Purpose The purpose of this article is to enable ServiceNow developers and architects to design "re-entrant" scripts. As a result of using this article, readers will be able to implement checkpointing mechanisms and batching logic that protect data integrity and save processing time when a background transaction is interrupted. Scope The scope of this article is limited to the ServiceNow platform, specifically targeting server-side scripting within Scheduled Jobs and long-running background scripts. It applies to all current ServiceNow releases. Understanding Platform Behavior during Interruptions Node Restart Impact In a ServiceNow environment, scheduled jobs are executed by worker threads on specific nodes. If a node restarts due to an error, maintenance, or an unexpected failure, any active transaction on that node is immediately terminated. The platform does not natively track the internal state of a custom script. Consequently: Transaction Termination: The script execution stops instantly.State Reset: The job record state is moved from Running back to Ready. Redundancy: When the job is picked up again (on the same or a different node), it executes from line one of the script, unaware of any work completed prior to the restart. Job State Before RestartBehavior After RestartRunningState reset to Ready; script starts over from the beginning.QueuedRemains in Ready state; no impact on eventual execution. Strategic Resiliency Patterns Implementation Flexibility It is important to note that the strategies outlined below are high-level architectural patterns. Because every business use case involves different data volumes and performance requirements, there is no single "correct" way to implement these features. Different users may choose to store state in different ways—such as custom logging tables, system properties, or scratchpad variables—depending on their specific organizational standards and the complexity of the task. 1. The Checkpoint Pattern To allow a job to "pick up where it left off," the script must store its progress in the database (which persists across node restarts). Mechanism: Periodically save a unique identifier (like a sys_id or a sequence number) of the last successfully processed record.Recovery: At the start of the script, query this stored value and modify the GlideRecord query to only fetch records with an ID greater than the last processed value. 2. Batching and Chunking Processing very large datasets in a single loop increases the risk of significant work loss. Dividing the workload into smaller batches ensures that if a failure occurs, the amount of lost progress is minimized. 3. Idempotency A resilient script should be idempotent, meaning that if the same record is processed twice (e.g., the node fails after the record is updated but before the checkpoint is saved), the end result remains the same and does not create duplicate data or errors. Strategic Resiliency Patterns The following example demonstrates a common method using a System Property to track progress. Users may adapt this to use a custom table if more detailed audit trails are required. Re-entrant Script Pattern JavaScript // Retrieve the last processed ID from a persistent location var lastId = gs.getProperty('custom.job.last_processed_id', '0'); var processedCount = 0; var gr = new GlideRecord('u_custom_table'); gr.addQuery('sys_id', '>', lastId); // Only get records we haven't finished gr.orderBy('sys_id'); gr.setLimit(1000); // Process in a manageable chunk gr.query(); while (gr.next()) { try { // --- Insert Business Logic Here --- processedCount++; // Update the checkpoint every 100 records if (processedCount % 100 === 0) { gs.setProperty('custom.job.last_processed_id', gr.getUniqueValue()); } } catch (e) { gs.error("Error at record " + gr.getUniqueValue() + ": " + e.message); } } // Final update for the remaining records in the batch if (gr.getUniqueValue()) { gs.setProperty('custom.job.last_processed_id', gr.getUniqueValue()); } Key Considerations Resetting State: Ensure the script or a sub-process resets the "Last Processed" value once the entire job is truly finished so that the next scheduled run starts from the beginning.Performance: Checkpointing too frequently (e.g., every single record) can cause database overhead. Most developers find that checkpointing every 100–500 records provides a healthy balance between safety and performance. Additional Resources Use this section to include helpful resources related to the work instructions, such as Knowledge articles, links to attached files, etc. KB1922043 - What happens to schedule jobs when node restart Revision Log... (Last updated: dd-mmm-202y) VersionPublishedSummary of Changes1.0dd-mmm-yyyyInitial version