MID Server are wrongly set as Down if the instance app node clocks/time/NTP are slightly out of sync


MID Server are repeatedly set as Down if the Heartbeat probes respond almost instantly, and the instance app nodes clocks are slightly out of sync.

The clue is a HeartbeatProbe ecc_queue output with a timestamp in the future compared to the ecc_queue input that is subsequently created. e.g. An input record sys_created_on 1s earlier than the output record has been seen.

5 minutes after that happens, when the 'MID Server Monitor' scheduled job runs again, the code is confused by that and the MID Server is marked as Down, although only momentarily because the 'MID - Heartbeat' sensors for this round of heartbeat probes will mark it back up again soon after. This flapping causes many false alert notifications.

The sequence on events:
1/ "MID Server Monitor" scheduled job runs, sets this app nodes current time in 'mid.monitor.heartbeat_sent' sys_status record, and sends heartbeat probes
2/ "MID - Heartbeat" sensor runs and updates ecc_agent.last_refreshed. This process is running on an app node that has a clock relatively slow compared to the app node that ran 1/.
Note: At this point sys_status value is 1 second after ecc_agent.last_refreshed, due to minor clock differences of app nodes, that have been rounded to the nearest second.
3/ 5 minutes later the "MID Server Monitor" scheduled job runs again. It compares sys_status value and ecc_agent.last_refreshed and sets MID Servers Down.

If sys_status is identical to ecc_agent.last_refreshed, or if sys_status less than ecc_agent.last_refreshed, then the MID Stays up. Only if sys_status greater than ecc_agent.last_refreshed is the MID marked down.

Steps to Reproduce

We've seen this with on-premise instances, and hosted instances due to PRB1370286.

This is caused by the node that runs the 5 minute scheduled job being slightly ahead of the node that processes the ecc_queue inputs from that's job's heartbeat probes. It's therefore hard to force to happen on properly set up servers.

This screenshot shows log statement output from the scheduled job and sensor. A sensor "apparently" runs before the heartbeat probe was sent. This is impossible, which means the timestamps are lying. An app node's clock must be wrong. App nodes use the clock of the server they are running on, so an app node's clock must be wrong relative to another app node.

If the MID Server took no time to run the heartbeat probe (which is unreal, but easier to explain), the unix epoch timestamps suggest the app node that ran the sensor's clock is running >816ms early.

5 minutes later, "last_refreshed" < "sent", so the MID is marked Down.

616ms later it is marked back Up again by the heartbeart sensor, but by then mid_server.down events and alert notifications have gone out.


This problem is currently under review. You can contact ServiceNow Technical Support or subscribe to this Known Error article by clicking the Subscribe button at the top right of this form to be notified when more information will become available.

Server clocks are not perfect, and time servers do need to keep them synchronised. Make sure your app nodes are synced to the same time source, and update regularly.

VM migration can also cause clocks to get out of sync, and so if the MID Server is running as a VM then be sure to re-synchronise the clocks.

If this is still an issue then if would be possible to make a modification to the 'MID - Heartbeat' sensor business rule:

Change this line: 

gr.setValue('last_refreshed', gs.nowNoTZ()); 


// The app node running this may think 'now' is actually before the time the output for this probe was sent. That's probably due to app node clock synchronisation. 
// This workaround uses the later(future) timestamp from the output, instead of 'now' 
// This will avoid the MID Server being marked Down in 5 minutes time by the next run of the 'MID Server Monitor' job. 
if (current.sys_created_on < current.response_to.sys_created_on) { 
gr.setValue('last_refreshed', current.response_to.sys_created_on); // use created timestamp of output 

else { 
gr.setValue('last_refreshed', gs.nowNoTZ()); // original line of code 


Related Problem: PRB1354777