False MID Server Down events caused by a combination of scheduler worker backlog and AMB channel down - Support and Troubleshooting

False MID Server Down events caused by a combination of scheduler worker backlog and AMB channel down Issue A MID Server can wrongly be marked as Down, causing a mid_server.down event/notification to be fired. A few seconds later the MID Server is marked back Up.

Cause This can happen if these 3 conditions coincide:

The MID Server AMB channel is not up, and so it is polling the instance every 40 seconds for new jobs. An instance scheduler worker backlog of sys_trigger jobs has meant the "MID Server Monitor" job has not been able to run on the normal 5 minute interval. The delayed MID Server Monitor job runs within 40 seconds of the next normal scheduled run time. At the end of the period of backlog, when the very delayed MID Server Monitor job runs, a new HeartBeat probe is created. If the MID Server is polling every 40 seconds because the AMB channel is down, it can take the MID Server up to 40 seconds to execute it and send the response back to the instance.

In that <40 second time period, the next MID Server Monitor job may run on the normal schedule, and it is that which marks the MID Server as down. As soon as the Heartbeat probe sent by that 2nd job is executed and the sensor runs, the MID Server is marked as UP again, and so likely to be only a few seconds in between.

Resolution The MID Server Monitor job runs at priority 100, so any other jobs at priority 100 or lower being added to the sys_trigger queue between 5 minute runs, could delay the job running.

The cause of the instance backlog needs investigating, identifying, and resolving to avoid this.

In one case, the 3rd party "JDBC File Loader" app's "ASYNC: JDBCFileLoaderSensor" jobs, which also run at priority 100, and lots were regularly being dumped into the queue, was what caused the backlog of an hour. The solution in that case was to change the priority of the JDBCFileLoaderSensor business rule to 110, so it doesn't block critical jobs like MID Server Monitor.

As well as causing the false down event, this can mask a real MID Server down. The MID Server's status would not be monitored during the time that the MID Server Monitor job is delayed, meaning a MID Server could have gone down without the instance knowing.