MID Server upgrade process - What actually happens when a MID Server upgrades itself?


Description

How does the MID server upgrade process work? Knowing this will help you debug if it goes wrong, and identify exactly where it went wrong.

This article describes the process that takes place when the MID server auto-upgrades, which should happen immediately after an instance upgrade of a patch finishes.

This is based on a Windows host and Paris version. The Linux process is basically the same, but with different temp folder locations and using shell scripts instead of batch files.

MID Server Upgrade Process on a Windows host

  1. The Upgrade check is triggered in one of these ways:
    • By the StartupSequencer thread when the MID Server starts up.
    • At the end of an Instance patch or upgrade, when the instance sends a topic=SystemCommand, source=autoUpgrade job to all MID Servers that are UP at the time via the ECC Queue.
      • Other MID Server related plugins may also send Restart requests to the MID Server at this time.1.1
    • Every hour, when the MID Server's "AutoUpgrade.3600" thread runs. You cannot stop that check happening.
    • If the "Upgrade MID" related link is clicked on the MID Server form.
  2. "Checking to see if MID server needs to upgrade." will be written to the MID Server agent log, and the instance is queried to find out what version the MID Server should be:
    • The MID Server knows the buildstamp of the 'core' and 'jre' packages installed from the contents of the agent\package\meta\mid-core.meta.properties and mid-jre.meta.properties files. 'core' is the MID Server application and is upgraded every time a MID Server Upgrades. The 'jre' is only likely to be upgraded every major relase so is likely to be an older buildstamp than the core.2.3
    • It asks the instance what is Assigned using the "MIDAssignedPackages" Scripted SOAP Service. The request includes os, architecture and JVM version, and the corresponding core, jre and upgrade package names/URLs are returned:
      • The assigned packages are derived from the MID Buildstamp of the instance version in the mid.buildstamp system property. That should always match the current instance version and be updated automatically with each instance upgrade. The instance stats.do page will also report the MID Buildstamp for manual confirmation.
      • glide.war and glide.war.assigned system properties are used to see if a instance rollback has occurred.2.1
      • If the specific instance app node that the SOAP request is handled by has not upgraded and restarted yet for some reason, the pre-upgrade version may still be returned2.2
      • A MID Server's mid.pinned.version parameter will override the instance version. This is the only method that will prevent a MID Server upgrade when an instance upgrades, and is planned to be deprecated and removed from the documentation, due to the dangers of running a mismatched MID Server.2.4
      • The assigned Java JRE version is hard-coded as a variable in the script of the SOAP web service. If this already matches the JRE installed in the MID Server, then the JRE upgrade will be skipped. This means that record must be the out-of-box version from the recent instance upgrade, and reverted if necessary. This script should never be customised or it will cause it to be skipped in upgrades.
      • The Rome JRE upgrade will also be skipped if the Linux host is 32bit or if glibc version is <v2.17.
    • 'Installed' is compared to 'Assigned', with the 'Missing' packages being the difference, and what will later be downloaded and installed in step 5.
    • If the assigned version is older than the installed version, then a downgrade is attempted, however, in some cases this will cause problems. The older instance version may not have the newer code/APIs that the MID Server was expecting when starting the downgrade process, while the MID Server is still running future code.

The MID Server agent log will report something like this. 'Assigned' will always include the mid-upgrade as well as the mid-core package. 'Assigned' and 'Missing' will include a mid-jre entry only if that needs to be upgraded, and in this example it didn't. An upgrade from Paris to Quebec, or Quebec to Rome, would include a JRE upgrade, but usually not for subsequent patches and hotfixes within the major version.

Current packages:
  Installed: [mid-core.quebec-12-09-2020__patch0-hotfix3-01-20-2021_01-21-2021_0905.universal.universal.zip, mid-jre.quebec-12-09-2020__patch0-hotfix1-01-04-2021_01-06-2021_1339.windows.x86-64.zip]
  Assigned: [mid-upgrade.quebec-12-09-2020__patch1-02-18-2021_03-01-2021_1225.universal.universal.zip, mid-core.quebec-12-09-2020__patch1-02-18-2021_03-01-2021_1225.universal.universal.zip]
  Missing: [mid-upgrade.quebec-12-09-2020__patch1-02-18-2021_03-01-2021_1225.universal.universal.zip, mid-core.quebec-12-09-2020__patch1-02-18-2021_03-01-2021_1225.universal.universal.zip]
Downloaded: []
  1. "Setting mid status to Upgrading" will be written to the MID Server agent log
    • The MID Server record will be set as Status=Upgrading3.1, and will be paused, and not take any more new jobs (except system commands).
    • The MID Server Upgrade History records in the instance will be updated.
  2. "Performing pre-upgrade validation tests" will be written to the MID Server agent log
    • For on-premise instances, isolated environments, or Regulated Market datacenter instances, it may not be possible to pass this test, requiring a workaround.4.1,4.8
    • A "mid-upgrade...preUpgradeCheck.zip" file is downloaded from https://install.service-now.com
    • Since Quebec, a Certificate check for host install.service-now.com is done. This should be fine as the certificate is usually the same as the connection to the instance.
    • If the signed preUpgradeCheck.zip file fails Verifying digital signature, the upgrade will fail.4.4
    • The contents are extracted to the TEMP folder. Since Rome, this is within agent\work. Quebec and earlier is the OS/Service User temp folder.
    • In Windows, this includes a test where a simple PowerShell script is run to check the PowerShell version and user permissions.4.2,4.3,4.7
    • Some known configurations that cause upgrades to fail are also checked, including that Application Experience is running4.6.
    • The temp files are deleted.
    • If all is well then "Pre-upgrade validation tests successful. Continuing with upgrade process", otherwise specific errors or non-blocking warnings will be added to the agent log and to the MID Server Issue table [ecc_agent_issue]. 4.5
      Note: These pre-checks will not be run if MID Server configuration parameter mid.upgrade.run_precheck=false
  3. Download the missing package ZIP files
    • mid-upgrade...zip then mid-core...zip will always need downloading, and possibly also mid-jre...zip, if the Java Runtime also needs upgrading. The specific filenames needed are listed under the "Missing: [...]" line of the Current Packages check above.
    • ZIP files are saved in the \agent\package\incoming\ folder.
    • If the ZIP file contains a META-INF folder, Signatures are checked to make sure the ZIP file is not tampered with.4.5
    • If all were logged as "Package was successfully downloaded" then we continue.
    • If instance system property mid.download.through.instance=true, then ZIP files will be downloaded via the instance, and not directly from install.service-now.com. That should now be set false by default due to causing blocked semaphores in the instance. 5.1
    • If the file is incomplete, perhaps due to socket timeout, the file is deleted and download retried. If maximum reties is reached, or there is a problem deleting the file, this needs resolving manually.
  4. "Upgrading MID server" will be written to the MID Server agent log, once we have everything we need.
    • Extracts the ZIP files to the temp folder. Anti-virus/security software can block creating those temp files, breaking the upgrade6.5.
    • Prior to Rome, including an upgrade to Rome or later from an earlier version, the OS temp folder is used, unless overwritten in wrapper-override.conf6.1. This is a random folder name like C:\Windows\TEMP\<random 13 digit number>-0\ .
    • From Rome, a temporary folder under agent/work is used. We have full control of permissions in that folder, and so will have fewer problems that we had with the OS shared temp folder. This would be a default behavior for any upgrade from Rome, once the instance has already been upgraded to Rome or later. mid.upgrade.use_os_temp_folder controls that behaviour, and defaults to false.
    • Since Quebec, writes MID and wrapper process IDs to agent\conf\pids for checking they are stopped in step 7.
    • "Stopping MID server. Bootstrapping upgrade." will be written to the MID Server agent log.
    • The temp folder name is written to agent\work\upgrade.info
    • A new process is started, which executes the upgrade binaries that are now in the TEMP folder.6.1 Since. The <TEMP>\<random number>-0\<mid buildstamp>\upgrade-wrapper\bin\glide-dist-upgrade.bat file is run to do that.
    • Prior to Orlando, the Process was instead a new Windows Service named "ServiceNow Platform Distribution Upgrade (<MID Server name>)". The MID Server service needed a 'logon as' user that is a member of the local Administrators group, or it will not be able to create and start the temporary upgrade service.
    • "Setting mid status to Down" is logged at the start of the shutdown process. The wrapper log will log "Stopping the ServiceNow MID Server_xxx service..." at this point.
    • "MIDServer MID Server stopped" is logged, however that does mean that all threads have been killed or that the JVM has stopped yet. There will still be probes that have not finished yet, and those are still going to have to end, or may crash with exceptions. The last agent log entry expected is "Thread-0 Main.handleStop() after shutdown, OperationalState=UPGRADING"
    • During this time wrapper log shows several "Waiting to stop..." logs, and will continue to repeat that every 5 seconds until all running threads/probes have ended.
    • Finally a log of "<-- Wrapper Stopped" in the wrapper log shows the JVM has shut down. There should now be no files locked for the java application, or wrapper service. This will take >2 minutes longer than normal in Madrid6.2 , and if there are other stuck probes this can take 15 minutes or more6.3,7.12.
  5. Meanwhile, the separate Upgrade process starts, and will do the following:
    • This will start immediately after the "Bootstrapping upgrade" log above, before the MID Server has finished shutting down, which may take some minutes to complete.
    • This process logs to glide-dist-upgrade.log, within the TEMP folder. Only later is this copied into the main wrapper.log. Any errors or warnings will be logged to the If the upgrade fails during this step, then this file may be the only clue as to what happened.
    • "The ServiceNow Platform Distribution Upgrade (xxx) service is not installed - The specified service does not exist as an installed service" is logged in Orlando and Paris6.4. Ignore that.
    • "com.snc.dist.mid_upgrade.UpgradeMain$1 start" is logged once the java wrapper has started the upgrader application.
    • This upgrade service (or process, for non-admin login as users) should wait until the MID Server service has fully shut down, before continuing7.9. It will first query the Tanuki wrapper every few seconds (using "agent\bin\mid.bat status")until it returns "Running: No", and those results are written directly to the wrapper.log. 
    • Since Quebec, it checks the MID and wrapper process previously recorded in file agent\conf\pids are truly stopped. Expect 2 entries like "Process (pid=xxx) is not running." once both are stopped.
    • Files in the agent\bin and agent\lib folder are deleted from the MID Server installation. It will retry every second if the file is still locked, so the fact the MID Server might still be shutting down should not be an issue, assuming the MID Server does eventually cleanly shut down. "com.snc.dist.mid_upgrade.UpgradeMain wipeDirs" is logged.
    • If the files are still locked after 10 minutes7.1,7.12 the upgrade will fail. The upgrade service stops, and the MID Server is not started, and remains Down. From Paris, a check that the actual processes have stopped is added 7.14, in addition to logging a list of currently running processes, which will probably confirm the java and wrapper processes were still running. It does not do a stack dump, or list running services, so the information to match up the process IDs (PID) with the installs/services when multiple MID Servers are running is not easy from this log. From Rome, this logging is much improved.
    • If the agent log shows "MID Server stopped" and "Main.handleStop() after shutdown, OperationalState=UPGRADING" it doesn't mean the JVM and wrapper have actually stopped. You need to also see "<-- Wrapper Stopped" in the wrapper.log to confirm the MID Server has shut down. Note: these log entries from the main service are not going to be in chronological order with the upgrade log entries, due to the upgrade log being copied into the wrapper log later.
    • It is possible that file lock errors happen before the 10 minute timeout, after the MID Server truly has shut down. For example, due to Application Experience7.3, and Anti-Virus software such as Cisco AMP7.4 and Dell SecureWorks Red Cloak7.10. There are others causes not yet nailed down7.5. These other non-mid server process are momentarily keeping a lock on the files as the upgrade service tries to delete them. Code is being added to the MID Server to create Issues records when known causes such as these are identified, and Application Experience is already checked for4.6. Anti-virus deleting suspicious files, such InjectorService.exe, while the upgrade is also trying to delete them causes exceptions as well.7.11 The upgrade service stops, and the MID Server is not started, and remains Down.
    • From Rome, only the files that need to be replaced are replaced. Prior to Rome the whole contents of the bin and lib folders were deleted and replaced even if only a few files had actually changed. For example, the Tanuki Wrapper executable bin\wrapper-windows-x86-64.exe is rarely changed, and so by not touching that many upgrade issues are avoided.
    • If 2 services incorrectly use the same install folder, the copies may fail due to file locks by the other running service.7.2 Checks on MID Server startup should now prevent that possibility.
    • After the deletes are done, "com.snc.dist.mid_upgrade.UpgradeMain migrateToTarget" and "Copying files to MID server installation path" is logged. New files previously extracted into the temp folder are copied into the MID Server installation folder, to replace those just deleted. Any existing files will be overwritten, and so would also need not to be locked.
    • Then "Correcting file permissions for directory" is logged for the correction and enforcement of file and folder permissions, before logging "Finished copying files".
    • If the Java JRE is also upgraded7.6, then the agent\jre folder is deleted and replaced. Customised files within agent/jre such as "cacerts" will be overwritten.7.7,7.13
    • "Upgrade complete" is logged
    • A crash or exception around this point could mean no further steps happen. This may be recoverable by allowing the ServiceNow Platform Distribution Upgrade to run again.7.8
    • The main MID Server service is Started using agent\start.bat7.15.
    • "Unable to install the ServiceNow MID Server_xxx service - The specified service already exists. (0x431)" is logged - ignore that, because this is the same start.bat that is used as part of a manual install, and tries to create a service just in case one is not already there. "wrapperm | Waiting to start..." may be logged a few time, and then "ServiceNow MID Server_xxx service started".
    • The log file is copied into the MID Server's wrapper log, in an << UPGRADE LOG BEGIN >>...<< UPGRADE LOG END >> section.
    • This upgrade process then shuts itself down.
  6. The MID Server Starts
    • The upgrade check on startup should confirm that the MID Server Installed version is now the Assigned version. If not, it will attempt to upgrade again.
    • The previous ServiceNow Platform Distribution Upgrade Windows service is uninstalled, and the TEMP folder is deleted, even if it had crashed and not finished.8.1
    • The TEMP folder is deleted, and the glide-dist-upgrade.log file with it.
    • The Instance Certificate will be validated, by checking for revocation with OCSP8.2, and the certificate chain and root certificate are also checked, which can cause problems when self-signed certificates of a proxy/firewall are involved.8.10
    • The Tanuki Wrapper will verify the start parameters, and the Certificates of the wrapper executables are valid.8.3,8.4,8.5
    • Passwords in config.xml may be re-encrypted if security-related code has changed.8.6
    • The Powershell version of the host is checked.
    • Cortex XDR has been seen to kill the MID Server Service during startup.8.12
    • A PowerShell script to enforce stricter Windows file permissions is run8.7. MID Server parameter mid.windows_host.file_permissions.enforce=false disables this. In Quebec this can timeout an prevent startup.8.11
    • A check is done to make sure the Service name in wrapper-override.conf matches the actual running service name, and if not shuts down the MID Server to avoid the chance of 2 services for the same install running. 8.8,8.9
    • A check is done for any other MID Server records in the instance with the same name.
    • The Version field of the MID Server record [ecc_agent.version] is updated by the "MID - Process XMLStats" Business rule sensor, in response to the topic=queue.stats input sent by the MID Server's StatusMonitor thread, which runs on startup (and every 10 minutes) and gets the version number from the agent/package/meta/mid-core.meta file.

Additional Information

Footnotes: