Event Management - Impact calculation explained - Support and Troubleshooting

Event Management - Impact calculation explained Table of Contents Introduction What is Impact Calculation How does Impact Calculation work 1. Copying Alert data from the Alert table to the alert history table 2. Using the copied data for the impact calculation Troubleshooting Impact Calculation issue 1. The current Alert status is not reflected on the Operator workspace 2. A service group job consumes high memory 3. Large Impact graph & Impact status tables 4. Slow queries 5. Invalid customization Troubleshooting Flow References Introduction This article aims to help TSEs understand Impact Calculation functionality and its troubleshooting. What is Impact Calculation As we all know, Impact calculation shows the magnitude of an outage on CIs, services, alerts, and alert groups . Thus it becomes essential for the customer to understand the overall impact on the business in case of an outage. Impact calculation is part of Event management, to see the current status of the services, Navigate to Event Management ➔ Operator Workspace . Below is a sample example:

Here, we can see the Normal and impacted Services, when you open any of the impacted services, you will get a clear understanding of the node(s) because of which the entire service is impacted.

How does Impact Calculation work Impact Calculation uses factors such as impact rules and CI relationships to calculate the severity of a generated alert. The severity appears on the impact tree, application service maps, and dashboards. There are 2 main steps involved here:

Copying Alert data from Alert table to alert history table. Using the copied data for impact calculation. 1. Copying Alert data from the Alert table to the alert history table Alert History [em_alert_history]:

OOTB we have a scheduled job " Event Management - Impact Calculator Trigger " which copies the data from the alert table to the alert history table and populates the VT_end field of the records.

Unlike the Alert table, you are expected to see multiple entries for the same alert number in the alert history table . Example Alert0010056.

Whenever there is an insert or update to an alert, a record is copied into the alert history table. The latest entry will always have a future Alert Valid Time End date . So if there is an entry for an alert with the state as open and future date, it means that there is an active impact caused by the alert. Basically, VT END shows the alert validity.

Read the above image from bottom to top .

The first record Alert Valid Time Start is 2022-09-08 09:18:49 and Alert Valid Time end is 2022-09-08 09:19:08 . This is the current status as per the screenshot. However, when it was the first record in the table, the Alert Valid Time end was " 8994-08-17 00:12:55 " because the impact end time is unknown.

It means that the impact for the alert started at 2022-09-08 09:18:49 and we don't how long will it be there so to keep the impact active, we set the end date as a future date [~6977 years from now]

After some time, the alert record is updated and copy job copied the updated copy of the Alert to the Alert history table on 2022-09-08 09:19:08 . This is the second record from the bottom. This means that the previous copy of the alert is no more valid thus we set the End time for the previous record [first record from bottom] as 2022-09-08 09:19:08 and the same value is the Start time of the second record and this goes on.

Now when we see the top most record, we can see the Start time is 2022-09-08 09:19:08 and we don't know how long will it be active thus we set the end time to 8994-08-17 00:12:55 . Please don't get confused with a future date. it's expected .

This Alert history data is used to calculate the impact of the CI on specific services. Note that, it shows impact for alerts that are not in a maintenance state, if the alert is already in a maintenance state then no impact is displayed.

The data on the alert history table is stored for 3 months and is cleared using the " Event Management - Clean Alert History Table " scheduled job which runs every 30 mins. As the events with vt_end as 8994 show us the status that the event is still active, so would not expect them to be cleaned. The clean-up happens based on the vt_start date. Note - The Service status on the operator workspace reflects the status value of the Alert history table and not from the Alert table .

2. Using the copied data for the impact calculation Impact Graph [em_impact_graph]

Impact calculation involves 3 important tables: Impact Graph [em_impact_graph], Impact Status [em_impact_status], and Hashes [sa_hash].

The Impact Tree Builder job [" Event Management - Impact Tree Builder "] handles all services with changes from the em_impact_changes table and rebuilds their impact trees and runs every 11 seconds. Records are inserted into em_impact_graph as part of building the impact tree. it has no dependency on Alert and is built out of Service Mapping Topology and Impact rules.

The impact tree shows the result of impact rules on CI parent and child relationships. The tree represents a service map with CIs from the Service Configuration Item Associations [svc_ci_assoc] table

For example, if you see the below example for service_v1_40, you will find it's a service record and 2 CI's are pointing to 1 service record, the status of all the records is valid and the "Is service" is shown as true.

When the system tried to rebuild the service or you manually change the status of Service from Operational >> Non-Operational >> Operational, all the records associated with the service are deleted and new records are created with Status as Valid.

Impact Status [em_impact_status]

Once we have the Service Graph populated, we have 6 jobs for the alert impact calculation: These are based on em_alert_history and em_impact_graph. These jobs calculate the service severity in the em_impact_status table and update the VT end field. Below are the job details:

Event Management - Impact Calculator for BS [Bucket 0-3] - To calculate the impact for services with bucket 0-3 Event Management - Impact Calculator for BS [Bucket 2] - Alert group and SLA. Event Management - Impact For Service Group - It calculates the impact on Service Groups

These jobs use the alert history records and create the entries in the em_impact_status. em_impact_status - data is stored for 3 months default and reflects the severity status of CI and how it affects the service, the CI is part of. This table is based on em_alert_history and em_impact_graph. Troubleshooting Impact Calculation issue There are 5 TOP scenarios observed with Impact Jobs: The current Alert status is not reflected on the Operator workspace. A service group job consumes high memory. Large Impact graph & Impact status tables Slow queries Invalid customization 1. The current Alert status is not reflected on the Operator workspace When we see the discrepancy with respect to the alert status on the operator workspace, we need to check the below tables first and validate where exactly is the issue: Alert History [em_alert_history] Alert Impact Status [em_impact_status] Hashes [sa_hash]

Alert History

Use Case 1 - Alert data is not getting copied :

We need to check whether alerts are copied to the table or not. If those are not copied then it means that there is something wrong with the Copy Job " Event Management - Impact Calculator Trigger ". kindly check the Application node logs and system logs to verify the job execution and see if any error is reported. The job needs to be fixed so that it starts copying data to the Alert History table.

Use Case 2 - Alert data is getting copied but not correctly.

Sometimes, the alert gets copied but the correct status of the record is not copied. This can happen because of slow BR's not committing data to DB before execution of copy job. So to take care of such use cases where we see the delayed update on the alert record because of slow custom BRs, we can follow the below:

evt_mgmt.max_objs_in_alert_query : Always make sure the value of this property is set to 1 and not 500. OOTB the value is set to 1. if set to 500, then it won't commit anything to DB until and unless it processes all 500 events in the transaction thus it delays the process. So we recommend processing events one by one so that processing is completed faster.

evt_mgmt.impact_calculation.alert_copy_delay : by default this property value is set to 2 milliseconds. This property is not visible OOTB. During the execution of the Job, we need to check if there are any slow BRs running on all alerts tables if yes, then we should consider changing the evt_mgmt.impact_calculation.alert_copy_delay value to higher [default value 2 milliseconds]. This will make sure the job considers the records updated before [now-n] time. n stands for the property value.

Note - After making the above changes, it will work for new records but the existing record will still be in a broken state thus to take care of existing data, kindly Follow step 3 in KB0826649 or execute the " emSupport-touchOpenAlertsHistory.txt " script attached to the KA.

Another issue would be with the Incorrect VT END date; We are aware that the copy jobs update this value, if there is an issue with backfill then this field won't be updated. This could be because the backfill hash value is out of sync on the sa_hash table. Impact Status .

If the impact status doesn't reflect what the user is expecting then we should check records for the impact graph table to make sure if the CI is part of the service. The details are explained above. Apart from this, we recommend following the below basic checks in the impact status table to validate the issue.

Compare VT Dates : For the reported impacted service make sure, the VT Start and VT end dates are populated correctly.

When we read the first record in the above screenshot, we see CI "ci_v1_179_1_1" as an element Identifier pointing to the "service_v1_179" business service. The Impact started at "2022-09-0523:13:44" and will continue till 8994-08-17 00:12:55. However when we see the second record, we can say that for the same CI, the impact started at "2022-09-05 23:13:37" and Impact ended at "2022-09-05 23:13:44".

So comparing the first and second records, the second record confirms that the impact ended on 2022-09-05 at 23:13:44, however, the impact for the first record is still active.

The above screenshot is a working example. If we see discrepancies in the above behavior the challenge is with the impact calculation.

In this case, as basic troubleshooting, you can make the operational service non-operational and back to operational so that it deletes all existing impact graphs and builds new ones. Then the newly built entries will be used for calculation.

Note - The above step should only be followed after identifying the root cause. If the Service Operational Status is changed without finding the root cause then we may end up in a similar situation again in the future.

Hashes [sa_hash] :

Another problem would be with Impact Calculator Jobs for Business Service. If the job is stuck then it will also be reflected on the sa_hash table. The main issue we see with the sa_hash table is broken sync. Impact hashes are not synced with other hashes.

We have a hash value for the Copy Job and hashes for the impact calculation job. The impact calculation of job progress is based on copy job progress thus the hash values for impact jobs are almost similar to copy jobs.

The problem is seen when we see a big gap in the values of hashes between copy and impact jobs. This is seen when the impact jobs for BS are stuck. Some of the common cases with sa_hash are :

A. last_impact_bacth_copy_job has lower value :

As we can see the hash of copy job is lower and the impact job is high, here the impact calculation jobs can't progress.

Check for PRBs and see if the customer is on the fixed version or not, if not then in order to stabilize the system, the below steps can be followed

Change the 'last_impact_batch_copy_job' hash value to be like the others ( the highest of calc hashes ). In the above example, set it to 224464 or above. Update the vt_end of old Alert history records to the vt_start value [attached - emSupport-UpdateVTEndOfOldAlertHistoryRecordsToCurrent.txt ] Touch all open Alerts [attached - emSupport-TouchAllOpenAlerts.txt ] Touch all services. For this activity, follow the below steps: Run the attached script for the Instance version below Washington [attached - emSupport-TriggerServices.txt ] For version Washington and above kindly execute the scheduled job " Event Management - Manual Impact Calculator Trigger " B. last_impact_batch_calc_job-2 is in Scientific Form :

This would impact the Alert Groups and SLA. The job ends with -2. This is a known issue PRB1564953 and is already taken care of .

To stabilize the system and fix the issue:

Deactivate the "Event Management - Impact Calculator for Alert Groups and SLA" job Change the hash value to be like the others (the value lower than the copy job hash value) Reactivate the "Event Management - Impact Calculator for Alert Groups and SLA" job C. calc job hashes/backfill out of sync :

This is caused when one of the Impact Calculation jobs is out of sync. This is just one of the reasons. There can be other reasons for the sa_hash going out of sync.

Check for PRBs and see if the customer is on the fixed version or not, if not then to stabilize the system, the below steps can be followed

Change the 'last_impact_batch_copy_job3' hash value to be like the others. In the above example, set it to 314085 or lower. Update the vt_end of old Alert history records to the vt_start value [attached - emSupport-UpdateVTEndOfOldAlertHistoryRecordsToCurrent.txt ] Touch all open Alerts [attached - emSupport-TouchAllOpenAlerts.txt ] Touch all services. For this activity, follow the below steps: Run the attached script for the Instance version below Washington [attached - emSupport-TriggerServices.txt ] For version Washington and above kindly execute the scheduled job " Event Management - Manual Impact Calculator Trigger " D. Duplicate entries for the hash name :

Sometimes, we see multiple records with the same hash name in the sa_hash table. We call it a " duplicate hash issue ". In this case, follow the below steps:

For the specific impact hash, locate the hash that is latest updated/last updated Except for the latest updated hash, delete all the duplicate hash values. Post deletion, verify if all Impact hashes are in sync. If not, follow the KB steps to fix sync issues for Impact hashes.

As a proactive measure, we can enable Self Health monitoring. We have a Self-health Monitoring >> Duplicate Impact Hashes Monitor script, that monitors this behavior and raises an alert to report the behavior. At the platform level, we are working on a solution that will let us know the risk for duplicate hashes so that we can prevent it and avoid duplicate hash issues. 2. A service group job consumes high memory This issue is mainly observed due to Default Service Group[cmdb_ci_service_group] "All". If there are many services and all are part of this All Service Group then the calculation requires higher memory which impacts the instance performance. We have made certain performance improvements for this in Tokyo. In order to deal with this issue, we need to manually backport the fix to another version. Kindly import the "EvtMgmtCalculateImpactForGroups" attached script. Note - Revert the script once the customer upgrades to Tokyo. 3. Large Impact graph & Impact status tables Possible causes and solutions:

Very large services (exceeding recommended default max CIs) For the Dynamic CI group, the default value is "sa.qbs.max_num_of_cis" =10000

Many Services rebuilds CI changes may affect Dynamic CI groups filters -> rebuild -> Impact graph grows -> Impact status grows To fix this - consider changing the filters for the dynamic CI group to be less sensitive to service changes so there will be fewer rebuilds.

Network Storage Path may be involved: To fix this - Consider excluding network paths from Application services and VMs from Dynamic CI groups by setting the below properties. Set ' evt_mgmt.network_path_excluded ' to true Set 'evt_mgmt.enrich_topology_dynamic_cis' to false

Other rebuilds with frequent changes To fix this - CI Exclusion: KB0952256 : Service Recomputation: exclude specific CI Types from recomputation to reduce performance issues Exclusion list for CI changes per field in svc_baseline_exclusion The issue with cleanup jobs: Check cleanup jobs that work properly frequently and validate if the data is getting deleted. Also if your impact-related data is very high consider changing the cleanup properties (under Event Management properties) to be lower than 3 months (but more than a day)

4. Slow queries if we observe performance challenges like slow query during impact calculations we can check if Active transactions – Is there a specific Impact job that runs longer than usual? Look for large services with the same bucket ID Find the query and validate if new indexes can be added here.V Validate the impact tables size and control the size by following the clean-up jobs properties mentioned in the " Large Impact graph & Impact status tables" section. 5. Invalid customization Sometimes, we observe that, as a part of business requirements, end users change the Run as of the scheduled job which won't allow it to run for other domains. In some cases, end users customize the operation status field values which causes the service to not list under impacted services of the alert. We have seen Impact Calculation log error: Predefined 'All' group was not found. This is because the Service groups table did not contain the record 'All' (the Service group responsibilities table affected 'All' record)

Troubleshooting Flow The below flowchart explains the troubleshooting flow.

References Alert impact calculation