Troubleshooting guide for Service Mapping ML Connection Suggestions / Traffic Based Discovery Troubleshooting Guide for ML Connection Suggestions / Traffic Based Discovery Connection suggestions and traffic based discovery are two features that create smart suggestions by using tcp connection information. Contents: Traffic Based connections prerequisitesOverview and architecture. 2Connection Suggestions feature prerequisitesTroubleshooting Basic troubleshooting flowProblem: There is an old traffic-based connection on the map Solution: Check aging time Problem: Missing AFP due to stuck job/ failed trainer/trainer availability. 4 Solution: Check clustering solution. 4 Problem: Missing connection suggestion. 4 Solution: validate the existence of relevant traffic info. 4Solution: Check sa_ml_process_to_process is populated and ready for ML training. 4Solution: Check there are no missing Source/Target AFP in sa_ml_process_to_process. 5Solution: Check version + if the CI is a load-balance. 5Solution: Check AFP ML solution capability type Problem: Confidence level field in suggestion records is missing or showing "N/A". 5 Solution: Validate traffic connection was collected. 5Solution: Check there are no missing Source/Target AFP in sa_ml_process_to_process. 5 Problem: Missing Confidence (or Missing AFP) due to Predictive Intelligence solution is stuck in "waiting for training" / "waiting for updating" or no movement in progress percentProblem: There is a message in ML readiness page: "Check connection suggestion scheduled job", and "Estimated Time in Hours" is NA Solution: Contact the owner team of the flow for supportSolution: Full cleanup if the training issue wasn't resolved Problem: Connection suggestion decision is 'Added by rule' or 'Added manually' but the connection is not showing on the map. 13 Solution: Target process classified as the source CI.Solution: Check discovery messages log Recovery procedure for SM ML-based feature after trainer availability failure Traffic Based connections prerequisites Sys property sa_ml.connection_suggestions.active = trueTraffic-based discovery flag in the service form must be activeMap Display – Show Traffic-Based Cis must be activecmdb_tcp table can't be empty.If it is empty, run horizontal discovery for all the relevant hosts and then run top-down discovery for the relevant business services. If it is not empty, run top-down discovery for the relevant business services. ML Connection Suggestions Overview and Architecture Connection suggestions feature enables a new approach to traffic-based service mapping using a new ML algorithm and new UI. The new algorithm classifies the connections to several confidence levels of relation to application service based on the source and target AFPs of the connection. The feature is enabled through these system properties: glide.app.p2p.conn.map.enabled=trueprocess.clustering.appfingerprint.enabled=true (OOTB, the above properties are not present, default is true) Once the feature is enabled traffic-based connections are not added automatically to the maps. The user can see the list of connection suggestions with their classifications and decide which connections he wants to add to the map. The connection suggestion ML model depends on the trained model of Application Fingerprint. Both models rely on successful Horizontal discovery. Connection suggestions rely on top-down discovery. The table "sa_ml_connection_suggestion" is population during top-down discovery. ML Connection Suggestions prerequisites PI plugin must be installed:Admin user must have ml_admin, sm_admin, and discovery_admin roles. Admin user application scope id should be global.App fingerprints table cmdb_process_groups must be populated. if it isn't, follow the guidelines in this section before proceeding: App Fingerprints prerequisites If the problem is within this section but you're having problems resolving it, please reassign the case to the Dev-SW-SmurAI team (case sensitive)Sys property " sa_ml.connection_suggestions.active" must exist and set to "true", or not exist at all Troubleshooting Basic troubleshooting flow: [1] Is traffic based discovery / connection suggestions enabled for the CI? Check through map view (should be enabled) [2] Finding the relevant PIDs: Navigate to the relevant CI on the map and open the discovery logSearch for the following pattern in the discovery log: "$$$process_data$$$###pid=4###parent_pid=0###command_line=c:\\windows\\syste m32\\inetsrv\\w3wp.exe###executable_path=C:\\Windows\\System32\\ntoskrnl.exe## #process_data###"Extract the following information: PID, PPID, command_line, executable pathNavigate to cmdb_running_processtableFor windows host, search for records that match the following: Computer Sys Id = The host idCommand = the executable path from the discovery logParameters = the command line from the discovery log For the UNIX host, split the command line from the discovery log to the command and parameters and search for records that match the following: Computer Sys Id = The host idCommand = the complete command (with path), but without the command's parametersParameters = the command's parameters (if any) Collect the pids of the records found (pid field in cmdb_running _process table)If no PIDs are found in the cmdb_running_processtable, collect the PIDs directly from the discovery logAdd child processes: Navigate to cmdb_running_processtable and search for records that match the following: Computer Sys Id = The host idPPID = the list of pids collected If such records are found, get the pids of those records (pid field in cmdb_running _process table) and add them to the pids listAdd the PPIDs from the discovery log to the pids list [3] Increase the max traffic based connections allowed property - sa.traffic_based_discovery.max_connections (default is 100) This defines the maximum number of traffic based connections from a single CI. Problem: There is an old traffic-based connection on the map Solution: Check aging time This is the expected behavior during the grace period of traffic-based connectionWe have a grace period for traffic-based connections which is the ageing period until traffic-based connections are removed from a CI (in hours)The grace period is defined by the sys property 'sa.traffic_based_discovery.conn_aging_time'. Problem: Missing AFP due to stuck job/ failed trainer/trainer availability Solution: Check clustering solution Review ML clustering solution troubleshooting: Check solution has been completedIn Vancouver patch 4 and above: verify capability typeIn case further problems are encountered, please contact the Dev-SW-SmurAI team (the team name is case sensitive). Problem: Missing connection suggestion Solution: Validate existence of relevant traffic info Review the table cmdb_tcp and filter to an IP on the same port, as per the below example: As per the example, green is what we have in order to establish connection: both sides of connections to an IP on the same portred is what is missing: process id for both connection sidesIf other traffic-related issues pop up please use the following traffic troubleshooting guide to make sure the relevant traffic info is collected:https://support.servicenow.com/kb?id=kb_article_view&sysparm_article=KB0832798if all the above do not provide an answer - please check if the source / target AFP is not part of the "sa_ml.connection_suggestions.bad.afp.list" system property.if it is - it means this AFP is not stable and connection suggestion is not supported for it Solution: Check sa_ml_process_to_process is populated and ready for ML training Table sa_ml_process_to_process is gathering connections between 2 processes using TCP data collected on discoveryDepends on tables: cmdb_tcp, cmdb_process_groupsReview the table sa_ml_process_to_process for existing recordsCheck that all records have "Is Complete"=true. "Is Complete" column –indicates if the row is ready to be sent to Trainer. False means that one of the AFP is still missingTo train the data and populate the confidence between processes, a minimum number of records(10,000 by default) in sa_ml_proces_to_process with "Is Complete"=true is required. The customer can change this minimum number using the property: 'glide.platform_ml.api.csv_min_line' with value greater than 0. If this number of records are not there in sa_ml_process_to_process, the confidence in connection suggestion won't be populated and application service candidates creation won't begin. Solution: Check there are no missing Source/Target AFP in sa_ml_process_to_process In case of missing AFPs in Source/Target AFP columns need to check if the process ID has a related AFP by running the below background script: var afpProviderJs = new AFPProviderJS(); var afpSolutionVersion = afpProviderJs.getAFPSolutionVersion(); var sourceGroupID = afpProviderJs.getAFPIDForProcessVersion(/*ProcessID*/, afpSolutionVersion); gs.log("sourceGroupID=" + sourceGroupID); If the script shows null as process ID it means the solution needs to be restarted following this procedure: Recovery procedure for SM ML-based feature after trainer availability failure Solution: Check version + if the CI is a load-balancer Connection suggestion for load balancer is supported only from San-Diego version. Prior to this only servers are supported. Solution: Check AFP ML solution capability type In Vancouver patch 4 and above, validate the AFP ML solution capability type Problem: Confidence level field in suggestion records is missing or showing "N/A" Solution: Verify ML solution update is complete Verify that the active ml_solution record doesn't have errors in "update state" field (add this field to the view) In case of error in "update state", check the resolution for the stuck ML solution. Solution: Validate traffic connection was collected Review the table cmdb_tcp and filter to the relevant IPs/host that are known to be part of the relevant serviceIf traffic info is missing, review the following troubleshooting guide for further steps to validate traffic is collected:https://support.servicenow.com/kb?id=kb_article_view&sysparm_article=KB0832798 Solution: Check if Source or Target AFP is not present in the bad AFP list We have a sys_property sa_ml.connection_suggestions.bad.afp.list the contains list of comma-separated AFP's for which we don't calculate confidence if they are present in source or target AFP. If required customer can remove the AFP from this list and re-run top-down discovery on the application and confidence will be populated. Problem: Missing Confidence (or Missing AFP) due to Predictive Intelligence solution is stuck in "waiting for training" / "waiting for updating"/"uploading trained solution error" or no movement in progress percentage Solution: Contact the owner team of the flow for support Check table: sa_ml_solution_failure_stats and verify it isn't empty:Alternatively check in table sa_hash for the existance of a record with name "cancel_retry_stuck_app_fingerprint_solution_name"In case one of the above conditions are met, you are using legacy code flow. Please contact the 'Dev-SW-SmurAI' team (the team name is case sensitive).Otherwise, contact 'ATG - Infrastructure' team for further assistance (the team name is case sensitive). Solution: Full cleanup if the training issue wasn't resolved If the instance that you are troubleshooting is a cloned instance and the solutions present in the instance came via cloning of records from the base instance then full cleanup would be required as training won't be able to proceed on the cloned instances without doing this as the solutions were created on the base instance. NOTICE: on cloned instance, please make sure that if you have override the property glide.servleturl you should make sure that the instance name is updated to the cloned instance after the clone. On a cloned instance, the training needs to start from scratch on cloned instances. Perform: Recovery procedure for SM ML-based feature after trainer availability failure Problem: There is a message in ML readiness page : "Check connection suggestion scheduled job", and "Estimated Time in Hours" is NA. Solution: The code checks that we have run period for "Service Mapping - Traffic Process to Process" and "All Applications" jobs. If one of those jobs doesn't have run period we show that message.Regarding to numbers on the chart (trained/not trained) we resolve it from sa_ml_connection_suggestion table. The total number is all the records in sa_ml_connection_suggestion table and "not trained" is the number of records with "Confidence=N/A". Problem: Connection suggestion decision is 'Added by rule' or 'Added manually' but the connection is not showing on the map. Solution: Target process classified as the source CI. 1. In case the discovery & pattern process classify the target process as the same CI as the source CI, no connection will be added to the map. 2. To confirm this use case, check the source CI discovery log - there will be at least two endpoint that discover the same CI(source CI endpoint and the connection suggestion endpoint). Solution: Check discovery messages log Sometimes an error will be shown in the discovery log but not on the service map. Open discovery messages log Check if there is an error with the same ip & port of the connection suggestion that added but not showing in the map. Recovery procedure for SM ML-based feature after trainer availability failure 1. Delete Application Fingerprints solution Enter "ml_capability_definition_base.list" from the main navigation menuFilter solution name by: "*global_application_suggestion"Select all appearances and choose "delete" from the action menu 2. Delete the flowing entries from sa_hash: 3. Delete Connection Suggestions solution Enter "ml_capability_definition_base.list" from main navigation menuFilter solution name by: "*global_proc_conn"Select all appearances and choose "delete" from the action menuDelete the following entries from sa_hash: "*conn" and "*p2p" 4. Delete Data From Tables Navigate to "system definition->tables" from main navigation menuEnter cmdb_process_groups tableDelete all records as per below:Do the same for sa_ml_process_to_process In case you need to delete connection suggestions, delete entries from sa_ml_connection_suggestion table. Be noted it will impact Application Service maps. Please avoid deleting records from this table if such impact is not desired. 5. Validate Application Fingerprints prerequisites Validate PI plugin is installedCheck sys property "process.clustering.appfingerprint.enabled" exists and equals "true", or does not exist 6. Run App Fingerprints scheduled job Enter "sysauto_script.list" from the main navigation menuFilter name to: "Applications suggestion - ITOM Autodiscovery"Press on execute now as per below:Check solution has completed: go to ml_solution.list and filter solution name by "*global_application_suggestion". Verify that state is "solution complete" and 100%. Verify that capability type is Clustering In case capability type is not Clustering, you need to upgrade Visibility Content application to version 6.15.2 or above and restart recovery process Check cmdb_process_groups table has records 7. Run Connection Suggestion scheduled job Enter "sysauto_script.list" from the main navigation menuFilter name by: "Service Mapping - Traffic Process to Process"Press on execute now as per below: 8. Post validation, a few hours after running sys_autoscript jobs Check that the data analysis solution under ml_solution exists and is in complete stateCheck that solution is connected to the trainer and has finalized successfully – recent results record should exist in the table: ml_connection_analysis_resultCheck table sa_ml_process_to_process has recordsMinimum rows for showing results are 10K, so there should be more than 10K rows with is_complete = true in the tableCheck table ml_data_analysis_sol_update has recordsThe solution that will be created can be found in ml_solution table and will contain the text proc_connAfter initial solution was created the updates will be shown in table ml_data_analysis_sol_update