[Predictive Intelligence] Clustering - Frequently asked questions (FAQ)SummaryClustering solutions can give you further insights into your data where conventional reporting tools fall short. However, as the machine learning results are dependent on the dataset provided for training, you will have to try different algorithms/parameters to determine the best results for your dataset. Therefore, we suggest first using a smaller sample size of 10k records, which reduces the time it takes to train the Clustering solution, and try many different combinations to see what works best with your data. In the latter stages of testing, you should then increase the sample size to the expected number, up to 300k records that you will use in the final solution to report on. We also recommend reading the following Community Articles on Clustering solutions - Tuning Predictive Intelligence Models (part 4) - Picking the right clustering algorithmImprove your Predictive Intelligence clustering results with DBSCAN!Using purity fields to better understand your clustersPredictive Intelligence - Using the Cluster Insight Table to improve your analysisPredictive Intelligence for HR – Find patterns using clustering (Article 3)Instructions1. Is the clustering visualization tab the only method for reporting purposes on a trained clustering solution? You can analyze a cluster with a data source. When you enable the "Create ClusterInsight table" checkbox on the Clustering solution, it will create a table on the ServiceNow Platform that will enable you to analyze the clusters on a trained solution giving you more insight using our ServiceNow Platform Reporting tools, such as Reporting and Performance Analytics. For example, out-of-box we provide Knowledge Demand Insights for Incidents dashboard, which uses a Clustering solution and turns it into a Pareto chart that shows the Candidate Knowledge Gaps for Incidents or Cases. Each bar is a cluster of similar incidents without knowledge. It really depends on what you want to do with the Clustering solution and the actions based on the data that it reports. 2. What are the relevant tables on the ServiceNow Platform that hold the cluster visualization data? The ClusterInsight table will have the same name as the Clustering "Solution Name" with a "ml_ci_" prefix. However, if you did not enable the "Create ClusterInsight table" on the Clustering solution table, then the data for the Clustering Visualization are held in tables [ml_cluster_summary] and [ml_cluster_detail]. 3. How do you cross reference two fields using one clustering solution? For example, to cross reference the incident description field with the solution in the close notes field, so that you can see the issue and solution in one cluster or in a few different clusters. To see the "Description" and the "Close Notes" fields in the Insights table, you should add both these fields to the Clustering Definition as Input fields, so that the data in both fields are cross referenced for similarity and thus formed into clusters based on the data in both Input fields. 4. When I use the pre-trained Word Corpus in the Clustering solution definition, it throws the error "SE0060:Solution Training Failed". Why? The pre-trained GloVe word corpus, which can provide a larger capacity for your word corpus content is only available for Similarity solutions, as per the Note on the Word Corpus field/value in Create and train a similarity solution documentation. The Word Corpus definitions are in table [ml_word_vector_corpus]. All AI capabilities are uploaded as attachments into table [ml_model_artifact], including the Word Corpus. In table [ml_model_artifact], if you exclude the empty records in the Word Corpus column, it will list all trained Word Corpus, each one has 2 record both contain an attachment with the trained ‘wordvector’ solution. They are trained once every 180 days controlled via the system property [glide.wordvector.upgrade_time_frame]. You cannot train a Word Corpus, as they are only trained via a linked Predictive Intelligence Solution. Note: With the Washington release, a word corpus is not required, because a pre-trained model (GUSE) is used instead. The Word Corpus field is not visible in the definition form for pre-trained models. 5. Do I need to train the stopwords or is it processed along with the Clustering solution training? It runs alongside the Clustering solution training. If there are words in the cluster concept that do not add value or are generic words, please add them to your custom stopword list, so that they are excluded from the cluster concept process that generates these words after the Clustering solution training has completed. The default out-of-box stopword is not an exhaustive list of words that are meaningless, and so we do recommend checking the generated words in the cluster concept after training the solution, and to remove unwanted words, you add them to your custom stopword list. When you next train the solution, it will remove the unwanted words from the cluster concept. 6. Why does it show 'null' as a word in the cluster concept? These words shown as 'null' are in the stopwords list, and this only occurs when the data for the Input fields in the Clustering solution has a limited number of words, and where most of these words are in the stopword list. To resolve this, add additional Input fields containing more vocabulary to the Clustering solution definition or remove the word shown as 'null' from the stopword list. 7. Why does it not recognize "F5" as a vocabulary word? The cluster concept is only generated from vocabulary words, and will ignore words such as "F5". It will only use recognised vocabulary words in the cluster concept, and so this has been raised as as an idea for an enhancement. 8. How does the system decide how much data to cover and how does it calculate Coverage? Coverage is calculated based on the "number of records clustered / number of records in the training dataset". So, if 5,000 records were clustered from a training dataset containing 10,000 records, the coverage is "0.5" and multiply that by 100 to get 50%. How much data the Clustering solution will cover depends on many factors including the data itself, the minimum number of records per cluster and other parameters used by the algorithms, and you can control some in the Advanced Solution Parameters based on the chosen algorithm. 9. How can I increase the amount of data it can cover, to increase the Coverage when training the Clustering solution? In the clustering definition go to Advanced Settings and set the "Target Solution Coverage" in "Solution Parameters" to 100 to capture all the variations in the data. 10. For most of the smaller sized clusters, the cluster concept can provide a good idea of the issue, but for the bigger sized clusters containing thousands of records, many will also include records that are not necessarily related to the cluster concept. How can you break down these larger clusters? Unfortunately, as each dataset is different, you will need to experiment with the different algorithms and their Advanced Solution Parameters to determine the best results for your data. The default K-means algorithm has no Advanced Solution Parameters, whereas DBSCAN and HDBSCAN do have additional advanced parameters. 11. I can see there are clusters created with very similar subjects, so how can I train the system to understand it's required under the same cluster? There must be enough variance in the data between these two similar clusters for it to create separate clusters and again you will need to experiment with the different algorithms and their Advanced Solution Parameters to determine the best results for your data. 12. When using DBSCAN, the default Advanced Solution Parameters for epsilon is 0.5 and the min_neighbours is 5. What happens when I change these default values? Keeping epsilon constant at default 0.5, and increase the value for min_neighbours Result: Reduces the number of clusters, as you increase min_neighbours Keeping min_neighbours constant at default 5, and reduce the value for epsilon Result: Increases the number of clusters, as you reduce epsilon. 13. When using DBSCAN, the Clustering solution definition has a “Minimum number of records per cluster” set to 20, and the Minimum Neighbors [min_neighbours] is set to 5, yet there are clusters with less than 5 records. Why? The minimum number of records in a cluster is not supported when using the DBSCAN algorithm. It's to do with the algorithm itself, which does not support this kind of feature. We have 3 different parameters that determines the minimum number of records in a cluster, as follows: Minimum number of records per cluster: Used internally only in K-Means (default) and HDBSCAN, to ensure the minimum number of records that are maintained in a cluster. DBSCAN does not support this parameter. Advanced Solution Parameters: min_samples: Used internally only in HDBSCAN.min_neighbours: Used internally in DBSCAN, but it is not used to set the minimum number of records in a cluster. 14. How does the system define the rank of record in a cluster that can range from 0 to 1000+? Each cluster generated will have a centroid, which can be thought of as the multi-dimensional average of the cluster. The higher the rank of a record in the cluster, the closer it is to the cluster centroid. We will possibly look at normalising these values in a future release. 15. When filtering by Service Category, the clustering visualization shows the sys_id and not the service name. Why? If using "Group By" on a referenced field, then you should select the "Name" field on the referenced table in the Clustering solution definition and then retrain the solution. You should now see the "Name" and not the sys_id in the "Group By" filter drop-down list on the Clustering Visualization tab. 16. To determine a suitable word corpus, how can we check what the most frequent used words are in the Input field data? You can create a text widget in Performance Analytics using the Word cloud visualization for visualizing the frequency of words and phrases using Text analytics, which is only available with the licensed version of Performance Analytics.