Administrating and Troubleshooting CNO for Visibility (Cloud-Native-Operations)Troubleshooting Deployment Issues Following the deployment, check if the informer pod is running by replacing INSTANCE_NAME and NAMESPACE in the command below and running it. export POD_NAME=$(kubectl get pods --namespace k8s-informer-INSTANCE_NAME -l "app=k8s_informer-INSTANCE_NAME" -o jsonpath="{.items[0].metadata.name}")kubectl get pod $POD_NAME --namespace NAMESPACE If the pod is not running, it is either because Kubernetes failed to download the informer image or that the secret holding the instance credentials is not found. In case Kubernetes failed to pull the image the pod will be in state of ErrImagePull or ImagePullBackOff. In case the secret does not exist, the pod state would be ContainerCreating.Run kubectl get events -n NAMESPACE in order to figure out the exact reason. The informer expects to have a secret named k8s-informer-cred-INSTANCE_NAME in the relevant namespace. If this secret does not exist, the pod will stay in the ContainerCreating state and you will see an event similar to this: 51s Warning FailedMount pod/k8s-informer-cnotal5-67bd77784d-bzb7k MountVolume.SetUp failed for volume "credentials" : secret "k8s-informer-cred-cnotal5" If the informer pod is running, run the following command to see the informer logs. Remember to replace INSTANCE_NAME and NAMESPACE. export POD_NAME=$(kubectl get pods --namespace k8s-informer-INSTANCE_NAME -l "app=k8s_informer-INSTANCE_NAME" -o jsonpath="{.items[0].metadata.name}")kubectl logs $POD_NAME --namespace NAMESPACE Failure to Connect to the Kubernetes API Server Failure to connect to the API server may be a result of misconfigured network policy or other networking component in the cluster. The symptom would be that the informer pod restarts multiple times and its status would be CrashLoopBackOff. The informer log and status message as seen in the pod yaml would be: Failed to get the kube-system namespace Get "https://172.20.0.1:443/api/v1/namespaces/kube-system": dial tcp 172.20.0.1:443 i/o timeout If you would like to verify that the problem is not with the ServiceNow pod , you can use the following pod that runs the curl command: Replace NAMESPACE by the namespace of the ServiceNow pod and run: kubectl run mycurlpod --image=curlimages/curl -i -n NAMESPACE --tty -- sh After getting a shell prompt, run the following commands: curl -k -I https://kubernetes.default.svc If you are getting the same "i/o timeout" error or the command hangs, it means that there is indeed network blocking between the namespace and the API server. Following this experiment, make sure to delete the experimental pod: kubectl delete pod mycurlpod -n NAMESPACE Failure to Access Kubernetes Resources In case the deployment is using a custom ClusterRole and not the one provided with the Helm chart, the ClusterRole provided should have list, get, and watch access to all relevant resources. When it does not have sufficient rights, you will see in the logs messages like: W1126 07:25:17.368897 1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.0/tools/cache/reflector.go:169: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:k8s-informer-cnotal5:servicenow-cnotal5" cannot list resource "endpoints" in API group "" at the cluster scope In that case, fix the ClusterRole. No need to restart the informer pod. Failure on DNS Resolution of ServiceNow Instance Name The informer is installed by default with dnsPolicy with value Default. For more information on the dnsPolicy parameter, see here. In case DNS resolution fails, you will see messages like those: 2023/11/24 13:41:20 Failed to send ecc message with payload size: 0 Post https://cnotal5.service-now.com/api/now/table/ecc_queue: dial tcp: lookup cnotal5.service-now.com on 10.60.41.190:53: read udp 10.225.11.24:58409->10.60.41.190:53: i/o timeout We recommend that you will redeploy with dnsPolicy=ClusterFirst. In case you install with Helm, add the command line argument --set dnsPolicy=ClusterFirst. In case you are using the k8s_informer.yaml, change the dnsPolicy value in the file. Incorrect Secret Keys The informer expects to have a secret named k8s-informer-cred-INSTANCE_NAME in the relevant namespace, and this secret should contain the keys ".user" and ".password". If the secret is available but one of the expected keys is missing, you will see messages like this in the logs: 2024/02/25 07:54:28 Failed to read secret from /etc/credentials/.user open /etc/credentials/.user: no such file or directory 2023/11/26 07:55:58 Failed to send to SN instance Missing credentials. Will not try to get ECC messages If you see this message, follow the documentation in order to create the secret, and delete the informer pod. Kubernetes will restart it. Make sure the secret keys are ".user" and ".password". Failure Connecting to the ServiceNow Instance In many cases, traffic from the cluster to internet can be blocked by a firewall or a network policy. In those cases, the pod logs will show an error on attempts to connect. As an example: Failed to get ecc output messages Get https://myinstance.service-now.com/api/now/table/ecc_queue?sysparm_query=sys_created_onRELATIVEGT%40minute%40ago%4030%5Equeue=output%5Eagent=k8s_informer_2e105171-98c4-4780-9c10-675e12d8bb20%5Etopic=k8s_informer%5Estate=ready&sysparm_fields=sys_id,payload&sysparm_limit=1: read tcp 10.1.2.3:52914->148.139.125.122:443: read: connection reset by peer Or Failed to get ecc output messages Get https://myinstance.service-now.com/api/now/table/ecc_queue?sysparm_query=sys_created_onRELATIVEGT%40minute%40ago%4030%5Equeue=output%5Eagent=k8s_informer_2e105171-98c4-4780-9c10-675e12d8bb20%5Etopic=k8s_informer%5Estate=ready&sysparm_fields=sys_id,payload&sysparm_limit=1: read tcp 10.1.2.3:52914->148.139.125.122:443: read: EOF Please consult your Kubernetes and network administrator on that. If you would like to verify that the problem is not with the ServiceNow pod or instance, you can use the following pod that runs the curl command: Replace NAMESPACE by the namespace of the ServiceNow pod and run: kubectl run mycurlpod --image=curlimages/curl -i -n NAMESPACE --tty -- sh After getting a shell prompt, replace INSTANCE by your instance name and run: curl -I https://INSTANCE.service-now.com If you get an error, or the response http status is not 200, then it is clear that the traffic to the ServiceNow instance is blocked. Following this experiment, make sure to delete the experimental pod: kubectl delete pod mycurlpod -n NAMESPACE Turning off the TLS Certificate Validation In some cases, due to network policies, the informer might fail to connect to the ServiceNow instance and will report in the logs on x509 error. For example: 2023/09/26 15:29:16 Failed to get ecc output messages Get https://myinstance.service-now.com/api/now/table/ecc_queue?sysparm_query=sys_created_onRELATIVEGT%40minute%40ago%4030%5Equeue=output%5Eagent=k8s_informer_2e105171-98c4-4780-9c10-675e12d8bb20%5Etopic=k8s_informer%5Estate=ready&sysparm_fields=sys_id,payload&sysparm_limit=1: x509: certificate signed by unknown authority If the network issue cannot be resolved, we advise to turn off the TLS Certificate Validation. With Helm chart: use the command line argument: --set skipTLSCertificateValidation=true in your helm install command. With k8s_informer.yaml: Place the value “true” in the line under SKIP_TLS_CERT_VALIDATION Invalid Credentials In case the credentials provided are invalid, the messages in the logs will look like this: 2023/11/26 08:02:20 Failed to send to SN instance 401 Unauthorized. Invalid credentials If you see this message, verify you got the correct user and password, then delete and recreate the secret correctly, and delete the informer pod. Kubernetes will restart the pod. If you think the credentials provided are correct, check the user configuration on the ServiceNow instance: Verify the "Active" checkbox is checkedVerify the "Locked out" checkbox is uncheckedVerify the "Password needs reset" checkbox is unchecked Credentials with Insufficient Role In case the credentials are valid, but the user does not have the discovery_admin role, the messages in the logs will look like this: 2023/11/26 08:09:39 Failed to send to SN instance 403 Forbidden. User has insufficient roles If you see, this message, add the discovery_admin role to the user being used on the ServiceNow's instance. No need to restart the pod. Failure of the Helm Command Some Helm versions may throw this error when running helm install: Error: template: k8s-informer-chart/templates/aws_secrets.yaml:1:6: executing "k8s-informer-chart/templates/aws_secrets.yaml" at <eq .Values.secretProvider "aws">: error calling eq: incompatible types for comparison" If you see this error add the following parameter to the helm command: --set secretProvider="" Post Deployment Administration and Troubleshooting Grabbing Logs To grab the logs from the “CNO for Visibility” pod running in a given cluster, navigate to “CNO for Visibility/Home”, and click on the row of the cluster. Then click on “Grab Informer Logs” in the “Related Links” section: Wait for up to two minutes and reload the page. The system will grab the last 1MB of data from the recent log and add it as an attachment to the current record. Whenever you grab logs again, the system will replace the attachment with the newer log. Running On-Demand Full Discovery To start on-demand full discovery, navigate to “CNO for Visibility/Home” and click on the row associated with the cluster you want to discover. Then click on “Full Discovery” in the “Related Links” section. The “Full Discovery Status” field will change to “In Progress”. Once the discovery is done, the field value will be changed again to “Completed”. The expected time to complete the full discovery depends on the cluster size and the load on the instance. Pause/Resume the Informer To pause or resume the informer, navigate to “CNO for Visibility/Home” and click on the row associated with the relevant cluster, then click on Pause or Resume in the “Related Links” section. When clicking on Pause, the status field will first change to “Pausing” and after up to one minute to “Paused”. When clicking on Resume, the status will first change to “Resuming” and after up to one minute to “Up”. Restarting the Informer In case you need to restart the informer pod, navigate to “CNO for Visibility/Home” and click on the row associated with the relevant cluster, then click on “Restart Informer” in the “Related Links”. The program will exit and Kubernetes will restart it.