Administrating and Troubleshooting Kubernetes Visibility Agent (formerly CNO for Visibility) - Support and Troubleshooting

Administrating and Troubleshooting Kubernetes Visibility Agent (formerly CNO for Visibility) Troubleshooting Deployment Issues Following the deployment, check if the informer pod is running by replacing INSTANCE_NAME and NAMESPACE in the command below and running it.

export POD_NAME=$(kubectl get pods --namespace k8s-informer-INSTANCE_NAME -l "app=k8s_informer-INSTANCE_NAME" -o jsonpath="{.items[0].metadata.name}") kubectl get pod $POD_NAME --namespace NAMESPACE

If the pod is not running, it is either because Kubernetes failed to download the informer image or that the secret holding the instance credentials is not found. In case Kubernetes failed to pull the image the pod will be in state of ErrImagePull or ImagePullBackOff. In case the secret does not exist, the pod state would be ContainerCreating.

Run kubectl get events -n NAMESPACE in order to figure out the exact reason.

The informer expects to have a secret named k8s-informer-cred-INSTANCE_NAME in the relevant namespace. If this secret does not exist, the pod will stay in the ContainerCreating state and you will see an event similar to this:

51s Warning FailedMount pod/k8s-informer-cnotal5-67bd77784d-bzb7k MountVolume.SetUp failed for volume "credentials" : secret "k8s-informer-cred-cnotal5"

If the informer pod is running, run the following command to see the informer logs. Remember to replace INSTANCE_NAME and NAMESPACE.

export POD_NAME=$(kubectl get pods --namespace k8s-informer-INSTANCE_NAME -l "app=k8s_informer-INSTANCE_NAME" -o jsonpath="{.items[0].metadata.name}") kubectl logs $POD_NAME --namespace NAMESPACE

Failure to Connect to the Kubernetes API Server Failure to connect to the API server may be a result of misconfigured network policy or other networking component in the cluster. The symptom would be that the informer pod restarts multiple times and its status would be CrashLoopBackOff. The informer log and status message as seen in the pod yaml would be:

Failed to get the kube-system namespace Get "https://172.20.0.1:443/api/v1/namespaces/kube-system": dial tcp 172.20.0.1:443 i/o timeout

If you would like to verify that the problem is not with the ServiceNow pod , you can use the following pod that runs the curl command:

Replace NAMESPACE by the namespace of the ServiceNow pod and run:

kubectl run mycurlpod --image=curlimages/curl -i -n NAMESPACE --tty -- sh

After getting a shell prompt, run the following commands:

curl -k -I https://kubernetes.default.svc

If you are getting the same "i/o timeout" error or the command hangs, it means that there is indeed network blocking between the namespace and the API server.

Following this experiment, make sure to delete the experimental pod:

kubectl delete pod mycurlpod -n NAMESPACE

Failure to Access Kubernetes Resources In case the deployment is using a custom ClusterRole and not the one provided with the Helm chart, the ClusterRole provided should have list, get, and watch access to all relevant resources. When it does not have sufficient rights, you will see in the logs messages like:

W1126 07:25:17.368897 1 reflector.go:424] pkg/mod/k8s.io/client-go@v0.26.0/tools/cache/reflector.go:169: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:k8s-informer-cnotal5:servicenow-cnotal5" cannot list resource "endpoints" in API group "" at the cluster scope

In that case, fix the ClusterRole. No need to restart the informer pod.

Failure on DNS Resolution of ServiceNow Instance Name The informer is installed by default with dnsPolicy with value Default. For more information on the dnsPolicy parameter, see here . In case DNS resolution fails, you will see messages like those:

2023/11/24 13:41:20 Failed to send ecc message with payload size: 0 Post https://cnotal5.service-now.com/api/now/table/ecc_queue : dial tcp: lookup cnotal5.service-now.com on 10.60.41.190:53: read udp 10.225.11.24:58409->10.60.41.190:53: i/o timeout

We recommend that you will redeploy with dnsPolicy=ClusterFirst. In case you install with Helm, add the command line argument --set dnsPolicy=ClusterFirst. In case you are using the k8s_informer.yaml, change the dnsPolicy value in the file.

Incorrect Secret Keys The informer expects to have a secret named k8s-informer-cred-INSTANCE_NAME in the relevant namespace, and this secret should contain the keys ".user" and ".password". If the secret is available but one of the expected keys is missing, you will see messages like this in the logs:

2024/02/25 07:54:28 Failed to read secret from /etc/credentials/.user open /etc/credentials/.user: no such file or directory

2023/11/26 07:55:58 Failed to send to SN instance Missing credentials. Will not try to get ECC messages

If you see this message, follow the documentation in order to create the secret, and delete the informer pod. Kubernetes will restart it. Make sure the secret keys are ".user" and ".password".

Failure Connecting to the ServiceNow Instance In many cases, traffic from the cluster to internet can be blocked by a firewall or a network policy. In those cases, the pod logs will show an error on attempts to connect.

As an example:

Failed to get ecc output messages Get https://myinstance.service-now.com/api/now/table/ecc_queue?sysparm_query=sys_created_onRELATIVEGT%40minute%40ago%4030%5Equeue=output%5Eagent=k8s_informer_2e105171-98c4-4780-9c10-675e12d8bb20%5Etopic=k8s_informer%5Estate=ready&sysparm_fields=sys_id,payload&sysparm_limit=1 : read tcp 10.1.2.3:52914->148.139.125.122:443: read: connection reset by peer

Or

Failed to get ecc output messages Get https://myinstance.service-now.com/api/now/table/ecc_queue?sysparm_query=sys_created_onRELATIVEGT%40minute%40ago%4030%5Equeue=output%5Eagent=k8s_informer_2e105171-98c4-4780-9c10-675e12d8bb20%5Etopic=k8s_informer%5Estate=ready&sysparm_fields=sys_id,payload&sysparm_limit=1 : read tcp 10.1.2.3:52914->148.139.125.122:443: read: EOF

Please consult your Kubernetes and network administrator on that. If you would like to verify that the problem is not with the ServiceNow pod or instance, you can use the following pod that runs the curl command:

Replace NAMESPACE by the namespace of the ServiceNow pod and run:

kubectl run mycurlpod --image=curlimages/curl -i -n NAMESPACE --tty -- sh

After getting a shell prompt, replace INSTANCE by your instance name and run:

curl -I https://INSTANCE.service-now.com

If you get an error, or the response http status is not 200, then it is clear that the traffic to the ServiceNow instance is blocked.

Following this experiment, make sure to delete the experimental pod:

kubectl delete pod mycurlpod -n NAMESPACE

Using Custom Root Certificate Authority Some customers place a proxy that terminates and re-encrypts TLS traffic between the informer pod and the ServiceNow instance. If the certificate used by this proxy is signed by a custom root certificate authority, the informer will fail to connect with instance and the log messages will show:

2023/09/26 15:29:16 Failed to get ecc output messages Get https://myinstance.service-now.com/api/now/table/ecc_queue?sysparm_query=sys_created_onRELATIVEGT%40minute%40ago%4030%5Equeue=output%5Eagent=k8s_informer_2e105171-98c4-4780-9c10-675e12d8bb20%5Etopic=k8s_informer%5Estate=ready&sysparm_fields=sys_id,payload&sysparm_limit=1: x509: certificate signed by unknown authority

To load the certificate of the custom root authority to the informer, consult KB1710906.

If you don't have this certificate in hand, see the section below titled "Turning off the TLS Certificate Validation"

Turning off the TLS Certificate Validation In some cases, due to network policies, the informer might fail to connect to the ServiceNow instance and will report in the logs on x509 error. For example:

2023/09/26 15:29:16 Failed to get ecc output messages Get https://myinstance.service-now.com/api/now/table/ecc_queue?sysparm_query=sys_created_onRELATIVEGT%40minute%40ago%4030%5Equeue=output%5Eagent=k8s_informer_2e105171-98c4-4780-9c10-675e12d8bb20%5Etopic=k8s_informer%5Estate=ready&sysparm_fields=sys_id,payload&sysparm_limit=1: x509: certificate signed by unknown authority

If the network issue cannot be resolved, we advise to turn off the TLS Certificate Validation.

With Helm chart: use the command line argument: --set skipTLSCertificateValidation=true in your helm install command.

With k8s_informer.yaml: Place the value “true” in the line under SKIP_TLS_CERT_VALIDATION

Invalid Credentials In case the credentials provided are invalid, the messages in the logs will look like this:

2023/11/26 08:02:20 Failed to send to SN instance 401 Unauthorized. Invalid credentials

If you see this message, verify you got the correct user and password, then delete and recreate the secret correctly, and delete the informer pod. Kubernetes will restart the pod.

If you think the credentials provided are correct, check the user configuration on the ServiceNow instance:

Verify the "Active" checkbox is checked Verify the "Locked out" checkbox is unchecked Verify the "Password needs reset" checkbox is unchecked Credentials with Insufficient Role In case the credentials are valid, but the user does not have the discovery_admin role, the messages in the logs will look like this:

2023/11/26 08:09:39 Failed to send to SN instance 403 Forbidden. User has insufficient roles

If you see, this message, add the discovery_admin role to the user being used on the ServiceNow's instance. No need to restart the pod.

Blocked by Rest API Access Policies In case a customer added custom Rest API Access Policies or changed the out-of-box-ones, those policies make block the informer from reading or writing to the ECC queue table. In that case you will see message with containing HTTP Status 401. For example:

Failed PUT request to https://myinstance.service-now.com/api/now/table/ecc_queue/fd280176935ade90f538fb2d6cba105c HTTP Status : 401

To resolve this, navigate to Rest API Access Policies and review your policies. Refer to the relevant documentation to understand how policies can override each other.

Failure of the Helm Command Some Helm versions may throw this error when running helm install:

Error: template: k8s-informer-chart/templates/aws_secrets.yaml:1:6: executing "k8s-informer-chart/templates/aws_secrets.yaml" at : error calling eq: incompatible types for comparison"

If you see this error add the following parameter to the helm command:

--set secretProvider=""

Post Deployment Administration and Troubleshooting Grabbing Logs To grab the logs from the informer pod running in a given cluster, navigate to “Kubernetes Visibility Agent/Home” (formerly CNO for Visibility) , and click on the row of the cluster. Then click on “Grab Informer Logs” in the “Related Links” section:

Wait for up to two minutes and reload the page. The system will grab the last 1MB of data from the recent log and add it as an attachment to the current record. Whenever you grab logs again, the system will replace the attachment with the newer log.

Running On-Demand Full Discovery To start on-demand full discovery, navigate to “Kubernetes Visibility Agent/Home” (formerly CNO for Visibility) and click on the row associated with the cluster you want to discover. Then click on “Full Discovery” in the “Related Links” section. The “Full Discovery Status” field will change to “In Progress”. Once the discovery is done, the field value will be changed again to “Completed”. The expected time to complete the full discovery depends on the cluster size and the load on the instance.

Pause/Resume the Informer To pause or resume the informer, navigate to “Kubernetes Visibility Agent/Home” (formerly CNO for Visibility) and click on the row associated with the relevant cluster, then click on Pause or Resume in the “Related Links” section. When clicking on Pause, the status field will first change to “Pausing” and after up to one minute to “Paused”. When clicking on Resume, the status will first change to “Resuming” and after up to one minute to “Up”.

Restarting the Informer In case you need to restart the informer pod, navigate to “Kubernetes Visibility Agent/Home” (formerly CNO for Visibility) and click on the row associated with the relevant cluster, then click on “Restart Informer” in the “Related Links”. The program will exit and Kubernetes will restart it.

Get Informer Info (from version 2.4.x) The UI action "Get Informer Info" in the informer form will bring two files and store them as attachment on the informer record:

A file named resource_info.txt will include the full information on the informer deployment in JSON format.

A file named additional_resources.txt contains two parts:

The content of ConfigMap storing the extensibility configuration: resources.json, mappings.json and mappings_oob.json. Those are mounted as files to the informer pod file system. The content of the memory structure holding this configuration once loaded into the informer. This is a JSON string. Note that it may take up to a minute until the informer picks the command and sends back the files. If the attachment were already there, they will be replaced by a new ones.