Add snap troubleshooting how-to

canonical · Jan 13, 2025 · 3efcac8 · 3efcac8
1 parent a725e45
commit 3efcac8
Show file tree

Hide file tree

Showing 2 changed files with 189 additions and 0 deletions.
diff --git a/docs/src/snap/howto/index.md b/docs/src/snap/howto/index.md
@@ -26,6 +26,7 @@ two-node-ha
 Set up Enhanced Platform Awareness <epa>
 contribute
 Get support <support>
+troubleshooting
 ```
 
 ---

diff --git a/docs/src/snap/howto/troubleshooting.md b/docs/src/snap/howto/troubleshooting.md
@@ -0,0 +1,188 @@
+# How to troubleshoot {{product}}
+
+Identifying issues in a Kubernetes cluster can be difficult, especially to new users. With {{product}} we aim to make deploying and managing your cluster as easy as possible. This how-to guide will walk you through the steps to troubleshoot your {{product}} cluster.
+
+## Common issues
+
+Maybe your issue has already been solved? Check out the [troubleshooting reference][snap-troubleshooting-reference] page to see a list of common issues and their solutions. Otherwise continue with this guide to help troubleshoot your {{product}} cluster.
+
+## Verify that the cluster status is ready
+
+Verify that the cluster status is ready by running the following command:
+
+```
+sudo k8s status
+```
+
+You should see output similar to the following:
+```
+cluster status:           ready
+control plane nodes:      10.94.106.249:6400 (voter), 10.94.106.208:6400 (voter), 10.94.106.99:6400 (voter)
+high availability:        yes
+datastore:                k8s-dqlite
+network:                  enabled
+dns:                      enabled at 10.152.183.106
+ingress:                  disabled
+load-balancer:            disabled
+local-storage:            enabled at /var/snap/k8s/common/rawfile-storage
+gateway                   enabled
+```
+
+
+## Verify that the API server is healthy
+
+Verify that the API server is healthy and reachable by running the following command on a control-plane node:
+
+```
+sudo k8s kubectl get all
+```
+
+This command lists resources that exist under the default namespace. You should see output similar to the following if the API server is healthy:
+```
+NAME                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
+service/kubernetes   ClusterIP   10.152.183.1   <none>        443/TCP   29m
+```
+
+A typical error message may look like this if the API server can not be reached:
+```
+The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
+```
+
+A failure could mean the API server on the particular node is unhealthy. The status of the API server service can be checked by running the following command:
+```
+sudo systemctl status snap.k8s.kube-apiserver
+```
+
+The logs of the API server service can be accessed by running the following command:
+```
+sudo journalctl -xe -u snap.k8s.kube-apiserver
+```
+
+If you are trying to reach the API server from a host that is not a control-plane node, a failure could mean that:
+* The API server is not reachable due to network issues or firewall limitations
+* The API server is failing on the control-plane node that's being reached
+* The control-plane node that's being reached is down
+
+```{warning}
+When running `sudo k8s config` on a control-plane node you retrieve the kubeconfig file that uses this node's IP address.
+```
+
+Try reaching the API server on a different control-plane node by updating the IP address that's used in the kubeconfig file.
+
+## Verify that the cluster nodes are healthy
+
+Verify that the nodes in the cluster are healthy by running the following command and checking if all of the nodes have the `Ready` status:
+
+```
+sudo k8s kubectl get nodes
+```
+
+## Troubleshooting an unhealthy node
+
+There are certain services running on each node of a {{product}} cluster which are required. The required services depend on the type of node.
+
+Services running on both types of nodes:
+* `k8sd`
+* `kubelet`
+* `containerd`
+* `kube-proxy`
+
+Services running only on control-plane nodes:
+* `kube-apiserver`
+* `kube-controller-manager`
+* `kube-scheduler`
+* `k8s-dqlite`
+
+Services running only on worker nodes:
+* `k8s-apiserver-proxy`
+
+Check the status of these services on the failing node by running the following command:
+
+```
+sudo systemctl status snap.k8s.<service>
+```
+
+The logs of a failing service can be checked by running the following command:
+
+```
+sudo journalctl -xe -u snap.k8s.<service>
+```
+
+If the issue indicates a problem with the configuration of the services on the node, examine the arguments used to run these services.
+
+The arguments of a service on the failing node can be examined by reading the file located at `/var/snap/k8s/common/args/<service>`
+
+## Verify that the system pods are healthy
+
+Verify that the cluster's pods are healthy by running the following command and checking if all of the pods are `Running` and `Ready`:
+
+```
+sudo k8s kubectl get pods -n kube-system
+```
+
+The pods in the `kube-system` namespace belong to {{product}} features such as `network`. Unhealthy pods could be related to configuration issues or nodes not meeting certain requirements.
+
+## Troubleshooting a failing pod
+
+Events on a failing pod can be checked by running the following command:
+
+```
+sudo k8s kubectl describe pod <pod-name> -n <namespace>
+```
+
+Logs on a failing pod can be checked by running the following command:
+
+```
+sudo k8s kubectl logs <pod-name> -n <namespace>
+```
+
+## Using the built-in inspection script
+
+{{product}} ships with a script to compile a complete report on {{product}} and the system which it is running on. This is an essential tool for bug reports and for investigating whether a system is (or isn’t) working and collecting all the relevant data in one place.
+
+Run the inspection script, by entering the command (admin privileges are required to collect all the data):
+
+```
+sudo /snap/k8s/current/k8s/scripts/inspect.sh
+```
+
+The command output is similar to the following:
+```
+Collecting service information
+Running inspection on a control-plane node
+ INFO:  Service k8s.containerd is running
+ INFO:  Service k8s.kube-proxy is running
+ INFO:  Service k8s.k8s-dqlite is running
+ INFO:  Service k8s.k8sd is running
+ INFO:  Service k8s.kube-apiserver is running
+ INFO:  Service k8s.kube-controller-manager is running
+ INFO:  Service k8s.kube-scheduler is running
+ INFO:  Service k8s.kubelet is running
+Collecting registry mirror logs
+Collecting service arguments
+ INFO:  Copy service args to the final report tarball
+Collecting k8s cluster-info
+ INFO:  Copy k8s cluster-info dump to the final report tarball
+Collecting SBOM
+ INFO:  Copy SBOM to the final report tarball
+Collecting system information
+ INFO:  Copy uname to the final report tarball
+ INFO:  Copy snap diagnostics to the final report tarball
+ INFO:  Copy k8s diagnostics to the final report tarball
+Collecting networking information
+ INFO:  Copy network diagnostics to the final report tarball
+Building the report tarball
+ SUCCESS:  Report tarball is at /root/inspection-report-20250109_132806.tar.gz
+```
+
+This script confirms that the services are running, and the resulting report file can be viewed to get a detailed look at every aspect of the system.
+
+## Reporting a bug
+If you cannot solve your issue and believe the fault may lie in {{product}}, please [file an issue on the project repository][].
+
+Help us deal effectively with issues by including the report obtained from the inspect script, any additional logs, and a summary of the issue.
+
+<!-- Links -->
+
+[file an issue on the project repository]: https://github.com/canonical/k8s-snap/issues/new/choose
+[snap-troubleshooting-reference]: ../reference/troubleshooting