Skip to content

Commit

Permalink
Add snap troubleshooting how-to
Browse files Browse the repository at this point in the history
  • Loading branch information
berkayoz committed Jan 13, 2025
1 parent a725e45 commit 3af6ea1
Show file tree
Hide file tree
Showing 2 changed files with 189 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/src/snap/howto/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ two-node-ha
Set up Enhanced Platform Awareness <epa>
contribute
Get support <support>
troubleshooting
```

---
Expand Down
188 changes: 188 additions & 0 deletions docs/src/snap/howto/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# How to troubleshoot {{product}}

Identifying issues in a Kubernetes cluster can be difficult, especially to new users. With {{product}} we aim to make deploying and managing your cluster as easy as possible. This how-to guide will walk you through the steps to troubleshoot your {{product}} cluster.

## Common issues

Maybe your issue has already been solved? Check out the [troubleshooting reference][snap-troubleshooting-reference] page to see a list of common issues and their solutions. Otherwise continue with this guide to help troubleshoot your {{product}} cluster.

## Verify that the cluster status is ready

Verify that the cluster status is ready by running the following command:

```
sudo k8s status
```

You should see output similar to the following:
```
cluster status: ready
control plane nodes: 10.94.106.249:6400 (voter), 10.94.106.208:6400 (voter), 10.94.106.99:6400 (voter)
high availability: yes
datastore: k8s-dqlite
network: enabled
dns: enabled at 10.152.183.106
ingress: disabled
load-balancer: disabled
local-storage: enabled at /var/snap/k8s/common/rawfile-storage
gateway enabled
```


## Verify that the API server is healthy

Verify that the API server is healthy and reachable by running the following command on a control-plane node:

```
sudo k8s kubectl get all
```

This command lists resources that exist under the default namespace. You should see output similar to the following if the API server is healthy:
```
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.152.183.1 <none> 443/TCP 29m
```

A typical error message may look like this if the API server can not be reached:
```
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
```

A failure could mean the API server on the particular node is unhealthy. The status of the API server service can be checked by running the following command:
```
sudo systemctl status snap.k8s.kube-apiserver
```

The logs of the API server service can be accessed by running the following command:
```
sudo journalctl -xe -u snap.k8s.kube-apiserver
```

If you are trying to reach the API server from a host that is not a control-plane node, a failure could mean that:
* The API server is not reachable due to network issues or firewall limitations
* The API server is failing on the control-plane node that's being reached
* The control-plane node that's being reached is down

```{warning}
When running `sudo k8s config` on a control-plane node you retrieve the kubeconfig file that uses this node's IP address.
```

Try reaching the API server on a different control-plane node by updating the IP address that's used in the kubeconfig file.

## Verify that the cluster nodes are healthy

Verify that the nodes in the cluster are healthy by running the following command and checking if all of the nodes have the `Ready` status:

```
sudo k8s kubectl get nodes
```

## Troubleshooting an unhealthy node

There are certain services running on each node of a {{product}} cluster which are required. The required services depend on the type of node.

Services running on both types of nodes:
* `k8sd`
* `kubelet`
* `containerd`
* `kube-proxy`

Services running only on control-plane nodes:
* `kube-apiserver`
* `kube-controller-manager`
* `kube-scheduler`
* `k8s-dqlite`

Services running only on worker nodes:
* `k8s-apiserver-proxy`

Check the status of these services on the failing node by running the following command:

```
sudo systemctl status snap.k8s.<service>
```

The logs of a failing service can be checked by running the following command:

```
sudo journalctl -xe -u snap.k8s.<service>
```

If the issue indicates a problem with the configuration of the services on the node, it could be helpful to examine the arguments used to run these services.

The arguments of a service on the failing node can be examined by reading the file located at `/var/snap/k8s/common/args/<service>`

## Verify that the system pods are healthy

Verify that the pods that are a part of the cluster are healthy by running the following command and checking if all of the pods are `Running` and `Ready`:

```
sudo k8s kubectl get pods -n kube-system
```

The pods under the `kube-system` namespace belong to {{product}} features such as `network` that provide important functionality. Unhealthy pods could be related to configuration issues or nodes not meeting certain requirements.

## Troubleshooting a failing pod

Events on a failing pod can be checked by running the following command:

```
sudo k8s kubectl describe pod <pod-name> -n <namespace>
```

Logs on a failing pod can be checked by running the following command:

```
sudo k8s kubectl logs <pod-name> -n <namespace>
```

## Using the built-in inspection script

{{product}} ships with a script to compile a complete report on {{product}} and the system which it is running on. This is essential for bug reports, but is also a useful way of confirming the system is (or isn’t) working and collecting all the relevant data in one place.

To run the inspection script, enter the command (admin privilege is required to collect all the data):

```
sudo /snap/k8s/current/k8s/scripts/inspect.sh
```

You should see output similar to the following:
```
Collecting service information
Running inspection on a control-plane node
INFO: Service k8s.containerd is running
INFO: Service k8s.kube-proxy is running
INFO: Service k8s.k8s-dqlite is running
INFO: Service k8s.k8sd is running
INFO: Service k8s.kube-apiserver is running
INFO: Service k8s.kube-controller-manager is running
INFO: Service k8s.kube-scheduler is running
INFO: Service k8s.kubelet is running
Collecting registry mirror logs
Collecting service arguments
INFO: Copy service args to the final report tarball
Collecting k8s cluster-info
INFO: Copy k8s cluster-info dump to the final report tarball
Collecting SBOM
INFO: Copy SBOM to the final report tarball
Collecting system information
INFO: Copy uname to the final report tarball
INFO: Copy snap diagnostics to the final report tarball
INFO: Copy k8s diagnostics to the final report tarball
Collecting networking information
INFO: Copy network diagnostics to the final report tarball
Building the report tarball
SUCCESS: Report tarball is at /root/inspection-report-20250109_132806.tar.gz
```

This confirms the services that are running, and the resulting report file can be viewed to get a detailed look at every aspect of the system.

## Reporting a bug
If you cannot solve your issue and believe the fault may lie in {{product}}, please [file an issue on the project repository][].

To help us deal effectively with issues, it is very useful to include the report obtained from the inspect script, as well as any additional logs, and a summary of the issue.

<!-- Links -->

[file an issue on the project repository]: https://github.com/canonical/k8s-snap/issues/new/choose
[snap-troubleshooting-reference]: ../reference/troubleshooting

0 comments on commit 3af6ea1

Please sign in to comment.