Skip to content

Commit

Permalink
Add how-to troubleshoot for charm deployments
Browse files Browse the repository at this point in the history
  • Loading branch information
berkayoz committed Jan 13, 2025
1 parent 2ce6865 commit f185610
Show file tree
Hide file tree
Showing 4 changed files with 230 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/src/charm/howto/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ custom-registry
Upgrade patch version <upgrade-patch>
Upgrade minor version <upgrade-minor>
Validate the cluster <validate>
troubleshooting
```

---
Expand Down
224 changes: 224 additions & 0 deletions docs/src/charm/howto/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# How to troubleshoot {{product}}

Identifying issues in a Kubernetes cluster can be difficult, especially to new users. With {{product}} we aim to make deploying and managing your cluster as easy as possible. This how-to guide will walk you through the steps to troubleshoot your {{product}} cluster.

## Common issues

Maybe your issue has already been solved? Check out the [troubleshooting reference][charm-troubleshooting-reference] page to see a list of common issues and their solutions. Otherwise continue with this guide to help troubleshoot your {{product}} cluster.

## Verify that the cluster status is ready

Verify that the cluster status is ready by running the following command:

```
juju status
```

You should see output similar to the following:
```
Model Controller Cloud/Region Version SLA Timestamp
k8s-testing localhost-localhost localhost/localhost 3.6.1 unsupported 09:06:50Z
App Version Status Scale Charm Channel Rev Exposed Message
k8s 1.32.0 active 1 k8s 1.32/beta 179 no Ready
k8s-worker 1.32.0 active 1 k8s-worker 1.32/beta 180 no Ready
Unit Workload Agent Machine Public address Ports Message
k8s-worker/0* active idle 1 10.94.106.154 Ready
k8s/0* active idle 0 10.94.106.136 6443/tcp Ready
Machine State Address Inst id Base AZ Message
0 started 10.94.106.136 juju-380ff2-0 [email protected] Running
1 started 10.94.106.154 juju-380ff2-1 [email protected] Running
```
In this example we can glean some information. The `Workload` column will show the status of a given service. The `Message` section will show you the health of a given service in the cluster. During deployment and maintenance these workload statuses will update to reflect what a given node is doing. For example the workload may say `maintenance` while message will describe this maintenance as `Ensuring snap installation`.


During normal operation the Workload should read `active`, the Agent column (which reflects what the Juju agent is doing) should read `idle`, and the messages will either say `Ready` or another descriptive term.

## Verify that the API server is healthy

Fetch the kubeconfig file for a control-plane node(unit) in the cluster by running the following command:

```
juju run k8s/0 get-kubeconfig
```

```{warning}
When running `juju run k8s/0 get-kubeconfig` you retrieve the kubeconfig file that uses the given unit's IP address.
```

Verify that the API server is healthy and reachable by running the following command:

```
kubectl get all
```

This command lists resources that exist under the default namespace. You should see output similar to the following if the API server is healthy:

```
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kubernetes ClusterIP 10.152.183.1 <none> 443/TCP 29m
```

A typical error message may look like this if the API server can not be reached:

```
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
```

The status of the API server service can be checked by running the following command:

```
juju exec --unit k8s/0 -- systemctl status snap.k8s.kube-apiserver
```

The logs of the API server service can be accessed by running the following command:

```
juju exec --unit k8s/0 -- journalctl -u snap.k8s.kube-apiserver
```

A failure could mean that:
* The API server is not reachable due to network issues or firewall limitations
* The API server on the particular node is unhealthy
* The control-plane node that's being reached is down

Try reaching the API server on a different unit by retrieving the kubeconfig file with `juju run <k8s/unit#> get-kubeconfig`.

## Verify that the cluster nodes are healthy

Verify that the nodes in the cluster are healthy by running the following command and checking if all of the nodes have the `Ready` status:

```
kubectl get nodes
```

## Troubleshooting an unhealthy node

There are certain services running on each node of a {{product}} cluster which are required. The required services depend on the type of node.

Services running on both types of nodes:
* `k8sd`
* `kubelet`
* `containerd`
* `kube-proxy`

Services running only on control-plane nodes:
* `kube-apiserver`
* `kube-controller-manager`
* `kube-scheduler`
* `k8s-dqlite`

Services running only on worker nodes:
* `k8s-apiserver-proxy`

SSH to the unhealthy node by running the following command:

```
juju ssh <k8s/unit#>
```

Check the status of these services on the failing node by running the following command:

```
sudo systemctl status snap.k8s.<service>
```

The logs of a failing service can be checked by running the following command:

```
sudo journalctl -xe -u snap.k8s.<service>
```

If the issue indicates a problem with the configuration of the services on the node, it could be helpful to examine the arguments used to run these services.

The arguments of a service on the failing node can be examined by reading the file located at `/var/snap/k8s/common/args/<service>`

## Verify that the system pods are healthy

Verify that the pods that are a part of the cluster are healthy by running the following command and checking if all of the pods are `Running` and `Ready`:

```
kubectl get pods -n kube-system
```

The pods under the `kube-system` namespace belong to {{product}} features such as `network` that provide important functionality. Unhealthy pods could be related to configuration issues or nodes not meeting certain requirements.

## Troubleshooting a failing pod

Events on a failing pod can be checked by running the following command:

```
kubectl describe pod <pod-name> -n <namespace>
```

Logs on a failing pod can be checked by running the following command:

```
kubectl logs <pod-name> -n <namespace>
```

## Using the built-in inspection script

{{product}} ships with a script to compile a complete report on {{product}} and the system which it is running on. This is essential for bug reports, but is also a useful way of confirming the system is (or isn’t) working and collecting all the relevant data in one place.

Inspection script can be executed on a specific unit by running the following commands:

```
juju exec --unit <k8s/unit#> -- sudo /snap/k8s/current/k8s/scripts/inspect.sh /home/ubuntu/inspection-report.tar.gz
juju scp <k8s/unit#>:/home/ubuntu/inspection-report.tar.gz ./
```

You should see output similar to the following:
```
Collecting service information
Running inspection on a control-plane node
INFO: Service k8s.containerd is running
INFO: Service k8s.kube-proxy is running
INFO: Service k8s.k8s-dqlite is running
INFO: Service k8s.k8sd is running
INFO: Service k8s.kube-apiserver is running
INFO: Service k8s.kube-controller-manager is running
INFO: Service k8s.kube-scheduler is running
INFO: Service k8s.kubelet is running
Collecting registry mirror logs
Collecting service arguments
INFO: Copy service args to the final report tarball
Collecting k8s cluster-info
INFO: Copy k8s cluster-info dump to the final report tarball
Collecting SBOM
INFO: Copy SBOM to the final report tarball
Collecting system information
INFO: Copy uname to the final report tarball
INFO: Copy snap diagnostics to the final report tarball
INFO: Copy k8s diagnostics to the final report tarball
Collecting networking information
INFO: Copy network diagnostics to the final report tarball
Building the report tarball
SUCCESS: Report tarball is at /home/ubuntu/inspection-report.tar.gz
```

This confirms the services that are running, and the resulting report file can be viewed to get a detailed look at every aspect of the system.

## Collecting debug information

To collect comprehensive debug output from your {{product}} cluster, install and run [juju-crashdump][] on a computer that has the Juju client installed, with the current controller and model pointing at your {{product}} deployment.

```
sudo snap install juju-crashdump --classic --channel edge
juju-crashdump -a debug-layer -a config
```

Running the `juju-crashdump` script will generate a tarball of debug information that includes systemd unit status and logs, Juju logs, charm unit data, and Kubernetes cluster information. It is recommended that you include this tarball when filing a bug.

## Reporting a bug
If you cannot solve your issue and believe the fault may lie in {{product}}, please [file an issue on the project repository][].

To help us deal effectively with issues, it is very useful to include the report obtained from the inspect script, the tarball obtained from `juju-crashdump`, as well as any additional logs, and a summary of the issue.

<!-- Links -->

[file an issue on the project repository]: https://github.com/canonical/k8s-operator/issues/new/choose
[charm-troubleshooting-reference]: ../reference/troubleshooting
[juju-crashdump]: https://github.com/juju/juju-crashdump
1 change: 1 addition & 0 deletions docs/src/charm/reference/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ architecture
Ports and Services <ports-and-services>
charm-configurations
Community <community>
troubleshooting
```

Expand Down
4 changes: 4 additions & 0 deletions docs/src/charm/reference/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Troubleshooting

This page provides techniques for troubleshooting common {{product}}
issues.

0 comments on commit f185610

Please sign in to comment.