-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add how-to troubleshoot for charm deployments
- Loading branch information
Showing
4 changed files
with
230 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,224 @@ | ||
# How to troubleshoot {{product}} | ||
|
||
Identifying issues in a Kubernetes cluster can be difficult, especially to new users. With {{product}} we aim to make deploying and managing your cluster as easy as possible. This how-to guide will walk you through the steps to troubleshoot your {{product}} cluster. | ||
|
||
## Common issues | ||
|
||
Maybe your issue has already been solved? Check out the [troubleshooting reference][charm-troubleshooting-reference] page to see a list of common issues and their solutions. Otherwise continue with this guide to help troubleshoot your {{product}} cluster. | ||
|
||
## Verify that the cluster status is ready | ||
|
||
Verify that the cluster status is ready by running the following command: | ||
|
||
``` | ||
juju status | ||
``` | ||
|
||
You should see output similar to the following: | ||
``` | ||
Model Controller Cloud/Region Version SLA Timestamp | ||
k8s-testing localhost-localhost localhost/localhost 3.6.1 unsupported 09:06:50Z | ||
App Version Status Scale Charm Channel Rev Exposed Message | ||
k8s 1.32.0 active 1 k8s 1.32/beta 179 no Ready | ||
k8s-worker 1.32.0 active 1 k8s-worker 1.32/beta 180 no Ready | ||
Unit Workload Agent Machine Public address Ports Message | ||
k8s-worker/0* active idle 1 10.94.106.154 Ready | ||
k8s/0* active idle 0 10.94.106.136 6443/tcp Ready | ||
Machine State Address Inst id Base AZ Message | ||
0 started 10.94.106.136 juju-380ff2-0 [email protected] Running | ||
1 started 10.94.106.154 juju-380ff2-1 [email protected] Running | ||
``` | ||
In this example we can glean some information. The `Workload` column will show the status of a given service. The `Message` section will show you the health of a given service in the cluster. During deployment and maintenance these workload statuses will update to reflect what a given node is doing. For example the workload may say `maintenance` while message will describe this maintenance as `Ensuring snap installation`. | ||
|
||
|
||
During normal operation the Workload should read `active`, the Agent column (which reflects what the Juju agent is doing) should read `idle`, and the messages will either say `Ready` or another descriptive term. | ||
|
||
## Verify that the API server is healthy | ||
|
||
Fetch the kubeconfig file for a control-plane node(unit) in the cluster by running the following command: | ||
|
||
``` | ||
juju run k8s/0 get-kubeconfig | ||
``` | ||
|
||
```{warning} | ||
When running `juju run k8s/0 get-kubeconfig` you retrieve the kubeconfig file that uses the given unit's IP address. | ||
``` | ||
|
||
Verify that the API server is healthy and reachable by running the following command: | ||
|
||
``` | ||
kubectl get all | ||
``` | ||
|
||
This command lists resources that exist under the default namespace. You should see output similar to the following if the API server is healthy: | ||
|
||
``` | ||
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE | ||
service/kubernetes ClusterIP 10.152.183.1 <none> 443/TCP 29m | ||
``` | ||
|
||
A typical error message may look like this if the API server can not be reached: | ||
|
||
``` | ||
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port? | ||
``` | ||
|
||
The status of the API server service can be checked by running the following command: | ||
|
||
``` | ||
juju exec --unit k8s/0 -- systemctl status snap.k8s.kube-apiserver | ||
``` | ||
|
||
The logs of the API server service can be accessed by running the following command: | ||
|
||
``` | ||
juju exec --unit k8s/0 -- journalctl -u snap.k8s.kube-apiserver | ||
``` | ||
|
||
A failure could mean that: | ||
* The API server is not reachable due to network issues or firewall limitations | ||
* The API server on the particular node is unhealthy | ||
* The control-plane node that's being reached is down | ||
|
||
Try reaching the API server on a different unit by retrieving the kubeconfig file with `juju run <k8s/unit#> get-kubeconfig`. | ||
|
||
## Verify that the cluster nodes are healthy | ||
|
||
Verify that the nodes in the cluster are healthy by running the following command and checking if all of the nodes have the `Ready` status: | ||
|
||
``` | ||
kubectl get nodes | ||
``` | ||
|
||
## Troubleshooting an unhealthy node | ||
|
||
There are certain services running on each node of a {{product}} cluster which are required. The required services depend on the type of node. | ||
|
||
Services running on both types of nodes: | ||
* `k8sd` | ||
* `kubelet` | ||
* `containerd` | ||
* `kube-proxy` | ||
|
||
Services running only on control-plane nodes: | ||
* `kube-apiserver` | ||
* `kube-controller-manager` | ||
* `kube-scheduler` | ||
* `k8s-dqlite` | ||
|
||
Services running only on worker nodes: | ||
* `k8s-apiserver-proxy` | ||
|
||
SSH to the unhealthy node by running the following command: | ||
|
||
``` | ||
juju ssh <k8s/unit#> | ||
``` | ||
|
||
Check the status of these services on the failing node by running the following command: | ||
|
||
``` | ||
sudo systemctl status snap.k8s.<service> | ||
``` | ||
|
||
The logs of a failing service can be checked by running the following command: | ||
|
||
``` | ||
sudo journalctl -xe -u snap.k8s.<service> | ||
``` | ||
|
||
If the issue indicates a problem with the configuration of the services on the node, examine the arguments used to run these services. | ||
|
||
The arguments of a service on the failing node can be examined by reading the file located at `/var/snap/k8s/common/args/<service>` | ||
|
||
## Verify that the system pods are healthy | ||
|
||
Verify that the cluster's pods are healthy by running the following command and checking if all of the pods are `Running` and `Ready`: | ||
|
||
``` | ||
kubectl get pods -n kube-system | ||
``` | ||
|
||
The pods in the `kube-system` namespace belong to {{product}}' features such as `network`. Unhealthy pods could be related to configuration issues or nodes not meeting certain requirements. | ||
|
||
## Troubleshooting a failing pod | ||
|
||
Events on a failing pod can be checked by running the following command: | ||
|
||
``` | ||
kubectl describe pod <pod-name> -n <namespace> | ||
``` | ||
|
||
Logs on a failing pod can be checked by running the following command: | ||
|
||
``` | ||
kubectl logs <pod-name> -n <namespace> | ||
``` | ||
|
||
## Using the built-in inspection script | ||
|
||
{{product}} ships with a script to compile a complete report on {{product}} and the system which it is running on. This is an essential tool for bug reports and for investigating whether a system is (or isn’t) working and collecting all the relevant data in one place. | ||
|
||
Inspection script can be executed on a specific unit by running the following commands: | ||
|
||
``` | ||
juju exec --unit <k8s/unit#> -- sudo /snap/k8s/current/k8s/scripts/inspect.sh /home/ubuntu/inspection-report.tar.gz | ||
juju scp <k8s/unit#>:/home/ubuntu/inspection-report.tar.gz ./ | ||
``` | ||
|
||
The command output is similar to the following: | ||
``` | ||
Collecting service information | ||
Running inspection on a control-plane node | ||
INFO: Service k8s.containerd is running | ||
INFO: Service k8s.kube-proxy is running | ||
INFO: Service k8s.k8s-dqlite is running | ||
INFO: Service k8s.k8sd is running | ||
INFO: Service k8s.kube-apiserver is running | ||
INFO: Service k8s.kube-controller-manager is running | ||
INFO: Service k8s.kube-scheduler is running | ||
INFO: Service k8s.kubelet is running | ||
Collecting registry mirror logs | ||
Collecting service arguments | ||
INFO: Copy service args to the final report tarball | ||
Collecting k8s cluster-info | ||
INFO: Copy k8s cluster-info dump to the final report tarball | ||
Collecting SBOM | ||
INFO: Copy SBOM to the final report tarball | ||
Collecting system information | ||
INFO: Copy uname to the final report tarball | ||
INFO: Copy snap diagnostics to the final report tarball | ||
INFO: Copy k8s diagnostics to the final report tarball | ||
Collecting networking information | ||
INFO: Copy network diagnostics to the final report tarball | ||
Building the report tarball | ||
SUCCESS: Report tarball is at /home/ubuntu/inspection-report.tar.gz | ||
``` | ||
|
||
This script confirms that the services are running, and the resulting report file can be viewed to get a detailed look at every aspect of the system. | ||
|
||
## Collecting debug information | ||
|
||
To collect comprehensive debug output from your {{product}} cluster, install and run [juju-crashdump][] on a computer that has the Juju client installed, with the current controller and model pointing at your {{product}} deployment. | ||
|
||
``` | ||
sudo snap install juju-crashdump --classic --channel edge | ||
juju-crashdump -a debug-layer -a config | ||
``` | ||
|
||
Running the `juju-crashdump` script will generate a tarball of debug information that includes systemd unit status and logs, Juju logs, charm unit data, and Kubernetes cluster information. It is recommended that you include this tarball when filing a bug. | ||
|
||
## Reporting a bug | ||
If you cannot solve your issue and believe the fault may lie in {{product}}, please [file an issue on the project repository][]. | ||
|
||
Help us deal effectively with issues by including the report obtained from the inspect script, the tarball obtained from `juju-crashdump`, as well as any additional logs, and a summary of the issue. | ||
|
||
<!-- Links --> | ||
|
||
[file an issue on the project repository]: https://github.com/canonical/k8s-operator/issues/new/choose | ||
[charm-troubleshooting-reference]: ../reference/troubleshooting | ||
[juju-crashdump]: https://github.com/juju/juju-crashdump |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Troubleshooting | ||
|
||
This page provides techniques for troubleshooting common {{product}} | ||
issues. |