Kubernetes troubleshooting

Striving for things to “just work” is one of the reasons for Kubernetes to exist. Its design serves the purpose of having you, the administrator, declare the way architecture is supposed to be and then sit back and watch Kubernetes set it up for uninterrupted service.

However, even if all the parts fit together well, something is always bound to break, unless the only thing running inside your Kubernetes cluster is a bash loop that prints current date once a minute.

Let us take an overview of techniques you can use to see what’s going on inside your cluster.

External reasons

When self-hosting Kubernetes instead of offloading the task to a cloud provider of choice, you also assume some extra risks and responsibilities. This is not to discourage you from having your own baremetal setup, but to stress a point about having a good eye on overall infrastructure health.

Depending on how many nodes you have in the cluster, how different they are in hardware/OS/kernel department (hopefully, not much), how do these nodes communicate with each other, and how sensitive your workloads are to all of those things — the real reason for the problem you’re dealing with might be actually unrelated to Kubernetes.

For example, if you’re having network connectivity issues between two deployments working on two different nodes, it might be quicker to set up a test case that will simulate the same behaviour between these two machines, running natively on host OS instead of inside Kubernetes.

Secondly, although this continues our previous point in a way, Kubernetes has a lot of options for monitoring systems you can use to peek at its insides. This is something that will provide a firm basis for your understanding of Kubernetes operation, and it’s not to be overlooked. It’s always better to react to an increase of I/O delays on the system, than to a severely degraded service half an hour later.

What does it mean to troubleshoot Kubernetes

To put it simply, Kubernetes is a set of components that tell other parts of your system what to do. Kubernetes itself won’t be running the application containers or redirecting network traffic — it will use existing system facilities for that, docker and iptables, for example. This gives you an opportunity to bypass Kubernetes altogether if it doesn’t give you enough information, and look directly at the lower level stuff.

You are going to be looking at the lower level stuff all the time.

View Kubernetes logs and events

There’s a bunch of different GUI tools that can be used to manage and troubleshoot a cluster, some of them Web-based and free and even running out-of-the-box, like Rancher Web UI, some of them are native desktop utilities you have to pay for.

However, the most direct, the fastest and the easiest to automate way is, as always, a command line.

For the purposes of this guide, we will use kubectl utility to communicate with our cluster.

There’s two things you will use: logs and events.

Logs are simple — it’s data being written to stdout by containers running inside your cluster. By trimming your containerized applications log output to a practical minimum, you’ll make future troubleshooting easier.

Events are an internal Kubernetes logging system that keeps record of everything happening within the cluster. Every object has a set of events associated with it, be it a deployment or a namespace, such as status changes, replication issues, object updates and so on.

Right now, we’re having problems with one of the workloads. First, we need to get a look over everything running inside the namespace:

kubectl get all --namespace our-production

This gets you the opportunity to look at states of deployments, pods and services. If some of them are failing, those are the ones you are going to be paying attention to.

It’s important to remember that “kubectl get all” is actually a bit of a misnomer: it returns commonly known things like workloads and services, but won’t return any CRD objects, even explicitly associated with a namespace you’re querying.

Looking at the “kubectl get all” output, we see that one of the deployments is failing, and its pod has problems starting up. Better get a full description of the pod:

kubectl describe --namespace our-production pod/my-pod-zl6m6

Now you’re looking at a human-readable representation of your Kubernetes object with some additional metadata. This includes latest events for the object, most likely at the very bottom of the “kubectl describe”.

For a general overview of what’s wrong, events are a good start. We’ll get events for a single pod:

kubectl get event --namespace our-production --field-selector involvedObject.name=my-pod-zl6m6

You can also query all the events for the cluster (‘--watch’ is to keep following the events in realtime):

kubectl get event --all-namespaces --watch

..but when you identify the problematic pod, it might be more convenient to view only its events:

kubectl get event --namespace our-production --field-selector involvedObject.name=python-app-x84m2

And get inside its logs, too:

kubectl logs --namespace our-production pod/python-app-x84m2

…and if there’s multiple containers inside the pod, single out one:

kubectl logs --namespace our-production pod/python-app-x84m2 --container python3-service

All of this basically gives you direct access to the application logs, like there’s no Kubernetes involved at all. But what should we do if the logs are lacking, and we need to execute something within the container?

Using kubectl exec to start a shell inside a container, we can descend into container’s environment and operate it like a common system:

kubectl exec --namespace our-production pod/python-app-x84m2 -- /bin/bash

That’s assuming you have bash inside the container. You might have sh instead or some other, less common, shell, or no shell at all — this is something that depends on the container.

But even if there’s no shell, you have an opportunity to use nsenter — a tool that allows you to execute host applications within a container namespace. It can only be used on a system that’s actually running the container, so first you need to find the node that currently hosts the container and its ID:

kubectl describe --namespace our-production pod/python-app-x84m2 | egrep 'Node:|Container ID:'

Then, SSH into the node and get container PID:

docker inspect -f '{{.State.Pid}}' 34716b0076603fa8d1789ed5c51587f4e7f93e297610bf6fa9639b210f39eb0d

Now you can use this PID to enter various namespaces with nsenter. For example, this is how you would force netstat to be executed only for a container with PID 4753:

nsenter -t 4753 -n netstat

‘-n’ is important here because it tells nsenter to use network namespace. If the utility you’re executing needs a different namespace, you’ll have to use a different argument.

This way, you can do pretty much everything you want to with a container environment, although it’s a good idea to have basic tools already baked into a container you deploy.

Conclusion

To make your life easier, and your technical backend stronger, you have to be prepared before the hammer strikes. There’s no healthy Kubernetes cluster if there’s shaky TCP/IP connectivity, faulty storage, or severe lack of resources. Invest your time into a monitoring solution beforehand.

With this covered, you’ll have way better experience troubleshooting your Kubernetes applications. It becomes a mostly predictable path of looking over individual container logs to piece together the way things started to break, or, when that fails, stripping systems away one-by-one — getting “under the hood”, so to say, although in this case you open the hood only to see another one under.

This is a good thing though, the more systems you get to look into, the more useful information you get to extract. What Kubernetes is unable to give you, you can retrieve from docker, or whatever else you might be using instead as a container engine.

Remember, there’s always more places to dig.