Kubernetes PreStop hook for container crash troubleshooting
16 Dec 2021 by dzlabDebugging container crashes on kubernetes can be frustrating, especially those due to out of memory issues. In fact, kuberenetes will kill the container failing to respond to heath checks and probably restart a new one (depending on your restart policy). This can happen very quickly leaving no time to detect the crash and react to capture any troubleshooting information to understand the root cause of the initial crash (e.g. Out Of Memory).
Likely, kubernetes provides some Container Lifecycle Hooks that can be used to run any logic on specific event. In our case, we can leverage PreStop hook to capture troubleshooting information like heap profile after a container crashes (e.g. Spark executor crashing) and saved for later analysis before the container disappear.
Note that detecting the actual crash may not be straightforward, and will probably depending on the runtime. For instance, JVM applications will have .core
and .dump
files created after crash so we could just look at the presence of those files to determine the crash. Furthermore, the JVM provides a HeapDumpOnOutOfMemoryError
and HeapDumpPath
, see documentation - link.
This article focuses on how to store heap profile after container crashes using PreStop into Azure ADLS using Azure File Volumes.
The buggy Application
As a toy example, we will use an application that exposes an API that we can hit to cause a real crash.
spec:
containers:
- name: java-k8s-playground
image: dmetzler/java-k8s-playground
# Health probes (1)
livenessProbe:
failureThreshold: 3
httpGet:
path: /q/health/live
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 10
readinessProbe:
failureThreshold: 15
httpGet:
path: /q/health/ready
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3
As you can see from the manifest, this application exposes the following APIs:
- a liveness probe at /q/health/live
- a rediness probe at /q/health/ready
- an API to cause a crash at /shoot
Option 1: Static
First option, is to create an azurefile share manually and use it as a volume in the application pod.
Azure file share
To store the troubleshooting information we need to create an Azure ADLS and a file share using the Azure CLI.
1- create environment variables to make life easier
STORAGE_ACCOUNT_NAME=myadls
RESOURCE_GROUP=my-azrg
LOCATION=westus
STORAGE_SHARE_NAME=myshare
Note: you may need to create a resource group before continuing if you don’t have one.
2- create ADLS account
az storage account create -n $STORAGE_ACCOUNT_NAME -g $RESOURCE_GROUP -l $LOCATION --sku Standard_LRS
3- Export the connection string to ADLS as an environment variable, this is used when creating the Azure file share
export STORAGE_CONNECTION_STRING=$(az storage account show-connection-string -n $STORAGE_ACCOUNT_NAME -g $RESOURCE_GROUP -o tsv)
4- Create the Azure file share
az storage share create -n $STORAGE_SHARE_NAME --connection-string $STORAGE_CONNECTION_STRING
5- Get storage account key so we store later on Kubernetes as a secret
STORAGE_KEY=$(az storage account keys list --resource-group $RESOURCE_GRO
Kubernetes configuration
Create a kubernetes secret to store the previously defined storage key $STORAGE_KEY
kubectl create secret generic azure-secret --from-literal=azurestorageaccountname=$STORAGE_ACCOUNT_NAME --from-literal=azurestorageaccountkey=$STORAGE_KEY
Note: If your kubernetes is not running on Azure (i.e. you are not using AKS) you may need to setup an AzureFile Storage class for your cluster.
Now, we can use the configured Azure share as a volume to any pod. For instance, we can create a read/write volume and mounted on /mnt/azure
like this:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
. . .
name: myapp
spec:
template:
spec:
containers:
- name: mycontainer
. . .
volumeMounts:
- mountPath: "/mnt/azure"
name: volume
volumes:
- name: volume
azureFile:
secretName: azure-secret
shareName: myshare
readOnly: false
After defining, the volume and mount path, we can configure our PreStop hook to store into /mnt/azure
with something like this
spec:
template:
spec:
containers:
- name: mycontainer
. . .
lifecycle:
preStop:
exec:
command:
- sh
- -c
- "jmap -dump:live,format=b,file=/mnt/azure/$(hostname).hprof 1"
Note how we are saving heap dump with
jmap -dump:live,format=b,file=/mnt/azure/$(hostname).hprof 1
Complete example
Now, we can put together the application defintion, the volume configuration, and the prestop hook into a deployment manifest that will look like this
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: java-k8s-playground
name: java-k8s-playground
spec:
replicas: 1
selector:
matchLabels:
app: java-k8s-playground
template:
metadata:
labels:
app: java-k8s-playground
spec:
containers:
- name: java-k8s-playground
image: dmetzler/java-k8s-playground
# Health probes (1)
livenessProbe:
failureThreshold: 3
httpGet:
path: /q/health/live
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 10
readinessProbe:
failureThreshold: 15
httpGet:
path: /q/health/ready
port: 8080
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3
# We ask to run the troubleshoot script when stopping (2)
lifecycle:
preStop:
exec:
command:
- sh
- -c
- "jmap -dump:live,format=b,file=/mnt/azure/$(hostname).hprof 1"
volumeMounts:
- mountPath: "/mnt/azure"
name: myvolume
volumes:
- name: myvolume
azureFile:
secretName: azure-secret
shareName: myshare
readOnly: false
terminationGracePeriodSeconds: 30
After deploying with kubectl apply -f manifest.yaml
, ssh into the java-k8s-playground
container and run the following command to cause a crash by simply calling the crash API
curl -XPUT localhost:8080/shoot
After the container, crashes the profile file will be available in storage space that we configured.
Option 2: Dynamic
Instead of manually defining an azurefile share and referenced it directly in the application pod, we can define instead dynamically create it and linked it to the pod. For details on this approach see documentation - link.
First, create a storage class of type kubernetes.io/azure-file
and define optional parameters (e.g. Azure SKU name, mount options)
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: my-azurefile
provisioner: kubernetes.io/azure-file
mountOptions:
- dir_mode=0777
- file_mode=0777
- uid=0
- gid=0
- mfsymlinks
- cache=strict
- actimeo=30
parameters:
skuName: Standard_LRS
Second, define a PersistentVolumeClaim
claim that will use the previous storage class and provision a storage account in the same resource group as the Azure kubernetes cluster.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-managed-disk
spec:
accessModes:
- ReadWriteMany
storageClassName: my-azurefile
resources:
requests:
storage: 5Gi
Finally, we define a volume that will use this claim as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
. . .
name: myapp
spec:
template:
spec:
containers:
- name: mycontainer
. . .
volumeMounts:
- mountPath: "/mnt/azure"
name: volume
volumes:
- name: volume
persistentVolumeClaim:
claimName: pvc-managed-disk
After applying the deployment we can simulate a crash and the storage of the heap profile as done in the previous section.
Resources
Here are additional resources for alternative ways to capture troubleshooting information from crashes
- How to do a Java/JVM heap dump in Kubernetes - link
- How to get a heap dump from Kubernetes k8s pod? - link
- Take Thread-Dump or Heap-Dump of K8s Pod · Issue #12 · aws-samples/kubernetes-for-java-developers - link
- Is there a way to dump Crash data on the Crashing POD before it dies. - link
- How to Dump OOMKilled Process on Kubernetes - link
- Troubleshooting a Java Application In Kubernetes - link
- How To Get A Heap Dump From Kubernetes K8S Pod - link
- 7 Ways to Capture Java Heap Dumps - link