Runtime Pod Restarting Frequently

Written by Albert Shau

Problem

The runtime pod restarts frequently due to OutOfMemory issues. This can manifest itself as pipeline run failures, with some sort of exception when the pipeline tries to talk to the runtime service.

For example:

java.io.IOException: Failed to send message for program run program_run:Altipal_DataLake.SQLSERVER_CARGA_MINUTOS.-SNAPSHOT.workflow.DataPipelineWorkflow.266d15ac-2bab-11ec-bdc4-42cf72c2cfe8 to https://[cdf-uri]:443/v3Internal/runtime/namespaces/[ns]/apps/[pipeline]/versions/-SNAPSHOT/workflows/DataPipelineWorkflow/runs/[runid]/topics/metrics8. Respond code: 502. Error: unknown error
	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeClient.throwIfError(RuntimeClient.java:209) ~[na:na]
	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeClient.sendMessages(RuntimeClient.java:115) ~[na:na]
	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeClientService$TopicRelayer.processMessages(RuntimeClientService.java:234) ~[na:na]
	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeClientService$TopicRelayer.publishMessages(RuntimeClientService.java:200) ~[na:na]
	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeClientService.runTask(RuntimeClientService.java:103) ~[na:na]

The OutOfMemory issues are due in part to a build up of historical run information on the runtime pod. To verify that this is the case, ssh to the pod and check the size of the ldb directory:

kubectl exec --stdin --tty cdap-[instance name]-runtime-0 -- /bin/bash
root@cdap-coaltipaldfprd-runtime-0:/data/ldb# du -h *
20K	cdap_system.entity.registry
16G	cdap_system.entity.store.d
12K	cdap_system.entity.store.i
16G	total

We have observed problems when the entity store becomes many gb in size.

See CDAP-18553: Runtime pod goes OOM after every few minsClosed for more information as well as which versions the fix will be released in.

Solution(s)

The temporary solution is to clean up the cdap_system.entity.* directories in the /data/ldb directory, then restart the pod. Note that it is best to coordinate with the customer on a time for this, as it is a minor service interruption and can potentially cause pipeline failures.