Lost task error in Spark

Problem

Pipeline execution fails when the execution engine is set to Spark with errors in the pipeline logs of the form

“Lost task x.y in stage z.xx”

Symptom(s)

Out of memory in spark executor,
Wrangler Bug, and
Task getting killed.

Solution(s)

Out of memory in spark executor

When the spark executors go out of memory, JVM will spend a lot of time in GC pause that will result in a timeout due to which executors would terminate. Under these scenarios the logs will be the following

Lost task 0.0 in stage 14.0 (TID 16, cdap-mock2dwh2-3ededd25-5837-11ea-b33b-1ad7eaaa4723-w-8.c.vf-pt-ngbi-dev-gen-03.internal, executor 1): \
ExecutorLostFailure (executor 1 exited caused by one of the running tasks) \
Reason: Executor heartbeat timed out after 125847 ms

Remediation:

Check the executor resource allocated for the pipeline, if it is too low (ex: 2 GB) increase it to a higher value (8GB)
If the executor memory is over 32GB for Join/aggregation use-cases and the pipeline still fails, ensure that the join best practices are being followed

Wrangler bug

Due to a wrangler bug the pipeline configs get overwritten which will result in the following error -

Lost task 48.0 in stage 17.0 (TID 78, cdap-mock2dwh2-3ededd25-5837-11ea-b33b-1ad7eaaa4723-w-2.c.vf-pt-ngbi-dev-gen-03.internal, executor 9): io.cdap.cdap.api.plugin.InvalidPluginConfigException: \
Unable to create plugin config.

Remediation:

Set the executor vcore to 1, for CDF versions below 6.1.3

Task killed

Spark framework will kill executor tasks that will result in the following log message

Lost task 78.0 in stage 17.0 (TID 112, cdap-mock2dwh2-3ededd25-5837-11ea-b33b-1ad7eaaa4723-w-6.c.vf-pt-ngbi-dev-gen-03.internal, executor 6): TaskKilled \
(Stage cancelled)

Note: The TaskKilled is not an actual error, this is a result of spark framework cancelling the executors during shutdown. The root cause of the error will be from a different executor.