Lost task error in Spark

Problem

Pipeline execution fails when the execution engine is set to Spark with errors in the pipeline logs of the form:

“Lost task x.y in stage z.xx”

Symptom(s)

  • Out of memory in spark executor

  • Wrangler Bug

  • Task getting killed

Solution(s)

Out of memory in spark executor

When the Spark executors go out of memory, JVM will spend a lot of time in GC pause that will result in a timeout due to which executors would terminate. Under these scenarios the logs will be the following:

Lost task 0.0 in stage 14.0 (TID 16, cdap-mock2dwh2-3ededd25-5837-11ea-b33b-1ad7eaaa4723-w-8.c.vf-pt-ngbi-dev-gen-03.internal, executor 1): \ ExecutorLostFailure (executor 1 exited caused by one of the running tasks) \ Reason: Executor heartbeat timed out after 125847 ms

Remediation:

  • Check the executor resource allocated for the pipeline. If it is too low (ex: 2 GB), increase it to a higher value (8 GB).

  • If the executor memory is over 32 GB for Join/aggregation use cases and the pipeline still fails, ensure that the join best practices are being followed.

Wrangler bug

Due to a wrangler bug, the pipeline configs get overwritten which will result in the following error:

Lost task 48.0 in stage 17.0 (TID 78, cdap-mock2dwh2-3ededd25-5837-11ea-b33b-1ad7eaaa4723-w-2.c.vf-pt-ngbi-dev-gen-03.internal, executor 9): io.cdap.cdap.api.plugin.InvalidPluginConfigException: \ Unable to create plugin config.

Remediation:

  • Set the executor vcore to 1, for CDF versions below 6.1.3/6.2.0.

Task killed

Spark framework will kill executor tasks that will result in the following log message:

Lost task 78.0 in stage 17.0 (TID 112, cdap-mock2dwh2-3ededd25-5837-11ea-b33b-1ad7eaaa4723-w-6.c.vf-pt-ngbi-dev-gen-03.internal, executor 6): TaskKilled \ (Stage cancelled)

Note: The TaskKilled is not an actual error. This is a result of the Spark framework canceling the executors during shutdown. The root cause of the error will be from a different executor.