Lost task error in Spark
Problem
Pipeline execution fails when the execution engine is set to Spark with errors in the pipeline logs of the form:
“Lost task x.y in stage z.xx”
Symptom(s)
Out of memory in spark executor
Wrangler Bug
Task getting killed
Solution(s)
Out of memory in spark executor
When the Spark executors go out of memory, JVM will spend a lot of time in GC pause that will result in a timeout due to which executors would terminate. Under these scenarios the logs will be the following:
Lost task 0.0 in stage 14.0 (TID 16, cdap-mock2dwh2-3ededd25-5837-11ea-b33b-1ad7eaaa4723-w-8.c.vf-pt-ngbi-dev-gen-03.internal, executor 1): \
ExecutorLostFailure (executor 1 exited caused by one of the running tasks) \
Reason: Executor heartbeat timed out after 125847 ms
Remediation:
Check the executor resource allocated for the pipeline. If it is too low (ex: 2 GB), increase it to a higher value (8 GB).
If the executor memory is over 32 GB for Join/aggregation use cases and the pipeline still fails, ensure that the join best practices are being followed.
Wrangler bug
Due to a wrangler bug, the pipeline configs get overwritten which will result in the following error:
Lost task 48.0 in stage 17.0 (TID 78, cdap-mock2dwh2-3ededd25-5837-11ea-b33b-1ad7eaaa4723-w-2.c.vf-pt-ngbi-dev-gen-03.internal, executor 9): io.cdap.cdap.api.plugin.InvalidPluginConfigException: \
Unable to create plugin config.
Remediation:
Set the executor
vcore
to 1, for CDF versions below 6.1.3/6.2.0.
Task killed
Spark framework will kill executor tasks that will result in the following log message:
Lost task 78.0 in stage 17.0 (TID 112, cdap-mock2dwh2-3ededd25-5837-11ea-b33b-1ad7eaaa4723-w-6.c.vf-pt-ngbi-dev-gen-03.internal, executor 6): TaskKilled \
(Stage cancelled)
Note: The TaskKilled
is not an actual error. This is a result of the Spark framework canceling the executors during shutdown. The root cause of the error will be from a different executor.