Problem
Pipeline execution fails when the execution engine is set to Spark with errors in the pipeline logs of the form
“Lost task x.y in stage z.xx”
Solution(s)
Lost task in spark can occur due to the following reasons
Out of memory in spark executor
When the spark executors go out of memory, JVM will spend a lot of time in GC pause that will result in a timeout due to which executors would terminate. Under these scenarios the logs will be the following
Lost task 0.0 in stage 14.0 (TID 16, cdap-mock2dwh2-3ededd25-5837-11ea-b33b-1ad7eaaa4723-w-8.c.vf-pt-ngbi-dev-gen-03.internal, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 125847 ms
Remediation:
Check the executor resource allocated for the pipeline, if it is too low (ex: 2 GB) increase it to a higher value (8GB)
If the executor memory is over 32GB for Join/aggregation use-cases and the pipeline still fails, ensure that the join best practices are being followed
2. Wrangler bug
Due to a wrangler bug the pipeline configs get overwritten which will result in the following error -
Lost task 48.0 in stage 17.0 (TID 78, cdap-mock2dwh2-3ededd25-5837-11ea-b33b-1ad7eaaa4723-w-2.c.vf-pt-ngbi-dev-gen-03.internal, executor 9): io.cdap.cdap.api.plugin.InvalidPluginConfigException: Unable to create plugin config.
Remediation:
Set the executor vcore to 1, for CDF versions below 6.1.3
3. Task killed
Spark framework will kill executor tasks that will result in the following log message
Lost task 78.0 in stage 17.0 (TID 112, cdap-mock2dwh2-3ededd25-5837-11ea-b33b-1ad7eaaa4723-w-6.c.vf-pt-ngbi-dev-gen-03.internal, executor 6): TaskKilled (Stage cancelled)
Note: The TaskKilled is not an actual error, this is a result of spark framework cancelling the executors during shutdown. The root cause of the error will be from a different executor.