JSchException during pipeline execution

Problem

You get a JSchException caused by a java.net.ConnectException: Connection timed out error or an Auth fail error. In these cases, your pipeline doesn’t run because Cloud Data Fusion is unable to SSH to the Cloud Dataproc cluster’s master node.

Symptom

Pipelines are configured by default to run on a remote Cloud Dataproc cluster. When you run your pipeline, Cloud Data Fusion runs the pipeline on a Cloud Dataproc cluster by SSHing to the cluster’s master node and launching a Hadoop job from the node. If Cloud Data Fusion is unable to SSH to the master node due to lack of network connectivity or authentication failure, the pipeline run will fail and a JSchException will appear in the pipeline logs.

There are two common cases in which you might get a JSchException:

  • java.net.ConnectException: Connection timed out error:

java.io.IOException: com.jcraft.jsch.JSchException: java.net.ConnectException: Connection timed out (Connection timed out) at io.cdap.cdap.common.ssh.DefaultSSHSession.<init>(DefaultSSHSession.java:82) ~[na:na] at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillPreparer.lambda$start$0(RemoteExecutionTwillPreparer.java:429) ~[na:na] at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillRunnerService$ControllerFactory.lambda$create$0(RemoteExecutionTwillRunnerService.java:519) ~[na:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_212] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_212] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212]
  • Auth fail error:

java.io.IOException: com.jcraft.jsch.JSchException: Auth fail at io.cdap.cdap.common.ssh.DefaultSSHSession.<init>(DefaultSSHSession.java:82) ~[na:na] at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillPreparer.lambda$start$0(RemoteExecutionTwillPreparer.java:429) ~[na:na] at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillRunnerService$ControllerFactory.lambda$create$0(RemoteExecutionTwillRunnerService.java:519) ~[na:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_212] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_212] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212] Caused by: com.jcraft.jsch.JSchException: Auth fail at com.jcraft.jsch.Session.connect(Session.java:519) ~[com.jcraft.jsch-0.1.54.jar:na] at com.jcraft.jsch.Session.connect(Session.java:183) ~[com.jcraft.jsch-0.1.54.jar:na] at io.cdap.cdap.common.ssh.DefaultSSHSession.<init>(DefaultSSHSession.java:79) ~[na:na] ... 7 common frames omitted

 

Solution

If the error message you get contains java.net.ConnectException: Connection timed out, this is likely because the firewall rules in your project don't allow ingress connections on port 22. New projects start with a default network that is pre-populated with a firewall rule, default-allow-ssh. This firewall rule allows ingress connections on port 22 from any source to any instance in the network. If such a firewall rule doesn't exist in the network used by your Cloud Data Fusion instance, create such a rule. Then rerun your pipeline.

If the error message you get contains Auth fail, this is likely because of a known issue that was resolved on May 23, 2019. If you're getting this error, the Cloud Data Fusion instance you're running might have been created before this time and therefore doesn't have the fix for this bug. Create a new instance.

 

 

Â