Problem
...
You get a JSchException
caused by a java.net.ConnectException: Connection timed out
error or an Auth fail
error. In these cases, your pipeline doesn’t run because Cloud Data Fusion is unable to SSH to the Cloud Dataproc cluster’s master node
...
.
Symptom
Pipelines that are configured by default to run on a remote Cloud Dataproc cluster, which is the default profile, are executed on a . When you run your pipeline, Cloud Data Fusion runs the pipeline on a Cloud Dataproc cluster by SSHing to the cluster’s master node and then launching the a Hadoop job from therethe node. If Cloud Data Fusion is unable to SSH to the master node due to lack of network connectivity or authentication failure, the pipeline run will fail and a JSchException
will appear in the pipeline logs.
There are two familiar scenarios for common cases in which you might get a JSchException
.The first is a “java:
java.net.ConnectException: Connection timed
...
out
error:
Code Block |
---|
java.io.IOException: com.jcraft.jsch.JSchException: java.net.ConnectException: Connection timed out (Connection timed out) at io.cdap.cdap.common.ssh.DefaultSSHSession.<init>(DefaultSSHSession.java:82) ~[na:na] at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillPreparer.lambda$start$0(RemoteExecutionTwillPreparer.java:429) ~[na:na] at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillRunnerService$ControllerFactory.lambda$create$0(RemoteExecutionTwillRunnerService.java:519) ~[na:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_212] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_212] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212] |
...
Auth fail
error:
Code Block |
---|
java.io.IOException: com.jcraft.jsch.JSchException: Auth fail at io.cdap.cdap.common.ssh.DefaultSSHSession.<init>(DefaultSSHSession.java:82) ~[na:na] at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillPreparer.lambda$start$0(RemoteExecutionTwillPreparer.java:429) ~[na:na] at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillRunnerService$ControllerFactory.lambda$create$0(RemoteExecutionTwillRunnerService.java:519) ~[na:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_212] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_212] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212] Caused by: com.jcraft.jsch.JSchException: Auth fail at com.jcraft.jsch.Session.connect(Session.java:519) ~[com.jcraft.jsch-0.1.54.jar:na] at com.jcraft.jsch.Session.connect(Session.java:183) ~[com.jcraft.jsch-0.1.54.jar:na] at io.cdap.cdap.common.ssh.DefaultSSHSession.<init>(DefaultSSHSession.java:79) ~[na:na] ... 7 common frames omitted |
Solution
For the case where the error message contains “Auth fail”, this is likely a known and already-resolved issue, where OSLogin is enabled in the customer’s project: CDAP-15369. If this error is being encountered, that likely means that the Data Fusion instance does not have the fix for this bug. Instances created after May 23rd, 2019 would have the fix. The resolution would be to create a new instance.For the case where the error message you get contains “javajava.net.ConnectException: Connection timed
out”out
, this is likely means that because the firewall rules in the customer’s project do not your project don't allow ingress connections on port 22. As documented on the public Data Fusion docs page, new New projects start with a default network. The default network that is pre-populated with a firewall rule, default-allow-ssh
, that . This firewall rule allows ingress connections on TCP port 22 from any source to any instance in the network. If such a firewall rule doesn't exist in the network used by your Cloud Data Fusion instance, you need to create such a rule. Then rerun your pipeline.
If the error message you get contains Auth fail
, this is likely because of a known issue that was resolved on May 23, 2019. If you're getting this error, the Cloud Data Fusion instance you're running might have been created before this time and therefore doesn't have the fix for this bug. Create a new instance.