Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Problem

...

You get a JSchException, caused by a java.net.ConnectException: Connection timed out error or an Auth fail error. In these cases, your pipeline doesn’t run because Cloud Data Fusion is unable to SSH to the Dataproc cluster’s master node

...

.

Symptom

Pipelines that are configured by default to run on a remote Dataproc cluster, which is the default profile, are executed . When you run your pipeline, Cloud runs the pipeline on a Dataproc cluster by SSHing to the cluster’s master node and then launching the a Hadoop job from therethe node. If Cloud Data Fusion is unable to SSH to the master node due to lack of network connectivity or authentication failure, the pipeline run will fail and a JSchException will appear in the pipeline logs.

There are two familiar scenarios for common cases in which you might get a JSchException.The first is a “java:

  • java.net.ConnectException: Connection timed

...

  • out error:

Code Block
java.io.IOException: com.jcraft.jsch.JSchException: java.net.ConnectException: Connection timed out (Connection timed out)
        at io.cdap.cdap.common.ssh.DefaultSSHSession.<init>(DefaultSSHSession.java:82) ~[na:na]
        at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillPreparer.lambda$start$0(RemoteExecutionTwillPreparer.java:429) ~[na:na]
        at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillRunnerService$ControllerFactory.lambda$create$0(RemoteExecutionTwillRunnerService.java:519) ~[na:na]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_212]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212]

...

  • Auth fail error:

Code Block
java.io.IOException: com.jcraft.jsch.JSchException: Auth fail
	at io.cdap.cdap.common.ssh.DefaultSSHSession.<init>(DefaultSSHSession.java:82) ~[na:na]
	at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillPreparer.lambda$start$0(RemoteExecutionTwillPreparer.java:429) ~[na:na]
	at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillRunnerService$ControllerFactory.lambda$create$0(RemoteExecutionTwillRunnerService.java:519) ~[na:na]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_212]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_212]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_212]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_212]
	at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212]
Caused by: com.jcraft.jsch.JSchException: Auth fail
	at com.jcraft.jsch.Session.connect(Session.java:519) ~[com.jcraft.jsch-0.1.54.jar:na]
	at com.jcraft.jsch.Session.connect(Session.java:183) ~[com.jcraft.jsch-0.1.54.jar:na]
	at io.cdap.cdap.common.ssh.DefaultSSHSession.<init>(DefaultSSHSession.java:79) ~[na:na]
	... 7 common frames omitted

Solution

For the case where the error message contains “Auth fail”, this is likely a known and already-resolved issue, where OSLogin is enabled in the customer’s project: CDAP-15369. If this error is being encountered, that likely means that the Data Fusion instance does not have the fix for this bug. Instances created after May 23rd, 2019 would have the fix. The resolution would be to create a new instance.For the case where the error message you get contains “javajava.net.ConnectException: Connection timed out”out, this is likely means that because the firewall rules in the customer’s project do not your project don't allow ingress connections on port 22. As documented on the public Data Fusion docs page, new New projects start with a default network. The default network that is pre-populated with a firewall rule, default-allow-ssh, that . This firewall rule allows ingress connections on TCP port 22 from any source to any instance in the network. If such a firewall rule doesn't exist in the network used by your Cloud Data Fusion instance, you need to create such a rule. Then rerun your pipeline.

If the error message you get contains Auth fail, this is likely due to a known issue that was resolved on May 23, 2019. If you're getting this error, the Cloud Data Fusion instance you're running might have been created before this time and therefore doesn't have the fix for this bug. Create a new instance.