Problem
Data Fusion is unable to SSH to the Dataproc cluster’s master node in order to launch the pipeline.
Symptom
Pipelines that are configured to run on a remote Dataproc cluster, which is the default profile, are executed on a Dataproc cluster by SSHing to the master node and then launching the Hadoop job from there. If Data Fusion is unable to SSH to the master node due to lack of network connectivity or authentication failure, the pipeline run will fail and a JSchException will appear in the pipeline logs.
There are two familiar scenarios for a JSchException.
The first is a “java.net.ConnectException: Connection timed out”:
java.io.IOException: com.jcraft.jsch.JSchException: java.net.ConnectException: Connection timed out (Connection timed out) at io.cdap.cdap.common.ssh.DefaultSSHSession.<init>(DefaultSSHSession.java:82) ~[na:na] at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillPreparer.lambda$start$0(RemoteExecutionTwillPreparer.java:429) ~[na:na] at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillRunnerService$ControllerFactory.lambda$create$0(RemoteExecutionTwillRunnerService.java:519) ~[na:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_212] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_212] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212]
The second is an “Auth fail”:
java.io.IOException: com.jcraft.jsch.JSchException: Auth fail at io.cdap.cdap.common.ssh.DefaultSSHSession.<init>(DefaultSSHSession.java:82) ~[na:na] at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillPreparer.lambda$start$0(RemoteExecutionTwillPreparer.java:429) ~[na:na] at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillRunnerService$ControllerFactory.lambda$create$0(RemoteExecutionTwillRunnerService.java:519) ~[na:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_212] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_212] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212] Caused by: com.jcraft.jsch.JSchException: Auth fail at com.jcraft.jsch.Session.connect(Session.java:519) ~[com.jcraft.jsch-0.1.54.jar:na] at com.jcraft.jsch.Session.connect(Session.java:183) ~[com.jcraft.jsch-0.1.54.jar:na] at io.cdap.cdap.common.ssh.DefaultSSHSession.<init>(DefaultSSHSession.java:79) ~[na:na] ... 7 common frames omitted
Solution
For the case where the error message contains “Auth fail”, this is likely a known and already-resolved issue, where OSLogin is enabled in the customer’s project: CDAP-15369. If this error is being encountered, that likely means that the Data Fusion instance does not have the fix for this bug. Instances created after May 23rd, 2019 would have the fix. The resolution would be to create a new instance.
For the case where the error message contains “java.net.ConnectException: Connection timed out”, this likely means that the firewall rules in the customer’s project do not allow ingress connections on port 22. As documented on the public Data Fusion docs page, new projects start with a default network. The default network is pre-populated with a firewall rule, default-allow-ssh, that allows ingress connections on TCP port 22 from any source to any instance in the network. If such a rule doesn't exist in the network used by your Cloud Data Fusion instance, you need to create such a rule.