Written by Mo Eseifan

This article contains a summary of the various components in Cloud Data Fusion. Please leave feedback in comments if a section is missing/not clear.

Tenant Project

GKE Cluster

These are the components that reside inside the GKE cluster in the tenant project. These components don't map directly to pods in the cluster. For example, in newer versions of CDF there are 10 pods for preview but they all do the same thing so they are listed as one here.

CDAP UI

This component is responsible for hosting the User Interface that is presented to customers. It runs a NodeJS server that handles all incoming requests. In older versions of CDAP (<6.1.4), the UI pod was smaller so it was possible to crash the UI with too many open tabs/connections. Since then the polling from the browser has been reduced and the pod was scaled up to prevent crashes.

In general a request from the browser takes the following path:

UI (browser) → User Interface Agent → UI (NodeJS) → AppFabric

If the UI is being slow or unresponsive the issue can be in any one of these steps, the most common issue is AppFabric being down/overwhelmed. If that is not the issue then the UI pod is likely the cause. It is uncommon for the the User Interface Agent or UI in the browser to cause an issue.

User Interface Agent

This is the Inverting Proxy Agent that handles all incoming requests to the GKE cluster. This agent is the second half of a two part system that allows the pods in the GKE cluster to communicate with the internet. The other part of the system is the Inverting Proxy Server that is hosted on Google-internal infrastructure (similar to how the CLH is hosted).

The logs of this pod can be useful in debugging if the UI/AppFabric is behaving as expected. If the UI or AppFabric are down you may see many timeout errors in the logs of this agent. The code of the agent is not changed very often so it rarely causes issues.

CDAP Router

The router is responsible for routing the HTTP requests between the pods within the cluster. The router exposes a Kubernetes service that all other pods use to communicate with it. For example, when attempting to make an API request to AppFabric, the user would use the following URL:

<router-service-name>:11015/v3/...

The router would then redirect the request to the AppFabric pod. The router acts as a single point of entry for HTTP requests into the cluster so users do not have to be aware of the GKE configuration when they attempt to interact with the REST API.

CDAP logs

This is the CDAP logging service, it collects and stores all of the logs from the CDAP programs and services to GCS. Pipelines are considered CDAP programs so this logs service is used to save/fetch the logs for pipeline runs. The user can directly query this logs service using the REST APIs to get logs for anything happening within CDAP.

AppFabric

This is the core of CDAP. It is responsible for creating, deploying, updating and monitoring the pipelines that are run within CDF. This pod communicates with almost every other pod in the cluster at some point during its tasks. If appfabric is not working as expected there is a good chance one of the other components may be down/not responding and causing AppFabric to behave incorrectly.

The logs from the AppFabric pod are usually the most revealing when trying to diagnose an issue with a pipeline/instance. Another useful tool is getting a thread dump to see which threads are blocked and where they are blocked. Using these two methods we can get a clear indication of where the problem might be and begin our investigation.

Preview

This component is responsible for handling pipeline previews from the Studio. In older versions of CDAP, this was one pod that would run the pipelines locally and present the results. This presented scalability issues when many users were attempting to run preview. In 6.2 onwards the architecture was changed to have a “manager” node that would accept and allocate jobs and many “runner” nodes which would execute the pipelines and return the results.

Metrics

The metrics processor is responsible for saving and processing the metrics generated by CDAP. These metrics include user metrics (records in/out) as well as system metrics (number of API requests to a given service). The metrics pod accepts all incoming metrics from the various CDAP components and stores them in a format that is easy to query. The metrics querying ability is not directly exposed in the UI but it is used to generate the graphs and visualizations that are presented after a run.

Metadata

This component is responsible for recording metadata that is emitted by other CDAP components. CDAP emits various metadata on artifacts, plugins, datasets etc that are deployed within the instance. This metadata is primarily used by other CDAP components however it can also be queried by the user in the UI (in OSS CDAP only) or using REST APIs (both CDF & CDAP).

ElasticSearch

ElasticSearch is used to power the querying of metadata. The user does not directly interact with ElasticSearch. CDAP uses ES in the backend to execute the queries in an efficient and performant way but the REST API is independent of the underlying query engine.

Messaging

This component is vital to the operation of CDAP. All the different components use this messaging service to communicate with each other. An issue with the message service leads to some odd behavior of CDAP, in the past we’ve seen issues like pipelines getting stuck, logs not appearing correctly, failing to deploy pipelines etc.

The messaging service is fairly reliable in newer versions of CDAP but there are some known memory issues in versions 6.1.x. Checking the logs or taking a thread dump will quickly confirm if the messaging service is experiencing issues. Restarting the service will usually fix the issue but it will probably happen again, we usually recommend customers upgrade to newer versions to increase the reliability.

Dataprep (Wrangler)

The dataprep service is responsible for everything that's seen in the Wrangler tab in the UI. Note that there is a difference between the Wrangler service and the Wrangler plugin. The service powers the interactive tab where users can apply directives to their data and see a preview of the results in real time. The plugin lives in a pipeline and only processes data once the pipeline is run, it does not rely on the service at all.

Wrangler is deployed as an app within CDAP. This means the user can stop/start it at any time using the APIs listed below. If the app goes down for some reason (this can be seen in the System Admin tab) then the user can attempt to start it again using the API. This can happen occasionally in older versions of CDAP, starting the app again manually usually resolves the issue.

POST /v3/namespaces/system/apps/dataprep/services/service/start
POST /v3/namespaces/system/apps/dataprep/services/service/stop

Pipeline Studio

The pipeline studio is primarily responsible for validation of pipelines. Running a pipeline does not depend on on the pipeline studio but deploying a pipeline does since CDAP always validates the pipeline before deploying. If the pipeline studio is down the user will see an error message when attempting to deploy or validate a pipeline. Starting in CDAP 6.3.0+ the pipeline studio is also used to manage drafts. So listing, updating and deleting pipeline drafts will not work if the pipeline studio is down.

The pipeline studio is deployed as an app within CDAP. This means the user can stop/start it at any time using the APIs listed below. If the app goes down for some reason (this can be seen in the System Admin tab) then the user can attempt to start it again using the API. This can happen occasionally in older versions of CDAP, starting the app again manually usually resolves the issue.

POST /v3/namespaces/system/apps/pipeline/services/studio/start
POST /v3/namespaces/system/apps/pipeline/services/studio/stop

Monitoring Agent

The monitoring agent is responsible for collecting metrics from CDAP and forwarding them to Stackdriver. The Stackdriver metrics are internal only so they are not visible to the user in their customer project. The monitoring agent has no user-facing impact on the cluster. Some older versions of the monitor may have a high pod restart count due to some error-handling issues. This high restart count does not have any impact on the instance’s performance or function.

Runtime

The runtime service is responsible for receiving requests from the Dataproc cluster to gather metadata for pipeline - logs, program status, metrics etc. This pod was introduced as part of the runtime revamp which occurred in version 6.2+ of CDAP (released May 2020). If this pod is not functioning correctly then pipelines will appear stuck in STARTING/RUNNING state. There will sometimes be 400 errors in the logs of this pod, those could be transient issues which may be fixed with a restart.

Cloud Storage

Cloud Storage is used to store the artifacts that CDAP needs to run a pipeline. When a pipeline is run CDAP copies the artifacts from the GCS bucket to the Dataproc cluster (or whatever provisioner the user is using). We never need to directly interact with this bucket, there are very few instances where there was an issue caused by it.

Cloud SQL

Each CDF instance has its own Cloud PostgerSQL database, this database is used to store all information for the instance. This includes run records, pipeline drafts, provisioner configs, deployed pipelines, namespaces etc. CloudSQL is the main store for CDF so any connection issues between the GKE cluster and CloudSQL will result in a misbehaving instance, the AppFabric logs will have clear error message if thats the case.

Customer Project

CDF does not create many resources in the customer project. This project will contain any data sources that the customer would like to read/write to. It is also possible to write to data sources in other projects but that requires additional IAM configurations.

Dataproc

By default, an ephemeral Dataproc cluster is used to run the pipelines. The user can also create a static Dataproc cluster and configure CDF to use that cluster. In either case, the user can instead specify another project to run the pipelines in (with some additional IAM configurations).

Cloud Storage

A Cloud Storage bucket is created in the customer project, the bucket name will have a prefix of ‘df-’. This bucket is only used for storing snapshots for streaming pipelines. It’s possible for the customer to delete this bucket, this will cause errors whenever they attempt to run a streaming pipeline. It is strongly recommended that the user not interact with this bucket directly.

Knowledge Base

CDF Component Documentation

Tenant Project

GKE Cluster

CDAP UI

User Interface Agent

CDAP Router

CDAP logs

AppFabric

Preview

Metrics

Metadata

ElasticSearch

Messaging

Dataprep (Wrangler)

Pipeline Studio

Monitoring Agent

Runtime

Cloud Storage

Cloud SQL

Customer Project

Dataproc

Cloud Storage