How to process Avro files in Data Fusion

Currently, to process Avro files at scale using Data Fusion, you need to perform a couple of manual steps. This article describes an effective way to process Avro files on GCS, transform the data, and load it into GCS.

Before you begin

Make sure you have an instance of Data Fusion
Download a sample Avro file, and upload it to a directory on GCS -

Parsing the Avro file using Wrangler

Open Wrangler
Using the GCS connection, navigate to the directory where you have stored the sample Avro file. Select the file.
The file should be shown in Wrangler with a single column body of type byte[].
Now apply the directive Parse → Avro on the body column.
The data should be split into multiple columns.
Click the More link towards the top right, and select View Schema.
In the Schema modal that appears as below, click the download button on the title bar, to download the schema of the Avro file. Store this file at a known location on your computer.

Applying transformations

You can continue to perform more transformations as needed on this data. For reference, you can use these transformations -

Creating a pipeline

Once you’ve applied your transformations, click Create Pipeline. This will bring you into the Studio, where you can see the GCS source and Wrangler nodes, and can create the rest of your pipeline.
Now we need to perform some manual steps:
1. Firstly, open the GCS source. Click the Actions button in the Output Schema section, and choose Import.
2. Specify the schema file that you had downloaded in the previous step. This is the schema of your Avro file.
3. Change the Format property of the GCS source from text to avro.
4. Now, open the Wrangler node, and remove the first directive (parse-as-avro-file) in the Recipe property.
Now you can build the rest of the pipeline. You can add additional transformations, aggregations, sinks, error collectors as required. Attached is a sample pipeline for your reference.

Running your pipeline

You can run your pipeline in Preview mode to verify your processing logic.
Once it succeeds, deploy the pipeline, and Run it as usual.