BigQuery Batch Sink: Schema Management

Introduction

Some sources can have various types of schema changes, currently BigQuery Batch Sink plugin has limited support for such changes. This page documents an approach to handle schema changes in BigQuery Batch Sink plugin.

Use case

RequirementUse CaseApproach
Add new column

BigQuery schema: a, b, c

New schema: a, b, c, d

Write data for column d.

ALLOW_FIELD_ADDITION value can be set to SchemaUpdateOptions in JobConfigurationLoad, will always add nullable field.
Remove existing column

BigQuery schema: a int, b int not null

New schema: a int

BigQuery does not allow deleting existing columns, only full table reload can be used in this case.
Change column mode to less restrictive

BigQuery schema: a int not null

New schema: a int

ALLOW_FIELD_RELAXATION value can be set to SchemaUpdateOptions in JobConfigurationLoad.
Change column mode to more restrictive

BigQuery schema: a int

New schema: a int not null

BigQuery does not allow column mode change to more restrictive, only full table reload can be used in this case.
Change column type

BigQuery schema: a int

New schema: a double

BigQuery does not allow changing column type, only full table reload can be used in this case.
Full table reloadRemove existing data, load new data, possibly change schemaWRITE_TRUNCATE value can be set to WriteDisposition in JobConfigurationLoad. This option will always overwrite the schema along with the existing data.
Use custom default for missing columns

BigQuery schema: a int, b int

New schema: a int

Add b column with custom default.

BigQuery Schema does not support default notion for columns, null value is written if value for column is missing. This can be handled using transformation plugin.
Add custom default to new column for already existing dataBigQuery has data in columns a and b. New column c is added. Need to insert data into column c for already existing data.BigQuery does not support default notion for columns. If new column is added, its value in existing rows will be null. In order to change it, only full table reload can be used.

Managing BigQuery table schema - https://cloud.google.com/bigquery/docs/managing-table-schemas

JobConfigurationLoad - https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html

Plugin Type

This is not a new plugin, but enhancements to existing plugin.

Design

Approach(s)

Properties

Add new property in Advanced section for BigQuery Batch Sink plugin properties:


User Facing NameTypeDefaultDescription
Allow Schema Relaxationbooleanfalse

Specifies the action that occurs

  • if input field is missing in the destination table;
  • if the original schema has required field which is nullable in the input schema.

False: Job fails if input data contains extra field or original schema has required field which is nullable in the input schema.

True: New nullable field is added to the schema if it is absent in destination table or required field in the original schema will be relaxed to nullable if in the input schema its nullable.

Security

Limitation(s)

Future Work

  • Some future work – HYDRATOR-99999
  • Another future work – HYDRATOR-99999

Test Case(s)

  • Test case #1
  • Test case #2

Sample Pipeline

Please attach one or more sample pipeline(s) and associated data. 

Pipeline #1

Pipeline #2



Table of Contents

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature