Document AI Batch Source

Introduction

Document AI plugin will allow users to use Document AI processors to process invoice, parse form, extract key value pair and more. User could also use this plugin to make predictions on AutoML custom models that exposed as Document AI processors.

NOTE: These plugins will incur additional cost.

https://cloud.google.com/document-ai/docs

Use case(s)

  1. As a user, I would like to parse my invoices, form/key-value-pair documents in PDF format to extract entities, with Data Fusion pipelines that orchestrate the end to end journey, from a data source (GCS) to a data sink (BigQuery).

User Storie(s)

  • As a data pipeline developer, I should be able to 

Plugin Type

  • Batch Source
  • Batch Sink 
  • Real-time Source
  • Real-time Sink
  • Action
  • Post-Run Action
  • Aggregate
  • Join
  • Spark Model
  • Spark Compute

Configuration

Invoice API

https://cloud.google.com/document-understanding/alpha/docs/quickstart-invoice

User Facing NameTypeDescriptionDefault valueNotes




















Table Parsing API

https://cloud.google.com/document-ai/docs/process-tables

User Facing NameTypeDescriptionDefault valueNotes





















Form Parsing or KV API

https://cloud.google.com/document-ai/docs/process-forms

User Facing NameTypeDescriptionDefault valueNotes






















Design / Implementation Tips

Design - To be filled in later

Approach(s)

Properties

Security

Limitation(s)

Future Work

Test Case(s) - To be filled in later

  • Test case #1
  • Test case #2

Sample Pipeline

Please attach one or more sample pipeline(s) and associated data. 

Pipeline #1

Pipeline #2

References

  • Documentation Links go here