[Draft] Process Mainframe/EBCDIC files using COBOL copybook

Mainframes use EBCDIC data format, which presents a challenge as most modern systems use ASCII data format. In addition data structures of the mainframe data can be complex and are generally specified through COBOL copybook records. The complications are related nested field, packed-decimals, arrays and REDEFINES making code translations complex and difficult to achieve. In this how-to guide we will provide information on how to process variable length / block or field length and block EBCDIC files.

This how-to does not cover on how to export data from mainframes. There are various ways to bring datasets from DB2 or IMS. This article assumes, you have used DBMS export or other utilities (like FTP) to bring datasets into flat files.

Before Start

Store ECBDIC data on GCS

EBCDIC data exported from mainframes can be stored on GCS in the same way we store any other files. The files are binary and require COBOL copybook to interpret them. This presents the challenge for debugging issues with data or system.

Access to COBOL Copybook

For processing the EBCDIC data, one would need a copybook that defines the structure of the data within the EBCDIC file.

More than one copybook might be required to process all the data in the file. This means records with different structures can exist within the same file.

Understanding how file(s) were exported from mainframe

Datasets (or dataset partitions) can be exported in various way from DB2 or IMS. It’s important to understand

Whether file is text or binary,
What RECFM was specified when data was FTP or copied from mainframe,
What is the code page,
Whether the file exported from mainframe was Big-endian (IBM mainframes) or Little-endian

Note that this article only cover how to handle one copy record parsing for mainframe files.

Steps

Following are the steps to setup a pipeline that will process EBCDIC mainframe files.

Deploy Mainframe Plugin from Hub

Create a Pipeline

To build a pipeline that can read EBCDIC data and move that into BigQuery or any other target system, you need to use the ‘Mainframe Record Reader’ plugin.

Select the plugin named “Mainframe Record Reader” from the left drawer with section header “Source”
Move that into the canvas
Select the target of your choice from section with header “Sink”
Connect the source “Mainframe Record Reader” and any sink you have selected with an edge.

It’s that simple, you have now created a simple pipeline that would read data from EBCDIC files and move them into the target. Next step is to configure the “Mainframe Record Reader”. In this article we will not talk about configuring the target (We are assuming that you are well versed there)

Configure Mainframe Record Reader Plugin

The “Mainframe Record Reader” has four sections that might require setup depending on the attributes of the how file was exported and transferred. Following are the four sections:

General,
Record Associations,
Record Selector, and
Experimental

General

In general section, there are few important configurations that need to setup correctly. If the configuration do not match the attributes of the file being processed, processing will fail. It can generally be hard to debug due to the nature of input file.

Record Format (RECFM) - Record format specifies the type of the records in the file. Records from mainframe can be either Fixed Length or Variable Length or Variable Block. Select the right configuration depending on the knowledge of the file or group of files you are processing.
- (RECFM=F) Fixed Length record file have all the records of the same size (bytes), there are EOL, CTRL-M characters indicating the end of line, they are just stream of bytes.
- (RECFM=V) Variable Length record files have records that can be varying sizes. Typical different sizes might indicate that there are different copybooks. Each copybook could be associated with one record size.
- (RECFM=VB) Variable Block record files have variable record length files, but the variable length records are grouped in blocks. Such files are easy to process in parallel.
Code Page
Code Page
Dialect
Copybook

Record Associations

Record Selector

Partitioning for parallel processing

Doing another task

If you need to provide more context between steps, write 1-2 sentences between headings.