WIP
Mainframes use EBCDIC data format, which presents a challenge as most modern systems use ASCII data format. In addition, data structures of the mainframe data can be complex and are generally specified through COBOL copybook records. The complications are related to nested field, packed-decimals, arrays and REDEFINES, making code translations complex and difficult to achieve. In this how-to guide, we will provide information on how to process variable length / block or field length and block EBCDIC files.
This how-to does not cover on how to export data from mainframes. There are various ways to bring datasets from DB2 or IMS. This article assumes that you have used DBMS export or other utilities (like FTP) to bring datasets into flat files.
Before You Start
Store ECBDIC data on GCS
EBCDIC data exported from mainframes can be stored on GCS in the same way we store any other files. The files are binary and require COBOL copybook to interpret them. This presents a challenge for debugging issues with data or the system.
Access to COBOL Copybook
To process the EBCDIC data, you need a copybook that defines the structure of the data within the EBCDIC file.
More than one copybook might be required to process all the data in the file. This means records with different structures can exist within the same file.
Understanding how file(s) were exported from mainframe
Datasets (or dataset partitions) can be exported in various way from DB2 or IMS. It’s important to understand:
Whether the file is text or binary,
What RECFM was specified when data was FTP or copied from mainframe,
What is the code page,
Whether the file exported from mainframe was Big-endian (IBM mainframes) or Little-endian
Note that this article only covers how to parse one copybook record for mainframe files.
Steps
Use the following steps to set up a pipeline that will process EBCDIC mainframe files.
Deploy Mainframe Plugin from Hub
Create a Pipeline
To build a pipeline that can read EBCDIC data and move the data into BigQuery or any other target system, you need to use the ‘Mainframe Record Reader’ plugin.
Select the plugin named “Mainframe Record Reader” from the left drawer with section header “Source”.
Move the Mainframe Record Reader source into the canvas.
Select the target of your choice from section with header “Sink”.
Connect the source “Mainframe Record Reader” and any sink you have selected with an edge.
It’s that simple, you have now created a simple pipeline that can read data from EBCDIC files and move them into the target. Next step is to configure the “Mainframe Record Reader”. In this article, we will not talk about configuring the target. (We are assuming that you are well versed there.)
Configure Mainframe Record Reader Plugin
The “Mainframe Record Reader” has four sections that might require setup depending on the attributes of the how file was exported and transferred. The setup is grouped into the following four sections:
General,
Record Associations,
Record Selector, and
Experimental
General
In general section, there are few important configurations that need to setup correctly. If the configuration do not match the attributes of the file being processed, processing will fail. It can generally be hard to debug due to the nature of input file.
Record Format (RECFM) - Record format specifies the type of the records in the file. Records from mainframe can be either Fixed Length or Variable Length or Variable Block. Select the right configuration depending on the knowledge of the file or group of files you are processing.
(RECFM=F) Fixed Length record file have all the records of the same size (bytes), there are EOL, CTRL-M characters indicating the end of line, they are just stream of bytes.
(RECFM=V) Variable Length record files have records that can be varying sizes. Typical different sizes might indicate that there are different copybooks. Each copybook could be associated with one record size.
(RECFM=VB) Variable Block record files have variable record length files, but the variable length records are grouped in blocks. Such files are easy to process in parallel.
If you do not know the record format of the file, start with RECFM=V. If there is a mismatch, the processing would fail. It is very difficult to detect whether the records within the file are variable length or fixed length as everything in file is just stream of bytes.
Code Page - Code page defines the character encoding that associates a unique number with the set of printable and control characters. Mainframes defined different code pages for different regions. So depending on the origin or character set on mainframe, the code page should be chosen. For example any mainframe in US will code page as cp037.
Dialect - Specifies the endianness of the mainframe. IBM mainframes are defaulted as Mainframe (Big-Endian) and there Intel and Fujitsu mainframes that have different endianness.
Copybook - Specifies the COBOL copybook that contains the structure of the data files. Copybook, contains only the fields and datatypes used in the COBOL file. The plugin can directly import COBOL copybooks (.cpy files) as definitions for generating the target schema. The schema definition is based on analyzing the entire copybook including REDEFINES and OCCURS. The schema can be simple or complex. Various different types of copybooks are currently supported.
How to split copybook - Depending on how you want to interpret the records within the file you can either use option to not split an individual record as different records (Do not split) or you can chose to split the the record at REDEFINE
.
If you have selected ‘Do not split' and file has various record types, then you would want to configure Record Selector. If entire file is of a single record type, then Record Selector is not required.
In case, you have selected ‘Split on REDEFINE’, you should configure Record Association. This will allow different records that are split at
REDEFINE
to be associated with record types.
Output Schema - Specifies the whether the output schema that gets represented in the target is a flat structure or nested structure as defined in copybook. See Field names and Output Schema types section for how the fields from copybook are translated into the target schema.
If you have specified Output Schema type as ‘Flat’ then Copybook split has no effect on the record being read. This option is important when Output Schema type is ‘Hierarchical’.
Record Associations
Record Selector
When to split Copybook ?
There are various scenarios
Field name and Output Schema types
Field Names
As modern target storage systems do not support schema field names to include underscore(_
) the 'Mainframe Record Reader' translates the copy record field names into Hungarian representation. For example if the name of the field in copybook is CL-MED-PRE-AUTHO-DT-CC
, then it is translated to ClMedPreAuthoDtCc
in the target system as well as pipeline schema.
Flat Structure
In flat structure, OCCURS
are expanded and field name is suffixed with _<n>
or (_<n>_<m>
) at the end. E.g. if you have a field in copybook WS-FIN-AMT OCCURS 5 PIC S9(9)V99 COMP-3
then, in the output schema you will find WS-FIN-AMT_1
, WS-FIN-AMT_2
, WS-FIN-AMT_3
, WS-FIN-AMT_4
and WS-FIN-AMT_5
. With names translated to be compatible with target system, the field names as as follows: WsFinAmt_1
,WsFinAmt_2
,WsFinAmt_3
,WsFinAmt_4
, and WsFinAmt_5
.
In case of nested structures (like two dimensional arrays), they would be represented two indexes separated by and underscore(_
).
Nested Schema
Nested schema maps the nested structure of copybook into the target system. The target system should support following types in order to use Nested Schema: RECORD
or STRUCT
and ARRAY
or UNION
. If you are not sure, we recommend using the flat structure.
Partitioning for parallel processing
This feature is an experimental feature. This feature allows the ability to split a single variable length file into multiple small files, so that they can be processed in parallel. While it is advantageous to split a variable length file, it does introduce initial time in ensuring that the file is split at correct record boundaries.
Initial startup time of pipeline execution with add additional few seconds to 10s of minutes depending on number of files, and number of records in a file.