Mainframes use EBCDIC data format, which presents a challenge as most modern systems use the ASCII data format. In addition, data structures of the mainframe data can be complex and are generally specified through COBOL copybook records. The complications are related to nested field, packed-decimals, arrays and REDEFINES, making code translations complex and difficult to achieve. In this how-to guide, we will provide information on how to process variable length / block or field length and block EBCDIC files.
This how-to guide does not cover on how to export data from mainframes. There are various ways to bring datasets from DB2 or IMS. This article assumes that you have used DBMS export or other utilities (like FTP) to bring datasets into flat files.
Before You
...
Begin
Store ECBDIC data on GCS
EBCDIC data exported from mainframes can be stored on GCS in the same way we store any other files. The files are binary and require COBOL copybook to interpret them. This presents a challenge for debugging issues with data or the system.
...
Whether the file is text or binary ,
What RECFM was specified when data was FTP or copied from mainframe,
What is the code page,
Whether the file exported from mainframe was Big-endian (IBM mainframes) or Little-endian
...
To build a pipeline that can read EBCDIC data and move the data into BigQuery or any other target system, you need to use the ‘Mainframe Record Reader’ pluginsource.
...
Select the source called “Mainframe Record Reader” from the left drawer with section header “Source”.
Move the Mainframe Record Reader source into the canvas.
Select the target of your choice from section with header “Sink”.
Connect the source “Mainframe Record Reader” and any to the sink you have selected.
It’s that simple, you . You have now created a simple pipeline that can read data from EBCDIC files and move data into the target. The next step is to configure the “Mainframe Record Reader”. In this article, we will not talk about configuring the target. (We are assuming that you are well versed there.)
...
The “Mainframe Record Reader” has four areas that might require setup, depending on the attributes of the how file was exported and transferred:
General ,
Record Associations,
Record Selector, and
Experimental
General
...
In the General section, there are few important configurations that need to be set up correctly. If the configurations do not match the attributes of the file being processed, processing will fail. It can generally be hard to debug due to the nature of input file.
...
Code Page - Code page defines the character encoding that associates a unique number with the set of printable and control characters. Mainframes defined different code pages for different regions. So depending on the origin or character set on the mainframe, the code page should be pre-populated. For example, any mainframe in US will set the code page as cp037.
...
Copybook - Specifies the COBOL copybook that contains the structure of the data files. Copybook contains only the fields and datatypes data types used in the COBOL file. The plugin can directly import COBOL copybooks (.cpy files) as definitions for generating the target schema. The schema definition is based on analyzing the entire copybook, including REDEFINES and OCCURS. The schema can be simple or complex. Various different types of copybooks are currently supported.
Output Schema type - Specifies whether the output schema that gets represented in the target is a Flat structure or Nested structure as defined in copybook. See the Field names and Output Schema types section for to learn how the fields from a copybook are translated into the target schema.
...
“Do not split” will consider the entire copybook as a single record. Any sub-records defined in the copybook also become sub-records in the target system.
For the copybook in the example above, a single record named ‘Root’ is created and within it are the two records ‘WS-RECORD-A’ and ‘WS-RECORD-B’. The file has both the records, and when either WS-RECORD-A is present or WS-RECORD-B is present, automatically, both the sub-records are made
Nullable
.When “Do not split” is configured, there is often a need to separate records based on some selection criteria. To select only the records that can be processed with the configured copybook, we recommend you to use “Record Selector”. For example, in cases where you have a single EBCDIC file that contains multiple records defined by multiple copybooks, the copybook you are configuring in the pipeline will have to select only the records that match the structure defined by the copybook.
In the example above, it’s highly recommended to not use ‘Do not Split’ option, but to go with ‘Split on REDEFINE’ option.
“Split on REDEFINE” will split the copybook at every REDEFINE (top level) into separate records.
If you have selected ‘Split on REDEFINE’, you must configure Record Association. This will allow different records that are split at
REDEFINE
to be associated with record types.
...
Record association provides a way to associate records based on the value of the a field in the record. For example, if WS-REC-TYPE is 00001
, then WS-RECORD-A should be populated and when WS-REC-TYPE is 00002
, then WS-RECORD-B should be populated. This section allows you to define those associations easily. To do so, you can use one field in the record (note that field should be present in all the records) who’s value determines the record to be picked. For the above example, this section of plugin is configured as follows:
...
Record selector allows filtering COBOL records from the file being read. COBOL field names generally have dashes(-). To make it easy to specify expressions, field associations are necessary. The field mappings provides provide association between COBOL field names with titled names used in the expression. Expressions can be used to process only the records you are interested in. Rest The rest of the records will be discarded, or sent to error depending on configuration. In the case of REDEFINES, please use the redefined COBOL field name. For the above example, this section of plugin is configured as follows:
...
As modern target storage systems do not support schema field names to that include underscore(_
), the 'Mainframe Record Reader' translates the copy record field names into Hungarian representation. For example, if the name of the field in the copybook is CL-MED-PRE-AUTHO-DT-CC
, then it is translated to ClMedPreAuthoDtCc
in the target system as well as pipeline schema.
...
In flat structure, OCCURS
are expanded and field name is suffixed with _<n>
or (_<n>_<m>
) at the end. For example, if you have a field in copybook WS-FIN-AMT OCCURS 5 PIC S9(9)V99 COMP-3
, then, in the output schema, you will find WS-FIN-AMT_1
, WS-FIN-AMT_2
, WS-FIN-AMT_3
, WS-FIN-AMT_4
and WS-FIN-AMT_5
. With names translated to be compatible with target system, the field names is are as follows: WsFinAmt_1
,WsFinAmt_2
,WsFinAmt_3
,WsFinAmt_4
, and WsFinAmt_5
.
...
Info |
---|
The initial startup time for pipeline execution might increase anywhere from 10 seconds to a few minutes, depending on number of files and the number of records in a file. |
...
Page Properties | ||
---|---|---|
| ||
|