...
Datasets (or dataset partitions) can be exported in various way ways from DB2 or IMS. It’s important to understand:
...
Info |
---|
Note that this article only covers how to parse one copybook record for mainframe files. |
Steps
...
To build a pipeline that can read EBCDIC data and move the data into BigQuery or any other target system, you need to use the ‘Mainframe Record Reader’ plugin.
...
Select the plugin named source called “Mainframe Record Reader” from the left drawer with section header “Source”.
Move the Mainframe Record Reader source into the canvas.
Select the target of your choice from section with header “Sink”.
Connect the source “Mainframe Record Reader” and any sink you have selected with an edge.
It’s that simple, you have now created a simple pipeline that can read data from EBCDIC files and move data into the target. The next step is to configure the “Mainframe Record Reader”. In this article, we will not talk about configuring the target. (We are assuming that you are well versed there.)
...
Record Format (RECFM) - Record format specifies the type of records in the file. Records from mainframe can be either Fixed Length or Variable Length or Variable Block. Select the right configuration depending on the your knowledge of the file or group of files you are processing. You have to understand how the file was exported and/or how the file was FTP from the mainframe system into Data Fusion.
(RECFM=F) Fixed Length record files have all the records of the same size (bytes), as well as EOL and CTRL-M characters indicating the end of line. They are just a stream of bytes.
(RECFM=V) Variable Length record files have records that can be varying sizes. Typically, different sizes might indicate that there are different copybooks. Each copybook could be associated with one record size.
(RECFM=VB) Variable Block record files have variable record length files, but the variable length records are grouped in blocks. Such files are easy to process in parallel.
Info |
---|
If you do not know the record format of the file, start with RECFM=V. If there is a mismatch, the processing will fail. It is very difficult to detect whether the records within the file are variable length or fixed length as everything in file is just a stream of bytes. |
Code Page - Code page defines the character encoding that associates a unique number with the set of printable and control characters. Mainframes defined different code pages for different regions. So depending on the origin or character set on the mainframe, the code page should be chosenpre-populated. For example any mainframe in US will set the code page as cp037.
Dialect - Specifies the endianness of the mainframe. IBM mainframes are defaulted as Mainframe (Big-Endian) and , but there are Intel and Fujitsu mainframes that have different endianness.
Copybook - Specifies the COBOL copybook that contains the structure of the data files. Copybook contains only the fields and datatypes used in the COBOL file. The plugin can directly import COBOL copybooks (.cpy files) as definitions for generating the target schema. The schema definition is based on analyzing the entire copybook, including REDEFINES and OCCURS. The schema can be simple or complex. Various different types of copybooks are currently supported.
Output Schema type - Specifies the whether the output schema that gets represented in the target is a Flat structure or Nested structure as defined in copybook. See the Field names and Output Schema types section for how the fields from a copybook are translated into the target schema.
How to split copybook (applicable only when Output Schema type is Nested Structure) - Depending on how you want to interpret the records within the file, you can either use the option to not split an individual record as different records (Do not split) or you can chose to split the the record at REDEFINE
.
Info |
---|
If you have specified the Output Schema type as ‘Flat’, then Copybook split has no effect affect on the record being read. This option is important when the Output Schema type is ‘Hierarchical’. |
...
“Do not split” will consider the entire copybook as a single record. Any sub-records defined in the copybook also become sub-records in the target system.
For the copybook in the example above, a single record named ‘Root’ is created and within it are the two records ‘WS-RECORD-A’ and ‘WS-RECORD-B’. The file has both the records, and when extract either WS-RECORD-A is present or WS-RECORD-B is present. Automatically, automatically, both the sub-records are made
Nullable
.When “Do not split” is configured, there is often need to separate records based on some selection criteria. E.g. In cases, were For example, in cases where you have a single EBCDIC file that contains multiple records defined by multiple copybooks. The , the copybook you are configuring in the pipeline with will have to select only the records that match the structure of the copybook configured. In that case, you have to configure ‘Record Selector’.
In the example above, it’s highly recommended to not use ‘Do not Split’ option, but to go with ‘Split on REDEFINE’ option.
“Split on REDEFINE” will split the copybook at every REDEFINE (top level) into separate records.
In case, If you have selected ‘Split on REDEFINE’, you should must configure Record Association. This will allow different records that are split at
REDEFINE
to be associated with record types.
...
Record association provides a way to associate records based on the value of the a field in the record. For example if WS-REC-TYPE is 00001
, then WS-RECORD-A should be populated and when WS-REC-TYPE is 00002
, then WS-RECORD-B should be populated. This section allows us you to define those associations easily. In order to To do so, you can use one field in the record (note that field should be present in all the records) who’s value determines the record to be picked. For the above example, this section of plugin is configured as follows:
...
Record selector allows filtering COBOL records from the file being read. COBOL field names generally have dashes(-), to . To make it easy to specify expressions, field associations are necessary. The field mappings provides association between COBOL field names with titled names used in the expression. Expressions can be for processing used to process only the records one is you are interested in. Rest of the records will be discarded, or sent to error depending on configuration. In the case of REDEFINES, please use the redefined COBOL field name. For the above example, this section of plugin is configured as follows:
...
As modern target storage systems do not support schema field names to include underscore(_
), the 'Mainframe Record Reader' translates the copy record field names into Hungarian representation. For example, if the name of the field in the copybook is CL-MED-PRE-AUTHO-DT-CC
, then it is translated to ClMedPreAuthoDtCc
in the target system as well as pipeline schema.
...
In flat structure, OCCURS
are expanded and field name is suffixed with _<n>
or (_<n>_<m>
) at the end. E.g. For example, if you have a field in copybook WS-FIN-AMT OCCURS 5 PIC S9(9)V99 COMP-3
, then, in the output schema, you will find WS-FIN-AMT_1
, WS-FIN-AMT_2
, WS-FIN-AMT_3
, WS-FIN-AMT_4
and WS-FIN-AMT_5
. With names translated to be compatible with target system, the field names is as as follows: WsFinAmt_1
,WsFinAmt_2
,WsFinAmt_3
,WsFinAmt_4
, and WsFinAmt_5
.
In the case of nested structures (like two dimensional arrays), they would be are represented in two indexes separated by and underscore(_
).
...
Nested schema maps the nested structure of the copybook into the target system. The target system should must support following types in order to use Nested Schema: RECORD
or STRUCT
and ARRAY
or UNION
. If you are not sure, we recommend using the flat structure.
...
This feature is an experimental feature. This feature allows gives you the ability to split a single variable length file into multiple small files, so that they can be processed in parallel. While it is advantageous to split a variable length file , it does introduce initial time in ensuring to ensure that the file is split at the correct record boundaries, it does increase the initial startup time of a pipeline.
Info |
---|
Initial The initial startup time of for pipeline execution with add additional few might increase anywhere from 10 seconds to 10s of a few minutes, depending on number of files , and the number of records in a file. |
...