Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Info |
---|
This article is posted on the CDAP Doc wiki and will be maintained here: https://cdap.atlassian.net/wiki/spaces/DOCS/pages/1165393956/Parsing+CSV+Files |
Overview
This document is a collection of best practices for Wrangler CSV parsing and cleansing of CSV files.
General Tips
Avoid using automatic header detection with
parse-as-csv
directive(parse-as-csv :col ‘\t’ false
). On large files that are distributed across multiple partitions, the header line which is the first line of CSV is not present. This will either result in failure or records will be lost.If you have to use
parse-as-csv
directive, then make sure , the files are smaller than 128MB 128 MB (lowest data block).Sometimes a file looks fine, It could contain non-printable ASCII characters that usually don’t belong in CSV files. It can be hard to track these down. Use
find-and-replace
directive.`find-and-replace :col 's/\000-\007\013-\037//g'
.To remove CTRL-M from the end of each line again use
find-and-replace
directive.find-and-replace :col 's/\r$//g'
if you want to remove CTRL-M at end of the line. Make sure you apply this directive before applyingparse-as-csv
.Filtering records that do not have specified number of columns in a record can be achieved with
send-to-error
orfilter-row
. e.g.send-to-error exp : { this.lengh() < 4 }
will send all records that have less than 4 (0..3) to error.filter-row exp: { this.length() < 4 } true
will filter records that are less than 4 from main set.
Table of Contents |
---|