Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Info |
---|
This article is posted on the CDAP Doc wiki and will be maintained here: https://cdap.atlassian.net/wiki/spaces/DOCS/pages/1165393956/Parsing+CSV+Files |
Overview
This document is a collection of best practices for Wrangler CSV parsing and cleansing of CSV files.
General Tips
Parse CSV Avoid using automatic header detection with
parse-as-csv
should avoid using automatic header determination directive(parse-as-csv :col ‘\t’ false
). On large files that are distributed across multiple partitions and , the header is no available to be set in different partitions. Sometimes a file looks fine, It could contain non-printable ASCII characters that usually don’t belong in CSV files. It can be hard to track these down. Usefind-and-replace
directive.`find-and-replace 's/\000-\007\013-\037\177-\377//g'
line which is the first line of CSV is not present. This will either result in failure or records will be lost.If you have to use
parse-as-csv
directive, then make sure the files are smaller than 128 MB (lowest data block).
Table of Contents |
---|