Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info

This article is posted on the CDAP Doc wiki and will be maintained here: https://cdap.atlassian.net/wiki/spaces/DOCS/pages/1165393956/Parsing+CSV+Files

Overview

This document is a collection of best practices for Wrangler CSV parsing and cleansing of CSV files.

General Tips

  • Parse CSV Avoid using automatic header detection with parse-as-csv should avoid using automatic header determination directive(parse-as-csv :col ‘\t’ false). On large files that are distributed across multiple partitions and , the header is no available to be set in different partitions. Sometimes a file looks fine, It could contain non-printable ASCII characters that usually don’t belong in CSV files. It can be hard to track these down. Use find-and-replace directive. `find-and-replace 's/\000-\007\013-\037\177-\377//g'line which is the first line of CSV is not present. This will either result in failure or records will be lost.

  • If you have to use parse-as-csv directive, then make sure the files are smaller than 128 MB (lowest data block).

Table of Contents