CSV Wrangler

This article is posted on the CDAP Doc wiki and will be maintained here: Parsing CSV Filesarchived

Overview

This document is a collection of best practices for Wrangler CSV parsing and cleansing of CSV files.

General Tips

  • Avoid using automatic header detection with parse-as-csv directive(parse-as-csv :col ‘\t’ false). On large files that are distributed across multiple partitions, the header line which is the first line of CSV is not present. This will either result in failure or records will be lost.

  • If you have to use parse-as-csv directive, then make sure the files are smaller than 128 MB (lowest data block).

 

Â