Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Overview

This document is a collection of best practices for Wrangler CSV parsing and cleansing of CSV files.

General Tips

  • Avoid using automatic header detection with parse-as-csv directive(parse-as-csv :col ‘\t’ false). On large files that are distributed across multiple partitions, header line which is the first line of CSV is not present. This will either result in failure or records will be lost.

  • If you have to use parse-as-csv directive, then make sure, the files are smaller than 128MB (lowest data block).

  • Sometimes a file looks fine, It could contain non-printable ASCII characters that usually don’t belong in CSV files. It can be hard to track these down. Use find-and-replace directive. `find-and-replace :col 's/\000-\007\013-\037//g'.

  • To remove CTRL-M from the end of each line again use find-and-replace directive. find-and-replace :col 's/\r$//g' if you want to remove CTRL-M at end of the line. Make sure you apply this directive before applying parse-as-csv.

  • Filtering records that do not have specified number of columns in a record can be achieved with send-to-error or filter-row. e.g. send-to-error exp : { this.lengh() < 4 } will send all records that have less than 4 (0..3) to error. filter-row exp: { this.length() < 4 } true will filter records that are less than 4 from main set.

  • No labels