Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Overview

This document is a collection of best practices for Wrangler CSV parsing and cleansing of CSV files.

General Tips

  • Parse CSV parse-as-csv should avoid using automatic header determination (parse-as-csv :col ‘\t’ false). On large files that are distributed across multiple partitions and header is no available to be set in different partitions.

  • Sometimes a file looks fine, It could contain non-printable ASCII characters that usually don’t belong in CSV files. It can be hard to track these down. Use find-and-replace directive. `find-and-replace :col 's/\000-\007\013-\037\177-\377//g'.

  • To remove CTRL-M from the end of each line again use find-and-replace directive. find-and-replace :col 's/\r$//g' if you want to remove CTRL-M at end of the line. Make sure you apply this directive before applying parse-as-csv.

  • No labels