Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Checklist

  •  User Stories Documented
  •  User Stories Reviewed
  •  Design Reviewed
  •  APIs reviewed
  •  Release priorities assigned
  •  Test cases reviewed
  •  Blog post

Introduction 

Users can use the command line tool DistCp to copy files between clusters with same or different filesystems. Currently CDAP does not support such file operations. We wish to add a Hydrator plugin that can help users perform whole file copies between different types of filesystems/ databases in the CDAP UI. 

Goals

According to this user request, our new plugin ideally should have the following features: 

  1. Should support file copying between the following file systems:
    1.  Local directories
    2. HDFS
    3. Amazon S3
    4. FTP
  2. Should support failover. It should start where it left during restarts or issues.
  3. We should have UI, where we can see progress
  4. We should have metrics for each process on how many files copied, size, time.
  5. Checks network bandwidth and displays estimated completion time.
  6. Maintains the timestamp of each file as is from the source.
  7. Specify Path filters through UI on the fly.
  8. File permission configurations.

User Stories 

  • As a cluster administrator, I want to migrate all my files and preserve file structures when upgrading to a newer cluster. 
  • As a data analyst, I want to retrieve files that contain data from some remote ftp location and store them in my cluster that runs the HDFS filesystem.  
  • As a cluster administrator, I'm only interested in files with specific file names and wish to copy them to some other location.
  • As a pipeline developer, I want to organize files by path and filenames and put them into different destinations.

Design

Cover details on assumptions made, design alternatives considered, high level design 

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

PathMethodDescriptionResponse CodeResponse
/v3/apps/<app-id>GETReturns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

 

     

Deprecated REST API

PathMethodDescription
/v3/apps/<app-id>GETReturns the application spec for a given application

CLI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

UI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

Security Impact 

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results
   
   
   
   

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3

 

Future work