External File Sets

Motivation

This is a design proposal for https://issues.cask.co/browse/CDAP-2758.

Currently, when a file set is created, we can give a property that specifies a base path in the file system, to which all other paths within the file set are relative (and if that property is not given, it is derived from the name of the file set). This path is treated as a relative path under the data directory of the namespace where the dataset is created. Every file under that base path is considered to "belong" to the file set, and when the file set is deleted or truncated, all these files are removed.

That has two limitations:

  • we cannot use a directory that is outside of the namespace's data dir
  • we cannot use a directory that is managed (and owned) by some other process

Together, these two limitations prevent us from processing files that existed on HDFS before the file set was created, and that were placed there by some other process. This issue addresses these two limitations.

User Stories

  1. A developer needs to write an application that reads existing data on HDFS. This data is managed by some other process outside of CDAP, for example a Flume HDFS sink. Still this data is required by the CDAP application.
  2. A developer needs to make the output of his application to users outside of CDAP. These users will only read the data, but they should not need to understand CDAP directory structures to access the data. Instead, that data may have to organized according to some corporate conventions in a standard location.

Requirements

Derived from these user stories, we get the following requirements:

  • A CDAP file set must be able to use a base directory that is not inside the CDAP HDFS space. 
  • A CDAP file set must be able to use an existing location on HDFS as it base path.
  • A CDAP file set must be able to prevent writes to an dataset that is managed by an external process.
  • The same requirements apply to PartitionedFileSet and TimePartitionedFileSet

Design

The first question is how the dataset is configured to use an outside location or externally managed location. One straight-forward approach would be to this:

  • If the dataset properties specify no base path, the the dataset is located inside the CDAP file system space, at a location derived from the dataset name.
  • If the dataset properties specify a relative base path (one that does not start with "/"), then the dataset is located inside the CDAP file system space, at the specified relative location.
  • If the dataset properties specify an absolute base path (that starts with "/"), then the dataset is located outside the CDAP space, at the exact absolute path in the file system.

This would be reasonable and intuitive. However, current CDAP (3.0) treats absolute base paths the as relative base paths: It appends the base path to the CDAP base directory for the namespace of the dataset. That effectively makes absolute paths relative to the namespace root. If we follow the above approach, then that changes backward compatibility. However, we can also treat the fact that the current code ignore the leading "/" of the path as a bug and fix it in 3.1. We would provide specific documentation for users who have used an absolute base path in existing applications on what the new behavior is. We would also need an upgrade procedure for existing file sets that have an absolute base path: this upgrade procedure would update the dataset properties to remove the leading "/" from the base path, to reflect the actual location of the file set under the fixed semantics. 

As an alternative, we could say:

  • If the dataset properties specify as the base path a full URL of the form "hdfs:/some/path", then that absolute URL is used as the base path without modification. 
  • Otherwise the dataset is located within the CDAP namespace's file system space.

However, that hard-wires this dataset to an HDFS location. It can therefore not be used in a standalone CDAP or in an in-memory unit test. For these environments, the developer would have to configure the dataset differently, for example "file:/some/path" for standalone. That, however, contradicts the premise of being able to deploy applications without change between different environments. 

Note: In the future, we may want to add this behavior independently of this design discussion. That will be required to configure a dataset to use a different HDFS file system than the default. 

The third alternative is to use explicit properties:

  • A dataset property "fileset.absolute.path" can be used to specify whether the base path is relative to the namespace's data dir or an an absolute path in the file system with default "false"

This approach is complementary to the second approach: If we should ever decide in the future to support fully qualified URLs for the base path, then that will imply both "fileset.absolute.path" and "fileset.expternal" to be "true".

Note that:

  • All approaches naturally apply to PartitionedFileSet and TimePartitionedFileSet through their embedded FileSets.
  • This has no impact in explore: whether and how a file set is explorable is an orthogonal configuration.
  • A dataset property "data.external" can be used to specify whether the file set is managed by an external process, default is "false". If this is true, then CDAP will not:
    • attempt to create the base path 
    • recursively delete the base path and all its contents when the file set is deleted or truncated
    • allow adding files to the file set (the file set is read-only)
  • External file sets cannot have a location within the CDAP file system space (and therefore "data.external" is implicitly true, and it must not be specified as false)
  • If an absolute path is given, it may not be inside the CDAP file system space (because it might collide with other CDAP data)

Conclusion

We have presented three approaches of which approach 1 is the most elegant and intuitive:

  • User story 1 is fulfilled by specifying data.external=true, and an absolute base path. 
  • User story 2 is fulfilled by specifying data.external=false, and an absolute base path.
  • This approach is not backwards-compatible with existing CDAP file sets, but we consider this a bug and provide a fix in 3.1; we will have an upgrade tool and precise docs for existing developers.
  • This approach allows future extension for using data on different HDFS file systems (or even non-HDFS). 

Â