Secure Impersonation - 4.1 RC Documentation

Dataset Improvements

Namespace Permissions

In order to impersonate an application with an owner principal (user) different from the namespace owner, the application owner must be able to create entities in the namespace.

Directories in HDFS under the <namespace_root>/data to create file sets and <namespace_root>/tmp as a staging location for launching programs.
- If you create the <namespace_root> beforehand and pass it in as a custom root directory when creating the namespace, you must ensure that all users that impersonate applications have write permissions in this directory.
- If you do not specify a custom location for the namespace, you can specify a group name for the namespace.
  - via the group-name option in the CLI:
    create namespace <name> principal <name/host@domain> group-name <group> keytab-URI <keytab-location>
  - via the groupName property in the namespace configuration when using the REST API: http://docs.cask.co/cdap/4.1.0-SNAPSHOT/en/reference-manual/http-restful-api/namespace.html#namespace-configurations
  As a best practice, we recommend to make every application owner as well as the namespace owner member of that group. CDAP will create the mentioned directories with that group ownership and rwx permissions permissions for that group.

Tables in the corresponding HBase namespace to create Table-based datasets
- If you provide a custom HBase namespace when creating the namespace, it is your responsibility to ensure that every application principal can create tables in this namespace.
  - in hbase shell: grant '<user>, 'AC', '@<namespace>'
  - or grant '@<group>', 'AC', '@<namespace>'
- If you let CDAP create the namespace, it will use the group name specified in the namespace configuration to issue the grant '@<group>', 'AC', '@<namespace>'. In this case it is necessary that all application owners are in that group.
Tables in the namespace's Hive database, to be able to enable Explore for datasets. Depending on the Hive authorization settings:
- The application user must be privileged to create tables in the database
- Hive must be configured to grant all privileges to the user that creates a table (depending on Hive configuration, this may not be the case)
- For any sharing between applications that requires additional permissions, these must be granted manually.

Dataset Permissions

CDAP 4.1 adds the capability to configure dataset permissions through dataset properties to allow access of data by users other than owner of the dataset.

For filesets, by default, all files and directories are created with the file system's default umask, and with the group of the parent directory. This can be overridden by dataset properties. For example, this configures read, write and execute for the owner and the group "etl":
```
PartitionedFileSetProperties.builder()
  ...
  .setFilePermissions("770")
  .setFileGroup("etl")
  .build();
```
For tables, additional permissions can be granted as part of the table creation. For example, this allows read and write for the user "joe" and read only for all members of the group "etl":
```
TableProperties.builder()
  ...
  .setTablePermissions(ImmutableMap.of("joe", "RW", "@etl", "R")
  .build();
```
Note that this is also need for PartitionedFileSets, because their partition metadata is stored in an HBase table.
Explore permissions in Hive must be granted manually outside of CDAP.

Custom Hive Database/Table

By default, the Explore table for a dataset is in the enclosing namespace's database and named dataset_<name>. In CDAP 4.1, you can configure a custom Hive database and table name as follows

PartitionedFileSetProperties.builder()
  ...
  .setExploreDatabaseName("my_database")
  .setExploreTableName("clicks_gold")
  .build();

Note that the database name must exist as CDAP will not attempt to create it.

Reuse Existing Storage Location

A new dataset property for (Partitioned)FileSets allows configuring an existing, possibly non-empty location for the dataset's files and an existing Hive database. Use:

FileSetProperties.setUseExisting(true) (or DATA_USE_EXISTING / "data.use.existing") to reuse an existing location and Hive table. The dataset will assume that it does not own the existing data in that location and Hive table, and therefore, when you delete or truncate the dataset, the data will not be deleted.
FileSetProperties.setPossessExisting(true) (or DATA_POSSESS_EXISTING / "data.possess.existing") to assume ownership an existing location and Hive table. The dataset will assume that it owns the existing data in that location and Hive table, and therefore, when you delete or truncate the dataset, all data will be deleted, including the previously existing data and Hive partitions.

Note that in both cases, the existing partitions in the Hive table are not known to CDAP and therefore only accessible via Hive, not through PartitionedFileSet APIs.

Cluster Configuration and Setup

To use application level impersonation in CDAP you will need to tweak some configuration of your cluster. Below is the list of changes you might have to do to ensure you cluster can run support app level impersonation in CDAP. Note some of these configuration might already exist in your environment in which case you can ignore them.

Enable Hbase Authorization (if needed)

Add the following to your hbase-site.xml

hbase-site.xml

<property>
	<name>hbase.security.exec.permission.checks</name>
   	<value>true</value>
 </property>
 <property>
   	<name>hbase.coprocessor.master.classes</name>
   	<value>org.apache.hadoop.hbase.security.access.AccessController</value>
 </property>
 <property>
   	<name>hbase.coprocessor.region.classes</name>
	<value>org.apache.hadoop.hbase.security.token.TokenProvider,org.apache.hadoop.hbase.security.access.AccessController</value>
 </property>

You will need to restart HBase after the above configuration changes.

Configure CDAP for App level impersonation

To support app level impersonation wherein applications, datasets and streams can have their own owner and the operations performed in CDAP should impersonate their respective owners, CDAP should have access to the owner principal and their associated keytabs. Owner principal of an entity is provided during the entity creation step (see REST APIs documentation in next section).

For user's keytab access CDAP uses the following conventions:

All keytabs must be present on the local filesystem on which CDAP Master is running.
These keytabs must be present under a path which can be in one of the following formats and cdap should have read access on all the keytabs:
1. /dir1>/<dir2>/${name}.keytab
2. /dir1>/<dir2>/${name}/${name}.keytab
The above path is provided to CDAP as a configuration parameter in cdap-site.xml for example:
cdap-site.xml
```
<property>
	<name>security.keytab.path</name>
    <value>/etc/security/keytabs/${name}.keytab</value>
</property>
```
Where ${name} will be replaced by CDAP by the short user name of the kerberos principal CDAP is impersonating.
Note: You will need to restart CDAP for the configuration changes to take effect.

Enable Hive SQL-based authorization (if needed):

Add the following to your hive-site.xml and restart hive:

hive-site.xml

<property>
	<name>hive.server2.enable.doAs</name>
	<value>false</value>
</property>
<property>
	<name>hive.users.in.admin.role</name>
	<value>hive,cdap</value>
</property>
<property>
	<name>hive.security.authorization.manager</name>
	<value>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdHiveAuthorizerFactory</value>
</property>
<property>
	<name>hive.security.authorization.enabled</name>
	<value>true</value>
</property>
<property>
	<name>hive.security.authenticator.manager</name>
	<value>org.apache.hadoop.hive.ql.security.ProxyUserAuthenticator</value>
</property>

Note your hive-site.xml should also be configured to support modifying properties at runtime. Specifically, you will need the following configuration in your hive-site.xml

hive-site.xml

<property>
	<name>hive.security.authorization.sqlstd.confwhitelist.append</name>
	<value>explore.*|mapreduce.job.queuename|mapreduce.job.complete.cancel.delegation.tokens|spark.hadoop.mapreduce.job.complete.cancel.delegation.tokens|mapreduce.job.credentials.binary|hive.exec.submit.local.task.via.child|hive.exec.submitviachild|hive.lock.sleep.*</value>
</property>

Hive Proxy Users

To enable Hive to impersonate other users, set the following in hive-site.xml

hive-site.xml

<property>
	<name>hive.server2.enable.doAs</name>
	<value>true</value>
</property>

Make sure that Hive is configured to impersonate users who can create/access entities in CDAP. This can by done by adding the following property in your core-site.xml. The first option allows Hive to impersonate users belonging to "group1" and "group2" and the second option allows Hive to impersonate on all hosts.

core-site.xml

<property>
	<name>hadoop.proxyuser.hive.groups</name>
	<value>group1,group2</value>
</property>

<property>
	<name>hadoop.proxyuser.hive.hosts</name>
	<value>*</value>
</property>

See http://www.cloudera.com/documentation/enterprise/5-2-x/topics/cdh_sg_hive_metastore_security.html details.

CDAP Authorization (if needed):

Additionally, you might want to enable CDAP authorization. For details on how to enable authorization in CDAP and manage privileges please refer to our documentation here: http://docs.cask.co/cdap/current/en/admin-manual/security/authorization.html?highlight=authorization

Note

Please note that the above cluster configuration is not a comprehensive guide for enabling authorization and/or impersonation on Hadoop cluster. You might need to add/remove configuration depending on your environment.

Operational APIs

Namespaces

Creating a Namespace

creating namespace from cli

create namespace testns principal rsinha/<host-name>@<realm> group-name deployers keytab-URI /etc/security/keytabs/rsinha.keytab

Application Lifecycle

Loading an artifact:

loading artifact from cli

load artifact SportResults-4.1.0-SNAPSHOT.jar

Creating application from an existing artifact:

creating application REST API

curl -v -X PUT http://hostname.net:11015/v3/namespaces/{namespace-id}/apps/{app-id} -d '{"artifact":{"name":"{artifact-name}","version":"{artifact-version}","scope":"USER"},"principal":"someuser/somehost.net@SOMEKDC.NET"}' -H "Authorization: Bearer your_access_token"

Querying application detail for owner information:

Existing REST API. Please see: http://docs.cask.co/cdap/current/en/reference-manual/http-restful-api/lifecycle.html#details-of-a-deployed-application

Streams

Creating a stream with an owner:

creating stream REST API

curl -X PUT -v http://somehost.net:11015/v3/namespaces/{namespace-id}/streams/{stream-name} -d '{ "ttl": 1, "principal": "someuser/somehost.net@SOMEKDC.NET" }' -H "Authorization: Bearer your_access_token"

Querying stream properties for owner information:

Existing REST API. Please see: http://docs.cask.co/cdap/current/en/reference-manual/http-restful-api/stream.html#getting-and-setting-stream-properties

Datasets

Creating a dataset with owner:

creating dataset REST API

curl -v -X PUT http://somehost.net:11015/v3/namespaces/{namespace-id}/data/datasets/{dataset-id} -d '{ "typeName": "table", "properties": {}, "principal": "someuser/somehost.net@SOMEKDC.NET" }' -H "Authorization: Bearer your_access_token"

Querying dataset properties for owner information:

querying dataset REST API

curl -v http://hostname.net:11015/v3/namespaces/{namespace-id}/data/datasets/{dataset-name} -H "Authorization: Bearer your_access_token"

Secure Impersonation - 4.1 RC Documentation

[data-colorid=lxpv6mvvqe]{color:#6a8759} html[data-color-mode=dark] [data-colorid=lxpv6mvvqe]{color:#89a678}[data-colorid=paq651ucu5]{color:#6a8759} html[data-color-mode=dark] [data-colorid=paq651ucu5]{color:#89a678}

Dataset Improvements

Namespace Permissions

Dataset Permissions

Custom Hive Database/Table

Reuse Existing Storage Location

Cluster Configuration and Setup

Enable Hbase Authorization (if needed)

Configure CDAP for App level impersonation

Enable Hive SQL-based authorization (if needed):

Hive Proxy Users

CDAP Authorization (if needed):

Note

Operational APIs

Namespaces

Creating a Namespace

Application Lifecycle

Loading an artifact:

Creating application from an existing artifact:

Querying application detail for owner information:

Streams

Creating a stream with an owner:

Querying stream properties for owner information:

Datasets

Creating a dataset with owner:

Querying dataset properties for owner information: