Goals

Performance improvements (caching authorization policies)
Authorization of dataset and stream access
Authorization for listing and viewing entities

Checklist

User stories documented (Bhooshan)
User stories reviewed (Nitin)
Design documented (Bhooshan)
Design reviewed (Andreas/Terence)
Feature merged (Bhooshan)
Documentation (Bhooshan)
Blog post

User Stories

As a CDAP security admin, I want all operations on datasets/streams to be governed by my configured authorization system.
As a CDAP system, I want list operations for all CDAP entities to only return entities that the logged-in user is authorized to view.
As a CDAP system, I want view operations for a CDAP entity to only succeed if the logged-in user is authorized to view that entity

Scenarios

Scenario #1

Derek is an IT Operations Extraordinaire at a corporation that uses CDAP to manage datasets with varying degrees of sensitivity. He would like to implement authorization policies for all data stored in CDAP across datasets and streams, so only authorized users have access to such data. He would like to control both read as well as write access.

Scenario #2

Derek would like to be able to use external authorization systems like Apache Sentry to manage authorization policies. Given that Apache Sentry could be installed in a different environment from CDAP, he would like to minimize the impact of verifying authorization while accessing data. Derek expects that performance improvement does not result in security breaches. For example, if authorization policies are cached in CDAP, Derek expects that they be refreshed regularly at configurable time intervals.

Scenario #3

In the said organization, CDAP is used to store data belonging to various business units. These business units are potentially completely disparate, and do not share information. Some of their data or applications may be extremely sensitive. As a security measure, Derek would also like to enforce authorization for operations that list CDAP entities, so that a user can only see the entities that he is authorized to read or write.

Design

Authorizing Dataset and Stream operations

The most critical requirement to address in 3.5 is to authorize dataset and stream operations. These operations can be categorized into data access (read/write) and admin (create, update, delete). Admin operations can be presumed to occur less often than data access operations, and are not in the data path. As a result, even though performance is important, it is less critical for admin operations compared to data access operations. For data access operations, it is not practical to communicate with an external authorization system like Apache Sentry for every single operation, since that would lead to major performance degradation. As a result, authorization policies need to be cached in CDAP potentially for all operations, but especially for data access operations.

One of the major concerns about caching is freshness or invalidation. It is especially important in a security/authorization context, because it could result in security breaches. For example, suppose we've cached all authorization policies. An update, especially a rollback of privileges in the external authorization system should result in an immediate refresh of the cache, otherwise there could be security breaches by the time refresh takes place.

For such an authorization policy cache, the major design goals are:

Minimal refresh time
1. The refresh operation should be fast. The time taken for the operation should certainly be less than the refresh interval.
2. It should make minimal RPC calls. If there is a way to load the entire snapshot of ACLs in a single RPC call, that should be preferred.
3. It should transfer only necessary data.
Configurable refresh interval
1. The refresh operation should happen at configurable time intervals so users can tune it per their requirement.

Approach 1:

To satisfy these goals, the data structure that should be cached can be defined as follows:

PrivilegeCache

// TODO: Explore using Guava Cache
class PrivilegeCache {
  private final Table<Principal, EntityId, Set<Action>> privileges = HashBasedTable.create();

  public void addPrivileges(Principal principal, EntityId entityId, Set<Action> actionsToAdd) {
    Set<Action> actions = privileges.get(principal, entityId);
    if (actions == null) {
      actions = new HashSet<>();
    }
    actions.addAll(actionsToAdd);
    privileges.put(principal, entityId, actions);
  }

  public void revokePrivileges(Principal principal, EntityId entityId, Set<Action> actionsToRemove) {
    Set<Action> actions = privileges.get(principal, entityId);
    if (actions == null) {
      throw new NoSuchElementException();
    }
    actions.removeAll(actionsToRemove);
    privileges.put(principal, entityId, actions);
  }
 
  public void updateSnapshot(Table<Principal, EntityId, Set<Action>> privilegeSnapshot) {
	privileges = HashBasedTable.create(privilegeSnapshot);
  }
 
  public void reset() {
	privileges = HashBasedTable.create();
  }
}

The above cache would be re-populated asynchronously from the configured Authorization Provider (Apache Sentry/Apache Ranger, etc) at a configurable time interval, using an AbstractScheduledService. Instead of querying these authorization providers every time an authorization check is required, various CDAP sub-components will instead query this cache.

Cache Freshness

Like mentioned above, the policy cache in CDAP can be made consistent with authorization providers at regular scheduled intervals. However, this has the following race: Suppose Alice and Bob have been given READ access to Dataset1, and this state is consistent in both the external system (e.g. Apache Sentry) and the cache. Now, ACLs are updated to remove Alice's permissions. Until the time when the refresh thread mentioned above runs, the cache will be inconsistent with the external system, and CDAP will still think that both Alice and Bob have READ access to Dataset1. The severity of this may vary depending on the situation, but it is a security loophole nonetheless. There are two possible ways in which this situation may arise:

User uses CDAP (CLI/REST APIs) to update ACLs: In this scenario, we can have a callback to the revoke APIs in CDAP to also update the cache. As long as both updating the store and the cache is done transactionally , there would not be an inconsistency between the external system and the CDAP cache.
User uses an external interface (e.g. Hue, Apache Ranger UI) to update ACLs: In this scenario, we may have to depend upon the external system providing a callback mechanism. Even if such a mechanism is provided, the interface for the cache to be updated (e.g. from a message queue), will have to be built in CDAP. The external system can then add events to such an interface, and the cache could keep itself up-to-date by consuming from this interface. In the first release, however, it is likely that there may be an inconsistency if this method is chosen to update ACLs.

Handling cache refresh failures

Since the sub-components of CDAP will now just use the authorization policy cache to check for ACLs, there would be a problem if the cache refresh continually keeps failing (let's say perhaps because the authorization backend is down). If such failures are continual and consistent over a period of time, it could result in the cache being stale over a long time. This could lead to serious security loopholes, and hence there should be a way to invalidate the cache when such consistent failures occur. This could be done by having a configurable retry limit for failures. When this limit is reached, the cache would be cleared, and until the next successful refresh, any operation in CDAP will result in an authorization failure. Although this would render CDAP in an unusable state, it will reduce the chances of such a security breach. In such a case, admins will have to fix the communication between CDAP and the authorization backend before CDAP can be used again.

Alternative Caching Approach (Approach 2)

An alternative caching approach would be for the CDAP sub-components to query the cache for a privilege, and the cache to return if there is a hit, and go back to the authorization provider if there is a miss.

Pros

Can have individual privilege level cache expiry, making the refresh process more streamlined
No need for an asynchronous cache refresh thread, that refreshes all policies (resulting in asynchronous, but longer refresh process)

Cons

The major drawback of this approach seems like it could make the majority access pattern potentially slow, because it requires a call to the authorization provider every time an privilege (a combination of a principal, an entity and an action) is not found in the cache. Since a majority of these combinations are unlikely to be in the cache at a given point in time, this approach is likely to cause a lot of cache misses. It is likely that in the normal flow, an operation is slow because it has to make a call to the authorization provider, whereas in the earlier approach, the slowness only happens when the cache is being updated.

Hybrid Approach (Approach 3)

Since both the approaches above have definite drawbacks, we could use a hybrid approach. In this approach, the cache would be keyed by a principal. When there is a cache miss for a principal, the requested ACL for the principal will be fetched from the authorization provider and the cache would be updated. Along with this, a background thread will update the cache with all the ACLs for the requested principal, so any further requests for this principal can be fulfilled by the cache. Each entry in the cache will have a configurable expiry, thereby ensuring freshness, without needing a long refresh time. This approach still does not avoid security loopholes, since a privilege could be updated before the cache is refreshed, but it seems like a good median. Guaranteeing security would need a more sophisticated mechanism of the authorization provider publishing a message whenever an ACL is updated in a queue that the cache listens to, but that could be future work.

Caching in Apache Sentry

Apache Sentry has some active work going on to enable client-side caching as part of SENTRY-1229. It will likely suffer from the same drawbacks mentioned above regarding cache freshness. There is a case for re-using this (and other such) caching from authorization providers in CDAP. However, we will choose to implement a cache in CDAP independently because of the following reasons:

We would like a cache in CDAP that works independently of authorization providers. For example, we would like the same caching mechanism to be available irrespective of the configured authorization backend (Apache Sentry, the Dataset-backed Authorization backend or Apache Ranger in future).
This is active work in progress in Apache Sentry, and there are no timelines yet as to when this change will make it to a CDH distro (currently marked for Apache Sentry 1.8.0).

Turning caching off

For certain use cases where caching of security policies may not be acceptable even at the cost of a significant performance hit, a configuration knob should be provided to turn caching off. By default though, caching will be enabled.

Intercepting Dataset calls

Since authorization policies must be applied to custom datasets as well, it is non-trivial to decide where dataset calls should be intercepted to add authorization checks. The right approach for this would depend on the design of the new Dataset APIs in Datasets Revamp. One option for doing this is to only intercept the getDataset call, which would get a dataset for READ, WRITE, READ_WRITE, etc, and then apply the corresponding authorization policy. With this approach, the actual read/write calls would not be intercepted. This approach has the obvious drawback that getDataset calls may be cached, but even if they aren't, what happens if a principal's privilege on a dataset is revoked after he has executed a getDataset call successfully.

Note: The approach here is TBD, it would depend on the new Dataset APIs and will be finalized during implementation.

Discussion with Andreas 06/13

Refresh rate of cache as a dataset property, since only a dataset can tell if it is sensitive
Ability to turn off caching for a single dataset
No caching for admin operations

Authorizing Service Requests

With Secure Impersonation - Security 3.5, user services will be started as the logged in user. However, service endpoints for accessing datasets can be called by any user. Hence, it is necessary to make sure that any dataset accesses via such endpoints is authorized. One way of doing this would be to add a handler hook to the NettyHttpService that runs the service, which in its preCall method will have an authorization check.

Note: This approach may not work, because even if this is done, how would we get the entity (the dataset) and the action (READ/WRITE, etc) in the hook. TBD, to figure out during implementation.

Authorizing list operations

Operations that list entities (namespaces, artifacts, apps, programs, datasets, streams) should be updated so that they only return the entities that the currently logged-in user has privileges on.

Listing namespaces, apps, artifacts, datasets and streams should return only those respective entities that the user has READ, WRITE or ALL permissions on
Listing programs should return programs that the user has READ, WRITE, EXECUTE or ALL permissions on

To achieve this, the corresponding list APIs in CDAP (e.g. NamespaceAdmin, AppLifecycleService, ProgramLifecycleService, DatasetFramework, StreamAdmin) should be updated with a filter to only return elements that users have access to.

Communication with Apache Sentry

The typical pattern in Sentry is to whitelist a set of users who the Sentry service can accept requests from. The property that dictates this is called sentry.service.allow.connect. The description for this property states: "List of users allowed to connect to the Sentry Server. These are usually service users such as hive and impala, and the list does not usually need to include end users." . As a result, the pattern in 3.4 was to whitelist the cdap user, which was fine, because all authorization requests to Sentry originate from the CDAP Master. However, the difference in 3.5 is that now, CDAP will make requests to Sentry for authorization enforcement from program containers. To add to that, programs will run as the user that starts the program, and this user is configured at the namespace level in 3.5. So,

A user creates a namespace myspace, and assigns the principal 'myuser' to it
The user deploys an app in 'myspace', and starts a program
The program is spawned as 'myuser'
During the program execution, requests need to be made to Sentry.

For 4. above, there are two options:

Send the request as the 'cdap' user. This communication has been tested to work, and will always work, as long as the 'cdap' user is whitelisted using the property mentioned earlier in the Sentry Service. To achieve this however, we will need to create an extra hop in this request. So from the program container, an RPC request is made to another container (that also executes other operations like recording lineage, usage registry and run records and workflow tokens. This other container will have the cdap user's delegation token, and will make the request to Sentry.
Send the request as the user running the program. This will not need the extra hop in 1. However, the disadvantages of this are:
1. Every single user who will ever run a CDAP program will have to be whitelisted in the Sentry Service. An alternate approach, where a certain 'cdapprogramrunners' group is whitelisted, and all users who will run a program are part of that group does not work. Even the whitelist property description suggests the same, and an experiment proved it as well.
2. Once a user is whitelisted, it is whitelisted for all operations in the Sentry Service. This property merely decides whether a request will be accepted or rejected solely based on the defined users. It makes no distinction based on the operation being performed. There are other parameters that influence that (viz: admin groups; the fact that only admin groups can list all roles, create a role, etc; granting/revoking privileges is also determined by a policy in CDAP, which ensures that only a user that has ADMIN rights on an entity can grant/revoke - the whitelist does not influence any of these operations).

Taking into consideration all the above, it seems like for communication with Sentry, the first approach of using an extra RPC call, but communicating as 'cdap' makes sense. Unless of course users are fine with going against the Sentry norm as well as the property description of whitelisting every single user (for 3.5, this number is effectively equal to the number of namespaces in CDAP).

Dependencies

Ability to distinguish between read and write operations in datasets

Authorization Policies

An authorization policy (or ACL) is only valid if the entity exists in CDAP. There may be orphaned (invalid) policies, but they can only exist if entity deletion fails before or during policy revoke - Details in the deletion section.
Any API that lists entities will filter all the entities to only return entities that the logged in user has access (READ/WRITE/ADMIN/ALL) to.
Any API that gets details of an entity will require that the user has access (READ/WRITE/ADMIN/ALL) to that entity.
Any API that creates a new entity will require that the user has WRITE access on the entity's parent (e.g. to create a dataset, the user will need WRITE access on the namespace where the dataset will be created).
- Such APIs first perform a check for entity existence
  - Entity does not exist
    - There cannot be a valid enforcement check in this situation, because CDAP does not have privileges for non-existing entities. As a result, CDAP will try to get the metadata for that entity. Since it does not exist, CDAP will respond with Not Found
      Andreas: I don't understand. If user has WRITE access to parent (e.g., namespace), then the creation should succeed. Why NotFound?
    - Bhooshan: Nope, I got this wrong. I've missed a sentence before these bullet points. This only applies to an existence check during creation, not the actual creation process. e.g. Dataset creation during app deployment first makes a get() call to check if the dataset service. For this call to proceed with dataset creation, it expects the get() call to respond with Not Found. I've added that point in blue.
  - Entity already exists
    - Since the entity exists, there can be valid enforcement checks for the entity.
      - User does not have access on existing entity
        CDAP will return an Unauthorized response
      - User has access on the entity
        CDAP will return an Already Exists response
  - Bear in mind, that in both the above conditions, the user can infer (implicitly or explicitly) that the entity exists.
Any API that creates a new entity, grants ALL privileges to the user once it is determined that the user has privileges to create the entity. Once the privileges have been granted, CDAP proceeds to create the entity. If the process of creation of the entity fails, CDAP rolls back the privileges. This is done so that there may be orphaned privileges in rare scenarios, but there can never be orphaned entities.
Andreas: Similar to delete, we don't want this to create an orphaned entity that nobody can ever see or delete again. So should we create the ACLs first, then create the entity, if that fails, attempt to remove the ACLs? That would cause, in rare situations, an orphaned ACL, but never an orphaned entity.
Bhooshan: Agreed. That makes it consistent. Updated
Any API that deletes an entity will require ADMIN privilege on that entity
- Irrespective of whether the entity exists or not, an authorization check will first be performed. It will return an Unauthorized response if the user does not have the ADMIN privilege. If the user does have the required privilege, the API will respond with a Not Found if the entity does not exist (because of an Orphaned ACL from a previous deletion). Else, it will proceed with deletion.
  Andreas: If the entity does not exist, then there cannot be an ACL for it (according to first point), so the user will never have the required privilege, right? That is, it will always return NotFound or succeed? I would think that if the user has any privilege (say READ), but not ADMIN, then this returns Unauthorized. If the user has no privileges at all, then it returns NotFound (irrespective of whether the entity exists). It will also return NotFound if the user has privileges but the entity does not exist, but that cannot happen according to the next bullet.
- Bhooshan: Discussed this in person and updated.
- Just to confirm: Deleting an entity that does not exist will - under normal circumstances - return Unauthorized. Is that intended? Whereas check for existence (as pointed out above) will return NotFound. And what will getDetail() return if the entity does not exist?
- Bhooshan: getDetail() returns NotFound if the entity does not exist, the authorization check is performed afterwards. Should we change delete also first do the same existence check?
Any API that deletes an entity will first remove all metadata for that entity, once it is determined by the previous policy that the user has privileges to delete the entity. This is so that entity can never subsequently be returned as a response to a list or get API. If deletion fails midway or while revoking privileges, CDAP may have orphaned privileges (for non-existing entities). There would be no easy way to clean up that entity or its privileges later. If someone re-creates that entity, it could have some rogue privileges. Until such operations can be transactional, the create operation will first delete any privileges on the entity that was successfully created, then grant the user ALL privileges on the entity.
Andreas: That means if the deletion fails, then it can never be deleted again, because all privileges have already been removed? So it becomes an orphan invisible to everybody (because list calls will not show it any longer)? Seems weird.
Bhooshan: Updated per our discussion. Please review again.
Any API that changes the characteristics/properties (update properties, upgrade entity) of an entity will require ADMIN privilege on that entity.
- Irrespective of whether the entity exists or not, an authorization check will first be performed. It will return an Unauthorized response if the user does not have the required privilege. If the user does have the required privilege, the API will respond with a Not Found if the entity does not exist. Else, it will proceed with the modifications to the entity's properties.
  Andreas: Same comment applies as for delete above

NOTE: Cells marked green were done in 3.4. Cells marked in yellow are in scope for 3.5.

Entity	Operation	Required Privileges	Resultant Privileges
Namespace	create	ADMIN (Instance)	ADMIN (Namespace)
	update	ADMIN (Namespace)
	list	READ/WRITE/ADMIN/ALL (Namespace)
	get	READ/WRITE/ADMIN/ALL (Namespace)
	delete	ADMIN (Namespace)
	set preference	WRITE (Namespace)
	get preference	READ (Namespace)
	search	READ/WRITE/ADMIN/ALL (Namespace)
Artifact	add	WRITE (Namespace)	ADMIN (Artifact)
	delete	ADMIN (Artifact)
	get	READ/WRITE/ADMIN/ALL (Artifact)
	list	READ/WRITE/ADMIN/ALL (Artifact)
	write property	ADMIN (Artifact)
	delete property	ADMIN (Artifact)
	get property	READ/WRITE/ADMIN/ALL (Artifact)
	refresh	WRITE (Instance)
	write metadata	ADMIN (Artifact)
	read metadata	READ (Artifact)
Application	deploy	WRITE (Namespace)	ADMIN (Application)
	get	READ/WRITE/ADMIN/ALL (Application)
	list	READ/WRITE/ADMIN/ALL (Application)
	update	ADMIN (Application)
	delete	ADMIN (Application)
	set preference	WRITE (Application)
	get preference	READ (Application)
	add metadata	ADMIN (Application)
	get metadata	READ (Application)
Programs	start/stop/debug	EXECUTE (Program)
	set instances	ADMIN (Program)
	list	READ/WRITE/ADMIN/ALL (Application)
	set runtime args	EXECUTE (Program)
	get runtime args	READ/WRITE/ADMIN/EXECUTE/ALL (Program)
	get instances	READ/WRITE/ADMIN/EXECUTE/ALL (Program)
	set preference	ADMIN (Program)
	get preference	READ (Program)
	get status	READ/WRITE/ADMIN/EXECUTE/ALL (Program)
	get history	READ/WRITE/ADMIN/EXECUTE/ALL (Program)
	add metadata	ADMIN (Program)
	get metadata	READ (Program)
	emit logs	WRITE (Program)
	view logs	READ (Program)
	emit metrics	WRITE (Program)
	view metrics	READ (Program)
Streams	create	WRITE (Namespace)	ALL (Stream)
	update properties	ADMIN (Stream)
	delete	ADMIN (Stream)
	truncate	ADMIN (Stream)
	enqueue asyncEnqueue batch	WRITE (Stream)
	get	READ/WRITE/ADMIN/ALL (Stream)
	list	READ/WRITE/ADMIN/ALL (Namespace)
	read events	READ (Stream)
	set preferences	ADMIN (Stream)
	get preferences	READ (Stream)
	add metadata	ADMIN (Stream)
	get metadata	READ (Stream)
	view lineage	READ (Stream)
	emit metrics	WRITE (Stream)
	view metrics	READ (Stream)
Datasets	list	READ/WRITE/ADMIN/ALL (Dataset)
	get	READ/WRITE/ADMIN/ALL (Dataset)
	create	WRITE (Namespace)	ADMIN (Dataset)
	update	ADMIN (Dataset)
	drop	ADMIN (Dataset)
	exists	READ/WRITE/ADMIN/ALL (Dataset)
	truncate	ADMIN (Dataset)
	upgrade	ADMIN (Dataset)
	add metadata	ADMIN (Dataset)
	get metadata	READ (Dataset)
	view lineage	READ (Dataset)
	emit metrics	WRITE (Dataset)
	view metrics	READ (Dataset)

Out-of-scope User Stories (4.0 and beyond)

As a CDAP admin, I should be able to authorize metadata changes to CDAP entities
As a CDAP system, I should be able to push down ACLs to storage providers
As a CDAP admin, I should be able to see an audit log of all authorization-related changes in CDAP
As a CDAP admin, I should be able to authorize all thrift-based traffic, so transaction management is also authorized.
As a CDAP admin, I should be able to authorize logging and metrics operations on CDAP entities.

Authorization - CDAP 3.5