Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

Currently, when we load the privileges for a principal in master's cache it remains there forever as the cache expiry defaults to a higher value than the cache refresh interval. As the number of entities and users grow the cache size grows along with it and causes a lot of memory pressure. The cache is currently not bound by size. The master runs a proxy service that all other containers talk to using the RemotePrivilegeFetcher to get all the privileges for a principal. Once a container receives the privileges for a principal it caches it forever as well.

There a couple of reasons for keeping the cache in the master.

The typical access pattern for Sentry is to whitelist a set of users whom the Sentry service will accept requests from. This list generally contains service users and not end users. Following this pattern "cdap" should be whitelisted in Sentry and all requests to Sentry should be made as "cdap".
When an application is started, multiple containers could try to fetch the privileges for the same principal. Fetching the data from Sentry for each of these calls would be costly and greatly increase the program startup time.

So, master fetches privileges from Sentry and caches it and all the containers get it from master and cache it locally.

The way the cache is loaded is Apache Sentry specific and relies on a listPrivileges call that lists all Privileges for a principal. Other Authorization providers, e.g. Apache Ranger, may not have this api.

We also have a delay between when an entity is created and when the privileges for it make it down to the container. This is dictated by the cache refresh rate.

Goals

Make the caching model more scalable. It should be able to handle hundreds of thousands of privileges.
Refactor the code so that it is easy to support other authorization providers specifically Apache Ranger.

User Stories

User Story #3

Design

Following are the key principles for the caching design, based on the goals above.

Scalable: must be able to support hundreds of thousand of entities across namespaces and for a large number of principals.
Minimize calls to external service as the could be costly.
Should efficiently handle the case where multiple containers are launched for the same app and there is a spike in authorization requests.
Should be able to support different authorization providers, e.g. Apache Sentry, Apache Ranger, and Amazon IAM etc.
Minimize the increase in complexity of creating a new Authorization extension.
Aim to provide consistent performance for different use cases.

The cache will be changed from Principal -> Map<EntityId, Set<Action>> to <Principal, EntityId> -> Set<Action> to reduce caching unnecessary privileges in case the user has a lot of privileges but only a few of them are being used at one time. The cache will also not be refreshed constantly but only loaded in case of a miss. The entries will have configurable expiry time.

Following are the different approaches that could be taken. I prefer the third approach as that would be most suitable for adding new authorization providers later. It also simplifies the CDAP side interfaces and makes it more agnostic to the Authorization provider.

Approach

Approach #1

The reason containers need to keep refreshing their privilege cache is because they need to know about any changes to the policies i.e., if new privileges were granted to the principal or revoked from them. Since the containers refresh their caches from the master, we need to cache the privileges at the master too because we don't want to fetch the privileges from Sentry on every refresh call from the containers. One way to avoid that is to cache every thing for a principal on a container and any updates to the policy is pushed to the containers. If the containers get updated every time there is a policy change for the principal that the container is interested in they won't need to refresh their caches and we can avoid keeping a cache on the master for refreshes. Master would still need to cache the privileges to handle the spike in requests from multiple containers when a new application is launched but the expiry time in this case could be lower and we don't need to keep refreshing it.

In this approach, all the policy changes would need to be intercepted and pushed to the containers using TMS. This is possible for sentry as the policy changes only happen through CDAP CLI and our Hue plugin.

Pros:

Cache size in master can be limited
Containers don't need to keep refreshing their cache. This would reduce a lot of network traffic in a cluster with a large number of containers.

Cons:

All policy changes need to be intercepted

This approach would not work for Apache Ranger as the policy changes there are done through RangerAdmin and there is no way to intercept it from inside CDAP.

Approach #2

The caching model can be changed from Principal -> Map<EntityId, Set<Actions>> to <Principal, EntityId> -> Set<Actions>. All cache misses will trigger and update from Sentry. Caches on both master and other containers will be size limited. In this approach, when we have a cache miss for a principal and entity combination we will fetch privileges from the authorization provider. In case of Sentry this would be done by calling listPrivileges, which would fetch all privileges for the principal and then we break it by principal and entity combination and load the cache. In case of another authorization provider which supports querying by a principal and entity combination, we can simply fetch the requested privilege. This approach requires minimal change to existing code.

Pros

Only the privileges for the combination of Principal and entities in use will be cached.
Fewer changes required

Cons

Many more calls to the authorization provider.
More cache misses as the key is now narrower
Could have double caching for extensions that do their own caching

Approach #3

Apache Ranger's plugin has it's own cache. It uses local files to cache policies and polls the server at a configurable interval to see if there are any new policy changes. A possible approach could be to let the Authorizer handle its own caching and the only interaction between CDAP and the extension are policy management and policy enforcement. With this approach CDAP won't have to worry about what API's the backend provider has for accessing privileges and the interaction could be a lot more standardized. All provider specific details would be handled by the provider specific extension.

The Authorizer interface does not need to be changed. We can add a cache implementation to the AbstractAuthorizer. The enforce call will hit the cache and if the Principal, Entity combination is in the Cache then the request will be satisfied there or the cache will be populated by that extension's loader, which in Sentry's case could be the same listPrivilege mechanism that we are doing now but for Ranger, for example, it could be a call to Ranger with a isAccessAllowed method call.

Pros

Less work to support Apache Ranger
CDAP is more agnostic to different authorization providers
Removes need to keep refreshing cache
No need to have PrivilegeFetcher interface, Authorizer handles all authorization work

Cons

More work required than Approach #2
Many more calls to the authorization provider.
More cache misses as the key is now narrower

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

Path

Method

Description

Response Code

Response

/v3/apps/<app-id>

GET

Returns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

Deprecated REST API

Path	Method	Description
/v3/apps/<app-id>	GET	Returns the application spec for a given application

CLI Impact or Changes

Impact #1
Impact #2
Impact #3

UI Impact or Changes

Impact #1
Impact #2
Impact #3

Security Impact

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test ID	Test Description	Expected Results

Releases

Release X.Y.Z

Related Work

Work #1
Work #2
Work #3

Future work

Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

Currently, when we load the privileges for a principal in master's cache it remains there forever as the cache expiry defaults to a higher value than the cache refresh interval. As the number of entities and users grow the cache size grows along with it and causes a lot of memory pressure. The cache is currently not bound by size. The master runs a proxy service that all other containers talk to using the RemotePrivilegeFetcher to get all the privileges for a principal. Once a container receives the privileges for a principal it caches it forever as well.

There a couple of reasons for keeping the cache in the master.

The typical access pattern for Sentry is to whitelist a set of users whom the Sentry service will accept requests from. This list generally contains service users and not end users. Following this pattern "cdap" should be whitelisted in Sentry and all requests to Sentry should be made as "cdap".
When an application is started, multiple containers could try to fetch the privileges for the same principal. Fetching the data from Sentry for each of these calls would be costly and greatly increase the program startup time.

So, master fetches privileges from Sentry and caches it and all the containers get it from master and cache it locally.

The way the cache is loaded is Apache Sentry specific and relies on a listPrivileges call that lists all Privileges for a principal. Other Authorization providers, e.g. Apache Ranger, may not have this api.

We also have a delay between when an entity is created and when the privileges for it make it down to the container. This is dictated by the cache refresh rate.

Goals

Make the caching model more scalable. It should be able to handle hundreds of thousands of privileges.
Refactor the code so that it is easy to support other authorization providers specifically Apache Ranger.

User Stories

User Story #3

Design

Following are the key principles for the caching design, based on the goals above.

Scalable: must be able to support hundreds of thousand of entities across namespaces and for a large number of principals.
Minimize calls to external service as the could be costly.
Should efficiently handle the case where multiple containers are launched for the same app and there is a spike in authorization requests.
Should be able to support different authorization providers, e.g. Apache Sentry, Apache Ranger, and Amazon IAM etc.
Minimize the increase in complexity of creating a new Authorization extension.
Aim to provide consistent performance for different use cases.

The cache will be changed from Principal -> Map<EntityId, Set<Action>> to <Principal, EntityId> -> Set<Action> to reduce caching unnecessary privileges in case the user has a lot of privileges but only a few of them are being used at one time. The cache will also not be refreshed constantly but only loaded in case of a miss. The entries will have configurable expiry time.

Following are the different approaches that could be taken.

Approach

Approach #1

The reason containers need to keep refreshing their privilege cache is because they need to know about any changes to the policies i.e., if new privileges were granted to the principal or revoked from them. Since the containers refresh their caches from the master, we need to cache the privileges at the master too because we don't want to fetch the privileges from Sentry on every refresh call from the containers. One way to avoid that is to cache every thing for a principal on a container and any updates to the policy is pushed to the containers. If the containers get updated every time there is a policy change for the principal that the container is interested in they won't need to refresh their caches and we can avoid keeping a cache on the master for refreshes. Master would still need to cache the privileges to handle the spike in requests from multiple containers when a new application is launched but the expiry time in this case could be lower and we don't need to keep refreshing it.

In this approach, all the policy changes would need to be intercepted and pushed to the containers using TMS. This is possible for sentry as the policy changes only happen through CDAP CLI and our Hue plugin.

Pros:

Cache size in master can be limited
Containers don't need to keep refreshing their cache. This would reduce a lot of network traffic in a cluster with a large number of containers.

Cons:

All policy changes need to be intercepted

This approach would not work for Apache Ranger as the policy changes there are done through RangerAdmin and there is no way to intercept it from inside CDAP.

Approach #2

The caching model can be changed from Principal -> Map<EntityId, Set<Actions>> to <Principal, EntityId> -> Set<Actions>. All cache misses will trigger and update from Sentry. Caches on both master and other containers will be size limited. In this approach, when we have a cache miss for a principal and entity combination we will fetch privileges from the authorization provider. In case of Sentry this would be done by calling listPrivileges, which would fetch all privileges for the principal and then we break it by principal and entity combination and load the cache. In case of another authorization provider which supports querying by a principal and entity combination, we can simply fetch the requested privilege. This approach requires minimal change to existing code.

Pros

Only the privileges for the combination of Principal and entities in use will be cached.
Fewer changes required

Cons

Many more calls to the authorization provider.
More cache misses as the key is now narrower
Could have double caching for extensions that do their own caching

Approach #3

Apache Ranger's plugin has it's own cache. It uses local files to cache policies and polls the server at a configurable interval to see if there are any new policy changes. A possible approach could be to let the Authorizer handle its own caching and the only interaction between CDAP and the extension are policy management and policy enforcement. With this approach CDAP won't have to worry about what API's the backend provider has for accessing privileges and the interaction could be a lot more standardized. All provider specific details would be handled by the provider specific extension.

The Authorizer interface does not need to be changed. We can add a cache implementation to the

AbstractAuthorizer

Pros

Less work to support Apache Ranger
CDAP is more agnostic to different authorization providers
Removes need to keep refreshing cache
No need to have PrivilegeFetcher interface, Authorizer handles all authorization work

Cons

More work required than Approach #2
Many more calls to the authorization provider.
More cache misses as the key is now narrower

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

Path

Method

Description

Response Code

Response

/v3/apps/<app-id>

GET

Returns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

Deprecated REST API

Path	Method	Description
/v3/apps/<app-id>	GET	Returns the application spec for a given application

CLI Impact or Changes

Impact #1
Impact #2
Impact #3

UI Impact or Changes

Impact #1
Impact #2
Impact #3

Security Impact

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test ID	Test Description	Expected Results

Privilege Caching scalability improvements

Introduction

Goals

User Stories

Design

Approach

Approach #1

Approach #2

Approach #3

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work

Introduction

Goals

User Stories

Design

Approach

Approach #1

Approach #2

Approach #3

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work