Authorization Design

Authorization in CDAP


Overview:

Currently CDAP provides authentication service - identifying whether users are who they claim to be. But once user A, has access to CDAP, he/she can access everything in CDAP - create namespaces, stop programs, delete apps, delete datasets etc.


Apache Sentry Primer:

Apache Sentry, part of CDH distro, specifically addresses the authorization aspect for Hadoop ecosystem tools. At a high level, Sentry has these concepts:

  • Resource - Entity that you want to regulate access to - namespace, application, program, dataset

  • Privilege - Read (read-only explore, read in CDAP programs) access to a CDAP Dataset

  • Role - Collection of privileges - read access to dataset A (data_analyst role - read access to dataset A, write access to dataset B)

  • Group - Collection of users (LDAP/OS groups) - can assign one or more roles to a group

  • Users - Users can belong to one or more groups
     

User Requirements/Stories:

  • Admin wants to provide full access for User Super_Dev to Dev Namespace but wants to provide only restricted (tbd) access to Data_Analyst. He should be able to easily grant/revoke privileges using a framework/UI that he is already familiar with and that he is already using for access to other tools in the cluster

  • Admin should have ability to provide both high/low level access - for ex, ability to start/stop a specific program, read/write access (explore/programmatic) access to datasets

  • Admin also wants access to audit logs of access requests - grants and denials

  • Ideally the full restriction of access to say, Dataset A for a restricted user, should prevent the user from reading/writing to that dataset by bypassing CDAP and going to storage layer (such as scanning HBase table directly)

    • Alternatively, only allow CDAP user to access the storage layer objects, and manage access through CDAP (may not be feasible since existing code directly reads from HDFS/HBase)


High-level Architecture:



Design Suggestion/Choices:

  • Apache Sentry, as described above, provides authorization control for Hadoop tools in CDH. We can thus delegate the ACL management to Sentry 

  • Note that Sentry provides only authorization services, authentication needs to be handled ourselves

  • Admin needs to use Sentry directly to set ACLs. TBD: In CDAP UI, we just need to decide if we just want to hide namespaces/apps that users don’t have access to.

  • Sentry service creates/maintains audit log trail (TBD: Figure out admin can access it. Does Hue provide access to it?)
     

WorkFlow:

  • Admin goes to Apache Sentry and provides access to namespaces/applications to specific groups

  • User provides an Auth Token when he/she makes request to CDAP, from which we can determine the user name

  • Router forwards the request to the appropriate system service (once the auth token is verified)

  • System service HTTP Handler (say AppFabric) checks with Sentry service to see if the user has authorization to perform the requested action. Gets a yes/no response and it accepts or denies the request (enables us to provide partial responses, for example, hide apps that user doesn’t have access to in that namespace)

Scope for 3.3:

  • Namespace authorization : A user gets access to a specific namespace or doesn’t get access to a namespace. 

  • Design of Authorization client on CDAP should be pluggable in nature, so that in future we can plop in Sentry/Ranger implementation and it should work without much modification.

  • Need to figure out how the plugging in Policy Engine/Data Model for CDAP in Sentry service will work for new/existing Sentry installations (since management of Sentry is outside the scope of CDAP installation)

Future Sentry Integrations:

  • ACLs should be pushed to underlying storage layers. For example, restricting access to a specific Dataset, should restrict access for that user even in HBase

  • CDAP Programs (such as Worker) should have inherit the Dataset access control (by impersonating the user who is starting the program)

  • Dataset operations in DatasetOpExecutor suffers from above issues

References:

  1. http://events.linuxfoundation.org/sites/events/files/slides/ApacheSentry2015_0.pdf

  2. https://blogs.apache.org/sentry/

  3. https://github.com/apache/incubator-sentry

  4. https://cwiki.apache.org/confluence/display/SQOOP/High+Level+Design+of+Role+Based+Access+Controller