Table of Contents |
---|
...
- User stories documented (Ali)
- User stories reviewed (Nitin)
- Design documented (Ali)
- Design reviewed (Andreas/Terence)
- Feature merged (Ali)
- Blog post
...
Hadoop's UserGroupInformation class has the following method:
// Log a user in from a keytab file.
UserGroupInformation loginUserFromKeytabAndReturnUGI(String user, String path);
...
A similar approach can be done for programs launched by a scheduler. The only difference would be that the principal and credentials would be resolved by the scheduler, instead.
System-executed operations on user data (dataset admin ops and namespace ops)
When the CDAP system performs dataset operations (create/delete/truncate/upgrade hbase tables, for instance), it is acting on user datasets. Because of this and the fact that we do not want the cdap system user to have superuser privileges, we need to impersonate users when executing these dataset admin operations.
To implement this, we'll have a DelegatingDatasetAdmin which will perform all of its operations for a particular UGI.
StorageProviderNamespaceAdmin will also have to perform all of its operations for a particular UGI (i.e. namespace create and namespace delete).
Upgrade Tool changes (TBD)
Very likely, upgrade tool will also have to follow a similar pattern as dataset op executor service.
Other miscellaneous tools that interact with user data: Flowlet pending metrics corrector, Flowlet queue inspector.
Streams (TBD)
StreamWriters are system code, but writing to user Streams, so this should also be impersonated.
It is not yet determined how impersonation will work here, but the above approach can not be used in this case.
An implementation of design for this will be flushed out later. A couple of things to consider when thinking about the design later:
...
Design of the necessary implementation for this has not been flushed out either, and will come later.
Brief summary of overall changes
- During program runtime, cdap master will impersonate a user and launch the YARN app. This will make it so that cdap programs run as various users.
- Because these users will not have access to system tables, they will go through CDAP system services for writing to system tables (run records, lineage, usage, workflow token).
- During namespace operations (create/delete), dataset service will perform the namespace create and delete operations (HBase namespace, HDFS directories, explore database), while impersonating the configured user.
- During dataset admin operations (create/delete/truncate), dataset op executor service will perform the operations while impersonating the configured user.
- (to be finalized) Stream admin operations as well as stream writing operations will have to happen while impersonating the configured user.
- (to be finalized) Explore queries launched will have to happen while impersonating the configured user.
- (to be finalized) Artifact deployment will also need to impersonate the user, when deploying artifact in user scope.
Note: any time that a system service wishes to impersonate a user, it will involve looking up the configured principal/keytab, then localizing the keytab from distributed file system, and creating a UGI based upon this keytab. A caching mechanism for these UGI's would be useful.
Problems Encountered
...
User applications writing to CDAP System tables
...
Any thoughts on this approach, or workable alternatives to this, are welcome.
Pending Questions
- How will admins configure multiple keytabs (for the various configured principals).
- Should we restrict updates to particular fields of the NamespaceConfig? Making it a 'final' configuration may simplify edge cases of the implementation, and will also reduce runtime failures. For instance, if user changes the principal of a namespace, the user would have to ensure that this new principal has all the appropriate permissions.
When launching jobs through twill, staging directory is always cdap/twill/...; Do we need to change twill to pass in staging dir through prepareRun?
- If a user is logged into cdap as 'ali', shouldn't we run the YARN app as user 'ali', instead of the mapping configured on the namespace/app/etc.?
- Programs launched by workflow - how will the appropriate principal be used for the launched programs (Mapreduce, Spark, Custom Action, etc).
...