Audit Log
Use-cases
Case #1
- Rishab is a data scientist/engineer at a company that implements a Data Lake. He is analyzing the effectiveness of the recommendation engine on the company's e-commerce site. For this investigation, he wants to analyze a dataset that includes click log for the last year. He is looking for clean click log data that is up-to-date. He wants to use part of the data to build model and rest to score the model and validate the predictions.
- Before he can conduct an analysis, Rishab needs to confirm the dataset is available in the Data Lake.
- To do so, he wishes to find all entities that include “click log”.
- He arrives at the Finder home screen (from nav, search results, other entry points?).For this analysis, Rishab is most concerned with the recency, the accuracy, and the integrity of the data.
- Enters “click log” in the Search Box and clicks Search.
- He arrives at the Results Page.
- Results returned
- By default, they are sorted by creation time
- Each Result includes:
- Snippet of the metadata that matches his query in context.
- Important to help him evaluate the relevance of the results.
- Date Created
- To know how recent/new it is.
- Snippet of the metadata that matches his query in context.
- He clicks the result and arrives at the Entity Detail Page where he can view all of the metadata associated with an entity.
- Rishab wished to verify the validity of the sources of this dataset. To do so, he clicks the Lineage Tab to trace the creation of this dataset to its source.
- Finder displays the lineage for this dataset as a diagram. The selected dataset displays in the center; to the left is the entity that precedes it and to the right is the one it precedes.
- Rishab discovers that it has been created from two separate sources.
- He then clicks one of the sources which takes him to the Entity Page of that dataset.
- He clicks on a program to see what has been done to the dataset.
- Rishab clicks the Audit Logs Tab to see how active this dataset has been - when was it last updated, who is using it, writing to it, reading from it.
- Rishab clicks the appropriate action to make this dataset a new source for his existing Click Log processing pipeline.
- This takes him to the Hydrator Studio where he can edit the Master Click Log pipeline.
Storing Audit Log
- Goal: Read AuditLog messages from Kafka and write messages to Table dataset.
- Reusing the MetadataConsumer flowlet from the Navigator App to handle reading messages from Kafka
- Beacuse of this, the app requires a Kafka config in order to be installed
Code Block { "config": { "metadataKafkaConfig": { "brokerString": "<host>:<port>", "topic" : "audit" } } }
- Beacuse of this, the app requires a Kafka config in order to be installed
- New Flowlet (AuditLogPublisher) for writing Kafka messages to Dataset
- Dataset is a Table class
- Data is stored using the inverse timestamp so that the most recent message is always stored and returned first
- Dataset key format: <namespace>-<type>-<name>-<messageTimeInMilliSecondsLong>-<UUID><namespace>DELMITER<type>DELMITER<name>DELMITER<inverseTimeInMilliSecondsLong>DELMITER<UUID>
- DELMITER currently "\1"
- Dataset Columns:
- timestamp - Long - timestamp of the message generated
- entityId - EntityId - the entity id that the message refers to. Only entity types with a namespace are supported.
- user - String - the name of the user that generated the message. If the user blank, a default value of "unknown" is inserted.
- actionType - String - The type of action that was taken. For more details, see: Audit information publishing
- entityType - String - The EntityType from the id, lowercase
- entityName - String - The name of the Entity
- metadata - AuditPayload - The change that was made, either a metadata change or an access. For all other types, the payload is empty
- Reusing the MetadataConsumer flowlet from the Navigator App to handle reading messages from Kafka
...
- Goal: Expose the AuditLog dataset as a REST API for consumption by the UI
- Fields returned
- totalResults - the total number of results for the query. If there are more than 100 results, this bails early since that can't be shown in the UI.
- offset - The starting offset of the first result
- results - An array of result records with a max length of limit and most recent timestamp first
REST API Design
HTTP Request Type
Endpoint:
Request Params
Response Status
Response Body
GET /namespaces/{namespace-id}/apps/Tracker/services/AuditLog/methods/auditlog/{type}/{name} name is Required Description Default Value type yes The type of the entity to search for, e.g. dataset or stream. Any namespaced entity can be searched for. Possible values: dataset, stream, stream_views name yes The name of the entity to search for startTime no The start time to search for. Accepts "now - 1d" syntax. Seconds Milliseconds granularity for timestamps. 0 endTime no The end time to search for. Accepts "now - 1d" syntax. Seconds Milliseconds granularity for timestamps. now offset no The offset to start the results at for paging 0 limit no The max number of results to return in the results 10 200 returns the audit log entries requested
400 Bad request is returned when the input values are invalid such as incorrect date format, negative offsets/limits, or invalid range. The response will include an appropriate error message.
500 unknown server error
Code Block { totalResults: 1, results: [{ time: 1457467029557, entityId: { namespace: "default", application: "testCubeAdapter", type: "Workflow", program: "ETLWorkflow", entity: "PROGRAM" }, user: "unknown", type: "METADATA_CHANGE", payload: { previous: { SYSTEM: { properties: { }, tags: [ ] } }, additions: { SYSTEM: { properties: { }, tags: [ "ETLMapReduce", "Batch", "Workflow", "ETLWorkflow" ] } }, deletions: { SYSTEM: { properties: { }, tags: [ ] } } } }], offset: 0 }
Example of no results being found.
Code Block { totalResults: 0, results: [ ], offset: 0 }
- Fields returned
...