Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This also includes removing the temporary datasets that could be created as a part of workflows.

 

API Changes: 

The two choices for APIs: 

  1. Collapse on Program: The API will request collapse=workflow for this view. If collapse=workflow is not provided then, the workflow fields are filled (if applicable) but relations are not collapsed. This approach is an extension of the collapse approach used today. But since workflows are also programs, it has the ambiguity when collapsing on "programs" because it still shows workflows. 
  2. Group-by on Workflow: In addition to the collapse, if a group-by is provided, it can be used as group-by=workflow and the programs that dont have workflows associated are left as it is. This approach also has some level of ambiguity listed above but is extensible if in future there is a requirement for group-by on application. [Because collapse workflow is confusing if there are no workflows present]  

Implementation  #1 rollup=workflow: New argument added for the lineage API. Programs in all relations are simply replaced by their associated workflows, if any. Programs that are not part of any workflows are left as it is. 

Lineage view will not have a mapping between programs and workflows. 

  • JSON Response Changes: 
  1. In the "programs" section: the ProgramId of the Program will be replaced by ProgramId of the workflow if the program is associated with a workflow. If not the Programs show as it is. 
  2. In the "relations" section: a new field called "workflow" will be added for all the relations that could be collapsed based on workflows. The programs field will be a collection of all the programs that were collapsed to form this workflow.
    the programs field will carry the name of the workflows if applicable.

 

Implementation Changes: 

  1. Once all the relations are set up for the required lineage, the new code: 
    1. Walks over all relations and makes a list of all Workflow IDs associated with all Programs in the relations. 
    2. For all these workflow IDs, AppMetadataStore is scanned to return a map of <ProgramID, MDSkey>. [while applying the workflow IDs as filter]
    3. Walks this map and creates another map of <RunIDs, ProgramIDs>. This map contains all workflow RunIDs for associated programs in the required relations. 
    4. Walk over the relations again, and replace ProgramIDs of Programs with ProgramIDs of associated workflows.  

curl "http://127.0.0.1:10000/v3/namespaces/default/datasets/EmpAgg/lineage?collapse=access&collapse=run&collapse=component&

...

rollup=

...

workflow&end=now&levels=1&start=now-7d" | python -m json.tool

 

...

                                 Dload  Upload   Total   Spent    Left  Speed

100  1006  100  1006    0     0   121k      0 --:--:-- --:--:-- --:--:--  140k

{

    "data": {

...

Code Block
curl "http://127.0.0.1:10000/v3/namespaces/default/datasets/EmpAgg/lineage?collapse=access&collapse=run&collapse=component&rollup=workflow&end=now&levels=1&start=now-7d" | python -m json.tool

{
    "data": {
        "dataset.default.EmpAgg":

...

 {
            "entityId":

...

 {
                "id":

...

 {
                    "instanceId": "EmpAgg",

...


                    "namespace":

...

 {
                        "id": "default"

...

                    }

                },

                "type": "datasetinstance"

            }

        },

...


                    }
                },
                "type": "datasetinstance"
            }
        },
        "dataset.default.conn-0.e0591f36-9661-11e6-af4a-0000007182af":

...

 {
            "entityId":

...

 {
                "id":

...

 {
                    "instanceId": "conn-0.e0591f36-9661-11e6-af4a-0000007182af",

...


                    "namespace":

...

 {
                        "id": "default"

...

                    }

                },

                "type": "datasetinstance"

            }

        }

    },

    "end": 1476926333,

    "programs": { 

        "<workflow name>": {

            "entityId": {

                "id": {

                    "application": {

                        "applicationId": "EmployeePipe_Long_copy",

                        "namespace": {

                            "id": "default"

                        }

                    },

                    "id": "phase-2",

                    "type": "Mapreduce"

                },

                "type": "<type workflow>"

            }

        }

    },

    "relations": [

        {

            "accesses": [

                "read", 

                "write"

            ],

            "components": [],

...


                    }
                },
                "type": "datasetinstance"
            }
        }
    },
    "end": 1476926333,
    "programs": { 
        "<workflow name>": {
            "entityId": {
                "id": {
                    "application": {
                        "applicationId": "EmployeePipe_Long_copy",
                        "namespace": {
                            "id": "default"
                        }
                    },
                    "id": "phase-2",
                    "type": "Mapreduce"
                },
                "type": "<type workflow>"
            }
        }
    },
    "relations": [
        {
            "accesses": [
                "read", 
                "write"
            ],
            "components": [],
            "data": "dataset.default.conn-0.e0591f36-9661-11e6-af4a-0000007182af",

...

            "workflow" : "<workflow name>"

            "program": [

                "mapreduce.default.EmployeePipe_Long_copy.phase-2",

                "mapreduce.default.EmployeePipe_Long_copy.phase-1"

            ]

            "runs": [

...


            "program" : "<workflow name>"
            "runs": [
                "e4051038-9661-11e6-8060-000000d79ea8",

...


                "e4051038-9661-11e6-8060-000000d79ea8"

...

            ]

        },

        {

...


            ]
        },
        {
            "accesses":

...

                "write",

                "read"

            ],

            "components": [],

...

 [
                "write",
                "read"
            ],
            "components": [],
            "data": "dataset.default.EmpAgg",

...

            "workflow": "", 

...


            "program": "mapreduce.default.EmployeePipe_Long_copy.phase-2",

...


            "runs":

...

 [
                "e4051038-9661-11e6-8060-000000d79ea8"
            ]
        }
    ],
    "start": 1476321533
}


            ]

        }

    ],

    "start": 1476321533



}Other Approaches Considered: 

#2 Collapse=program: Programs are collapsed into workflows if applicable. Mapping of program to workflow is maintained in this case. This approach is an extension of the collapse approach used today. But since workflows are also programs, it has the ambiguity when collapsing on "programs" because it still shows workflows. 

  • JSON Response Changes: 
  1. In the "programs" section: the ProgramId of the Program will be replaced by ProgramId of the workflow if the program is associated with a workflow. If not the Programs show as it is.
  2. In the "relations" section: a new field called "workflow" will be added for all the relations that could be collapsed based on workflows. The programs field will be a collection of all the programs that were collapsed to form this workflow.

#3 Group-by on Workflow: In addition to the collapse, if a group-by is provided, it can be used as group-by=workflow and the programs that dont have workflows associated are left as it is. This approach also has some level of ambiguity listed above but is extensible if in future there is a requirement for group-by on application. [Because collapse workflow is confusing if there are no workflows present]