Jerry, Great question. Spark and Tachyon capture lineage information at different granularities. We are working on an integration between Spark/Tachyon about this. Hope to get it ready to be released soon.
Best, Haoyuan On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi spark developers, > > I was thinking it would be nice to extract the data lineage information > from a data processing pipeline. I assume that spark/tachyon keeps this > information somewhere. For instance, a data processing pipeline uses > datasource A and B to produce C. C is then used by another process to > produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so > useful if there is a way to capture this information when we are using > spark/tachyon to query this data lineage information. For example, give me > datasets that produce E. It should give me a graph like (A and B)->C->E. > > Is this something already possible with spark/tachyon? If not, do you > think it is possible? Does anyone mind to share their experience in > capturing the data lineage in a data processing pipeline? > > Best Regards, > > Jerry > -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/