Re: Spark or Tachyon: capture data lineage

Haoyuan Li Fri, 02 Jan 2015 12:33:04 -0800

Jerry,

Great question. Spark and Tachyon capture lineage information at different
granularities. We are working on an integration between Spark/Tachyon about
this. Hope to get it ready to be released soon.


Best,

Haoyuan

On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam <chiling...@gmail.com> wrote:

> Hi spark developers,
>
> I was thinking it would be nice to extract the data lineage information
> from a data processing pipeline. I assume that spark/tachyon keeps this
> information somewhere. For instance, a data processing pipeline uses
> datasource A and B to produce C. C is then used by another process to
> produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so
> useful if there is a way to capture this information when we are using
> spark/tachyon to query this data lineage information. For example, give me
> datasets that produce E. It should give me  a graph like (A and B)->C->E.
>
> Is this something already possible with spark/tachyon? If not, do you
> think it is possible? Does anyone mind to share their experience in
> capturing the data lineage in a data processing pipeline?
>
> Best Regards,
>
> Jerry
>



-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/

Re: Spark or Tachyon: capture data lineage

Reply via email to