On Wed, Nov 29, 2017 at 6:38 PM Reuven Lax <re...@google.com> wrote:

> There has been a lot of conversation about schemas on PCollections
> recently. There are a number of reasons for this. Schemas as first-class
> objects in Beam provide a nice base for building BeamSQL. Spark has
> provided schema-support via Dataframes for over two years, and it has
> proved to be very popular among Spark users; it turns out that FlumeJava -
> the original inspiration for the Beam API - has had schema support for even
> longer, though this feature was not included in the Beam (at that time
> Dataflow) API. It turns out that most records have structure, and allowing
> the system to understand record structure can both simplify usage of the
> system and allow for new performance optimizations.
>
> After discussion with JB, Eugene, Kenn, Robert, and a number of others on
> the list, I've started a proposal document here
> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit?usp=sharing>
> describing how schemas can be added to Beam in a manner that integrates
> with the existing Beam API. The goal is not blindly copy existing systems
> that have schemas, but rather to ensure that we get the best fit for Beam.
> Please comment on this proposal - as much feedback as possible is valuable.
>
> In addition, you may notice this document is incomplete. While it does
> sketch out how schemas can fit into Beam semantically, many portions of
> this design remain to be fleshed out. In particular, the API signatures are
> only sketched at at a high level, exactly what all these APIs will look
> like has not yet been defined. I would welcome help from interested members
> of the community to define these APIs, and to make sure we're covering all
> relevant use cases.
>

Thanks for sharing this Reuven, I'm excited to see this being discussed.
One global comment: all of the existing examples are in Java. It would be
great if we could design this with Python in mind (and how it could
interact cleanly with Pandas) at the same time. +Robert Bradshaw
<rober...@google.com> , +Holden Karau <hka...@google.com> , and +Ahmet Altay
<al...@google.com> , all whom I've spoken with regarding this and other
Python things recently, just to be sure they see it. But of course it'd be
great if anyone working on Python could jump in.

-Tyler



>
> Thanks all,
>
> Reuven
>
>
>

Reply via email to