Re: Feature request: split dataset based on condition

Ryan Blue Mon, 04 Feb 2019 08:52:50 -0800

Andrew, can you give us more information about why partitioning the output
data doesn't work for your use case?


It sounds like all you need to do is to create a table partitioned by A and
B, then you would automatically get the divisions you want. If what you're
looking for is a way to scale the number of combinations then you can use
formats that support more partitions, or you could sort by the fields and
rely on Parquet row group pruning to filter out data you don't want.

rb

On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo <[email protected]> wrote:

> Hello
>
> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini <[email protected]> wrote:
> >
> > I've seen many application need to split dataset to multiple datasets
> based on some conditions. As there is no method to do it in one place,
> developers use filter method multiple times. I think it can be useful to
> have method to split dataset based on condition in one iteration, something
> like partition method of scala (of-course scala partition just split list
> into two list, but something more general can be more useful).
> > If you think it can be helpful, I can create Jira issue and work on it
> to send PR.
>
> This would be a really useful feature for our use case (processing
> collision data from the LHC). We typically want to take some sort of
> input and split into multiple disjoint outputs based on some
> conditions. E.g. if we have two conditions A and B, we'll end up with
> 4 outputs (AB, !AB, A!B, !A!B). As we add more conditions, the
> combinatorics explode like n^2, when we could produce them all up
> front with this "multi filter" (or however it would be called).
>
> Cheers
> Andrew
>
> >
> > Best Regards
> > Moein
> >
> > --
> >
> > Moein Hosseini
> > Data Engineer
> > mobile: +98 912 468 1859
> > site: www.moein.xyz
> > email: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Feature request: split dataset based on condition

Reply via email to