Hi Alexander,

I had a look at the logical plan generated by Calcite for NOT IN again and
noticed that the CROSS JOIN is only used to join a single record (the
result of a global aggregation) to all records of the other table.
IMO, this is a special case, that we can easily support and efficiently
implement.

For that I would implement a translation rule, that converts a Cross Join
(join(true)) only into a DataSetSingleRowCross if one input of the join is
a global aggregation, i.e., a LogicalAggregate without groupset / groupsets.
The DataSetSingleRowCross would be implemented as MapFunction with
BroadcastSet input (for the single record input) and internally call a code
generated JoinFunction.

If you want to follow that approach, please add a summary of our discussion
to the JIRA issue and assign it to yourself (if that doesn't work, please
let me know and I'll give you contributor permissions).

Thanks,
Fabian

2016-11-11 9:28 GMT+01:00 Alexander Shoshin <alexander_shos...@epam.com>:

> Hi Fabian,
>
> Should we close this issue then? Or I could just leave the comment why we
> can't repair NOT IN at the moment. So no one else will do the same research
> again. Perhaps the Calcite team will change a logical plan for NOT IN and
> we will be back to this issue.
>
> Regards,
> Alexander
>
> -----Original Message-----
> From: Fabian Hueske [mailto:fhue...@gmail.com]
> Sent: Friday, November 11, 2016 2:04 AM
> To: dev@flink.apache.org
> Subject: Re: [FLINK-4541] Support for SQL NOT IN operator
>
> > Hi Alexander,
> >
> > Thanks for looking into this issue!
> >
> > We did not support CROSS JOIN on purpose because the general case is
> very expensive to compute.
> > Also as you noticed we would have to make sure that inner-joins are
> preferred over cross joins (if possible).
> > Cost-based optimizers (such as Calcite's Volcano Planner) use cost
> estimates for that.
> > So in principle, the approach of assigning high costs was correct. In
> the case of Flink,
> > we have to be careful though, because we do not have good estimates for
> the size of
> > our inputs (actually we do not have any at the moment...). So even
> tweaked cost functions might bit us.
>
> > I would actually prefer to not add support for NOT IN (except for the
> simple case of a list of a few literals) at this point.
> > Adding a CrossJoin translation rule will also add support for arbitrary
> cross joins, which I would like to avoid.
> > With support for cross joins it is possible to write by accident queries
> that run for days and produce
> > vast amounts of data filling up all disks.
> >
> > I also think there are other issues, which are more important to address
> and would have a bigger impact than support for NOT IN.
> >
> > Best, Fabian
> >
> > 2016-11-10 11:26 GMT+01:00 Alexander Shoshin <alexander_shos...@epam.com
> >:
> >
> > > Hi,
> > >
> > > I am working on FLINK-4541 issue and this is my current changes:
> > > https://github.com/apache/flink/compare/master...
> > > AlexanderShoshin:FLINK-4541.
> > >
> > > I found that NOT IN does not work with nested queries because of
> > > missed DataSet planner rule for a cross join. After adding
> > > DataSetCrossJoinRule several tests from
> > > org.apache.flink.api.scala.batch.ExplainTest
> > > (testJoinWithExtended and testJoinWithoutExtended) that should check
> > > execution plans became fail because VolcanoPlanner started to build
> > > new execution plans for them using DataSetCrossJoin. That's why I
> > > increased DataSetCrossJoin cost (in computeSelfCost(...) function) to
> > > avoid its usage if it is possible. But it seems to be not a good idea.
> > >
> > > Do you have any ideas how to solve a problem with testJoinWithExtended
> > > and testJoinWithoutExtended tests in another way? Is it correct that
> > > these tests check an execution plan instead of a query result?
> > >
> > > Regards,
> > > Alexander
> > >
> > >
>

Reply via email to