Hi Fabian,

Should we close this issue then? Or I could just leave the comment why we can't 
repair NOT IN at the moment. So no one else will do the same research again. 
Perhaps the Calcite team will change a logical plan for NOT IN and we will be 
back to this issue.

Regards,
Alexander

-----Original Message-----
From: Fabian Hueske [mailto:fhue...@gmail.com] 
Sent: Friday, November 11, 2016 2:04 AM
To: dev@flink.apache.org
Subject: Re: [FLINK-4541] Support for SQL NOT IN operator

> Hi Alexander,
>
> Thanks for looking into this issue!
>
> We did not support CROSS JOIN on purpose because the general case is very 
> expensive to compute.
> Also as you noticed we would have to make sure that inner-joins are preferred 
> over cross joins (if possible).
> Cost-based optimizers (such as Calcite's Volcano Planner) use cost estimates 
> for that.
> So in principle, the approach of assigning high costs was correct. In the 
> case of Flink,
> we have to be careful though, because we do not have good estimates for the 
> size of
> our inputs (actually we do not have any at the moment...). So even tweaked 
> cost functions might bit us.

> I would actually prefer to not add support for NOT IN (except for the simple 
> case of a list of a few literals) at this point.
> Adding a CrossJoin translation rule will also add support for arbitrary cross 
> joins, which I would like to avoid.
> With support for cross joins it is possible to write by accident queries that 
> run for days and produce
> vast amounts of data filling up all disks.
>
> I also think there are other issues, which are more important to address and 
> would have a bigger impact than support for NOT IN.
>
> Best, Fabian
>
> 2016-11-10 11:26 GMT+01:00 Alexander Shoshin <alexander_shos...@epam.com>:
>
> > Hi,
> >
> > I am working on FLINK-4541 issue and this is my current changes:
> > https://github.com/apache/flink/compare/master...
> > AlexanderShoshin:FLINK-4541.
> >
> > I found that NOT IN does not work with nested queries because of 
> > missed DataSet planner rule for a cross join. After adding 
> > DataSetCrossJoinRule several tests from 
> > org.apache.flink.api.scala.batch.ExplainTest
> > (testJoinWithExtended and testJoinWithoutExtended) that should check 
> > execution plans became fail because VolcanoPlanner started to build 
> > new execution plans for them using DataSetCrossJoin. That's why I 
> > increased DataSetCrossJoin cost (in computeSelfCost(...) function) to 
> > avoid its usage if it is possible. But it seems to be not a good idea.
> >
> > Do you have any ideas how to solve a problem with testJoinWithExtended 
> > and testJoinWithoutExtended tests in another way? Is it correct that 
> > these tests check an execution plan instead of a query result?
> >
> > Regards,
> > Alexander
> >
> >

Reply via email to