Hi Alexander, I had a look at the logical plan generated by Calcite for NOT IN again and noticed that the CROSS JOIN is only used to join a single record (the result of a global aggregation) to all records of the other table. IMO, this is a special case, that we can easily support and efficiently implement.
For that I would implement a translation rule, that converts a Cross Join (join(true)) only into a DataSetSingleRowCross if one input of the join is a global aggregation, i.e., a LogicalAggregate without groupset / groupsets. The DataSetSingleRowCross would be implemented as MapFunction with BroadcastSet input (for the single record input) and internally call a code generated JoinFunction. If you want to follow that approach, please add a summary of our discussion to the JIRA issue and assign it to yourself (if that doesn't work, please let me know and I'll give you contributor permissions). Thanks, Fabian 2016-11-11 9:28 GMT+01:00 Alexander Shoshin <alexander_shos...@epam.com>: > Hi Fabian, > > Should we close this issue then? Or I could just leave the comment why we > can't repair NOT IN at the moment. So no one else will do the same research > again. Perhaps the Calcite team will change a logical plan for NOT IN and > we will be back to this issue. > > Regards, > Alexander > > -----Original Message----- > From: Fabian Hueske [mailto:fhue...@gmail.com] > Sent: Friday, November 11, 2016 2:04 AM > To: dev@flink.apache.org > Subject: Re: [FLINK-4541] Support for SQL NOT IN operator > > > Hi Alexander, > > > > Thanks for looking into this issue! > > > > We did not support CROSS JOIN on purpose because the general case is > very expensive to compute. > > Also as you noticed we would have to make sure that inner-joins are > preferred over cross joins (if possible). > > Cost-based optimizers (such as Calcite's Volcano Planner) use cost > estimates for that. > > So in principle, the approach of assigning high costs was correct. In > the case of Flink, > > we have to be careful though, because we do not have good estimates for > the size of > > our inputs (actually we do not have any at the moment...). So even > tweaked cost functions might bit us. > > > I would actually prefer to not add support for NOT IN (except for the > simple case of a list of a few literals) at this point. > > Adding a CrossJoin translation rule will also add support for arbitrary > cross joins, which I would like to avoid. > > With support for cross joins it is possible to write by accident queries > that run for days and produce > > vast amounts of data filling up all disks. > > > > I also think there are other issues, which are more important to address > and would have a bigger impact than support for NOT IN. > > > > Best, Fabian > > > > 2016-11-10 11:26 GMT+01:00 Alexander Shoshin <alexander_shos...@epam.com > >: > > > > > Hi, > > > > > > I am working on FLINK-4541 issue and this is my current changes: > > > https://github.com/apache/flink/compare/master... > > > AlexanderShoshin:FLINK-4541. > > > > > > I found that NOT IN does not work with nested queries because of > > > missed DataSet planner rule for a cross join. After adding > > > DataSetCrossJoinRule several tests from > > > org.apache.flink.api.scala.batch.ExplainTest > > > (testJoinWithExtended and testJoinWithoutExtended) that should check > > > execution plans became fail because VolcanoPlanner started to build > > > new execution plans for them using DataSetCrossJoin. That's why I > > > increased DataSetCrossJoin cost (in computeSelfCost(...) function) to > > > avoid its usage if it is possible. But it seems to be not a good idea. > > > > > > Do you have any ideas how to solve a problem with testJoinWithExtended > > > and testJoinWithoutExtended tests in another way? Is it correct that > > > these tests check an execution plan instead of a query result? > > > > > > Regards, > > > Alexander > > > > > > >