Hi there,

I'm working on the Gradoop project at the University of Leipzig (https://github.com/dbs-leipzig/gradoop). Currently we're using the Batch-API - now we're investigating Table-API as an abstraction for Batch-API. I found 2 issues I want to discuss:

1. I get an error (Error while applying rule AggregateUnionAggregateRule) on compile time when having a DISTINCT on a result of a JOIN within an UNION, e.g.

(
  SELECT DISTINCT c
  FROM a JOIN b ON a = b
)
UNION
(
  SELECT c
  FROM c
)

Java example: https://gist.github.com/lordon/27fc5277b0d5abd58158f4ec40cda384

2. As we have large workflows, several parts of such a workflow are reused at differents point within the workflow. For example: Two datasets get scanned, INTERSECTED and JOINED to another dataset. The resulting dataset is used as JOIN partner for six other datasets. Using Table-API the resulting operator tree looks like:

Workflow

As you can see, the whole part of INTERSECTING and JOINING is executed for each reference. I guess this is because you decided to treat Flink Tables as VIEWs which get recalculated on each reference. In fact this doesn't make sense for our large workflows (note we're using the BatchEnvironment only). Is there any chance to avoid that behavior? Is there a possibility to allow Calcite to optimize/combine such common sub trees in the operator tree?

Thanks in advance!

Best,
Elias

Reply via email to