Re: Confused by the `projectFields` Method in `ProjectableTableSource` Interface

Rong Rong Fri, 12 Jul 2019 10:23:45 -0700

Hi Caizhi,

from my understanding, the "ProjectableTableSource" interface is used for
something like predicator push-down scenarios:
where your produced output should be the same as how your SELECT statement
requires.

For example, in the case of:
SourceSchema: {a: Int, b: Double, c: String, d: Long}
SQL: "select c, d from my_table"

If implemented ProjectableTableSource, Flink will invoke the projectFields
method to create another TableSource that does not
return the full schema, but only the SelectedFields. (this is particularly
useful for columnar store formats like Parquet where only a subset of files
gets read into Flink).

So, in short, when Flink invokes the override "projectFields" for your
table source. it wouldn't be passing in the argument as {1,2}, but will be
{2,3} - e.g. field "c" and "d".

--------
This also brings to an interesting question: in your example, you mentioned
that the table source is already having a produce type as {a: Int, c:
String, d: Long}. I am assuming you are asking this when considering
something similar to: if two SQLs are acting on the source, "SELECT a, c
..." and "SELECT c, d ..."

I am not 100% sure since it has been sometime since I look at the code, but
my understanding is:
the projectFields will be invoked twice. and generate 2 new instances of
the table source, with the same table schema
but one with produced type {a: Int, c: String} and one with {c: String, d:
Long}. So, there will not be a table source with {a: Int, c: String, d:
Long}.

Thanks,
Rong

On Thu, Jul 11, 2019 at 9:53 PM Caizhi Weng <tsreape...@gmail.com> wrote:

> Hi Flink developers,
>
> When implementing `JDBCTableSource` with `ProjectableTableSource` interface
> I'm confused by the `projectFields` method.
>
> The java doc of the `projectFields` states that (It also has a typo...
> poduced -> produced):
>
> > Creates a copy of the {@link TableSource} that projects its output to the
> > given field indexes.
> > The field indexes relate to the physical poduced data type ({@link
> > TableSource#getProducedDataType()}) and not to the table schema ({@link
> > TableSource#getTableSchema} of the {@link TableSource}.
>
>
> So my understanding of this java doc is that, if the table schema of the
> source is {a: Int, b: Double, c: String, d: Long} and the produced data
> type of the source is {a: Int, c: String, d: Long}. Then if user writes
> "select
> c, d from my_table" then the project field indices should be {1, 2} instead
> of {2, 3} (because they should be related to the produced type and not to
> the schema).
>
> But the implementation of `CSVTableSource` says otherwise. The field
> indices are related to the schema, not to the produced data type.
>
> I pick the implementation of `CSVTableSource` to implement JDBC table
> source (as `CSVTableSource` obviously passed all the test cases). So which
> one is correct, my understanding on the java doc or the implementation of
> `CSVTableSource`?
>
> Thanks.
>

Re: Confused by the `projectFields` Method in `ProjectableTableSource` Interface

Reply via email to