[jira] [Commented] (FLINK-5280) Extend TableSource to support nested data

Ivan Mushketyk (JIRA) Mon, 12 Dec 2016 09:22:21 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-5280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742507#comment-15742507
 ]


Ivan Mushketyk commented on FLINK-5280:
---------------------------------------

Hi [~fhueske] ,

Thank you for your comments. It's a much clearer now, but it seems that I am 
either still missing something obvious or it seems to me that the task is more 
involved than it was described.

Let me first describe how I understand this issue so that you could correct me.

So the goal of this task is to support nested data structures. So it means that 
if we have a type definition like this:

{code:java}
class ParentPojo {
  ChildPojo child;
  int num;
}

class ChildPojo {
  String str;
}
{code}

and we have a *TableSource* that returns a dataset of *ParentPojo* we can 
access nested fields in SQL queries. Something like:

{code:sql}
SELECT * FROM pojos WHERE child.str LIKE '%Rubber%'
{code}

In this case *child.str* is a way to access a nested field.

The first thing that confuses me is that current [SQL 
grammar|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/table_api.html#sql-syntax]
 does not seem to support any nested fields access, but I think may be a 
relatively minor nuisance.

If I understand it correctly internally *flink-table* converts any input into a 
dataset of Rows and then performs operations on it. To convert a nested 
*ParentPojo* into a flat schema we can extract all leaf values into two columns:

{code}
child.str num
{code}

similarly to how *Parquet* identifies columns in nested types (see the 
following 
[slide|http://www.slideshare.net/julienledem/strata-london-2016-the-future-of-column-oriented-data-processing-with-arrow-and-parquet/10?src=clipshare])

Now, where this becomes more interesting. If I understand it correctly 
*BatchScan#convertToExpectedType* is used to convert an input dataset into a 
dataset of *Row*s. For this task it generates a mapper function in 
*FlinkRel#getConversionMapper* which than calls 
*CodeGenerator#generateConverterResultExpression*.

So in our case it should generate code similar to something like:

{code:java}
public Row map(ParentPojo parent) {
        Row row = new Row(2);
        row.setField(0, parent.child.str);
        row.setField(1, parent.num);

        return row;
}
{code}

*CodeGenerator* accepts *fieldNames* and optional POJO field mapping to 
generate accessors. It seems that the main work is performed in 
*CodeGenerator#generateFieldAccess* that generates an access code for different 
fields of the POJO, but it does not create any code that accesses nested 
fields. It just generates an access code to a POJO field with a corresponding 
field name in CodeGenerator#generateFieldAccess.

Therefore, if I understand this correctly, we need to start with updating 
*CodeGenerator* to generate nested accessors and then we can extend 
*TableSource* to support nested data.

Am I overthink this issue? Or am I missing something obvious?




> Extend TableSource to support nested data
> -----------------------------------------
>
>                 Key: FLINK-5280
>                 URL: https://issues.apache.org/jira/browse/FLINK-5280
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table API & SQL
>    Affects Versions: 1.2.0
>            Reporter: Fabian Hueske
>            Assignee: Ivan Mushketyk
>
> The {{TableSource}} interface does currently only support the definition of 
> flat rows. 
> However, there are several storage formats for nested data that should be 
> supported such as Avro, Json, Parquet, and Orc. The Table API and SQL can 
> also natively handle nested rows.
> The {{TableSource}} interface and the code to register table sources in 
> Calcite's schema need to be extended to support nested data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-5280) Extend TableSource to support nested data

Reply via email to