[ https://issues.apache.org/jira/browse/FLINK-5280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742507#comment-15742507 ]
Ivan Mushketyk commented on FLINK-5280: --------------------------------------- Hi [~fhueske] , Thank you for your comments. It's a much clearer now, but it seems that I am either still missing something obvious or it seems to me that the task is more involved than it was described. Let me first describe how I understand this issue so that you could correct me. So the goal of this task is to support nested data structures. So it means that if we have a type definition like this: {code:java} class ParentPojo { ChildPojo child; int num; } class ChildPojo { String str; } {code} and we have a *TableSource* that returns a dataset of *ParentPojo* we can access nested fields in SQL queries. Something like: {code:sql} SELECT * FROM pojos WHERE child.str LIKE '%Rubber%' {code} In this case *child.str* is a way to access a nested field. The first thing that confuses me is that current [SQL grammar|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/table_api.html#sql-syntax] does not seem to support any nested fields access, but I think may be a relatively minor nuisance. If I understand it correctly internally *flink-table* converts any input into a dataset of Rows and then performs operations on it. To convert a nested *ParentPojo* into a flat schema we can extract all leaf values into two columns: {code} child.str num {code} similarly to how *Parquet* identifies columns in nested types (see the following [slide|http://www.slideshare.net/julienledem/strata-london-2016-the-future-of-column-oriented-data-processing-with-arrow-and-parquet/10?src=clipshare]) Now, where this becomes more interesting. If I understand it correctly *BatchScan#convertToExpectedType* is used to convert an input dataset into a dataset of *Row*s. For this task it generates a mapper function in *FlinkRel#getConversionMapper* which than calls *CodeGenerator#generateConverterResultExpression*. So in our case it should generate code similar to something like: {code:java} public Row map(ParentPojo parent) { Row row = new Row(2); row.setField(0, parent.child.str); row.setField(1, parent.num); return row; } {code} *CodeGenerator* accepts *fieldNames* and optional POJO field mapping to generate accessors. It seems that the main work is performed in *CodeGenerator#generateFieldAccess* that generates an access code for different fields of the POJO, but it does not create any code that accesses nested fields. It just generates an access code to a POJO field with a corresponding field name in CodeGenerator#generateFieldAccess. Therefore, if I understand this correctly, we need to start with updating *CodeGenerator* to generate nested accessors and then we can extend *TableSource* to support nested data. Am I overthink this issue? Or am I missing something obvious? > Extend TableSource to support nested data > ----------------------------------------- > > Key: FLINK-5280 > URL: https://issues.apache.org/jira/browse/FLINK-5280 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL > Affects Versions: 1.2.0 > Reporter: Fabian Hueske > Assignee: Ivan Mushketyk > > The {{TableSource}} interface does currently only support the definition of > flat rows. > However, there are several storage formats for nested data that should be > supported such as Avro, Json, Parquet, and Orc. The Table API and SQL can > also natively handle nested rows. > The {{TableSource}} interface and the code to register table sources in > Calcite's schema need to be extended to support nested data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)