Benchao, Thank you for your detailed explanation.
Schema Inference can solve my problem partially. For example, starting from some time, all the json afterward will contain a new field. I think for this case, schema inference will help. but if I need to handle all the json events with different schemas in one table(this is the case 2), I agree with you. Schema inference does not help either. Guodong On Fri, May 29, 2020 at 11:02 AM Benchao Li <libenc...@gmail.com> wrote: > Hi Guodong, > > After an offline discussion with Leonard. I think you get the right > meaning of schema inference. > But there are two problems here: > 1. schema of the data is fixed, schema inference can save your effort to > write the schema explicitly. > 2. schema of the data is dynamic, in this case the schema inference cannot > help. Because SQL is somewhat static language, which should know all the > data types at compile stage. > > Maybe I've misunderstood your question at the very beginning. I thought > your case is #2. If your case is #1, then schema inference is a good > choice. > > Guodong Wang <wangg...@gmail.com> 于2020年5月28日周四 下午11:39写道: > >> Yes. Setting the value type as raw is one possible approach. And I would >> like to vote for schema inference as well. >> >> Correct me if I am wrong, IMO schema inference means I can provide a >> method in the table source to infer the data schema base on the runtime >> computation. Just like some calcite adaptor does. Right? >> For SQL table registration, I think that requiring the table source to >> provide a static schema might be too strict. Let planner to infer the table >> schema will be more flexible. >> >> Thank you for your suggestions. >> >> Guodong >> >> >> On Thu, May 28, 2020 at 11:11 PM Benchao Li <libenc...@gmail.com> wrote: >> >>> Hi Guodong, >>> >>> Does the RAW type meet your requirements? For example, you can specify >>> map<varchar, raw> type, and the value for the map is the raw JsonNode >>> parsed from Jackson. >>> This is not supported yet, however IMO this could be supported. >>> >>> Guodong Wang <wangg...@gmail.com> 于2020年5月28日周四 下午9:43写道: >>> >>>> Benchao, >>>> >>>> Thank you for your quick reply. >>>> >>>> As you mentioned, for current scenario, approach 2 should work for me. >>>> But it is a little bit annoying that I have to modify schema to add new >>>> field types when upstream app changes the json format or adds new fields. >>>> Otherwise, my user can not refer the field in their SQL. >>>> >>>> Per description in the jira, I think after implementing this, all the >>>> json values will be converted as strings. >>>> I am wondering if Flink SQL can/will support the flexible schema in the >>>> future, for example, register the table without defining specific schema >>>> for each field, to let user define a generic map or array for one field. >>>> but the value of map/array can be any object. Then, the type conversion >>>> cost might be saved. >>>> >>>> Guodong >>>> >>>> >>>> On Thu, May 28, 2020 at 7:43 PM Benchao Li <libenc...@gmail.com> wrote: >>>> >>>>> Hi Guodong, >>>>> >>>>> I think you almost get the answer, >>>>> 1. map type, it's not working for current implementation. For example, >>>>> use map<varchar, varchar>, if the value if non-string json object, then >>>>> `JsonNode.asText()` may not work as you wish. >>>>> 2. list all fields you cares. IMO, this can fit your scenario. And you >>>>> can set format.fail-on-missing-field = true, to allow setting non-existed >>>>> fields to be null. >>>>> >>>>> For 1, I think maybe we can support it in the future, and I've created >>>>> jira[1] to track this. >>>>> >>>>> [1] https://issues.apache.org/jira/browse/FLINK-18002 >>>>> >>>>> Guodong Wang <wangg...@gmail.com> 于2020年5月28日周四 下午6:32写道: >>>>> >>>>>> Hi ! >>>>>> >>>>>> I want to use Flink SQL to process some json events. It is quite >>>>>> challenging to define a schema for the Flink SQL table. >>>>>> >>>>>> My data source's format is some json like this >>>>>> { >>>>>> "top_level_key1": "some value", >>>>>> "nested_object": { >>>>>> "nested_key1": "abc", >>>>>> "nested_key2": 123, >>>>>> "nested_key3": ["element1", "element2", "element3"] >>>>>> } >>>>>> } >>>>>> >>>>>> The big challenges for me to define a schema for the data source are >>>>>> 1. the keys in nested_object are flexible, there might be 3 unique >>>>>> keys or more unique keys. If I enumerate all the keys in the schema, I >>>>>> think my code is fragile, how to handle event which contains more >>>>>> nested_keys in nested_object ? >>>>>> 2. I know table api support Map type, but I am not sure if I can put >>>>>> generic object as the value of the map. Because the values in >>>>>> nested_object >>>>>> are of different types, some of them are int, some of them are string or >>>>>> array. >>>>>> >>>>>> So. how to expose this kind of json data as table in Flink SQL >>>>>> without enumerating all the nested_keys? >>>>>> >>>>>> Thanks. >>>>>> >>>>>> Guodong >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Best, >>>>> Benchao Li >>>>> >>>> >>> >>> -- >>> >>> Best, >>> Benchao Li >>> >> > > -- > > Best, > Benchao Li >