Re: How to create schema for flexible json data in Flink SQL

Benchao Li Thu, 28 May 2020 20:03:50 -0700

Hi Guodong,

After an offline discussion with Leonard. I think you get the right meaning
of schema inference.
But there are two problems here:
1. schema of the data is fixed, schema inference can save your effort to
write the schema explicitly.
2. schema of the data is dynamic, in this case the schema inference cannot
help. Because SQL is somewhat static language, which should know all the
data types at compile stage.


Maybe I've misunderstood your question at the very beginning. I thought
your case is #2. If your case is #1, then schema inference is a good
choice.

Guodong Wang <wangg...@gmail.com> 于2020年5月28日周四 下午11:39写道：

> Yes. Setting the value type as raw is one possible approach. And I would
> like to vote for schema inference as well.
>
> Correct me if I am wrong, IMO schema inference means I can provide a
> method in the table source to infer the data schema base on the runtime
> computation. Just like some calcite adaptor does. Right?
> For SQL table registration, I think that requiring the table source to
> provide a static schema might be too strict. Let planner to infer the table
> schema will be more flexible.
>
> Thank you for your suggestions.
>
> Guodong
>
>
> On Thu, May 28, 2020 at 11:11 PM Benchao Li <libenc...@gmail.com> wrote:
>
>> Hi Guodong,
>>
>> Does the RAW type meet your requirements? For example, you can specify
>> map<varchar, raw> type, and the value for the map is the raw JsonNode
>> parsed from Jackson.
>> This is not supported yet, however IMO this could be supported.
>>
>> Guodong Wang <wangg...@gmail.com> 于2020年5月28日周四 下午9:43写道：
>>
>>> Benchao,
>>>
>>> Thank you for your quick reply.
>>>
>>> As you mentioned, for current scenario, approach 2 should work for me.
>>> But it is a little bit annoying that I have to modify schema to add new
>>> field types when upstream app changes the json format or adds new fields.
>>> Otherwise, my user can not refer the field in their SQL.
>>>
>>> Per description in the jira, I think after implementing this, all the
>>> json values will be converted as strings.
>>> I am wondering if Flink SQL can/will support the flexible schema in the
>>> future, for example, register the table without defining specific schema
>>> for each field, to let user define a generic map or array for one field.
>>> but the value of map/array can be any object. Then, the type conversion
>>> cost might be saved.
>>>
>>> Guodong
>>>
>>>
>>> On Thu, May 28, 2020 at 7:43 PM Benchao Li <libenc...@gmail.com> wrote:
>>>
>>>> Hi Guodong,
>>>>
>>>> I think you almost get the answer,
>>>> 1. map type, it's not working for current implementation. For example,
>>>> use map<varchar, varchar>, if the value if non-string json object, then
>>>> `JsonNode.asText()` may not work as you wish.
>>>> 2. list all fields you cares. IMO, this can fit your scenario. And you
>>>> can set format.fail-on-missing-field = true, to allow setting non-existed
>>>> fields to be null.
>>>>
>>>> For 1, I think maybe we can support it in the future, and I've created
>>>> jira[1] to track this.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-18002
>>>>
>>>> Guodong Wang <wangg...@gmail.com> 于2020年5月28日周四 下午6:32写道：
>>>>
>>>>> Hi !
>>>>>
>>>>> I want to use Flink SQL to process some json events. It is quite
>>>>> challenging to define a schema for the Flink SQL table.
>>>>>
>>>>> My data source's format is some json like this
>>>>> {
>>>>> "top_level_key1": "some value",
>>>>> "nested_object": {
>>>>> "nested_key1": "abc",
>>>>> "nested_key2": 123,
>>>>> "nested_key3": ["element1", "element2", "element3"]
>>>>> }
>>>>> }
>>>>>
>>>>> The big challenges for me to define a schema for the data source are
>>>>> 1. the keys in nested_object are flexible, there might be 3 unique
>>>>> keys or more unique keys. If I enumerate all the keys in the schema, I
>>>>> think my code is fragile, how to handle event which contains more
>>>>> nested_keys in nested_object ?
>>>>> 2. I know table api support Map type, but I am not sure if I can put
>>>>> generic object as the value of the map. Because the values in 
>>>>> nested_object
>>>>> are of different types, some of them are int, some of them are string or
>>>>> array.
>>>>>
>>>>> So. how to expose this kind of json data as table in Flink SQL without
>>>>> enumerating all the nested_keys?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Guodong
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Best,
>>>> Benchao Li
>>>>
>>>
>>
>> --
>>
>> Best,
>> Benchao Li
>>
>

-- 

Best,
Benchao Li

Re: How to create schema for flexible json data in Flink SQL

Reply via email to