Re: PyFlink - Scala UDF - How to convert Scala Map in Table API?

Xingbo Huang Mon, 30 Nov 2020 22:26:49 -0800

Hi Pierre,

Have you ever thought of declaring your entire json as a string field in
`Table` and putting the parsing work in UDF?


Best,
Xingbo

Pierre Oberholzer <pierre.oberhol...@gmail.com> 于2020年12月1日周二 上午4:13写道：

> Hi Xingbo,
>
> Many thanks for your follow up. Yes you got it right.
> So using Table API and a ROW object for the nested output of my UDF, and
> since types are mandatory, I guess this boils down to:
> - How to nicely specify the types for the 100k fields : shall I use
> TypeInformation [1] or better retrieve it from Schema Registry [2] ?
> - Do I have to put NULL values for all the fields that don't have a value
> in my JSON ?
> - Will the resulting Table be "sparse" and suffer performance limitations ?
> Let me know if Table API and ROW are the right candidates here, or if
> other better alternatives exist.
> As said I'd be glad to apply some downstream transformations using
> key,value access (and possibly some Table <-> Pandas operations). Hope that
> doesn't make it a too long wish list ;)
>
> Thanks a lot !
>
> Best regards,
>
> [1]
> https://stackoverflow.com/questions/48696875/how-to-attach-schema-to-a-flink-datastream-on-the-fly
> [2]
> https://docs.cloudera.com/csa/1.2.0/datastream-connectors/topics/csa-schema-registry.html
>
> Le sam. 28 nov. 2020 à 04:04, Xingbo Huang <hxbks...@gmail.com> a écrit :
>
>> Hi Pierre,
>>
>> Sorry for the late reply.
>> Your requirement is that your `Table` has a `field` in `Json` format and
>> its key has reached 100k, and then you want to use such a `field` as the
>> input/output of `udf`, right? As to whether there is a limit on the number
>> of nested key, I am not quite clear. Other contributors with experience in
>> this area may have answers. On the part of `Python UDF`, if the type of key
>> or value of your `Map` is `Any`, we do not support it now. You need to
>> specify a specific type. For more information, please refer to the related
>> document[1].
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/python/table-api-users-guide/python_types.html
>>
>> Best,
>> Xingbo
>>
>> 2020年11月28日 上午12:49，Pierre Oberholzer <pierre.oberhol...@gmail.com> 写道：
>>
>> Hello Wei, Dian, Xingbo,
>>
>> Not really sure when it is appropriate to knock on the door of the
>> community ;)
>> I just wanted to mention that your feedback on the above topic will be
>> highly appreciated as it will condition the choice of framework on our side
>> for the months to come, and potentially help the community to cover sparse
>> data with Flink.
>>
>> Thanks a lot !
>>
>> Have a great week-end
>>
>> Best,
>>
>> Le ven. 20 nov. 2020 à 10:11, Pierre Oberholzer <
>> pierre.oberhol...@gmail.com> a écrit :
>>
>>> Hi Wei,
>>>
>>> Thanks for the hint. May I please follow up by adding more context and
>>> ask for your guidance.
>>>
>>> In case the bespoken Map[String,Any] object returned by Scala:
>>>
>>> - Has a defined schema (incl. nested) with up to 100k (!) different
>>> possible keys
>>> - Has only some portion of the keys populated for each record
>>> - Is convertible to JSON
>>> - Has to undergo downstream processing in Flink and/or Python UDF with
>>> key value access
>>> - Has to be ultimately stored in a Kafka/AVRO sink
>>>
>>> How would you declare the types explicitly in such a case ?
>>>
>>> Thanks for your support !
>>>
>>> Pierre
>>>
>>> Le jeu. 19 nov. 2020 à 03:54, Wei Zhong <weizhong0...@gmail.com> a
>>> écrit :
>>>
>>>> Hi Pierre,
>>>>
>>>> Currently there is no type hint like ‘Map[String, Any]’. The
>>>> recommended way is declaring your type more explicitly.
>>>>
>>>> If you insist on doing this, you can try to declaring a RAW data type
>>>> for java.util.HashMap [1], but you may encounter some troubles [2] related
>>>> to the kryo serializers.
>>>>
>>>> Best,
>>>> Wei
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/types.html#raw
>>>> [2]
>>>> https://stackoverflow.com/questions/28157236/kryo-serialization-with-nested-hashmap-with-custom-class
>>>>
>>>>
>>>> 在 2020年11月19日，04:31，Pierre Oberholzer <pierre.oberhol...@gmail.com> 写道：
>>>>
>>>> Hi Wei,
>>>>
>>>> It works ! Thanks a lot for your support.
>>>> I hadn't tried this last combination for option 1, and I had wrong
>>>> syntax for option 2.
>>>>
>>>> So to summarize..
>>>>
>>>> Methods working:
>>>> - Current: DataTypeHint in UDF definition + SQL for UDF registering
>>>> - Outdated: override getResultType in UDF definition
>>>> + t_env.register_java_function for UDF registering
>>>>
>>>> Type conversions working:
>>>> - scala.collection.immutable.Map[String,String] =>
>>>> org.apache.flink.types.Row => ROW<STRING,STRING>
>>>> - scala.collection.immutable.Map[String,String] =>
>>>> java.util.Map[String,String] => MAP<STRING,STRING>
>>>>
>>>> Any hint for Map[String,Any] ?
>>>>
>>>> Best regards,
>>>>
>>>> Le mer. 18 nov. 2020 à 03:26, Wei Zhong <weizhong0...@gmail.com> a
>>>> écrit :
>>>>
>>>>> Hi Pierre,
>>>>>
>>>>> Those 2 approaches all work in my local machine, this is my code:
>>>>>
>>>>> Scala UDF:
>>>>>
>>>>> package com.dummy
>>>>>
>>>>> import org.apache.flink.api.common.typeinfo.TypeInformation
>>>>> import org.apache.flink.table.annotation.DataTypeHint
>>>>> import org.apache.flink.table.api.Types
>>>>> import org.apache.flink.table.functions.ScalarFunction
>>>>> import org.apache.flink.types.Row
>>>>>
>>>>> /**
>>>>>   * The scala UDF.
>>>>>   */
>>>>> class dummyMap extends ScalarFunction {
>>>>>
>>>>>   // If the udf would be registered by the SQL statement, you need add 
>>>>> this typehint
>>>>>   @DataTypeHint("ROW<s STRING,t STRING>")
>>>>>   def eval(): Row = {
>>>>>
>>>>>     Row.of(java.lang.String.valueOf("foo"), 
>>>>> java.lang.String.valueOf("bar"))
>>>>>
>>>>>   }
>>>>>
>>>>>   // If the udf would be registered by the method 
>>>>> 'register_java_function', you need override this
>>>>>   // method.
>>>>>   override def getResultType(signature: Array[Class[_]]): 
>>>>> TypeInformation[_] = {
>>>>>     // The type of the return values should be TypeInformation
>>>>>     Types.ROW(Array("s", "t"), Array[TypeInformation[_]](Types.STRING(), 
>>>>> Types.STRING()))
>>>>>   }
>>>>> }
>>>>>
>>>>> Python code:
>>>>>
>>>>> from pyflink.datastream import StreamExecutionEnvironment
>>>>> from pyflink.table import StreamTableEnvironment
>>>>>
>>>>> s_env = StreamExecutionEnvironment.get_execution_environment()
>>>>> st_env = StreamTableEnvironment.create(s_env)
>>>>>
>>>>> # load the scala udf jar file, the path should be modified to yours
>>>>> # or your can also load the jar file via other approaches
>>>>> st_env.get_config().get_configuration().set_string("pipeline.jars", "
>>>>> file:///Users/zhongwei/the-dummy-udf.jar")
>>>>>
>>>>> # register the udf via
>>>>> st_env.execute_sql("CREATE FUNCTION dummyMap AS 'com.dummy.dummyMap'
>>>>> LANGUAGE SCALA")
>>>>> # or register via the method
>>>>> # st_env.register_java_function("dummyMap", "com.dummy.dummyMap")
>>>>>
>>>>> # prepare source and sink
>>>>> t = st_env.from_elements([(1, 'hi', 'hello'), (2, 'hi', 'hello')],
>>>>> ['a', 'b', 'c'])
>>>>> st_env.execute_sql("""create table mySink (
>>>>>         output_of_my_scala_udf ROW<s STRING,t STRING>
>>>>>     ) with (
>>>>>         'connector' = 'print'
>>>>>     )""")
>>>>>
>>>>> # execute query
>>>>>
>>>>> t.select("dummyMap()").execute_insert("mySink").get_job_client().get_job_execution_result().result()
>>>>>
>>>>> Best,
>>>>> Wei
>>>>>
>>>>> 在 2020年11月18日，03:28，Pierre Oberholzer <pierre.oberhol...@gmail.com>
>>>>> 写道：
>>>>>
>>>>> Hi Wei,
>>>>>
>>>>> True, I'm using the method you mention, but glad to change.
>>>>> I tried your suggestion instead, but got a similar error.
>>>>>
>>>>> Thanks for your support. That is much more tedious than I thought.
>>>>>
>>>>> *Option 1 - SQL UDF*
>>>>>
>>>>> *SQL UDF*
>>>>> create_func_ddl = """
>>>>> CREATE FUNCTION dummyMap
>>>>>   AS 'com.dummy.dummyMap' LANGUAGE SCALA
>>>>> """
>>>>>
>>>>> t_env.execute_sql(create_func_ddl)
>>>>>
>>>>> *Error*
>>>>> Py4JJavaError: An error occurred while calling o672.execute.
>>>>> : org.apache.flink.table.api.TableException: Result field does not
>>>>> match requested type. Requested: Row(s: String, t: String); Actual:
>>>>> GenericType<org.apache.flink.types.Row>
>>>>>
>>>>> *Option 2 *- *Overriding getResultType*
>>>>>
>>>>> Back to the old registering method, but overriding getResultType:
>>>>>
>>>>> t_env.register_java_function("dummyMap","com.dummy.dummyMap")
>>>>>
>>>>> *Scala UDF*
>>>>> class dummyMap() extends ScalarFunction {
>>>>>
>>>>>   def eval(): Row = {
>>>>>
>>>>>       Row.of(java.lang.String.valueOf("foo"),
>>>>> java.lang.String.valueOf("bar"))
>>>>>
>>>>>   }
>>>>>
>>>>>   override def getResultType(signature: Array[Class[_]]):
>>>>> TypeInformation[_] = DataTypes.ROW(DataTypes.STRING,DataTypes.STRING)
>>>>> }
>>>>>
>>>>> *Error (on compilation)*
>>>>>
>>>>> [error] dummyMap.scala:66:90: overloaded method value ROW with
>>>>> alternatives:
>>>>> [error]   (x$1:
>>>>> org.apache.flink.table.api.DataTypes.AbstractField*)org.apache.flink.table.types.UnresolvedDataType
>>>>> <and>
>>>>> [error]   ()org.apache.flink.table.types.DataType <and>
>>>>> [error]   (x$1:
>>>>> org.apache.flink.table.api.DataTypes.Field*)org.apache.flink.table.types.DataType
>>>>> [error]  cannot be applied to (org.apache.flink.table.types.DataType,
>>>>> org.apache.flink.table.types.DataType)
>>>>> [error]   override def getResultType(signature: Array[Class[_]]):
>>>>> TypeInformation[_] = DataTypes.ROW(DataTypes.STRING,DataTypes.STRING)
>>>>> [error]
>>>>>                             ^
>>>>> [error] one error found
>>>>> [error] (Compile / compileIncremental) Compilation failed
>>>>> [error] Total time: 3 s, completed 17 nov. 2020 à 20:00:01
>>>>>
>>>>> Le mar. 17 nov. 2020 à 14:01, Wei Zhong <weizhong0...@gmail.com> a
>>>>> écrit :
>>>>>
>>>>>> Hi Pierre,
>>>>>>
>>>>>> I guess your UDF is registered by the method 'register_java_function'
>>>>>> which uses the old type system. In this situation you need to override 
>>>>>> the
>>>>>> 'getResultType' method instead of adding type hint.
>>>>>>
>>>>>> You can also try to register your UDF via the "CREATE FUNCTION" sql
>>>>>> statement, which accepts the type hint.
>>>>>>
>>>>>> Best,
>>>>>> Wei
>>>>>>
>>>>>> 在 2020年11月17日，19:29，Pierre Oberholzer <pierre.oberhol...@gmail.com>
>>>>>> 写道：
>>>>>>
>>>>>> Hi Wei,
>>>>>>
>>>>>> Thanks for your suggestion. Same error.
>>>>>>
>>>>>> *Scala UDF*
>>>>>>
>>>>>> @FunctionHint(output = new DataTypeHint("ROW<s STRING,t STRING>"))
>>>>>> class dummyMap() extends ScalarFunction {
>>>>>>   def eval(): Row = {
>>>>>>     Row.of(java.lang.String.valueOf("foo"),
>>>>>> java.lang.String.valueOf("bar"))
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Le mar. 17 nov. 2020 à 10:04, Wei Zhong <weizhong0...@gmail.com> a
>>>>>> écrit :
>>>>>>
>>>>>>> Hi Pierre,
>>>>>>>
>>>>>>> You can try to replace the '@DataTypeHint("ROW<s STRING,t
>>>>>>> STRING>")' with '@FunctionHint(output = new DataTypeHint("ROW<s STRING,t
>>>>>>> STRING>”))'
>>>>>>>
>>>>>>> Best,
>>>>>>> Wei
>>>>>>>
>>>>>>> 在 2020年11月17日，15:45，Pierre Oberholzer <pierre.oberhol...@gmail.com>
>>>>>>> 写道：
>>>>>>>
>>>>>>> Hi Dian, Community,
>>>>>>>
>>>>>>> (bringing the thread back to wider audience)
>>>>>>>
>>>>>>> As you suggested, I've tried to use DataTypeHint with Row instead
>>>>>>> of Map but also this simple case leads to a type mismatch between
>>>>>>> UDF and Table API.
>>>>>>> I've also tried other Map objects from Flink (table.data.MapData,
>>>>>>> flink.types.MapValue, flink.table.api.DataTypes.MAP) in addition to Java
>>>>>>> (java.util.Map) in combination with DataTypeHint, without success.
>>>>>>> N.B. I'm using version 1.11.
>>>>>>>
>>>>>>> Am I doing something wrong or am I facing limitations in the toolkit
>>>>>>> ?
>>>>>>>
>>>>>>> Thanks in advance for your support !
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> *Scala UDF*
>>>>>>>
>>>>>>> class dummyMap() extends ScalarFunction {
>>>>>>>
>>>>>>>  @DataTypeHint("ROW<s STRING,t STRING>")
>>>>>>>  def eval(): Row = {
>>>>>>>
>>>>>>>     Row.of(java.lang.String.valueOf("foo"),
>>>>>>> java.lang.String.valueOf("bar"))
>>>>>>>
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>> *Table DDL*
>>>>>>>
>>>>>>> my_sink_ddl = f"""
>>>>>>>     create table mySink (
>>>>>>>         output_of_my_scala_udf ROW<s STRING,t STRING>
>>>>>>>     ) with (
>>>>>>>         ...
>>>>>>>     )
>>>>>>> """
>>>>>>>
>>>>>>> *Error*
>>>>>>>
>>>>>>> Py4JJavaError: An error occurred while calling o2.execute.
>>>>>>> : org.apache.flink.table.api.ValidationException: Field types of
>>>>>>> query result and registered TableSink
>>>>>>> `default_catalog`.`default_database`.`mySink` do not match.
>>>>>>> Query result schema: [output_of_my_scala_udf:
>>>>>>> GenericType<org.apache.flink.types.Row>]
>>>>>>> TableSink schema:    [output_of_my_scala_udf: Row(s: String, t:
>>>>>>> String)]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Le ven. 13 nov. 2020 à 11:59, Pierre Oberholzer <
>>>>>>> pierre.oberhol...@gmail.com> a écrit :
>>>>>>>
>>>>>>>> Thanks Dian, but same error when using explicit returned type:
>>>>>>>>
>>>>>>>> class dummyMap() extends ScalarFunction {
>>>>>>>>
>>>>>>>>   def eval() : util.Map[java.lang.String,java.lang.String] = {
>>>>>>>>
>>>>>>>>     val states = Map("key1" -> "val1", "key2" -> "val2")
>>>>>>>>
>>>>>>>> states.asInstanceOf[util.Map[java.lang.String,java.lang.String]]
>>>>>>>>
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>> Le ven. 13 nov. 2020 à 10:34, Dian Fu <dian0511...@gmail.com> a
>>>>>>>> écrit :
>>>>>>>>
>>>>>>>>> You need to explicitly defined the result type the UDF. You could
>>>>>>>>> refer to [1] for more details if you are using Flink 1.11. If you are 
>>>>>>>>> using
>>>>>>>>> other versions of Flink, you need to refer to the corresponding
>>>>>>>>> documentation.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/functions/udfs.html#implementation-guide
>>>>>>>>>
>>>>>>>>> 在 2020年11月13日，下午4:56，Pierre Oberholzer <
>>>>>>>>> pierre.oberhol...@gmail.com> 写道：
>>>>>>>>>
>>>>>>>>> ScalarFunction
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Pierre
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Pierre
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Pierre
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Pierre
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Pierre
>>>>
>>>>
>>>>
>>>
>>> --
>>> Pierre
>>>
>>
>>
>> --
>> Pierre
>>
>>
>>
>
> --
> Pierre
>

Re: PyFlink - Scala UDF - How to convert Scala Map in Table API?

Reply via email to