Re: map JSON to scala case class & off-heap optimization

Aljoscha Krettek Wed, 15 Jul 2020 06:58:21 -0700

On 11.07.20 10:31, Georg Heiler wrote:

1) similarly to spark the Table API works on some optimized binary
representation
2) this is only available in the SQL way of interaction - there is no
programmatic API

yes it's available from SQL, but also the Table API, which is aprogrammatic declarative API, similar to Spark's Structured Streaming.

q1) I have read somewhere (I think in some Flink Forward presentations)
that the SQL API is not necessarily stable with regards to state - even
with small changes to the DAG (due to optimization). So does this also
/still apply to the table API? (I assume yes)

Yes, unfortunately this is correct. Because the Table API/SQL isdeclarative users don't have control over the DAG and the state that theoperators have. Some work will happen on at least making sure that theoptimizer stays stable between Flink versions or that we can let userspin a certain physical graph of a query so that it can be re-used acrossversions.

q2) When I use the DataSet/Stream (classical scala/java) API it looks like
I must create a custom serializer if I want to handle one/all of:

   - side-output failing records and not simply crash the job
   - as asked before automatic serialization to a scala (case) class


This is true, yes.

But I also read that creating the ObjectMapper (i.e. in Jackson terms)
inside the map function is not recommended. From Spark I know that there is
a map-partitions function, i.e. something where a database connection can
be created and then reused for the individua elements. Is a similar
construct available in Flink as well?

Yes, for this you can use "rich functions", which have an open()/close()method that allows initializing and re-using resources acrossinvocations:https://ci.apache.org/projects/flink/flink-docs-master/dev/user_defined_functions.html#rich-functions

Also, I have read a lot of articles and it looks like a lot of people
are using the String serializer and then manually parse the JSON which also
seems inefficient.
Where would I find an example for some Serializer with side outputs for
failed records as well as efficient initialization using some similar
construct to map-partitions?


I'm not aware of such examples, unfortunately.

I hope that at least some answers will be helpful!

Best,
Aljoscha

Re: map JSON to scala case class & off-heap optimization

Reply via email to