On 11.07.20 10:31, Georg Heiler wrote:
1) similarly to spark the Table API works on some optimized binary
representation
2) this is only available in the SQL way of interaction - there is no
programmatic API

yes it's available from SQL, but also the Table API, which is a programmatic declarative API, similar to Spark's Structured Streaming.


q1) I have read somewhere (I think in some Flink Forward presentations)
that the SQL API is not necessarily stable with regards to state - even
with small changes to the DAG (due to optimization). So does this also
/still apply to the table API? (I assume yes)

Yes, unfortunately this is correct. Because the Table API/SQL is declarative users don't have control over the DAG and the state that the operators have. Some work will happen on at least making sure that the optimizer stays stable between Flink versions or that we can let users pin a certain physical graph of a query so that it can be re-used across versions.

q2) When I use the DataSet/Stream (classical scala/java) API it looks like
I must create a custom serializer if I want to handle one/all of:

   - side-output failing records and not simply crash the job
   - as asked before automatic serialization to a scala (case) class

This is true, yes.

But I also read that creating the ObjectMapper (i.e. in Jackson terms)
inside the map function is not recommended. From Spark I know that there is
a map-partitions function, i.e. something where a database connection can
be created and then reused for the individua elements. Is a similar
construct available in Flink as well?

Yes, for this you can use "rich functions", which have an open()/close() method that allows initializing and re-using resources across invocations: https://ci.apache.org/projects/flink/flink-docs-master/dev/user_defined_functions.html#rich-functions

Also, I have read a lot of articles and it looks like a lot of people
are using the String serializer and then manually parse the JSON which also
seems inefficient.
Where would I find an example for some Serializer with side outputs for
failed records as well as efficient initialization using some similar
construct to map-partitions?

I'm not aware of such examples, unfortunately.

I hope that at least some answers will be helpful!

Best,
Aljoscha

Reply via email to