On 11.07.20 10:31, Georg Heiler wrote:
1) similarly to spark the Table API works on some optimized binary representation 2) this is only available in the SQL way of interaction - there is no programmatic API
yes it's available from SQL, but also the Table API, which is a programmatic declarative API, similar to Spark's Structured Streaming.
q1) I have read somewhere (I think in some Flink Forward presentations) that the SQL API is not necessarily stable with regards to state - even with small changes to the DAG (due to optimization). So does this also /still apply to the table API? (I assume yes)
Yes, unfortunately this is correct. Because the Table API/SQL is declarative users don't have control over the DAG and the state that the operators have. Some work will happen on at least making sure that the optimizer stays stable between Flink versions or that we can let users pin a certain physical graph of a query so that it can be re-used across versions.
q2) When I use the DataSet/Stream (classical scala/java) API it looks like I must create a custom serializer if I want to handle one/all of: - side-output failing records and not simply crash the job - as asked before automatic serialization to a scala (case) class
This is true, yes.
But I also read that creating the ObjectMapper (i.e. in Jackson terms) inside the map function is not recommended. From Spark I know that there is a map-partitions function, i.e. something where a database connection can be created and then reused for the individua elements. Is a similar construct available in Flink as well?
Yes, for this you can use "rich functions", which have an open()/close() method that allows initializing and re-using resources across invocations: https://ci.apache.org/projects/flink/flink-docs-master/dev/user_defined_functions.html#rich-functions
Also, I have read a lot of articles and it looks like a lot of people are using the String serializer and then manually parse the JSON which also seems inefficient. Where would I find an example for some Serializer with side outputs for failed records as well as efficient initialization using some similar construct to map-partitions?
I'm not aware of such examples, unfortunately. I hope that at least some answers will be helpful! Best, Aljoscha