Hi all, I would like to consult with you regarding deployment strategies.
We have +250 Kafka topics that we want users of the platform to submit SQL queries that will run indefinitely. We have a query parsers to extract topic names from user queries, and the application locally creates Kafka tables and execute the query. The result can be collected to multiple sinks such as databases, files, cloud services. We want to have the best isolation between queries, so in case of failures, the other jobs will not get affected. We have a huge YARN cluster to handle 1PB a day scale from Kafka. I believe cluster per job type deployment makes sense for the sake of isolation. However, that creates some scalability problems. There might be SQL queries running on the same Kafka topic that we do not want to read them again for each query in different sessions. The ideal case is that we read the topic once and executes multiple queries on this data to avoid rereading the same topic. That breaks the desire of a fully isolated system, but it improves network and Kafka performance and still provides isolation on the topic level as we just read the topic once and execute multiple SQL queries on it. We are quite new to Flink, but we have experience with Spark. In Spark, we can submit an application, and in master, that can listen a query queue and submit jobs to the cluster dynamically from different threads. However, In Flink, it looks like the main() has to produce the job the graph in advance. We do use an EMR cluster; what would you recommend for my use case? Thank you. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/