I must bootstrap state from postgres (approximately 200 GB of data) and I notice that the state processor API requires the DataSet API in order to bootstrap state for the Stream API.
I wish there was a way to use the SQL API and use a partitioned scan, but I don't know if that is even possible with the DataSet API. I never used the DataSet API, and I am unsure how it manages memory, or distributes load, when handling large state. Would it run out of memory if I map data from a JDBCInputFormat into a large DataSet and then use that to bootstrap state for my stream job? Any advice on how I should proceed with this would be greatly appreciated. Thank you.