Hello, I am trying to implement a satellite image processing chain. Satellite images are stored as rasters which are heavy, (several GBs) in a FileSystem (I am currently using HDFS for testing purpose but will move on S3 when I'll deploy it on the cloud). So in order to reduce the processing time and save up RAM, the rasters are split into tiles to which we apply algorithms that can be parallelized. Once they are done, we have to aggregate the processed tiles into a processed raster (the output product). To do so, we directly write the processed tiles onto the processed raster (its skeloton was created in the FileSystem beforehand) which allows us to write them as they are produced and not wait for all of them to have been processed and save up RAM by not gathering them all on one node before writing down the raster.
Since it's a peculiar format not handled by the FileSystem connector and I don't know if the FileSystem Sink has a feature to write several times on the same file at specific parts, I decided to try my own implementation with a Flatmap that receives the processed tiles and write them in HDFS as they come using rasterio (a library for reading and writing rasters that uses gdal in fact). A simple solution is to open the connection in the Flatmap everytime and write (there shouldn't be concurrent writing since the tiles are keyed by raster id which will hand over all the tiles of a same raster to the same TaskSlot if I have understood correctly). However, I was wondering if I could open the connection, store the DatasetReader in the state and call it in the Flatmap to write down the tiles to avoid reopening it every time a new processed tile is produced (a raster can be divided into thousand of tiles). To sum it up, here are my questions: - Can the FileSystem sink write down rasters and write several times on the same file ? - Can Flink's state store java objects such as a DatasetReader (which is returned by rasterio.open(...)) ? Sincerely, Ky Alexandre