Hello,

I am trying to implement a satellite image processing chain. Satellite images 
are stored as rasters which are heavy, (several GBs) in a FileSystem (I am 
currently using HDFS for testing purpose but will move on S3 when I'll deploy 
it on the cloud). So in order to reduce the processing time and save up RAM, 
the rasters are split into tiles to which we apply algorithms that can be 
parallelized. Once they are done, we have to aggregate the processed tiles into 
a processed raster (the output product). To do so, we directly write the 
processed tiles onto the processed raster (its skeloton was created in the 
FileSystem beforehand) which allows us to write them as they are produced and 
not wait for all of them to have been processed and save up RAM by not 
gathering them all on one node before writing down the raster.

Since it's a peculiar format not handled by the FileSystem connector and I 
don't know if the FileSystem Sink has a feature to write several times on the 
same file at specific parts, I decided to try my own implementation with a 
Flatmap that receives the processed tiles and write them in HDFS as they come 
using rasterio (a library for reading and writing rasters that uses gdal in 
fact). A simple solution is to open the connection in the Flatmap everytime and 
write (there shouldn't be concurrent writing since the tiles are keyed by 
raster id which will hand over all the tiles of a same raster to the same 
TaskSlot if I have understood correctly). However, I was wondering if I could 
open the connection, store the DatasetReader in the state and call it in the 
Flatmap to write down the tiles to avoid reopening it every time a new 
processed tile is produced (a raster can be divided into thousand of tiles).

To sum it up, here are my questions:
- Can the FileSystem sink write down rasters and write several times on the 
same file ?
- Can Flink's state store java objects such as a DatasetReader (which is 
returned by rasterio.open(...)) ?

Sincerely,
Ky Alexandre

Reply via email to