Re: Extracting state keys for a very large RocksDB savepoint

2021-03-17 Thread Andrey Bulgakov
I guess there's no point in making it a KeyedProcessFunction since it's not going to have access to context, timers or anything like that. So it can be a simple InputFormat returning a DataSet of key and value tuples. On Wed, Mar 17, 2021 at 8:37 AM Andrey Bulgakov wrote: > Hi Gordon, > > I thin

Re: Extracting state keys for a very large RocksDB savepoint

2021-03-17 Thread Andrey Bulgakov
Hi Gordon, I think my current implementation is very specific and wouldn't be that valuable for the broader public. But I think there's a potential version of it that could also retrieve values from a savepoint in the same efficient way and that would be something that other people might need. I'

Re: Extracting state keys for a very large RocksDB savepoint

2021-03-14 Thread Tzu-Li (Gordon) Tai
Hi Andrey, Perhaps the functionality you described is worth adding to the State Processor API. Your observation on how the library currently works is correct; basically it tries to restore the state backends as is. In you current implementation, do you see it worthwhile to try to add this? Cheer

Re: Extracting state keys for a very large RocksDB savepoint

2021-03-14 Thread Andrey Bulgakov
If anyone is interested, I reliazed that State Processor API was not the right tool for this since it spends a lot of time rebuilding RocksDB tables and then a lot of memory trying to read from it. All I really needed was operator keys. So I used SavepointLoader.loadSavepointMetadata to get KeyGro

Extracting state keys for a very large RocksDB savepoint

2021-03-09 Thread Andrey Bulgakov
Hi all, I'm trying to use the State Processor API to extract all keys from a RocksDB savepoint produced by an operator in a Flink streaming job into CSV files. The problem is that the storage size of the savepoint is 30TB and I'm running into garbage collection issues no matter how much memory in