Hi,

I have been running some experiments on  large graph data, smallest graph I 
have been using is around ~70 billion edges. I have a graph generator, which 
generates the graph in parallel and feeds to the running system. However, it 
takes a lot of time to read the edges, because even though the graph generation 
process is parallel, in Flink I can only listen from master node (correct me if 
I am wrong). Another option is dumping the generated data to a file and reading 
with readFromCsv, however this is not feasible in terms of storage management. 

What I want to do is, invoking my graph generator, using ipc/tcp protocols  and 
reading the generated data from the sockets. Since the graph data is also 
generated parallel in each node, I want to make use of ipc, and read the data 
in parallel at each node. I made some online digging  but couldn’t find 
something similar using dataset api. I would be glad if you have some similar 
use cases or examples.

Is it possible to use streaming environment to create the data in parallel and 
switch to dataset api?

Thanks in advance!

Best
Kaan

Reply via email to