Hi Kaan, afaik there is no (easy) way to switch from streaming back to batch API while retaining all data in memory (correct me if I misunderstood).
However, from your description, I also have some severe understanding problems. Why can't you dump the data to some file? Do you really have more main memory than disk space? Or do you have no shared memory between your generating cluster and the flink cluster? It almost sounds as if the issue at heart is rather to find a good serialization format on how to store the edges. The 70 billion edges could be stored in an array of id pairs, which amount to ~560 GB uncompressed data if stored in Avro (or any other binary serialization format) when ids are longs. That's not much by today's standards and could also be easily offloaded to S3. Alternatively, if graph generation is rather cheap, you could also try to incorporate it directly into the analysis job. On Wed, Apr 22, 2020 at 2:58 AM Kaan Sancak <kaans...@gmail.com> wrote: > Hi, > > I have been running some experiments on large graph data, smallest graph > I have been using is around ~70 billion edges. I have a graph generator, > which generates the graph in parallel and feeds to the running system. > However, it takes a lot of time to read the edges, because even though the > graph generation process is parallel, in Flink I can only listen from > master node (correct me if I am wrong). Another option is dumping the > generated data to a file and reading with readFromCsv, however this is not > feasible in terms of storage management. > > What I want to do is, invoking my graph generator, using ipc/tcp > protocols and reading the generated data from the sockets. Since the graph > data is also generated parallel in each node, I want to make use of ipc, > and read the data in parallel at each node. I made some online digging but > couldn’t find something similar using dataset api. I would be glad if you > have some similar use cases or examples. > > Is it possible to use streaming environment to create the data in parallel > and switch to dataset api? > > Thanks in advance! > > Best > Kaan -- Arvid Heise | Senior Java Developer <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng