Hi Martin! You can use a broadcast join for that as well. You use it exactly like the usual join, but you write "joinWithTiny" or "joinWithLarge", depending on whether the data set that is the argument to the function is the small or the large one.
The broadcast join internally also broadcasts the small data set (like the broadcast sets), but keeps it in the managed memory, which has the benefit that it does spill to disk, if needed. BTW. Many times (if you join two files, a small one and a large one), the Flink optimizer will actually automatically pick the broadcast join. That depends currently on whether the client can access the HDFS and gather statistics, and what kind of functions you use after the data sources and before the join (whether size estimates are still good or already fuzzy). If you want to figure out what strategy is used, have a look at the execution plan (either dump the JSON, put into the HTML file, or use the web client). Greetings, Stephan On Thu, Apr 9, 2015 at 5:36 PM, Martin Neumann <mneum...@sics.se> wrote: > Hej, > > Up to what sizes are broadcast sets a good idea? > > I have large dataset (~5 GB) and I'm only interested in lines with a > certain ID that I have in a file. The file has ~10 k entries. > I could either Join the dataset with the IDList or I could broadcast the > ID list and do the filtering in a Mapper. > > What would be the better solution given the data sizes described above? > Is there a good rule of thumb when to switch from one solution to the > other? > > cheers Martin >