Hi Martin, The answer of your question really depends on the DOP in which you will be running the job and the expected selectivity (the fraction of lines with that certain ID) in case this does not depend on "the other side" and can be pre-filtered prior to broadcasting.
However, since Flink's optimizer determines the actual data shipment strategy for the join (Broadcast or Re-Partition) based on the estimated input size, I think sticking with the join is the more conservative and long-term flexible solution in your case. To summarize: if you want to do a join, I would always stick with the Flink Join primitive. Flink is different from other systems as it hides multiple strategies like a "Map-side" join (which in other systems is done manually with a broadcast variable) and a "Reduce-side" join (also known as re-partition join) behind an abstract primitive and decides which one to use in a cost-based way. PS. I think that these kind of questions are better suited for the user list. 2015-04-09 17:36 GMT+02:00 Martin Neumann <mneum...@sics.se>: > Hej, > > Up to what sizes are broadcast sets a good idea? > > I have large dataset (~5 GB) and I'm only interested in lines with a > certain ID that I have in a file. The file has ~10 k entries. > I could either Join the dataset with the IDList or I could broadcast the > ID list and do the filtering in a Mapper. > > What would be the better solution given the data sizes described above? > Is there a good rule of thumb when to switch from one solution to the > other? > > cheers Martin >