What kind of benchmark do you need to take? I mean, you want to benchmark Spark 
many to many joins, or you want to benchmark another aspect of spark or 
cluster? (such as network or disk)
If you want only to take a many-to-many join, you can use cross join or 
repartitioning the data with another key. These actions run in the many-to-many 
manner in spark cluster.

> On Ordibehesht 1, 1400 AP, at 21:25, Dhruv Kumar <dh...@umn.edu.INVALID> 
> wrote:
> 
> Hi
> 
> I wanted to ask if anyone knows any datasets or benchmarks which I can use 
> for evaluating many-to-many joins (as depicted in the attached snapshot). I 
> looked at TPC-H <http://tpc.org/tpch/> and TPC-DS <http://www.tpc.org/tpcds/> 
> benchmarks but surprisingly, they mostly have one-to-many joins and I could 
> not get much help there.
> 
> 
> <PastedGraphic-1.png>
> 
> 
> Thanks
> Dhruv
> 
> --------------------------------------------------
> Dhruv Kumar
> PhD Candidate
> Computer Science and Engineering
> University of Minnesota
> www.dhruvkumar.me <http://dhruvkumar.me/>
> 
> 

Reply via email to