Automated setup of a multi-node cluster for Apache Spark and generation of profiling results

2021-04-10 Thread Dhruv Kumar
Hello

I am new to Apache Spark and am looking for some close guidance or 
collaboration for my Spark Project which has the following main components:

1. Writing scripts for automated setup of a multi-node cluster for Apache Spark 
with Hadoop File System (HDFS). This is required since I don’t have a fixed set 
of machines to run my Spark experiments and hence, need an easy, quick and 
automated way to do the entire Spark setup.

2. Writing scripts for simple SQL queries which read input from HDFS, run the 
SQL queries on the multi-node spark cluster and store the output in HDFS.

3. Generating detailed profiling results such as latency, shuffled data size 
for every task/operator in the SQL query and generating graphs for the same.

Happy to discuss in more detail.

Thanks
Dhruv
dh...@umn.edu <mailto:dh...@umn.edu>

--
Dhruv Kumar
PhD Candidate
Computer Science and Engineering
University of Minnesota
www.dhruvkumar.me <http://dhruvkumar.me/>





Benchmarks for Many-to-Many Joins

2021-04-21 Thread Dhruv Kumar
Hi

I wanted to ask if anyone knows any datasets or benchmarks which I can use for 
evaluating many-to-many joins (as depicted in the attached snapshot). I looked 
at TPC-H <http://tpc.org/tpch/> and TPC-DS <http://www.tpc.org/tpcds/> 
benchmarks but surprisingly, they mostly have one-to-many joins and I could not 
get much help there.





Thanks
Dhruv

------
Dhruv Kumar
PhD Candidate
Computer Science and Engineering
University of Minnesota
www.dhruvkumar.me <http://dhruvkumar.me/>

Unsubscribe

2021-05-18 Thread Dhruv Kumar
Unsubscribe

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org