Hello
I am new to Apache Spark and am looking for some close guidance or
collaboration for my Spark Project which has the following main components:
1. Writing scripts for automated setup of a multi-node cluster for Apache Spark
with Hadoop File System (HDFS). This is required since I don’t have a fixed set
of machines to run my Spark experiments and hence, need an easy, quick and
automated way to do the entire Spark setup.
2. Writing scripts for simple SQL queries which read input from HDFS, run the
SQL queries on the multi-node spark cluster and store the output in HDFS.
3. Generating detailed profiling results such as latency, shuffled data size
for every task/operator in the SQL query and generating graphs for the same.
Happy to discuss in more detail.
Thanks
Dhruv
dh...@umn.edu <mailto:dh...@umn.edu>
--
Dhruv Kumar
PhD Candidate
Computer Science and Engineering
University of Minnesota
www.dhruvkumar.me <http://dhruvkumar.me/>