Spark misconfigured? Small input split sizes in shark query

David Rosenstrauch Tue, 15 Jul 2014 14:59:49 -0700

Got a spark/shark cluster up and running recently, and have been kickingthe tires on it. However, been wrestling with an issue on it that I'mnot quite sure how to solve. (Or, at least, not quite sure about thecorrect way to solve it.)

I ran a simple Hive query (select count ...) against a dataset of .tsvfiles stored in S3, and then ran the same query on shark for comparison.But the shark query took 3x as long.

After a bit of digging, I was able to find out what was happening:apparently with the hive query each map task was reading an input splitconsisting of 2 entire files from the dataset (approximately 180MBeach), while with shark each task was reading an input split consistingof a 64MB chunk from one of the files. This made sense: since theshark query had to open each S3 file 3 separate times (and had to run 3xas many tasks) it made sense that it took much longer.

After much experimentation I was finally able to work around this issueby overriding the value of mapreduce.input.fileinputformat.split.minsizein my hive-site.xml file. (Bumping it up to 512MB.) However, I'mfeeling like this isn't really the "right" way to solve the issue:

a) That parm is normally set to 1. It doesn't seem right that I shouldneed to override it - or set it to a value as large as 512MB.

b) We only seem to experience this issue on an existing Hadoop clusterthat we've deployed spark/shark onto. When we run the same query on anew cluster launched via the spark ec2 scripts, the number of splitsseems to get calculated correctly - without the need for overriding thatparm. This leads me to believe we may just have something misconfiguredon our existing cluster.

c) This seems like an error prone way to overcome this issue. 512MB isan arbitrary value, and should I happen to be running a query againstfiles that are larger than 512MB, I'll again run into the chunking issue.

So my gut tells me there's a better way to solve this issue - i.e.,somehow configuring spark so that the input splits it generates won'tchunk the input files. Anyone know how to accomplish this / what Imight have misconfigured?


Thanks,

DR

Spark misconfigured? Small input split sizes in shark query

Reply via email to