Got a spark/shark cluster up and running recently, and have been kicking the tires on it. However, been wrestling with an issue on it that I'm not quite sure how to solve. (Or, at least, not quite sure about the correct way to solve it.)

I ran a simple Hive query (select count ...) against a dataset of .tsv files stored in S3, and then ran the same query on shark for comparison. But the shark query took 3x as long.

After a bit of digging, I was able to find out what was happening: apparently with the hive query each map task was reading an input split consisting of 2 entire files from the dataset (approximately 180MB each), while with shark each task was reading an input split consisting of a 64MB chunk from one of the files. This made sense: since the shark query had to open each S3 file 3 separate times (and had to run 3x as many tasks) it made sense that it took much longer.

After much experimentation I was finally able to work around this issue by overriding the value of mapreduce.input.fileinputformat.split.minsize in my hive-site.xml file. (Bumping it up to 512MB.) However, I'm feeling like this isn't really the "right" way to solve the issue:

a) That parm is normally set to 1. It doesn't seem right that I should need to override it - or set it to a value as large as 512MB.

b) We only seem to experience this issue on an existing Hadoop cluster that we've deployed spark/shark onto. When we run the same query on a new cluster launched via the spark ec2 scripts, the number of splits seems to get calculated correctly - without the need for overriding that parm. This leads me to believe we may just have something misconfigured on our existing cluster.

c) This seems like an error prone way to overcome this issue. 512MB is an arbitrary value, and should I happen to be running a query against files that are larger than 512MB, I'll again run into the chunking issue.

So my gut tells me there's a better way to solve this issue - i.e., somehow configuring spark so that the input splits it generates won't chunk the input files. Anyone know how to accomplish this / what I might have misconfigured?

Thanks,

DR
  • Spark misconfigured? Small input split sizes in shark ... David Rosenstrauch

Reply via email to