Impact of input format on timing

Tom Hubregtsen Sun, 05 Oct 2014 13:59:14 -0700

Hi,

I ran the same version of a program with two different types of input
containing equivalent information. 
Program 1: 10,000 files with on average 50 IDs, one every line
Program 2: 1 file containing 10,000 lines. On average 50 IDs per line

My program takes the input, creates key/value pairs of them, and performs
about 7 more steps. I have compared the sets of key/value pairs after the
initialization phase, and they are equivalent. All other steps are equal.
The only difference is using wholeTextFile versus textFile and the
initialization input map function itself.

Since I am reading in from HDFS, and the minimum part size there is 64MB,
every file in program 1 will take 64MB, even though they are KBs original.
Because of this, I expected program 2 to be faster, or equivalent as the
initialization phase may be neglectable.

In my first comparison, I provided the program with sufficient memory, and
they both take around 2 minutes. No surprises here.

In my second comparison, I limit the memory to in this case 4 GB. Program 1
executes in a little over 4 minutes, but program 2 takes over 15 minutes
(after which I terminated the program, as I see it is getting there, but
spilling massively in every stage). The difference between the two runs is
the amount of spilling in phases later on in the program (*not* in the
initialization phase). Program 1 spills 2 chunks per stage:
14/10/05 14:52:30 INFO ExternalAppendOnlyMap: Thread 241 spilling in-memory
map of 327 MB to disk (1 time so far)
14/10/05 14:52:30 INFO ExternalAppendOnlyMap: Thread 242 spilling in-memory
map of 328 MB to disk (1 time so far)
Program 2 also spills these 2 chunks:
14/10/05 14:44:35 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory
map of 320 MB to disk (1 time so far)
14/10/05 14:44:35 INFO ExternalAppendOnlyMap: Thread 241 spilling in-memory
map of 323 MB to disk (1 time so far)
But then spills 2 * ~15,000 chuncks of <1MB:
14/10/05 14:44:41 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory
map of 0 MB to disk (15561 time so far)
14/10/05 14:44:41 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory
map of 0 MB to disk (13866 times so far)

I understand that RDDs with narrow dependencies to their parents are
pipelined, and form a stage. These all point to the parent RDD. This parent
RDD will have different partitioning and different data placement between
the two programs. Because of this, I did expect a difference in this stage,
but as we change from JavaRDD to JavaPairRDD and do a reduceByKey after the
initialization, I expected this difference to be gone in the next stages.
Can anyone explain this behavior, or point me in a direction?

Thanks in advance!

--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Impact-of-input-format-on-timing-tp8655.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Impact of input format on timing

Reply via email to