Incremental load of RDD from HDFS?

Chris Spagnoli Tue, 20 Oct 2015 09:24:05 -0700

I am new to Spark, and this user community, so my apologies if this was
answered elsewhere and I missed it (I did try search first).


We have multiple large RDDs stored across a HDFS via Spark (by calling
pairRDD.saveAsNewAPIHadoopFile()), and one thing we need to do is re-load a
given RDD (by calling sc.newAPIHadoopFile()) and return that data to our
application.  In order to manage the data returned, we are calling
rdd.toLocalIterator() followed by successive rddIt.hasNext() and
rddIt.next() calls to return the data to our application one row at a time. 
This works pretty well.

But we are observing that the first rddIt.hasNext() or rddIt.next()
invocation will block until the entire RDD is read from HDFS, and this can
cause a considerable delay for a larger RDD.  Within our application we may
end up only iterating over the first few hundred rows of data, so it is more
important to us to be able to get those initial rows back as quickly as
possible, rather than having to wait for the entire RDD to be loaded from
HDFS.  Waiting for only the first partition to complete loading before
starting to get data back would be fine, or even the first few partitions.

The only solution I could come with would be to save each of our large RDDs
as a collection of smaller sub-RDDs, and then load those sub-RDDs from HDFS
sequentially into our application.  But that seems silly.

Is there any approach using Spark which can start returning data from a
large RDD before it is completely loaded from HDFS?

Thanks in advance...

- Chris



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Incremental-load-of-RDD-from-HDFS-tp25145.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Incremental load of RDD from HDFS?

Reply via email to