I am new to Spark, and this user community, so my apologies if this was answered elsewhere and I missed it (I did try search first).
We have multiple large RDDs stored across a HDFS via Spark (by calling pairRDD.saveAsNewAPIHadoopFile()), and one thing we need to do is re-load a given RDD (by calling sc.newAPIHadoopFile()) and return that data to our application. In order to manage the data returned, we are calling rdd.toLocalIterator() followed by successive rddIt.hasNext() and rddIt.next() calls to return the data to our application one row at a time. This works pretty well. But we are observing that the first rddIt.hasNext() or rddIt.next() invocation will block until the entire RDD is read from HDFS, and this can cause a considerable delay for a larger RDD. Within our application we may end up only iterating over the first few hundred rows of data, so it is more important to us to be able to get those initial rows back as quickly as possible, rather than having to wait for the entire RDD to be loaded from HDFS. Waiting for only the first partition to complete loading before starting to get data back would be fine, or even the first few partitions. The only solution I could come with would be to save each of our large RDDs as a collection of smaller sub-RDDs, and then load those sub-RDDs from HDFS sequentially into our application. But that seems silly. Is there any approach using Spark which can start returning data from a large RDD before it is completely loaded from HDFS? Thanks in advance... - Chris -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Incremental-load-of-RDD-from-HDFS-tp25145.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
