Re: Incremental load of RDD from HDFS?

Ali Tajeldin EDU Tue, 20 Oct 2015 10:57:00 -0700

I could be misreading the code, but looking at the code for toLocalIterator 
(copied below), it should lazily call runJob on each partition in your input.  
It shouldn't be parsing the entire RDD before returning from the first "next" 
call. If it is taking a long time on the first "next" call, it means your first 
partition is very large or the runJob overhead is very large.  If it is due to 
partition size, perhaps you can repartition your data into smaller blocks (more 
partitions).


def toLocalIterator: Iterator[T] = withScope {
  def collectPartition(p: Int): Array[T] = {
    sc.runJob(this, (iter: Iterator[T]) => iter.toArray, Seq(p), allowLocal = 
false).head
  }
  (0 until partitions.length).iterator.flatMap(i => collectPartition(i))
}

--
Ali

On Oct 20, 2015, at 9:23 AM, Chris Spagnoli <c...@clouddatavisualizer.com> 
wrote:

> I am new to Spark, and this user community, so my apologies if this was
> answered elsewhere and I missed it (I did try search first).
> 
> We have multiple large RDDs stored across a HDFS via Spark (by calling
> pairRDD.saveAsNewAPIHadoopFile()), and one thing we need to do is re-load a
> given RDD (by calling sc.newAPIHadoopFile()) and return that data to our
> application.  In order to manage the data returned, we are calling
> rdd.toLocalIterator() followed by successive rddIt.hasNext() and
> rddIt.next() calls to return the data to our application one row at a time. 
> This works pretty well.
> 
> But we are observing that the first rddIt.hasNext() or rddIt.next()
> invocation will block until the entire RDD is read from HDFS, and this can
> cause a considerable delay for a larger RDD.  Within our application we may
> end up only iterating over the first few hundred rows of data, so it is more
> important to us to be able to get those initial rows back as quickly as
> possible, rather than having to wait for the entire RDD to be loaded from
> HDFS.  Waiting for only the first partition to complete loading before
> starting to get data back would be fine, or even the first few partitions.
> 
> The only solution I could come with would be to save each of our large RDDs
> as a collection of smaller sub-RDDs, and then load those sub-RDDs from HDFS
> sequentially into our application.  But that seems silly.
> 
> Is there any approach using Spark which can start returning data from a
> large RDD before it is completely loaded from HDFS?
> 
> Thanks in advance...
> 
> - Chris
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Incremental-load-of-RDD-from-HDFS-tp25145.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Incremental load of RDD from HDFS?

Reply via email to