Re: Custom Image RDD and Sequence Files

Jörn Franke Thu, 28 Jul 2016 22:46:35 -0700

Why don't you write your own Hadoop FileInputFormat. It can be used by Spark...


> On 28 Jul 2016, at 20:04, jtgenesis <jtgene...@gmail.com> wrote:
> 
> Hey all,
> 
> I was wondering what the best course of action is for processing an image
> that has an involved internal structure (file headers, sub-headers, image
> data, more sub-headers, more kinds of data etc). I was hoping to get some
> insight on the approach I'm using and whether there is a better, more Spark
> way of handling it.
> 
> I'm coming from a Hadoop approach where I convert the image to a sequence
> file. Now, i'm new to both Spark and Hadoop, but I have a deeper
> understanding of Hadoop, which is why I went with the sequence files. The
> sequence file is chopped into key/value pairs that contain file and image
> meta-data and separate key/value pairs that contain the raw image data. I
> currently use a LongWritable for the key and a BytesWritable for the value.
> This is a naive approach, but I plan to create custom Writable key type that
> contain pertinent information to the corresponding image data. The idea is
> to create a custom Spark Partitioner, taking advantage of the key structure,
> to reduce inter-cluster communication. Example. store all image tiles with
> the same key.id property on the same node.
> 
> 1.) Is converting the image to a Sequence File superfluous? Is it better to
> do this pre-processing and creating a custom key/value type another way.
> Would it be through Spark or Hadoop's Writable? It seems like Spark just
> uses different flavors of Hadoop's InputFormat under the hood.
> 
> I see that Spark does have support for SequenceFiles, but I'm still not
> fully clear on the extent of it.
> 
> 2.)  When you read in a .seq file through sc.sequenceFIle(), it's using
> SequenceFileInputFormat. This means that the number of partitions will be
> determined by the number of splits, specified in the
> SequenceFileInputFormat.getSplits. Do the input splits happen on key/value
> boundaries? 
> 
> 3.) The RDD created from Sequence Files will have the translated Scala
> key/value type, but if I use a custom Hadoop Writable, will I have to do
> anything on Spark/Scala side to understand it?
> 
> 4.) Since I'm using a custom Hadoop Writable, is it best to register my
> writable types with Kryo?
> 
> Thanks for any help!
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Image-RDD-and-Sequence-Files-tp27426.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Custom Image RDD and Sequence Files

Reply via email to