Custom Image RDD and Sequence Files

jtgenesis Thu, 28 Jul 2016 11:05:26 -0700

Hey all,

I was wondering what the best course of action is for processing an image
that has an involved internal structure (file headers, sub-headers, image
data, more sub-headers, more kinds of data etc). I was hoping to get some
insight on the approach I'm using and whether there is a better, more Spark
way of handling it.

I'm coming from a Hadoop approach where I convert the image to a sequence
file. Now, i'm new to both Spark and Hadoop, but I have a deeper
understanding of Hadoop, which is why I went with the sequence files. The
sequence file is chopped into key/value pairs that contain file and image
meta-data and separate key/value pairs that contain the raw image data. I
currently use a LongWritable for the key and a BytesWritable for the value.
This is a naive approach, but I plan to create custom Writable key type that
contain pertinent information to the corresponding image data. The idea is
to create a custom Spark Partitioner, taking advantage of the key structure,
to reduce inter-cluster communication. Example. store all image tiles with
the same key.id property on the same node.

1.) Is converting the image to a Sequence File superfluous? Is it better to
do this pre-processing and creating a custom key/value type another way.
Would it be through Spark or Hadoop's Writable? It seems like Spark just
uses different flavors of Hadoop's InputFormat under the hood.

I see that Spark does have support for SequenceFiles, but I'm still not
fully clear on the extent of it.

2.) When you read in a .seq file through sc.sequenceFIle(), it's using
SequenceFileInputFormat. This means that the number of partitions will be
determined by the number of splits, specified in the
SequenceFileInputFormat.getSplits. Do the input splits happen on key/value
boundaries?

3.) The RDD created from Sequence Files will have the translated Scala
key/value type, but if I use a custom Hadoop Writable, will I have to do
anything on Spark/Scala side to understand it?

4.) Since I'm using a custom Hadoop Writable, is it best to register my
writable types with Kryo?

Thanks for any help!

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Image-RDD-and-Sequence-Files-tp27426.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Custom Image RDD and Sequence Files

Reply via email to