Why don't you write your own Hadoop FileInputFormat. It can be used by Spark...
> On 28 Jul 2016, at 20:04, jtgenesis <jtgene...@gmail.com> wrote: > > Hey all, > > I was wondering what the best course of action is for processing an image > that has an involved internal structure (file headers, sub-headers, image > data, more sub-headers, more kinds of data etc). I was hoping to get some > insight on the approach I'm using and whether there is a better, more Spark > way of handling it. > > I'm coming from a Hadoop approach where I convert the image to a sequence > file. Now, i'm new to both Spark and Hadoop, but I have a deeper > understanding of Hadoop, which is why I went with the sequence files. The > sequence file is chopped into key/value pairs that contain file and image > meta-data and separate key/value pairs that contain the raw image data. I > currently use a LongWritable for the key and a BytesWritable for the value. > This is a naive approach, but I plan to create custom Writable key type that > contain pertinent information to the corresponding image data. The idea is > to create a custom Spark Partitioner, taking advantage of the key structure, > to reduce inter-cluster communication. Example. store all image tiles with > the same key.id property on the same node. > > 1.) Is converting the image to a Sequence File superfluous? Is it better to > do this pre-processing and creating a custom key/value type another way. > Would it be through Spark or Hadoop's Writable? It seems like Spark just > uses different flavors of Hadoop's InputFormat under the hood. > > I see that Spark does have support for SequenceFiles, but I'm still not > fully clear on the extent of it. > > 2.) When you read in a .seq file through sc.sequenceFIle(), it's using > SequenceFileInputFormat. This means that the number of partitions will be > determined by the number of splits, specified in the > SequenceFileInputFormat.getSplits. Do the input splits happen on key/value > boundaries? > > 3.) The RDD created from Sequence Files will have the translated Scala > key/value type, but if I use a custom Hadoop Writable, will I have to do > anything on Spark/Scala side to understand it? > > 4.) Since I'm using a custom Hadoop Writable, is it best to register my > writable types with Kryo? > > Thanks for any help! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Image-RDD-and-Sequence-Files-tp27426.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org