Hey all, I was wondering what the best course of action is for processing an image that has an involved internal structure (file headers, sub-headers, image data, more sub-headers, more kinds of data etc). I was hoping to get some insight on the approach I'm using and whether there is a better, more Spark way of handling it.
I'm coming from a Hadoop approach where I convert the image to a sequence file. Now, i'm new to both Spark and Hadoop, but I have a deeper understanding of Hadoop, which is why I went with the sequence files. The sequence file is chopped into key/value pairs that contain file and image meta-data and separate key/value pairs that contain the raw image data. I currently use a LongWritable for the key and a BytesWritable for the value. This is a naive approach, but I plan to create custom Writable key type that contain pertinent information to the corresponding image data. The idea is to create a custom Spark Partitioner, taking advantage of the key structure, to reduce inter-cluster communication. Example. store all image tiles with the same key.id property on the same node. 1.) Is converting the image to a Sequence File superfluous? Is it better to do this pre-processing and creating a custom key/value type another way. Would it be through Spark or Hadoop's Writable? It seems like Spark just uses different flavors of Hadoop's InputFormat under the hood. I see that Spark does have support for SequenceFiles, but I'm still not fully clear on the extent of it. 2.) When you read in a .seq file through sc.sequenceFIle(), it's using SequenceFileInputFormat. This means that the number of partitions will be determined by the number of splits, specified in the SequenceFileInputFormat.getSplits. Do the input splits happen on key/value boundaries? 3.) The RDD created from Sequence Files will have the translated Scala key/value type, but if I use a custom Hadoop Writable, will I have to do anything on Spark/Scala side to understand it? 4.) Since I'm using a custom Hadoop Writable, is it best to register my writable types with Kryo? Thanks for any help! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-Image-RDD-and-Sequence-Files-tp27426.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org