Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

vaquar khan Sat, 23 Sep 2017 16:23:56 -0700

+1 looks good,

Regards,
Vaquar khan


On Sat, Sep 23, 2017 at 12:22 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> +1; we should consider something similar for multi-dimensional tensors too.
>
> Matei
>
> > On Sep 23, 2017, at 7:27 AM, Yanbo Liang <yblia...@gmail.com> wrote:
> >
> > +1
> >
> > On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan <nomanbp...@live.com> wrote:
> > +1
> >
> > Regards
> > Noman
> > From: Denny Lee <denny.g....@gmail.com>
> > Sent: Friday, September 22, 2017 2:59:33 AM
> > To: Apache Spark Dev; Sean Owen; Tim Hunter
> > Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan
> > Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
> >
> > +1
> >
> > On Thu, Sep 21, 2017 at 11:15 Sean Owen <so...@cloudera.com> wrote:
> > Am I right that this doesn't mean other packages would use this
> representation, but that they could?
> >
> > The representation looked fine to me w.r.t. what DL frameworks need.
> >
> > My previous comment was that this is actually quite lightweight. It's
> kind of like how I/O support is provided for CSV and JSON, so makes enough
> sense to add to Spark. It doesn't really preclude other solutions.
> >
> > For those reasons I think it's fine. +1
> >
> > On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter <timhun...@databricks.com>
> wrote:
> > Hello community,
> >
> > I would like to call for a vote on SPARK-21866. It is a short proposal
> that has important applications for image processing and deep learning.
> Joseph Bradley has offered to be the shepherd.
> >
> > JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
> > PDF version: https://issues.apache.org/jira/secure/attachment/
> 12884792/SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf
> >
> > Background and motivation
> > As Apache Spark is being used more and more in the industry, some new
> use cases are emerging for different data formats beyond the traditional
> SQL types or the numerical types (vectors and matrices). Deep Learning
> applications commonly deal with image processing. A number of projects add
> some Deep Learning capabilities to Spark (see list below), but they
> struggle to communicate with each other or with MLlib pipelines because
> there is no standard way to represent an image in Spark DataFrames. We
> propose to federate efforts for representing images in Spark by defining a
> representation that caters to the most common needs of users and library
> developers.
> > This SPIP proposes a specification to represent images in Spark
> DataFrames and Datasets (based on existing industrial standards), and an
> interface for loading sources of images. It is not meant to be a
> full-fledged image processing library, but rather the core description that
> other libraries and users can rely on. Several packages already offer
> various processing facilities for transforming images or doing more complex
> operations, and each has various design tradeoffs that make them better as
> standalone solutions.
> > This project is a joint collaboration between Microsoft and Databricks,
> which have been testing this design in two open source packages: MMLSpark
> and Deep Learning Pipelines.
> > The proposed image format is an in-memory, decompressed representation
> that targets low-level applications. It is significantly more liberal in
> memory usage than compressed image representations such as JPEG, PNG, etc.,
> but it allows easy communication with popular image processing libraries
> and has no decoding overhead.
> > Targets users and personas:
> > Data scientists, data engineers, library developers.
> > The following libraries define primitives for loading and representing
> images, and will gain from a common interchange format (in alphabetical
> order):
> >       • BigDL
> >       • DeepLearning4J
> >       • Deep Learning Pipelines
> >       • MMLSpark
> >       • TensorFlow (Spark connector)
> >       • TensorFlowOnSpark
> >       • TensorFrames
> >       • Thunder
> > Goals:
> >       • Simple representation of images in Spark DataFrames, based on
> pre-existing industrial standards (OpenCV)
> >       • This format should eventually allow the development of
> high-performance integration points with image processing libraries such as
> libOpenCV, Google TensorFlow, CNTK, and other C libraries.
> >       • The reader should be able to read popular formats of images from
> distributed sources.
> > Non-Goals:
> > Images are a versatile medium and encompass a very wide range of formats
> and representations. This SPIP explicitly aims at the most common use case
> in the industry currently: multi-channel matrices of binary, int32, int64,
> float or double data that can fit comfortably in the heap of the JVM:
> >       • the total size of an image should be restricted to less than 2GB
> (roughly)
> >       • the meaning of color channels is application-specific and is not
> mandated by the standard (in line with the OpenCV standard)
> >       • specialized formats used in meteorology, the medical field, etc.
> are not supported
> >       • this format is specialized to images and does not attempt to
> solve the more general problem of representing n-dimensional tensors in
> Spark
> > Proposed API changes
> > We propose to add a new package in the package structure, under the
> MLlib project:
> > org.apache.spark.image
> > Data format
> > We propose to add the following structure:
> > imageSchema = StructType([
> >       • StructField("mode", StringType(), False),
> >               • The exact representation of the data.
> >               • The values are described in the following OpenCV
> convention. Basically, the type has both "depth" and "number of channels"
> info: in particular, type "CV_8UC3" means "3 channel unsigned bytes". BGRA
> format would be CV_8UC4 (value 32 in the table) with the channel order
> specified by convention.
> >               • The exact channel ordering and meaning of each channel
> is dictated by convention. By default, the order is RGB (3 channels) and
> BGRA (4 channels).
> > If the image failed to load, the value is the empty string "".
> >       • StructField("origin", StringType(), True),
> >               • Some information about the origin of the image. The
> content of this is application-specific.
> >               • When the image is loaded from files, users should expect
> to find the file name in this field.
> >       • StructField("height", IntegerType(), False),
> >               • the height of the image, pixels
> >               • If the image fails to load, the value is -1.
> >       • StructField("width", IntegerType(), False),
> >               • the width of the image, pixels
> >               • If the image fails to load, the value is -1.
> >       • StructField("nChannels", IntegerType(), False),
> >               • The number of channels in this image: it is typically a
> value of 1 (B&W), 3 (RGB), or 4 (BGRA)
> >               • If the image fails to load, the value is -1.
> >       • StructField("data", BinaryType(), False)
> >               • packed array content. Due to implementation limitation,
> it cannot currently store more than 2 billions of pixels.
> >               • The data is stored in a pixel-by-pixel BGR row-wise
> order. This follows the OpenCV convention.
> >               • If the image fails to load, this array is empty.
> > For more information about image types, here is an OpenCV guide on
> types: http://docs.opencv.org/2.4/modules/core/doc/intro.html#
> fixed-pixel-types-limited-use-of-templates
> > The reference implementation provides some functions to convert popular
> formats (JPEG, PNG, etc.) to the image specification above, and some
> functions to verify if an image is valid.
> > Image ingest API
> > We propose the following function to load images from a remote
> distributed source as a DataFrame. Here is the signature in Scala. The
> python interface is similar. For compatibility with java, this function
> should be made available through a builder pattern or through the
> DataSource API. The exact mechanics can be discussed during implementation;
> the goal of the proposal below is to propose a specification of the
> behavior and of the options:
> > def readImages(
> >     path:
> > String
> > ,
> >     session: SparkSession =
> > null
> > ,
> >     recursive:
> > Boolean = false
> > ,
> >     numPartitions: Int = 0,
> >     dropImageFailures:
> > Boolean = false
> > ,
> >
> > // Experimental options
> >
> >     sampleRatio: Double
> >  = 1.0): DataFrame
> >
> > The type of the returned DataFrame should be the structure type above,
> with the expectation that all the file names be filled.
> > Mandatory parameters:
> >       • path: a directory for a file system that contains images
> > Optional parameters:
> >       • session (SparkSession, default null): the Spark Session to use
> to create the dataframe. If not provided, it will use the current default
> Spark session via SparkSession.getOrCreate().
> >       • recursive (bool, default false): take the top-level images or
> look into directory recursively
> >       • numPartitions (int, default null): the number of partitions of
> the final dataframe. By default uses the default number of partitions from
> Spark.
> >       • dropImageFailures (bool, default false): drops the files that
> failed to load. If false (do not drop), some invalid images are kept.
> > Parameters that are experimental/may be quickly deprecated. These would
> be useful to have but are not critical for a first cut:
> >       • sampleRatio (float, in (0,1), default 1): if less than 1,
> returns a fraction of the data. There is no statistical guarantee about how
> the sampling is performed. This proved to be very helpful for fast
> prototyping. Marked as experimental since it should be pushed to the Spark
> core.
> > The implementation is expected to be in Scala for performance, with a
> wrapper for python.
> > This function should be lazy to the extent possible: it should not
> trigger access to the data when called. Ideally, any file system supported
> by Spark should be supported when loading images. There may be restrictions
> for some options such as zip files, etc.
> > The reference implementation has also some experimental options
> (undocumented here).
> > Reference implementation
> > A reference implementation is available as an open-source Spark package
> in this repository (Apache 2.0 license):
> > https://github.com/Microsoft/spark-images
> > This Spark package will also be published in a binary form on
> spark-packages.org .
> > Comments about the API should be addressed in this ticket.
> > Optional Rejected Designs
> > The use of User-Defined Types was considered. It adds some burden to the
> implementation of various languages and does not provide significant
> advantages.
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783
Greater Chicago

Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

Reply via email to