How is the order ensured in the jdbc relation provider when inserting data from multiple executors

2016-11-21 Thread Niranda Perera
Hi,

Say, I have a table with 1 column and 1000 rows. I want to save the result
in a RDBMS table using the jdbc relation provider. So I run the following
query,

"insert into table table2 select value, count(*) from table1 group by value
order by value"

While debugging, I found that the resultant df from select value, count(*)
from table1 group by value order by value would have around 200+ partitions
and say I have 4 executors attached to my driver. So, I would have 200+
writing tasks assigned to 4 executors. I want to understand, how these
executors are able to write the data to the underlying RDBMS table of
table2 without messing up the order.

I checked the jdbc insertable relation and in jdbcUtils [1] it does the
following

df.foreachPartition { iterator =>
  savePartition(getConnection, table, iterator, rddSchema, nullTypes,
batchSize, dialect)
}

So, my understanding is, all of my 4 executors will parallely run the
savePartition function (or closure) where they do not know which one should
write data before the other!

In the savePartition method, in the comment, it says
"Saves a partition of a DataFrame to the JDBC database.  This is done in
   * a single database transaction in order to avoid repeatedly inserting
   * data as much as possible."

I want to understand, how these parallel executors save the partition
without harming the order of the results? Is it by locking the database
resource, from each executor (i.e. ex0 would first obtain a lock for the
table and write the partition0, while ex1 ... ex3 would wait till the lock
is released )?

In my experience, there is no harm done to the order of the results at the
end of the day!

Would like to hear from you guys! :-)

[1]
https://github.com/apache/spark/blob/v1.6.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277

-- 
Niranda Perera
@n1r44 
+94 71 554 8430
https://www.linkedin.com/in/niranda
https://pythagoreanscript.wordpress.com/


Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

2016-11-21 Thread Maciej Szymkiewicz
In commonly used RDBM systems relations have no fixed order and physical
location of the records can change during routine maintenance
operations. Unless you explicitly order data during retrieval order you
see is incidental and not guaranteed. 

Conclusion: order of inserts just doesn't matter.

On 11/21/2016 10:03 AM, Niranda Perera wrote:
> Hi, 
>
> Say, I have a table with 1 column and 1000 rows. I want to save the
> result in a RDBMS table using the jdbc relation provider. So I run the
> following query, 
>
> "insert into table table2 select value, count(*) from table1 group by
> value order by value"
>
> While debugging, I found that the resultant df from select value,
> count(*) from table1 group by value order by value would have around
> 200+ partitions and say I have 4 executors attached to my driver. So,
> I would have 200+ writing tasks assigned to 4 executors. I want to
> understand, how these executors are able to write the data to the
> underlying RDBMS table of table2 without messing up the order. 
>
> I checked the jdbc insertable relation and in jdbcUtils [1] it does
> the following
>
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema,
> nullTypes, batchSize, dialect)
> }
>
> So, my understanding is, all of my 4 executors will parallely run the
> savePartition function (or closure) where they do not know which one
> should write data before the other! 
>
> In the savePartition method, in the comment, it says 
> "Saves a partition of a DataFrame to the JDBC database.  This is done in
>* a single database transaction in order to avoid repeatedly inserting
>* data as much as possible."
>
> I want to understand, how these parallel executors save the partition
> without harming the order of the results? Is it by locking the
> database resource, from each executor (i.e. ex0 would first obtain a
> lock for the table and write the partition0, while ex1 ... ex3 would
> wait till the lock is released )? 
>
> In my experience, there is no harm done to the order of the results at
> the end of the day! 
>
> Would like to hear from you guys! :-) 
>
> [1] 
> https://github.com/apache/spark/blob/v1.6.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277
>
> -- 
> Niranda Perera
> @n1r44 
> +94 71 554 8430
> https://www.linkedin.com/in/niranda
> https://pythagoreanscript.wordpress.com/

-- 
Best regards,
Maciej Szymkiewicz



Re: OutOfMemoryError on parquet SnappyDecompressor

2016-11-21 Thread Ryan Blue
Aniket,

The solution was to add a sort so that only one file is written at a time,
which minimizes the memory footprint of columnar formats like Parquet.
That's been released for quite a while, so memory issues caused by Parquet
are more rare now. If you're using Parquet default settings and a recent
Spark version, you should be fine.

rb

On Sun, Nov 20, 2016 at 3:35 AM, Aniket  wrote:

> Was anyone able  find a solution or recommended conf for this? I am
> running into the same "java.lang.OutOfMemoryError: Direct buffer memory"
> but during snappy compression.
>
> Thanks,
> Aniket
>
> On Tue, Sep 23, 2014 at 7:04 PM Aaron Davidson [via Apache Spark
> Developers List] <[hidden email]
> > wrote:
>
>> This may be related: https://github.com/Parquet/parquet-mr/issues/211
>>
>> Perhaps if we change our configuration settings for Parquet it would get
>> better, but the performance characteristics of Snappy are pretty bad here
>> under some circumstances.
>>
>> On Tue, Sep 23, 2014 at 10:13 AM, Cody Koeninger <[hidden email]
>> > wrote:
>>
>> > Cool, that's pretty much what I was thinking as far as configuration
>> goes.
>> >
>> > Running on Mesos.  Worker nodes are amazon xlarge, so 4 core / 15g.
>> I've
>> > tried executor memory sizes as high as 6G
>> > Default hdfs block size 64m, about 25G of total data written by a job
>> with
>> > 128 partitions.  The exception comes when trying to read the data (all
>> > columns).
>> >
>> > Schema looks like this:
>> >
>> > case class A(
>> >   a: Long,
>> >   b: Long,
>> >   c: Byte,
>> >   d: Option[Long],
>> >   e: Option[Long],
>> >   f: Option[Long],
>> >   g: Option[Long],
>> >   h: Option[Int],
>> >   i: Long,
>> >   j: Option[Int],
>> >   k: Seq[Int],
>> >   l: Seq[Int],
>> >   m: Seq[Int]
>> > )
>> >
>> > We're just going back to gzip for now, but might be nice to help
>> someone
>> > else avoid running into this.
>> >
>> > On Tue, Sep 23, 2014 at 11:18 AM, Michael Armbrust <[hidden email]
>> 
>> > >
>> > wrote:
>> >
>> > > I actually submitted a patch to do this yesterday:
>> > > https://github.com/apache/spark/pull/2493
>> > >
>> > > Can you tell us more about your configuration.  In particular how
>> much
>> > > memory/cores do the executors have and what does the schema of your
>> data
>> > > look like?
>> > >
>> > > On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger <[hidden email]
>> >
>> > > wrote:
>> > >
>> > >> So as a related question, is there any reason the settings in
>> SQLConf
>> > >> aren't read from the spark context's conf?  I understand why the sql
>> > conf
>> > >> is mutable, but it's not particularly user friendly to have most
>> spark
>> > >> configuration set via e.g. defaults.conf or --properties-file, but
>> for
>> > >> spark sql to ignore those.
>> > >>
>> > >> On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger <[hidden email]
>> >
>> > >> wrote:
>> > >>
>> > >> > After commit 8856c3d8 switched from gzip to snappy as default
>> parquet
>> > >> > compression codec, I'm seeing the following when trying to read
>> > parquet
>> > >> > files saved using the new default (same schema and roughly same
>> size
>> > as
>> > >> > files that were previously working):
>> > >> >
>> > >> > java.lang.OutOfMemoryError: Direct buffer memory
>> > >> > java.nio.Bits.reserveMemory(Bits.java:658)
>> > >> > java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>>
>> > >> > java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>> > >> >
>> > >> >
>> > >>
>> > parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99)
>>
>> > >> >
>> > >> >
>> > >>
>> > parquet.hadoop.codec.NonBlockedDecompressorStream.read(
>> NonBlockedDecompressorStream.java:43)
>> > >> > java.io.DataInputStream.readFully(DataInputStream.java:195)
>>
>> > >> > java.io.DataInputStream.readFully(DataInputStream.java:169)
>>
>> > >> >
>> > >> >
>> > >>
>> > parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201)
>>
>> > >> >
>> > >> >
>> > parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521)
>>
>> > >> >
>> > >> >
>> > >>
>> > parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493)
>>
>> > >> >
>> > >> >
>> > parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546)
>>
>> > >> >
>> > >> > parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:339)
>>
>> > >> >
>> > >> >
>> > >>
>> > parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
>>
>> > >> >
>> > >> >
>> > >>
>> > parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
>>
>> > >> >
>> > >> >
>> > >>
>> > parquet.io.RecordReaderImplementation.(
>> RecordReaderImplementation.java:265)
>> > >

Re: OutOfMemoryError on parquet SnappyDecompressor

2016-11-21 Thread Aniket
Thanks Ryan. I am running into this rarer issue. For now, I have moved away
from parquet but if I will create a bug in jira if I am able to produce
code that easily reproduces this.

Thanks,
Aniket

On Mon, Nov 21, 2016, 3:24 PM Ryan Blue [via Apache Spark Developers List] <
ml-node+s1001551n19972...@n3.nabble.com> wrote:

> Aniket,
>
> The solution was to add a sort so that only one file is written at a time,
> which minimizes the memory footprint of columnar formats like Parquet.
> That's been released for quite a while, so memory issues caused by Parquet
> are more rare now. If you're using Parquet default settings and a recent
> Spark version, you should be fine.
>
> rb
> On Sun, Nov 20, 2016 at 3:35 AM, Aniket <[hidden email]
> > wrote:
>
> Was anyone able  find a solution or recommended conf for this? I am
> running into the same "java.lang.OutOfMemoryError: Direct buffer memory"
> but during snappy compression.
>
> Thanks,
> Aniket
>
> On Tue, Sep 23, 2014 at 7:04 PM Aaron Davidson [via Apache Spark
> Developers List] <[hidden email]
> > wrote:
>
> This may be related: https://github.com/Parquet/parquet-mr/issues/211
>
> Perhaps if we change our configuration settings for Parquet it would get
> better, but the performance characteristics of Snappy are pretty bad here
> under some circumstances.
>
> On Tue, Sep 23, 2014 at 10:13 AM, Cody Koeninger <[hidden email]
> > wrote:
>
> > Cool, that's pretty much what I was thinking as far as configuration
> goes.
> >
> > Running on Mesos.  Worker nodes are amazon xlarge, so 4 core / 15g.
> I've
> > tried executor memory sizes as high as 6G
> > Default hdfs block size 64m, about 25G of total data written by a job
> with
> > 128 partitions.  The exception comes when trying to read the data (all
> > columns).
> >
> > Schema looks like this:
> >
> > case class A(
> >   a: Long,
> >   b: Long,
> >   c: Byte,
> >   d: Option[Long],
> >   e: Option[Long],
> >   f: Option[Long],
> >   g: Option[Long],
> >   h: Option[Int],
> >   i: Long,
> >   j: Option[Int],
> >   k: Seq[Int],
> >   l: Seq[Int],
> >   m: Seq[Int]
> > )
> >
> > We're just going back to gzip for now, but might be nice to help someone
> > else avoid running into this.
> >
> > On Tue, Sep 23, 2014 at 11:18 AM, Michael Armbrust <[hidden email]
> 
> > >
> > wrote:
> >
> > > I actually submitted a patch to do this yesterday:
> > > https://github.com/apache/spark/pull/2493
> > >
> > > Can you tell us more about your configuration.  In particular how much
> > > memory/cores do the executors have and what does the schema of your
> data
> > > look like?
> > >
> > > On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger <[hidden email]
> >
> > > wrote:
> > >
> > >> So as a related question, is there any reason the settings in SQLConf
> > >> aren't read from the spark context's conf?  I understand why the sql
> > conf
> > >> is mutable, but it's not particularly user friendly to have most
> spark
> > >> configuration set via e.g. defaults.conf or --properties-file, but
> for
> > >> spark sql to ignore those.
> > >>
> > >> On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger <[hidden email]
> >
> > >> wrote:
> > >>
> > >> > After commit 8856c3d8 switched from gzip to snappy as default
> parquet
> > >> > compression codec, I'm seeing the following when trying to read
> > parquet
> > >> > files saved using the new default (same schema and roughly same
> size
> > as
> > >> > files that were previously working):
> > >> >
> > >> > java.lang.OutOfMemoryError: Direct buffer memory
> > >> > java.nio.Bits.reserveMemory(Bits.java:658)
> > >> > java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> > >> > java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
> > >> >
> > >> >
> > >>
> >
> parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99)
>
> > >> >
> > >> >
> > >>
> >
> parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:43)
>
> > >> > java.io.DataInputStream.readFully(DataInputStream.java:195)
> > >> > java.io.DataInputStream.readFully(DataInputStream.java:169)
> > >> >
> > >> >
> > >>
> >
> parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201)
> > >> >
> > >> >
> > parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521)
> > >> >
> > >> >
> > >>
> >
> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493)
> > >> >
> > >> >
> > parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546)
> > >> >
> > >> >
> parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:339)
> > >> >
> > >> >
> > >>
> >
> parquet.column.impl.ColumnReadStoreImpl.newMemColumnRe

Re: OutOfMemoryError on parquet SnappyDecompressor

2016-11-21 Thread Ryan Blue
It's unlikely that you're hitting this, unless you have several tasks
writing at once on the same executor. Parquet does have high memory
consumption, so the most likely explanation is either that you're close to
the memory limit for other reasons, or that you need to increase the amount
of overhead memory for off-heap tasks.

rb

On Mon, Nov 21, 2016 at 10:07 AM, Aniket  wrote:

> Thanks Ryan. I am running into this rarer issue. For now, I have moved
> away from parquet but if I will create a bug in jira if I am able to
> produce code that easily reproduces this.
>
> Thanks,
> Aniket
>
> On Mon, Nov 21, 2016, 3:24 PM Ryan Blue [via Apache Spark Developers List]
> <[hidden email] >
> wrote:
>
>> Aniket,
>>
>> The solution was to add a sort so that only one file is written at a
>> time, which minimizes the memory footprint of columnar formats like
>> Parquet. That's been released for quite a while, so memory issues caused by
>> Parquet are more rare now. If you're using Parquet default settings and a
>> recent Spark version, you should be fine.
>>
>> rb
>> On Sun, Nov 20, 2016 at 3:35 AM, Aniket <[hidden email]
>> > wrote:
>>
>> Was anyone able  find a solution or recommended conf for this? I am
>> running into the same "java.lang.OutOfMemoryError: Direct buffer memory"
>> but during snappy compression.
>>
>> Thanks,
>> Aniket
>>
>> On Tue, Sep 23, 2014 at 7:04 PM Aaron Davidson [via Apache Spark
>> Developers List] <[hidden email]
>> > wrote:
>>
>> This may be related: https://github.com/Parquet/parquet-mr/issues/211
>>
>> Perhaps if we change our configuration settings for Parquet it would get
>> better, but the performance characteristics of Snappy are pretty bad here
>> under some circumstances.
>>
>> On Tue, Sep 23, 2014 at 10:13 AM, Cody Koeninger <[hidden email]
>> > wrote:
>>
>> > Cool, that's pretty much what I was thinking as far as configuration
>> goes.
>> >
>> > Running on Mesos.  Worker nodes are amazon xlarge, so 4 core / 15g.
>> I've
>> > tried executor memory sizes as high as 6G
>> > Default hdfs block size 64m, about 25G of total data written by a job
>> with
>> > 128 partitions.  The exception comes when trying to read the data (all
>> > columns).
>> >
>> > Schema looks like this:
>> >
>> > case class A(
>> >   a: Long,
>> >   b: Long,
>> >   c: Byte,
>> >   d: Option[Long],
>> >   e: Option[Long],
>> >   f: Option[Long],
>> >   g: Option[Long],
>> >   h: Option[Int],
>> >   i: Long,
>> >   j: Option[Int],
>> >   k: Seq[Int],
>> >   l: Seq[Int],
>> >   m: Seq[Int]
>> > )
>> >
>> > We're just going back to gzip for now, but might be nice to help
>> someone
>> > else avoid running into this.
>> >
>> > On Tue, Sep 23, 2014 at 11:18 AM, Michael Armbrust <[hidden email]
>> 
>> > >
>> > wrote:
>> >
>> > > I actually submitted a patch to do this yesterday:
>> > > https://github.com/apache/spark/pull/2493
>> > >
>> > > Can you tell us more about your configuration.  In particular how
>> much
>> > > memory/cores do the executors have and what does the schema of your
>> data
>> > > look like?
>> > >
>> > > On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger <[hidden email]
>> >
>> > > wrote:
>> > >
>> > >> So as a related question, is there any reason the settings in
>> SQLConf
>> > >> aren't read from the spark context's conf?  I understand why the sql
>> > conf
>> > >> is mutable, but it's not particularly user friendly to have most
>> spark
>> > >> configuration set via e.g. defaults.conf or --properties-file, but
>> for
>> > >> spark sql to ignore those.
>> > >>
>> > >> On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger <[hidden email]
>> >
>> > >> wrote:
>> > >>
>> > >> > After commit 8856c3d8 switched from gzip to snappy as default
>> parquet
>> > >> > compression codec, I'm seeing the following when trying to read
>> > parquet
>> > >> > files saved using the new default (same schema and roughly same
>> size
>> > as
>> > >> > files that were previously working):
>> > >> >
>> > >> > java.lang.OutOfMemoryError: Direct buffer memory
>> > >> > java.nio.Bits.reserveMemory(Bits.java:658)
>> > >> > java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>>
>> > >> > java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>> > >> >
>> > >> >
>> > >>
>> > parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99)
>>
>> > >> >
>> > >> >
>> > >>
>> > parquet.hadoop.codec.NonBlockedDecompressorStream.read(
>> NonBlockedDecompressorStream.java:43)
>> > >> > java.io.DataInputStream.readFully(DataInputStream.java:195)
>>
>> > >> > java.io.DataInputStream.readFully(DataInputStream.java:169)
>>
>> > >

MinMaxScaler behaviour

2016-11-21 Thread Joeri Hermans
Hi all,

I observed some weird behaviour while applying some feature transformations 
using MinMaxScaler. More specifically, I was wondering if this behaviour is 
intended and makes sense? Especially because I explicitly defined min and max.

Basically, I am preprocessing the MNIST dataset, and thereby scaling the 
features between the ranges 0 and 1 using the following code:

# Clear the dataset in the case you ran this cell before.
dataset = dataset.select("features", "label", "label_encoded")
# Apply MinMax normalization to the features.
scaler = MinMaxScaler(min=0.0, max=1.0, inputCol="features", 
outputCol="features_normalized")
# Compute summary statistics and generate MinMaxScalerModel.
scaler_model = scaler.fit(dataset)
# Rescale each feature to range [min, max].
dataset = scaler_model.transform(dataset)

Complete code is here: 
https://github.com/JoeriHermans/dist-keras/blob/development/examples/mnist.ipynb
 (Normalization section)

The original MNIST images are shown in original.png. Whereas the processed 
images are shown in processed.png. Note the 0.5 artifacts. I checked the source 
code of this particular estimator / transformer and found the following.

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L191

According to the documentation:

 * 
 *$$
 *Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + 
min
 *$$
 * 
 *
 * For the case $E_{max} == E_{min}$, $Rescaled(e_i) = 0.5 * (max + min)$.

So basically, when the difference between E_{max} and E_{min} is 0, we assing 
0.5 as a raw value. I am wondering if this is helpful in any situation? Why not 
assign 0?



Kind regards,

Joeri
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: MinMaxScaler behaviour

2016-11-21 Thread Sean Owen
It's a degenerate case of course. 0, 0.5 and 1 all make about as much
sense. Is there a strong convention elsewhere to use 0?

Min/max scaling is the wrong thing to do for a data set like this anyway.
What you probably intend to do is scale each image so that its max
intensity is 1 and min intensity is 0, but that's different. Scaling each
pixel across all images doesn't make as much sense.

On Mon, Nov 21, 2016 at 8:26 PM Joeri Hermans <
joeri.raymond.e.herm...@cern.ch> wrote:

> Hi all,
>
> I observed some weird behaviour while applying some feature
> transformations using MinMaxScaler. More specifically, I was wondering if
> this behaviour is intended and makes sense? Especially because I explicitly
> defined min and max.
>
> Basically, I am preprocessing the MNIST dataset, and thereby scaling the
> features between the ranges 0 and 1 using the following code:
>
> # Clear the dataset in the case you ran this cell before.
> dataset = dataset.select("features", "label", "label_encoded")
> # Apply MinMax normalization to the features.
> scaler = MinMaxScaler(min=0.0, max=1.0, inputCol="features",
> outputCol="features_normalized")
> # Compute summary statistics and generate MinMaxScalerModel.
> scaler_model = scaler.fit(dataset)
> # Rescale each feature to range [min, max].
> dataset = scaler_model.transform(dataset)
>
> Complete code is here:
> https://github.com/JoeriHermans/dist-keras/blob/development/examples/mnist.ipynb
> (Normalization section)
>
> The original MNIST images are shown in original.png. Whereas the processed
> images are shown in processed.png. Note the 0.5 artifacts. I checked the
> source code of this particular estimator / transformer and found the
> following.
>
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L191
>
> According to the documentation:
>
>  * 
>  *$$
>  *Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max -
> min) + min
>  *$$
>  * 
>  *
>  * For the case $E_{max} == E_{min}$, $Rescaled(e_i) = 0.5 * (max + min)$.
>
> So basically, when the difference between E_{max} and E_{min} is 0, we
> assing 0.5 as a raw value. I am wondering if this is helpful in any
> situation? Why not assign 0?
>
>
>
> Kind regards,
>
> Joeri
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


RE: MinMaxScaler behaviour

2016-11-21 Thread Joeri Hermans
I see. I think I read the documentation a little bit too quick :)

My apologies.


Kind regards,

Joeri

From: Sean Owen [so...@cloudera.com]
Sent: 21 November 2016 21:32
To: Joeri Hermans; dev@spark.apache.org
Subject: Re: MinMaxScaler behaviour

It's a degenerate case of course. 0, 0.5 and 1 all make about as much sense. Is 
there a strong convention elsewhere to use 0?

Min/max scaling is the wrong thing to do for a data set like this anyway. What 
you probably intend to do is scale each image so that its max intensity is 1 
and min intensity is 0, but that's different. Scaling each pixel across all 
images doesn't make as much sense.

On Mon, Nov 21, 2016 at 8:26 PM Joeri Hermans 
mailto:joeri.raymond.e.herm...@cern.ch>> wrote:
Hi all,

I observed some weird behaviour while applying some feature transformations 
using MinMaxScaler. More specifically, I was wondering if this behaviour is 
intended and makes sense? Especially because I explicitly defined min and max.

Basically, I am preprocessing the MNIST dataset, and thereby scaling the 
features between the ranges 0 and 1 using the following code:

# Clear the dataset in the case you ran this cell before.
dataset = dataset.select("features", "label", "label_encoded")
# Apply MinMax normalization to the features.
scaler = MinMaxScaler(min=0.0, max=1.0, inputCol="features", 
outputCol="features_normalized")
# Compute summary statistics and generate MinMaxScalerModel.
scaler_model = scaler.fit(dataset)
# Rescale each feature to range [min, max].
dataset = scaler_model.transform(dataset)

Complete code is here: 
https://github.com/JoeriHermans/dist-keras/blob/development/examples/mnist.ipynb
 (Normalization section)

The original MNIST images are shown in original.png. Whereas the processed 
images are shown in processed.png. Note the 0.5 artifacts. I checked the source 
code of this particular estimator / transformer and found the following.

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L191

According to the documentation:

 * 
 *$$
 *Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + 
min
 *$$
 * 
 *
 * For the case $E_{max} == E_{min}$, $Rescaled(e_i) = 0.5 * (max + min)$.

So basically, when the difference between E_{max} and E_{min} is 0, we assing 
0.5 as a raw value. I am wondering if this is helpful in any situation? Why not 
assign 0?



Kind regards,

Joeri
-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Memory leak warnings in Spark 2.0.1

2016-11-21 Thread Nicholas Chammas
I'm also curious about this. Is there something we can do to help
troubleshoot these leaks and file useful bug reports?

On Wed, Oct 12, 2016 at 4:33 PM vonnagy  wrote:

> I am getting excessive memory leak warnings when running multiple mapping
> and
> aggregations and using DataSets. Is there anything I should be looking for
> to resolve this or is this a known issue?
>
> WARN  [Executor task launch worker-0]
> org.apache.spark.memory.TaskMemoryManager - leak 16.3 MB memory from
> org.apache.spark.unsafe.map.BytesToBytesMap@33fb6a15
> WARN  [Executor task launch worker-0]
> org.apache.spark.memory.TaskMemoryManager - leak a page:
> org.apache.spark.unsafe.memory.MemoryBlock@29e74a69 in task 88341
> WARN  [Executor task launch worker-0]
> org.apache.spark.memory.TaskMemoryManager - leak a page:
> org.apache.spark.unsafe.memory.MemoryBlock@22316bec in task 88341
> WARN  [Executor task launch worker-0] org.apache.spark.executor.Executor -
> Managed memory leak detected; size = 17039360 bytes, TID = 88341
>
> Thanks,
>
> Ivan
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Memory-leak-warnings-in-Spark-2-0-1-tp19424.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Please limit commits for branch-2.1

2016-11-21 Thread Joseph Bradley
To committers and contributors active in MLlib,

Thanks everyone who has started helping with the QA tasks in SPARK-18316!
I'd like to request that we stop committing non-critical changes to MLlib,
including the Python and R APIs, since still-changing public APIs make it
hard to QA.  We need have already started to sign off on some QA tasks, but
we may need to re-open them if changes are committed, especially if those
changes are to public APIs.  There's no need to push Python and R wrappers
into 2.1 at the last minute.

Let's focus on completing QA, after which we can resume committing API
changes to master (not branch-2.1).

Thanks everyone!
Joseph


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] 


Re: [SPARK-16654][CORE][WIP] Add UI coverage for Application Level Blacklisting

2016-11-21 Thread Reynold Xin
You can submit a pull request against Imran's branch for the pull request.

On Mon, Nov 21, 2016 at 7:33 PM Jose Soltren  wrote:

> Hi - I'm proposing a patch set for UI coverage of Application Level
> Blacklisting:
>
> https://github.com/jsoltren/spark/pull/1
>
> This patch set builds on top of Imran Rashid's pending pull request,
>
> [SPARK-8425][CORE] Application Level Blacklisting #14079
> https://github.com/apache/spark/pull/14079/commits
>
> The best way I could find to send this for review was to fork and
> clone apache/spark, pull PR 14079, branch, apply my UI changes, and
> issue a pull request of the UI changes into PR 14079. If there is a
> better way, forgive me, I would love to hear it.
>
> Attached is a screen shot of the updated UI.
>
> I would appreciate feedback on this WIP patch. I will issue a formal
> pull request once PR 14079 is merged.
>
> Cheers,
> --José
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


Re: How to convert spark data-frame to datasets?

2016-11-21 Thread Sachith Withana
Hi Minudika,

To add to what Oscar said, this blog post [1] should clarify it for you.
And this should be posted in the user-list not the dev.

[1]
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Cheers,
Sachith

On Thu, Aug 18, 2016 at 8:43 PM, Oscar Batori  wrote:

> From the docs
> ,
> DataFrame is just Dataset[Row]. The are various converters for subtypes of
> Product if you want, using "as[T]", where T <: Product, or there is an
> implicit decoder in scope, I believe.
>
> Also, this is probably a user list question.
>
>
> On Thu, Aug 18, 2016 at 10:59 AM Minudika Malshan 
> wrote:
>
>> Hi all,
>>
>> Most of Spark ML algorithms requires a dataset to train the model.
>> I would like to know how to convert a spark *data-frame* to a *dataset*
>> using Java.
>> Your support is much appreciated.
>>
>> Thank you!
>> Minudika
>>
>


-- 
Sachith Withana
Software Engineer; WSO2 Inc.; http://wso2.com
E-mail: sachith AT wso2.com
M: +94715518127
Linked-In: https://lk.linkedin.com/in/sachithwithana


Re: How to convert spark data-frame to datasets?

2016-11-21 Thread Minudika Malshan
Hi,

Thanks all for the support. And sorry for the mistake done by posting here
instead of users list. :)

BR

On Tue, Nov 22, 2016 at 10:33 AM, Sachith Withana  wrote:

> Hi Minudika,
>
> To add to what Oscar said, this blog post [1] should clarify it for you.
> And this should be posted in the user-list not the dev.
>
> [1] https://databricks.com/blog/2016/07/14/a-tale-of-three-
> apache-spark-apis-rdds-dataframes-and-datasets.html
>
> Cheers,
> Sachith
>
> On Thu, Aug 18, 2016 at 8:43 PM, Oscar Batori 
> wrote:
>
>> From the docs
>> ,
>> DataFrame is just Dataset[Row]. The are various converters for subtypes of
>> Product if you want, using "as[T]", where T <: Product, or there is an
>> implicit decoder in scope, I believe.
>>
>> Also, this is probably a user list question.
>>
>>
>> On Thu, Aug 18, 2016 at 10:59 AM Minudika Malshan 
>> wrote:
>>
>>> Hi all,
>>>
>>> Most of Spark ML algorithms requires a dataset to train the model.
>>> I would like to know how to convert a spark *data-frame* to a *dataset*
>>> using Java.
>>> Your support is much appreciated.
>>>
>>> Thank you!
>>> Minudika
>>>
>>
>
>
> --
> Sachith Withana
> Software Engineer; WSO2 Inc.; http://wso2.com
> E-mail: sachith AT wso2.com
> M: +94715518127
> Linked-In: https://lk.linkedin.com/in/
> sachithwithana
>



-- 
*Minudika Malshan*
Undergraduate
Department of Computer Science and Engineering
University of Moratuwa
Sri Lanka.