destroyPythonWorker job in PySpark

2016-06-23 Thread Krishna
Hi,

I am running a PySpark app with 1000's of cores (partitions is a small
multiple of # of cores) and the overall application performance is fine.
However, I noticed that, at the end of the job, PySpark initiates job
clean-up procedures and as part of this procedure, PySpark executes a job
shown in the Web UI as "runJob at PythonRDD.scala:361" for each
executor/core. The pain point is that, this step is running in a sequential
fashion and it has become the bottleneck in our application. Even though
each job takes only 0.5 sec (on average), it adds up when running with
1000's of executors.

Looking into the code for "destroyPythonWorker" in "SparkEnv.scala", is
this behavior the result of "stopWorker" being executed sequentially within
foreach? Let me know if I'm missing something and what can be done to fix
the issue.

  private[spark]
>   def destroyPythonWorker(pythonExec: String, envVars: Map[String,
> String], worker: Socket) {
> synchronized {
>   val key = (pythonExec, envVars)
>   pythonWorkers.get(key).foreach(_.stopWorker(worker))
> }
>   }



Spark version: 1.5.0-cdh5.5.1

Thanks


unsubscribe

2020-01-01 Thread vijay krishna



Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-15 Thread Krishna Sankar
Can't find the "spark-assembly-2.0.0-hadoop2.7.0.jar" after compilation.
Usually it is in the assembly/target/scala-2.11
Has the packaging changed for 2.0.0 ?
Cheers


On Thu, Jul 14, 2016 at 11:59 AM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0. The vote is open until Sunday, July 17, 2016 at 12:00 PDT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.0
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.0-rc4
> (e5f8c1117e0c48499f54d62b556bc693435afae0).
>
> This release candidate resolves ~2500 issues:
> https://s.apache.org/spark-2.0.0-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> *https://repository.apache.org/content/repositories/orgapachespark-1192/
> *
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs/
>
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.x.
>
> ==
> What justifies a -1 vote for this release?
> ==
> Critical bugs impacting major functionalities.
>
> Bugs already present in 1.x, missing features, or bugs related to new
> features will not necessarily block this release. Note that historically
> Spark documentation has been published on the website separately from the
> main release so we do not need to block the release due to documentation
> errors either.
>
>
> Note: There was a mistake made during "rc3" preparation, and as a result
> there is no "rc3", but only "rc4".
>
>


Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-15 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OS X 10.11.5 (El Capitan) OK Total time: 26:27 min
 mvn clean package -Pyarn -Phadoop-2.7 -DskipTests
2. Tested pyspark, mllib (iPython 4.0)
2.0 Spark version is 2.0.0
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Lasso Regression OK
2.3. Classification : Decision Tree, Naive Bayes OK
2.4. Clustering : KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 3.6 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages
com.databricks:spark-csv_2.10:1.4.0)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK
[Dataframe Operations very fast from 11-12 secs to 3 secs, to 1.8 secs, to
1.5 secs! Good work !!!]
7.0. GraphX/Scala
7.1. Create Graph (small and bigger dataset) OK
7.2. Structure APIs - OK
7.3. Social Network/Community APIs - OK
7.4. Algorithms : PageRank of 2 datasets, aggregateMessages() - OK

Cheers


On Thu, Jul 14, 2016 at 11:59 AM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0. The vote is open until Sunday, July 17, 2016 at 12:00 PDT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.0
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.0-rc4
> (e5f8c1117e0c48499f54d62b556bc693435afae0).
>
> This release candidate resolves ~2500 issues:
> https://s.apache.org/spark-2.0.0-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> *https://repository.apache.org/content/repositories/orgapachespark-1192/
> *
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc4-docs/
>
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.x.
>
> ==
> What justifies a -1 vote for this release?
> ==
> Critical bugs impacting major functionalities.
>
> Bugs already present in 1.x, missing features, or bugs related to new
> features will not necessarily block this release. Note that historically
> Spark documentation has been published on the website separately from the
> main release so we do not need to block the release due to documentation
> errors either.
>
>
> Note: There was a mistake made during "rc3" preparation, and as a result
> there is no "rc3", but only "rc4".
>
>


Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-20 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OS X 10.11.5 (El Capitan) OK Total time: 24:07 min
 mvn clean package -Pyarn -Phadoop-2.7 -DskipTests
2. Tested pyspark, mllib (iPython 4.0)
2.0 Spark version is 2.0.0
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Lasso Regression OK
2.3. Classification : Decision Tree, Naive Bayes OK
2.4. Clustering : KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 3.6 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages
com.databricks:spark-csv_2.10:1.4.0)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK
[Dataframe Operations very fast from 11 secs to 3 secs, to 1.8 secs, to 1.5
secs! Good work !!!]
7.0. GraphX/Scala
7.1. Create Graph (small and bigger dataset) OK
7.2. Structure APIs - OK
7.3. Social Network/Community APIs - OK
7.4. Algorithms : PageRank of 2 datasets, aggregateMessages() - OK

Cheers


On Tue, Jul 19, 2016 at 7:35 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0. The vote is open until Friday, July 22, 2016 at 20:00 PDT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.0
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.0-rc5
> (13650fc58e1fcf2cf2a26ba11c819185ae1acc1f).
>
> This release candidate resolves ~2500 issues:
> https://s.apache.org/spark-2.0.0-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1195/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/
>
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.x.
>
> ==
> What justifies a -1 vote for this release?
> ==
> Critical bugs impacting major functionalities.
>
> Bugs already present in 1.x, missing features, or bugs related to new
> features will not necessarily block this release. Note that historically
> Spark documentation has been published on the website separately from the
> main release so we do not need to block the release due to documentation
> errors either.
>
>


Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Krishna Sankar
I do run both Python and Scala. But via iPython/Python2 with my own test
code. Not running the tests from the distribution.
Cheers


On Mon, Sep 26, 2016 at 11:59 AM, Holden Karau  wrote:

> I'm seeing some test failures with Python 3 that could definitely be
> environmental (going to rebuild my virtual env and double check), I'm just
> wondering if other people are also running the Python tests on this release
> or if everyone is focused on the Scala tests?
>
> On Mon, Sep 26, 2016 at 11:48 AM, Maciej Bryński 
> wrote:
>
>> +1
>> At last :)
>>
>> 2016-09-26 19:56 GMT+02:00 Sameer Agarwal :
>>
>>> +1 (non-binding)
>>>
>>> On Mon, Sep 26, 2016 at 9:54 AM, Davies Liu 
>>> wrote:
>>>
 +1 (non-binding)

 On Mon, Sep 26, 2016 at 9:36 AM, Joseph Bradley 
 wrote:
 > +1
 >
 > On Mon, Sep 26, 2016 at 7:47 AM, Denny Lee 
 wrote:
 >>
 >> +1 (non-binding)
 >> On Sun, Sep 25, 2016 at 23:20 Jeff Zhang  wrote:
 >>>
 >>> +1
 >>>
 >>> On Mon, Sep 26, 2016 at 2:03 PM, Shixiong(Ryan) Zhu
 >>>  wrote:
 
  +1
 
  On Sun, Sep 25, 2016 at 10:43 PM, Pete Lee 
  wrote:
 >
 > +1
 >
 >
 > On Sun, Sep 25, 2016 at 3:26 PM, Herman van Hövell tot Westerflier
 >  wrote:
 >>
 >> +1 (non-binding)
 >>
 >> On Sun, Sep 25, 2016 at 2:05 PM, Ricardo Almeida
 >>  wrote:
 >>>
 >>> +1 (non-binding)
 >>>
 >>> Built and tested on
 >>> - Ubuntu 16.04 / OpenJDK 1.8.0_91
 >>> - CentOS / Oracle Java 1.7.0_55
 >>> (-Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver
 >>> -Pyarn)
 >>>
 >>>
 >>> On 25 September 2016 at 22:35, Matei Zaharia
 >>>  wrote:
 
  +1
 
  Matei
 
  On Sep 25, 2016, at 1:25 PM, Josh Rosen <
 joshro...@databricks.com>
  wrote:
 
  +1
 
  On Sun, Sep 25, 2016 at 1:16 PM Yin Huai >>> >
  wrote:
 >
 > +1
 >
 > On Sun, Sep 25, 2016 at 11:40 AM, Dongjoon Hyun
 >  wrote:
 >>
 >> +1 (non binding)
 >>
 >> RC3 is compiled and tested on the following two systems,
 too. All
 >> tests passed.
 >>
 >> * CentOS 7.2 / Oracle JDK 1.8.0_77 / R 3.3.1
 >>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
 >> -Phive-thriftserver -Dsparkr
 >> * CentOS 7.2 / Open JDK 1.8.0_102
 >>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
 >> -Phive-thriftserver
 >>
 >> Cheers,
 >> Dongjoon
 >>
 >>
 >>
 >> On Saturday, September 24, 2016, Reynold Xin <
 r...@databricks.com>
 >> wrote:
 >>>
 >>> Please vote on releasing the following candidate as Apache
 Spark
 >>> version 2.0.1. The vote is open until Tue, Sep 27, 2016 at
 15:30 PDT and
 >>> passes if a majority of at least 3+1 PMC votes are cast.
 >>>
 >>> [ ] +1 Release this package as Apache Spark 2.0.1
 >>> [ ] -1 Do not release this package because ...
 >>>
 >>>
 >>> The tag to be voted on is v2.0.1-rc3
 >>> (9d28cc10357a8afcfb2fa2e6eecb5c2cc2730d17)
 >>>
 >>> This release candidate resolves 290 issues:
 >>> https://s.apache.org/spark-2.0.1-jira
 >>>
 >>> The release files, including signatures, digests, etc. can
 be
 >>> found at:
 >>>
 >>> http://people.apache.org/~pwen
 dell/spark-releases/spark-2.0.1-rc3-bin/
 >>>
 >>> Release artifacts are signed with the following key:
 >>> https://people.apache.org/keys/committer/pwendell.asc
 >>>
 >>> The staging repository for this release can be found at:
 >>>
 >>> https://repository.apache.org/
 content/repositories/orgapachespark-1201/
 >>>
 >>> The documentation corresponding to this release can be
 found at:
 >>>
 >>> http://people.apache.org/~pwen
 dell/spark-releases/spark-2.0.1-rc3-docs/
 >>>
 >>>
 >>> Q: How can I help test this release?
 >>> A: If you are a Spark user, you can help us test this
 release by
 >>> taking an existing Spark workload and running on this
 release candidate,
 >>> then reporting any regressions from 2.0.0.
 >>>
 >>> Q: What justifies a -1 vote for this release?
 >>> A: This is a maintenance release i

Contributing to PySpark

2016-10-18 Thread Krishna Kalyan
Hello,
I am a masters student. Could someone please let me know how set up my dev
working environment to contribute to pyspark.
Questions I had were:
a) Should I use Intellij Idea or PyCharm?.
b) How do I test my changes?.

Regards,
Krishna


Running Unit Tests in pyspark failure

2016-11-03 Thread Krishna Kalyan
Hello,
I am trying to run unit tests on pyspark.

When I try to run unit test I am faced with errors.
krishna@Krishna:~/Experiment/spark$ ./python/run-tests
Running PySpark tests. Output is in /Users/krishna/Experiment/
spark/python/unit-tests.log
Will test against the following Python executables: ['python2.6']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml',
'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Please install unittest2 to test with Python 2.6 or earlier
Had test failures in pyspark.sql.tests with python2.6; see logs.

and when I try to Install unittest2, It says requirement already satisfied.

krishna@Krishna:~/Experiment/spark$ sudo pip install --upgrade unittest2
Password:
Requirement already up-to-date: unittest2 in /usr/local/lib/python2.7/site-
packages
Requirement already up-to-date: argparse in
/usr/local/lib/python2.7/site-packages
(from unittest2)
Requirement already up-to-date: six>=1.4 in
/usr/local/lib/python2.7/site-packages
(from unittest2)
Requirement already up-to-date: traceback2 in
/usr/local/lib/python2.7/site-packages
(from unittest2)
Requirement already up-to-date: linecache2 in
/usr/local/lib/python2.7/site-packages
(from traceback2->unittest2)

Help!

Thanks,
Krishna


Re: Running Unit Tests in pyspark failure

2016-11-03 Thread Krishna Kalyan
I could resolve this by passing the argument below
 ./python/run-tests --python-executables=python2.7

Thanks,
Krishna

On Thu, Nov 3, 2016 at 4:16 PM, Krishna Kalyan 
wrote:

> Hello,
> I am trying to run unit tests on pyspark.
>
> When I try to run unit test I am faced with errors.
> krishna@Krishna:~/Experiment/spark$ ./python/run-tests
> Running PySpark tests. Output is in /Users/krishna/Experiment/spar
> k/python/unit-tests.log
> Will test against the following Python executables: ['python2.6']
> Will test the following Python modules: ['pyspark-core', 'pyspark-ml',
> 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
> Please install unittest2 to test with Python 2.6 or earlier
> Had test failures in pyspark.sql.tests with python2.6; see logs.
>
> and when I try to Install unittest2, It says requirement already satisfied.
>
> krishna@Krishna:~/Experiment/spark$ sudo pip install --upgrade unittest2
> Password:
> Requirement already up-to-date: unittest2 in /usr/local/lib/python2.7/site-
> packages
> Requirement already up-to-date: argparse in 
> /usr/local/lib/python2.7/site-packages
> (from unittest2)
> Requirement already up-to-date: six>=1.4 in 
> /usr/local/lib/python2.7/site-packages
> (from unittest2)
> Requirement already up-to-date: traceback2 in
> /usr/local/lib/python2.7/site-packages (from unittest2)
> Requirement already up-to-date: linecache2 in
> /usr/local/lib/python2.7/site-packages (from traceback2->unittest2)
>
> Help!
>
> Thanks,
> Krishna
>
>
>
>
>


Contributing to Spark in GSoC 2017

2016-11-09 Thread Krishna Kalyan
Hello,
I am Krishna, currently a 2nd year Masters student in (MSc. in Data Mining)
currently in Barcelona studying at Université Polytechnique de Catalogne.
I know its a little early for GSoC, however I wanted to get  a head start
working with the spark community.
Is there anyone who would be mentoring GSoC 2017?.
Could anyone please guide on how to go about it?.

Related Experience:
My masters is mostly focussed on data mining and machine learning
techniques. Before my masters, I was a  data engineer with IBM (India). I
was responsible for managing 50 node Hadoop Cluster for more than a year.
Most of my time was spent optimising and writing ETL (Apache Pig) jobs. Our
daily batch job aggregated more than 30gbs of CDR+Weblogs in our cluster.

I am the most comfortable with Python and R. (Not a Scala expert, I am sure
that I can pick it up quickly)

 My CV could be viewed by following the link below.
(https://github.com/krishnakalyan3/Resume/raw/master/Resume.pdf)

My Spark Pull Requests
(
https://github.com/apache/spark/pulls?utf8=%E2%9C%93&q=is%3Apr%20author%3Akrishnakalyan3%20
)

Thank you so much,
Krishna


Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-13 Thread Krishna Sankar
+1
1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4
-Dhadoop.version=2.4.0 -DskipTests clean package 10:49 min
2. Tested pyspark, mlib
2.1. statistics OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
2.5. rdd operations OK
2.6. recommendation OK
2.7. Good work ! In 1.1.0, there was an error and my program used to hang
(over memory allocation) consistently running validation using itertools,
compute optimum rank, lambda,numofiterations/rmse; data - movielens medium
dataset (1 million records) . It works well in 1.1.1 !
Cheers

P.S: Missed Reply all, first time

On Wed, Nov 12, 2014 at 8:35 PM, Andrew Or  wrote:

> I will start the vote with a +1
>
> 2014-11-12 20:34 GMT-08:00 Andrew Or :
>
> > Please vote on releasing the following candidate as Apache Spark version
> 1
> > .1.1.
> >
> > This release fixes a number of bugs in Spark 1.1.0. Some of the notable
> > ones are
> > - [SPARK-3426] Sort-based shuffle compression settings are incompatible
> > - [SPARK-3948] Stream corruption issues in sort-based shuffle
> > - [SPARK-4107] Incorrect handling of Channel.read() led to data
> truncation
> > The full list is at http://s.apache.org/z9h and in the CHANGES.txt
> > attached.
> >
> > The tag to be voted on is v1.1.1-rc1 (commit 72a4fdbe):
> > http://s.apache.org/cZC
> >
> > The release files, including signatures, digests, etc can be found at:
> > http://people.apache.org/~andrewor14/spark-1.1.1-rc1/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/andrewor14.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1034/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~andrewor14/spark-1.1.1-rc1-docs/
> >
> > Please vote on releasing this package as Apache Spark 1.1.1!
> >
> > The vote is open until Sunday, November 16, at 04:30 UTC and passes if
> > a majority of at least 3 +1 PMC votes are cast.
> > [ ] +1 Release this package as Apache Spark 1.1.1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
> > Cheers,
> > Andrew
> >
>


Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-19 Thread Krishna Sankar
+1
1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4
-Dhadoop.version=2.4.0 -DskipTests clean package 10:49 min
2. Tested pyspark, mlib
2.1. statistics OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
2.5. rdd operations OK
2.6. recommendation OK
2.7. Good work ! In 1.1.0, there was an error and my program used to hang
(over memory allocation) consistently running validation using itertools,
compute optimum rank, lambda,numofiterations/rmse; data - movielens medium
dataset (1 million records) . It works well in 1.1.1 !

Cheers


On Wed, Nov 19, 2014 at 6:00 PM, Xiangrui Meng  wrote:

> +1. Checked version numbers and doc. Tested a few ML examples with
> Java 6 and verified some recently merged bug fixes. -Xiangrui
>
> On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or  wrote:
> > I will start with a +1
> >
> > 2014-11-19 14:51 GMT-08:00 Andrew Or :
> >
> >> Please vote on releasing the following candidate as Apache Spark
> version 1
> >> .1.1.
> >>
> >> This release fixes a number of bugs in Spark 1.1.0. Some of the notable
> >> ones are
> >> - [SPARK-3426] Sort-based shuffle compression settings are incompatible
> >> - [SPARK-3948] Stream corruption issues in sort-based shuffle
> >> - [SPARK-4107] Incorrect handling of Channel.read() led to data
> truncation
> >> The full list is at http://s.apache.org/z9h and in the CHANGES.txt
> >> attached.
> >>
> >> Additionally, this candidate fixes two blockers from the previous RC:
> >> - [SPARK-4434] Cluster mode jar URLs are broken
> >> - [SPARK-4480][SPARK-4467] Too many open files exception from shuffle
> >> spills
> >>
> >> The tag to be voted on is v1.1.1-rc2 (commit 3693ae5d):
> >> http://s.apache.org/p8
> >>
> >> The release files, including signatures, digests, etc can be found at:
> >> http://people.apache.org/~andrewor14/spark-1.1.1-rc2/
> >>
> >> Release artifacts are signed with the following key:
> >> https://people.apache.org/keys/committer/andrewor14.asc
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1043/
> >>
> >> The documentation corresponding to this release can be found at:
> >> http://people.apache.org/~andrewor14/spark-1.1.1-rc2-docs/
> >>
> >> Please vote on releasing this package as Apache Spark 1.1.1!
> >>
> >> The vote is open until Saturday, November 22, at 23:00 UTC and passes if
> >> a majority of at least 3 +1 PMC votes are cast.
> >> [ ] +1 Release this package as Apache Spark 1.1.1
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see
> >> http://spark.apache.org/
> >>
> >> Cheers,
> >> Andrew
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-11-28 Thread Krishna Sankar
Looks like the documentation hasn't caught up with the new features.
On the machine learning side, for example org.apache.spark.ml,
RandomForest, gbtree and so forth. Is a refresh of the documentation
planned ?
Am happy to see these capabilities, but these would need good explanations
as well, especially the new thinking around the ml ... pipelines,
transformations et al.
IMHO, the documentation is a -1.
Will check out the compilation, mlib et al

Cheers


On Fri, Nov 28, 2014 at 9:16 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.2.0!
>
> The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.0-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1048/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.2.0!
>
> The vote is open until Tuesday, December 02, at 05:15 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.1.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == What justifies a -1 vote for this release? ==
> This vote is happening very late into the QA period compared with
> previous votes, so -1 votes should only occur for significant
> regressions from 1.0.2. Bugs already present in 1.1.X, minor
> regressions, or bugs related to new features will not block this
> release.
>
> == What default changes should I be aware of? ==
> 1. The default value of "spark.shuffle.blockTransferService" has been
> changed to "netty"
> --> Old behavior can be restored by switching to "nio"
>
> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
> --> Old behavior can be restored by setting "spark.shuffle.manager" to
> "hash".
>
> == Other notes ==
> Because this vote is occurring over a weekend, I will likely extend
> the vote if this RC survives until the end of the vote period.
>
> - Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-11-29 Thread Krishna Sankar
+1
1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4
-Dhadoop.version=2.4.0 -DskipTests clean package 16:46 min (slightly slower
connection)
2. Tested pyspark, mlib - running as well as compare esults with 1.1.x
2.1. statistics OK
2.2. Linear/Ridge/Laso Regression OK
   Slight difference in the print method (vs. 1.1.x) of the model
object - with a label & more details. This is good.
2.3. Decision Tree, Naive Bayes OK
   Changes in print(model) - now print (model.ToDebugString()) - OK
   Some changes in NaiveBayes. Different from my 1.1.x code - had to
flatten list structures, zip required same number in partitions
   After code changes ran fine.
2.4. KMeans OK
   zip occasionally fails with error "localhost):
org.apache.spark.SparkException: Can only zip RDDs with same number of
elements in each partition"
Has https://issues.apache.org/jira/browse/SPARK-2251 reappeared ?
Made it work by doing a different transformation ie reusing an original
rdd.
2.5. rdd operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. recommendation OK
2.7. Good work ! In 1.x.x, had a map distinct over the movielens medium
dataset which never worked. Works fine in 1.2.0 !
3. Scala Mlib - subset of examples as in #2 above, with Scala
3.1. statistics OK
3.2. Linear Regression OK
3.3. Decision Tree OK
3.4. KMeans OK
Cheers

P.S: Plan to add RF and .ml mechanics to this bank

On Fri, Nov 28, 2014 at 9:16 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.2.0!
>
> The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.0-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1048/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.2.0!
>
> The vote is open until Tuesday, December 02, at 05:15 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.1.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == What justifies a -1 vote for this release? ==
> This vote is happening very late into the QA period compared with
> previous votes, so -1 votes should only occur for significant
> regressions from 1.0.2. Bugs already present in 1.1.X, minor
> regressions, or bugs related to new features will not block this
> release.
>
> == What default changes should I be aware of? ==
> 1. The default value of "spark.shuffle.blockTransferService" has been
> changed to "netty"
> --> Old behavior can be restored by switching to "nio"
>
> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
> --> Old behavior can be restored by setting "spark.shuffle.manager" to
> "hash".
>
> == Other notes ==
> Because this vote is occurring over a weekend, I will likely extend
> the vote if this RC survives until the end of the vote period.
>
> - Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-04 Thread Krishna Sankar
Will do. Am on the road - will annotate an iPython notebook with what works
& what didn't work ...
Cheers


On Wed, Dec 3, 2014 at 4:19 PM, Xiangrui Meng  wrote:

> Krishna, could you send me some code snippets for the issues you saw
> in naive Bayes and k-means? -Xiangrui
>
> On Sun, Nov 30, 2014 at 6:49 AM, Krishna Sankar 
> wrote:
> > +1
> > 1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4
> > -Dhadoop.version=2.4.0 -DskipTests clean package 16:46 min (slightly
> slower
> > connection)
> > 2. Tested pyspark, mlib - running as well as compare esults with 1.1.x
> > 2.1. statistics OK
> > 2.2. Linear/Ridge/Laso Regression OK
> >Slight difference in the print method (vs. 1.1.x) of the model
> > object - with a label & more details. This is good.
> > 2.3. Decision Tree, Naive Bayes OK
> >Changes in print(model) - now print (model.ToDebugString()) - OK
> >Some changes in NaiveBayes. Different from my 1.1.x code - had to
> > flatten list structures, zip required same number in partitions
> >After code changes ran fine.
> > 2.4. KMeans OK
> >zip occasionally fails with error "localhost):
> > org.apache.spark.SparkException: Can only zip RDDs with same number of
> > elements in each partition"
> > Has https://issues.apache.org/jira/browse/SPARK-2251 reappeared ?
> > Made it work by doing a different transformation ie reusing an original
> > rdd.
> > 2.5. rdd operations OK
> >State of the Union Texts - MapReduce, Filter,sortByKey (word
> count)
> > 2.6. recommendation OK
> > 2.7. Good work ! In 1.x.x, had a map distinct over the movielens medium
> > dataset which never worked. Works fine in 1.2.0 !
> > 3. Scala Mlib - subset of examples as in #2 above, with Scala
> > 3.1. statistics OK
> > 3.2. Linear Regression OK
> > 3.3. Decision Tree OK
> > 3.4. KMeans OK
> > Cheers
> > 
> > P.S: Plan to add RF and .ml mechanics to this bank
> >
> > On Fri, Nov 28, 2014 at 9:16 PM, Patrick Wendell 
> wrote:
> >
> >> Please vote on releasing the following candidate as Apache Spark version
> >> 1.2.0!
> >>
> >> The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
> >>
> >>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> http://people.apache.org/~pwendell/spark-1.2.0-rc1/
> >>
> >> Release artifacts are signed with the following key:
> >> https://people.apache.org/keys/committer/pwendell.asc
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1048/
> >>
> >> The documentation corresponding to this release can be found at:
> >> http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
> >>
> >> Please vote on releasing this package as Apache Spark 1.2.0!
> >>
> >> The vote is open until Tuesday, December 02, at 05:15 UTC and passes
> >> if a majority of at least 3 +1 PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Spark 1.1.0
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see
> >> http://spark.apache.org/
> >>
> >> == What justifies a -1 vote for this release? ==
> >> This vote is happening very late into the QA period compared with
> >> previous votes, so -1 votes should only occur for significant
> >> regressions from 1.0.2. Bugs already present in 1.1.X, minor
> >> regressions, or bugs related to new features will not block this
> >> release.
> >>
> >> == What default changes should I be aware of? ==
> >> 1. The default value of "spark.shuffle.blockTransferService" has been
> >> changed to "netty"
> >> --> Old behavior can be restored by switching to "nio"
> >>
> >> 2. The default value of "spark.shuffle.manager" has been changed to
> "sort".
> >> --> Old behavior can be restored by setting "spark.shuffle.manager" to
> >> "hash".
> >>
> >> == Other notes ==
> >> Because this vote is occurring over a weekend, I will likely extend
> >> the vote if this RC survives until the end of the vote period.
> >>
> >> - Patrick
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >>
>


Fwd: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-17 Thread Krishna Sankar
Forgot Reply To All ;o(
-- Forwarded message --
From: Krishna Sankar 
Date: Wed, Dec 10, 2014 at 9:16 PM
Subject: Re: [VOTE] Release Apache Spark 1.2.0 (RC2)
To: Matei Zaharia 

+1
Works same as RC1
1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4
-Dhadoop.version=2.4.0 -DskipTests clean package 13:07 min
2. Tested pyspark, mlib - running as well as compare results with 1.1.x
2.1. statistics OK
2.2. Linear/Ridge/Laso Regression OK
   Slight difference in the print method (vs. 1.1.x) of the model
object - with a label & more details. This is good.
2.3. Decision Tree, Naive Bayes OK
   Changes in print(model) - now print (model.ToDebugString()) - OK
   Some changes in NaiveBayes. Different from my 1.1.x code - had to
flatten list structures, zip required same number in partitions
   After code changes ran fine.
2.4. KMeans OK
   Center And Scale OK
   zip occasionally fails with error "localhost):
org.apache.spark.SparkException: Can only zip RDDs with same number of
elements in each partition"
Has https://issues.apache.org/jira/browse/SPARK-2251 reappeared ?
Made it work by doing a different transformation ie reusing an original
rdd.
(Xiangrui, I will end you the iPython Notebook & the dataset by a separate
e-mail)
2.5. rdd operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. recommendation OK
2.7. Good work ! In 1.x.x, had a map distinct over the movielens medium
dataset which never worked. Works fine in 1.2.0 !
3. Scala Mlib - subset of examples as in #2 above, with Scala
3.1. statistics OK
3.2. Linear Regression OK
3.3. Decision Tree OK
3.4. KMeans OK
Cheers


On Wed, Dec 10, 2014 at 3:05 PM, Matei Zaharia 
wrote:

> +1
>
> Tested on Mac OS X.
>
> Matei
>
> > On Dec 10, 2014, at 1:08 PM, Patrick Wendell  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 1.2.0!
> >
> > The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.0-rc2/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1055/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
> >
> > Please vote on releasing this package as Apache Spark 1.2.0!
> >
> > The vote is open until Saturday, December 13, at 21:00 UTC and passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.2.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
> > == What justifies a -1 vote for this release? ==
> > This vote is happening relatively late into the QA period, so
> > -1 votes should only occur for significant regressions from
> > 1.0.2. Bugs already present in 1.1.X, minor
> > regressions, or bugs related to new features will not block this
> > release.
> >
> > == What default changes should I be aware of? ==
> > 1. The default value of "spark.shuffle.blockTransferService" has been
> > changed to "netty"
> > --> Old behavior can be restored by switching to "nio"
> >
> > 2. The default value of "spark.shuffle.manager" has been changed to
> "sort".
> > --> Old behavior can be restored by setting "spark.shuffle.manager" to
> "hash".
> >
> > == How does this differ from RC1 ==
> > This has fixes for a handful of issues identified - some of the
> > notable fixes are:
> >
> > [Core]
> > SPARK-4498: Standalone Master can fail to recognize completed/failed
> > applications
> >
> > [SQL]
> > SPARK-4552: Query for empty parquet table in spark sql hive get
> > IllegalArgumentException
> > SPARK-4753: Parquet2 does not prune based on OR filters on partition
> columns
> > SPARK-4761: With JDBC server, set Kryo as default serializer and
> > disable reference tracking
> > SPARK-4785: When called with arguments referring column fields, PMOD
> throws NPE
> >
> > - Patrick
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Krishna Sankar
+1
1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:55 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests
2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
1.2.0
2.1. statistics OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
   Fixed : org.apache.spark.SparkException in zip !
2.5. rdd operations OK
   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. recommendation OK

Cheers


On Mon, Jan 26, 2015 at 11:02 PM, Patrick Wendell 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.2.1!
>
> The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1061/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.2.1!
>
> The vote is open until Friday, January 30, at 07:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.2.1
> [ ] -1 Do not release this package because ...
>
> For a list of fixes in this release, see http://s.apache.org/Mpn.
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> - Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:22 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0
-Phive -DskipTests
2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
1.2.0
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
   Fixed : org.apache.spark.SparkException in zip !
2.5. rdd operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lmbda) with itertools
OK

Cheers


On Wed, Jan 28, 2015 at 5:17 AM, Sean Owen  wrote:

> +1 (nonbinding). I verified that all the hash / signing items I
> mentioned before are resolved.
>
> The source package compiles on Ubuntu / Java 8. I ran tests and the
> passed. Well, actually I see the same failure I've seeing locally on
> OS X and on Ubuntu for a while, but I think nobody else has seen this?
>
> MQTTStreamSuite:
> - mqtt input stream *** FAILED ***
>   org.eclipse.paho.client.mqttv3.MqttException: Too many publishes in
> progress
>   at
> org.eclipse.paho.client.mqttv3.internal.ClientState.send(ClientState.java:423)
>
> Doesn't happen on Jenkins. If nobody else is seeing this, I suspect it
> is something perhaps related to my env that I haven't figured out yet,
> so should not be considered a blocker.
>
> On Wed, Jan 28, 2015 at 10:06 AM, Patrick Wendell 
> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> 1.2.1!
> >
> > The tag to be voted on is v1.2.1-rc1 (commit b77f876):
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.1-rc2/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1062/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
> >
> > Changes from rc1:
> > This has no code changes from RC1. Only minor changes to the release
> script.
> >
> > Please vote on releasing this package as Apache Spark 1.2.1!
> >
> > The vote is open until  Saturday, January 31, at 10:04 UTC and passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.2.1
> > [ ] -1 Do not release this package because ...
> >
> > For a list of fixes in this release, see http://s.apache.org/Mpn.
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-02 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 11:13 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
1.2.0
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
   Fixed : org.apache.spark.SparkException in zip !
2.5. rdd operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lmbda) with itertools
OK
3. Scala - MLLib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWIthSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK

Cheers



On Mon, Feb 2, 2015 at 8:57 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.2.1!
>
> The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc3/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1065/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/
>
> Changes from rc2:
> A single patch fixing a windows issue.
>
> Please vote on releasing this package as Apache Spark 1.2.1!
>
> The vote is open until Friday, February 06, at 05:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.2.1
> [ ] -1 Do not release this package because ...
>
> For a list of fixes in this release, see http://s.apache.org/Mpn.
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-18 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 14:50 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
1.2.x
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK

But MSE has increased from 40.81 to 105.86. Has some refactoring happened
on SGD/Linear Models ? Or do we have some extra parameters ? or change
of defaults ?

2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
   WSSSE has come down slightly
2.5. rdd operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lmbda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWIthSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK

Cheers

P.S: For some reason replacing  "import sqlContext.createSchemaRDD" with "
import sqlContext.implicits._" doesn't do the implicit conversations.
registerTempTable
gives syntax error. I will dig deeper tomorrow. Has anyone seen this ?

On Wed, Feb 18, 2015 at 3:25 PM, Sean Owen  wrote:

> On Wed, Feb 18, 2015 at 6:13 PM, Patrick Wendell 
> wrote:
> >> Patrick this link gives a 404:
> >> https://people.apache.org/keys/committer/pwendell.asc
> >
> > Works for me. Maybe it's some ephemeral issue?
>
> Yes works now; I swear it didn't before! that's all set now. The
> signing key is in that file.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-19 Thread Krishna Sankar
Excellent. Explicit toDF() works.
a) employees.toDF().registerTempTable("Employees") - works
b) Also affects saveAsParquetFile - orders.toDF().saveAsParquetFile

Adding to my earlier tests:
4.0 SQL from Scala and Python
4.1 result = sqlContext.sql("SELECT * from Employees WHERE State = 'WA'") OK
4.2 result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.3 result = sqlContext.sql("SELECT ShipCountry, Sum(OrderDetails.UnitPrice
* Qty * Discount) AS ProductSales FROM Orders INNER JOIN OrderDetails ON
Orders.OrderID = OrderDetails.OrderID GROUP BY ShipCountry") OK
4.4 saveAsParquetFile OK
4.5 Read and verify the 4.4 save - sqlContext.parquetFile,
registerTempTable, sql OK

Cheers & thanks Michael




On Thu, Feb 19, 2015 at 12:02 PM, Michael Armbrust 
wrote:

> P.S: For some reason replacing  "import sqlContext.createSchemaRDD" with "
>> import sqlContext.implicits._" doesn't do the implicit conversations.
>> registerTempTable
>> gives syntax error. I will dig deeper tomorrow. Has anyone seen this ?
>
>
> We will write up a whole migration guide before the final release, but I
> can quickly explain this one.  We made the implicit conversion
> significantly less broad to avoid the chance of confusing conflicts.
> However, now you have to call .toDF in order to force RDDs to become
> DataFrames.
>


Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-03 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 13:53 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
1.2.x
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
But MSE has increased from 40.81 to 105.86. Has some refactoring happened
on SGD/Linear Models ? Or do we have some extra parameters ? or change of
defaults ?
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
   WSSSE has come down slightly
2.5. rdd operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lmbda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWIthSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
4.0. SQL from Python
4.1. result = sqlContext.sql("SELECT * from Employees WHERE State = 'WA'")
OK

Cheers


On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.3.0!
>
> The tag to be voted on is v1.3.0-rc2 (commit 3af2687):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.3.0-rc2/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> Staging repositories for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1074/
> (published with version '1.3.0')
> https://repository.apache.org/content/repositories/orgapachespark-1075/
> (published with version '1.3.0-rc2')
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/
>
> Please vote on releasing this package as Apache Spark 1.3.0!
>
> The vote is open until Saturday, March 07, at 04:17 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == How does this compare to RC1 ==
> This patch includes a variety of bug fixes found in RC1.
>
> == How can I help test this release? ==
> If you are a Spark user, you can help us test this release by
> taking a Spark 1.2 workload and running on this release candidate,
> then reporting any regressions.
>
> If you are happy with this release based on your own testing, give a +1
> vote.
>
> == What justifies a -1 vote for this release? ==
> This vote is happening towards the end of the 1.3 QA period,
> so -1 votes should only occur for significant regressions from 1.2.1.
> Bugs already present in 1.2.X, minor regressions, or bugs related
> to new features will not block this release.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-04 Thread Krishna Sankar
It is the LR over car-data at https://github.com/xsankar/cloaked-ironman.
1.2.0 gives Mean Squared Error = 40.8130551358
1.3.0 gives Mean Squared Error = 105.857603953

I will verify it one more time tomorrow.

Cheers


On Tue, Mar 3, 2015 at 11:28 PM, Xiangrui Meng  wrote:

> On Tue, Mar 3, 2015 at 11:15 PM, Krishna Sankar 
> wrote:
> > +1 (non-binding, of course)
> >
> > 1. Compiled OSX 10.10 (Yosemite) OK Total time: 13:53 min
> >  mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
> > -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
> > 2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
> > 1.2.x
> > 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> > 2.2. Linear/Ridge/Laso Regression OK
> > But MSE has increased from 40.81 to 105.86. Has some refactoring happened
> > on SGD/Linear Models ? Or do we have some extra parameters ? or change of
> > defaults ?
>
> Could you share the code you used? I don't remember any changes in
> linear regression. Thanks! -Xiangrui
>
> > 2.3. Decision Tree, Naive Bayes OK
> > 2.4. KMeans OK
> >Center And Scale OK
> >WSSSE has come down slightly
> > 2.5. rdd operations OK
> >   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> > 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
> >Model evaluation/optimization (rank, numIter, lmbda) with
> itertools
> > OK
> > 3. Scala - MLlib
> > 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> > 3.2. LinearRegressionWIthSGD OK
> > 3.3. Decision Tree OK
> > 3.4. KMeans OK
> > 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> > 4.0. SQL from Python
> > 4.1. result = sqlContext.sql("SELECT * from Employees WHERE State =
> 'WA'")
> > OK
> >
> > Cheers
> > 
> >
> > On Tue, Mar 3, 2015 at 8:19 PM, Patrick Wendell 
> wrote:
> >
> >> Please vote on releasing the following candidate as Apache Spark version
> >> 1.3.0!
> >>
> >> The tag to be voted on is v1.3.0-rc2 (commit 3af2687):
> >>
> >>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3af26870e5163438868c4eb2df88380a533bb232
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> http://people.apache.org/~pwendell/spark-1.3.0-rc2/
> >>
> >> Release artifacts are signed with the following key:
> >> https://people.apache.org/keys/committer/pwendell.asc
> >>
> >> Staging repositories for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1074/
> >> (published with version '1.3.0')
> >> https://repository.apache.org/content/repositories/orgapachespark-1075/
> >> (published with version '1.3.0-rc2')
> >>
> >> The documentation corresponding to this release can be found at:
> >> http://people.apache.org/~pwendell/spark-1.3.0-rc2-docs/
> >>
> >> Please vote on releasing this package as Apache Spark 1.3.0!
> >>
> >> The vote is open until Saturday, March 07, at 04:17 UTC and passes if
> >> a majority of at least 3 +1 PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Spark 1.3.0
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see
> >> http://spark.apache.org/
> >>
> >> == How does this compare to RC1 ==
> >> This patch includes a variety of bug fixes found in RC1.
> >>
> >> == How can I help test this release? ==
> >> If you are a Spark user, you can help us test this release by
> >> taking a Spark 1.2 workload and running on this release candidate,
> >> then reporting any regressions.
> >>
> >> If you are happy with this release based on your own testing, give a +1
> >> vote.
> >>
> >> == What justifies a -1 vote for this release? ==
> >> This vote is happening towards the end of the 1.3 QA period,
> >> so -1 votes should only occur for significant regressions from 1.2.1.
> >> Bugs already present in 1.2.X, minor regressions, or bugs related
> >> to new features will not block this release.
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >>
>


Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-06 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 13:55 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
1.2.x
   pyspark works well with the new iPython 3.0.0 release
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
 Note: But MSE has increased from 40.81 (1.2.x) to 105.86 (1.3.0).
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
   Note : WSSSE has come down slightly
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0  Good work on introducing DataFrames. Didn’t test DataFrames. Will add
test cases for next release.

Cheers


On Thu, Mar 5, 2015 at 6:52 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.3.0!
>
> The tag to be voted on is v1.3.0-rc2 (commit 4aaf48d4):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4aaf48d46d13129f0f9bdafd771dd80fe568a7dc
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.3.0-rc3/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> Staging repositories for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1078
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.3.0-rc3-docs/
>
> Please vote on releasing this package as Apache Spark 1.3.0!
>
> The vote is open until Monday, March 09, at 02:52 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == How does this compare to RC2 ==
> This release includes the following bug fixes:
>
> https://issues.apache.org/jira/browse/SPARK-6144
> https://issues.apache.org/jira/browse/SPARK-6171
> https://issues.apache.org/jira/browse/SPARK-5143
> https://issues.apache.org/jira/browse/SPARK-6182
> https://issues.apache.org/jira/browse/SPARK-6175
>
> == How can I help test this release? ==
> If you are a Spark user, you can help us test this release by
> taking a Spark 1.2 workload and running on this release candidate,
> then reporting any regressions.
>
> If you are happy with this release based on your own testing, give a +1
> vote.
>
> == What justifies a -1 vote for this release? ==
> This vote is happening towards the end of the 1.3 QA period,
> so -1 votes should only occur for significant regressions from 1.2.1.
> Bugs already present in 1.2.X, minor regressions, or bugs related
> to new features will not block this release.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-08 Thread Krishna Sankar
Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
Distributions X ...

May be one option is to have a minimum basic set (which I know is what we
are discussing) and move the rest to spark-packages.org. There the vendors
can add the latest downloads - for example when 1.4 is released, HDP can
build a release of HDP Spark 1.4 bundle.

Cheers


On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell  wrote:

> We probably want to revisit the way we do binaries in general for
> 1.4+. IMO, something worth forking a separate thread for.
>
> I've been hesitating to add new binaries because people
> (understandably) complain if you ever stop packaging older ones, but
> on the other hand the ASF has complained that we have too many
> binaries already and that we need to pare it down because of the large
> volume of files. Doubling the number of binaries we produce for Scala
> 2.11 seemed like it would be too much.
>
> One solution potentially is to actually package "Hadoop provided"
> binaries and encourage users to use these by simply setting
> HADOOP_HOME, or have instructions for specific distros. I've heard
> that our existing packages don't work well on HDP for instance, since
> there are some configuration quirks that differ from the upstream
> Hadoop.
>
> If we cut down on the cross building for Hadoop versions, then it is
> more tenable to cross build for Scala versions without exploding the
> number of binaries.
>
> - Patrick
>
> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen  wrote:
> > Yeah, interesting question of what is the better default for the
> > single set of artifacts published to Maven. I think there's an
> > argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
> > and cons discussed more at
> >
> > https://issues.apache.org/jira/browse/SPARK-5134
> > https://github.com/apache/spark/pull/3917
> >
> > On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia 
> wrote:
> >> +1
> >>
> >> Tested it on Mac OS X.
> >>
> >> One small issue I noticed is that the Scala 2.11 build is using Hadoop
> 1 without Hive, which is kind of weird because people will more likely want
> Hadoop 2 with Hive. So it would be good to publish a build for that
> configuration instead. We can do it if we do a new RC, or it might be that
> binary builds may not need to be voted on (I forgot the details there).
> >>
> >> Matei
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Krishna Sankar
Excellent, Thanks Xiangrui. The mystery is solved.
Cheers



On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng  wrote:

> Krishna, I tested your linear regression example. For linear
> regression, we changed its objective function from 1/n * \|A x -
> b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least
> squares formulations. It means you could re-produce the same result by
> multiplying the step size by 2. This is not a problem if both run
> until convergence (if not blow up). However, in your example, a very
> small step size is chosen and it didn't converge in 100 iterations. In
> this case, the step size matters. I will put a note in the migration
> guide. Thanks! -Xiangrui
>
> On Mon, Mar 9, 2015 at 1:38 PM, Sean Owen  wrote:
> > I'm +1 as I have not heard of any one else seeing the Hive test
> > failure, which is likely a test issue rather than code issue anyway,
> > and not a blocker.
> >
> > On Fri, Mar 6, 2015 at 9:36 PM, Sean Owen  wrote:
> >> Although the problem is small, especially if indeed the essential docs
> >> changes are following just a couple days behind the final release, I
> >> mean, why the rush if they're essential? wait a couple days, finish
> >> them, make the release.
> >>
> >> Answer is, I think these changes aren't actually essential given the
> >> comment from tdas, so: just mark these Critical? (although ... they do
> >> say they're changes for the 1.3 release, so kind of funny to get to
> >> them for 1.3.x or 1.4, but that's not important now.)
> >>
> >> I thought that Blocker really meant Blocker in this project, as I've
> >> been encouraged to use it to mean "don't release without this." I
> >> think we should use it that way. Just thinking of it as "extra
> >> Critical" doesn't add anything. I don't think Documentation should be
> >> special-cased as less important, and I don't think there's confusion
> >> if Blocker means what it says, so I'd 'fix' that way.
> >>
> >> If nobody sees the Hive failure I observed, and if we can just zap
> >> those "Blockers" one way or the other, +1
> >>
> >>
> >> On Fri, Mar 6, 2015 at 9:17 PM, Patrick Wendell 
> wrote:
> >>> Sean,
> >>>
> >>> The docs are distributed and consumed in a fundamentally different way
> >>> than Spark code itself. So we've always considered the "deadline" for
> >>> doc changes to be when the release is finally posted.
> >>>
> >>> If there are small inconsistencies with the docs present in the source
> >>> code for that release tag, IMO that doesn't matter much since we don't
> >>> even distribute the docs with Spark's binary releases and virtually no
> >>> one builds and hosts the docs on their own (that I am aware of, at
> >>> least). Perhaps we can recommend if people want to build the doc
> >>> sources that they should always grab the head of the most recent
> >>> release branch, to set expectations accordingly.
> >>>
> >>> In the past we haven't considered it worth holding up the release
> >>> process for the purpose of the docs. It just doesn't make sense since
> >>> they are consumed "as a service". If we decide to change this
> >>> convention, it would mean shipping our releases later, since we
> >>> could't pipeline the doc finalization with voting.
> >>>
> >>> - Patrick
> >>>
> >>> On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen  wrote:
> >>>> Given the title and tagging, it sounds like there could be some
> >>>> must-have doc changes to go with what is being released as 1.3. It can
> >>>> be finished later, and published later, but then the docs source
> >>>> shipped with the release doesn't match the site, and until then, 1.3
> >>>> is released without some "must-have" docs for 1.3 on the site.
> >>>>
> >>>> The real question to me is: are there any further, absolutely
> >>>> essential doc changes that need to accompany 1.3 or not?
> >>>>
> >>>> If not, just resolve these. If there are, then it seems like the
> >>>> release has to block on them. If there are some docs that should have
> >>>> gone in for 1.3, but didn't, but aren't essential, well I suppose it
> >>>> bears thi

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-04 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 15:04 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
2. Tested pyspark, mlib - running as well as compare results with 1.3.0
   pyspark works well with the new iPython 3.0.0 release
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK

On Sat, Apr 4, 2015 at 5:13 PM, Reynold Xin  wrote:

> +1
>
> Tested some DataFrame functions locally on Mac OS X.
>
> On Sat, Apr 4, 2015 at 5:09 PM, Patrick Wendell 
> wrote:
>
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.3.1!
> >
> > The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
> >
> > The list of fixes present in this release can be found at:
> > http://bit.ly/1C2nVPY
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.3.1-rc1/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1080
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/
> >
> > Please vote on releasing this package as Apache Spark 1.3.1!
> >
> > The vote is open until Wednesday, April 08, at 01:10 UTC and passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.3.1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
> > - Patrick
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>


Re: [VOTE] Release Apache Spark 1.2.2

2015-04-06 Thread Krishna Sankar
+1

On Sun, Apr 5, 2015 at 4:24 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.2.2!
>
> The tag to be voted on is v1.2.2-rc1 (commit 7531b50):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7531b50e406ee2e3301b009ceea7c684272b2e27
>
> The list of fixes present in this release can be found at:
> http://bit.ly/1DCNddt
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.2-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1082/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.2-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.2.2!
>
> The vote is open until Thursday, April 08, at 00:30 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.2.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.3.1 (RC2)

2015-04-08 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 14:16 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
2. Tested pyspark, mlib - running as well as compare results with 1.3.0
   pyspark works well with the new iPython 3.0.0 release
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK

On Tue, Apr 7, 2015 at 10:46 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.3.1!
>
> The tag to be voted on is v1.3.1-rc2 (commit 7c4473a):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7c4473aa5a7f5de0323394aaedeefbf9738e8eb5
>
> The list of fixes present in this release can be found at:
> http://bit.ly/1C2nVPY
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.3.1-rc2/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1083/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.3.1-rc2-docs/
>
> The patches on top of RC1 are:
>
> [SPARK-6737] Fix memory leak in OutputCommitCoordinator
> https://github.com/apache/spark/pull/5397
>
> [SPARK-6636] Use public DNS hostname everywhere in spark_ec2.py
> https://github.com/apache/spark/pull/5302
>
> [SPARK-6205] [CORE] UISeleniumSuite fails for Hadoop 2.x test with
> NoClassDefFoundError
> https://github.com/apache/spark/pull/4933
>
> Please vote on releasing this package as Apache Spark 1.3.1!
>
> The vote is open until Saturday, April 11, at 07:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-11 Thread Krishna Sankar
+1. All tests OK (same as RC2)
Cheers


On Fri, Apr 10, 2015 at 11:05 PM, Patrick Wendell 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.3.1!
>
> The tag to be voted on is v1.3.1-rc2 (commit 3e83913):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e8391327ba586eaf54447043bd526d919043a44
>
> The list of fixes present in this release can be found at:
> http://bit.ly/1C2nVPY
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.3.1-rc3/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1088/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.3.1-rc3-docs/
>
> The patches on top of RC2 are:
> [SPARK-6851] [SQL] Create new instance for each converted parquet relation
> [SPARK-5969] [PySpark] Fix descending pyspark.rdd.sortByKey.
> [SPARK-6343] Doc driver-worker network reqs
> [SPARK-6767] [SQL] Fixed Query DSL error in spark sql Readme
> [SPARK-6781] [SQL] use sqlContext in python shell
> [SPARK-6753] Clone SparkConf in ShuffleSuite tests
> [SPARK-6506] [PySpark] Do not try to retrieve SPARK_HOME when not needed...
>
> Please vote on releasing this package as Apache Spark 1.3.1!
>
> The vote is open until Tuesday, April 14, at 07:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-19 Thread Krishna Sankar
Quick tests from my side - looks OK. The results are same or very similar
to 1.3.1. Will add dataframes et al in future tests.

+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:42 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests
2. Tested pyspark, mlib - running as well as compare results with 1.3.1
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK

Cheers


On Tue, May 19, 2015 at 9:10 AM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.4.0!
>
> The tag to be voted on is v1.4.0-rc1 (commit 777a081):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=777a08166f1fb144146ba32581d4632c3466541e
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.4.0-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1092/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.4.0-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.4.0!
>
> The vote is open until Friday, May 22, at 17:03 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == How can I help test this release? ==
> If you are a Spark user, you can help us test this release by
> taking a Spark 1.3 workload and running on this release candidate,
> then reporting any regressions.
>
> == What justifies a -1 vote for this release? ==
> This vote is happening towards the end of the 1.4 QA period,
> so -1 votes should only occur for significant regressions from 1.3.1.
> Bugs already present in 1.3.X, minor regressions, or bugs related
> to new features will not block this release.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.4.0 (RC2)

2015-05-24 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 16:52 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -DskipTests
2. Tested pyspark, mlib - running as well as compare results with 1.3.1
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
4.2. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.3. saveAsParquetFile OK
4.4. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK

Cheers


On Sun, May 24, 2015 at 12:22 AM, Patrick Wendell 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.4.0!
>
> The tag to be voted on is v1.4.0-rc2 (commit 03fb26a3):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=03fb26a3e50e00739cc815ba4e2e82d71d003168
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> [published as version: 1.4.0]
> https://repository.apache.org/content/repositories/orgapachespark-1103/
> [published as version: 1.4.0-rc2]
> https://repository.apache.org/content/repositories/orgapachespark-1104/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-docs/
>
> Please vote on releasing this package as Apache Spark 1.4.0!
>
> The vote is open until Wednesday, May 27, at 08:12 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == What has changed since RC1 ==
> Below is a list of bug fixes that went into this RC:
> http://s.apache.org/U1M
>
> == How can I help test this release? ==
> If you are a Spark user, you can help us test this release by
> taking a Spark 1.3 workload and running on this release candidate,
> then reporting any regressions.
>
> == What justifies a -1 vote for this release? ==
> This vote is happening towards the end of the 1.4 QA period,
> so -1 votes should only occur for significant regressions from 1.3.1.
> Bugs already present in 1.3.X, minor regressions, or bugs related
> to new features will not block this release.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-05-30 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 17:07 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -DskipTests
2. Tested pyspark, mlib - running as well as compare results with 1.3.1
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK

Cheers


On Fri, May 29, 2015 at 4:40 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.4.0!
>
> The tag to be voted on is v1.4.0-rc3 (commit dd109a8):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=dd109a8746ec07c7c83995890fc2c0cd7a693730
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> [published as version: 1.4.0]
> https://repository.apache.org/content/repositories/orgapachespark-1109/
> [published as version: 1.4.0-rc3]
> https://repository.apache.org/content/repositories/orgapachespark-1110/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc3-docs/
>
> Please vote on releasing this package as Apache Spark 1.4.0!
>
> The vote is open until Tuesday, June 02, at 00:32 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == What has changed since RC1 ==
> Below is a list of bug fixes that went into this RC:
> http://s.apache.org/vN
>
> == How can I help test this release? ==
> If you are a Spark user, you can help us test this release by
> taking a Spark 1.3 workload and running on this release candidate,
> then reporting any regressions.
>
> == What justifies a -1 vote for this release? ==
> This vote is happening towards the end of the 1.4 QA period,
> so -1 votes should only occur for significant regressions from 1.3.1.
> Bugs already present in 1.3.X, minor regressions, or bugs related
> to new features will not block this release.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-05 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 25:42 min (My brand new
shiny MacBookPro12,1 : 16GB. Inaugurated the machine with compile & test
1.4.0-RC4 !)
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -DskipTests
2. Tested pyspark, mlib - running as well as compare results with 1.3.1
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK

Cheers


On Tue, Jun 2, 2015 at 8:53 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.4.0!
>
> The tag to be voted on is v1.4.0-rc3 (commit 22596c5):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
> 22596c534a38cfdda91aef18aa9037ab101e4251
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> [published as version: 1.4.0]
> https://repository.apache.org/content/repositories/orgapachespark-/
> [published as version: 1.4.0-rc4]
> https://repository.apache.org/content/repositories/orgapachespark-1112/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/
>
> Please vote on releasing this package as Apache Spark 1.4.0!
>
> The vote is open until Saturday, June 06, at 05:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.4.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == What has changed since RC3 ==
> In addition to may smaller fixes, three blocker issues were fixed:
> 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make
> metadataHive get constructed too early
> 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
> 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton
>
> == How can I help test this release? ==
> If you are a Spark user, you can help us test this release by
> taking a Spark 1.3 workload and running on this release candidate,
> then reporting any regressions.
>
> == What justifies a -1 vote for this release? ==
> This vote is happening towards the end of the 1.4 QA period,
> so -1 votes should only occur for significant regressions from 1.3.1.
> Bugs already present in 1.3.X, minor regressions, or bugs related
> to new features will not block this release.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.4.1

2015-06-28 Thread Krishna Sankar
Patrick,
   Haven't seen any replies on test results. I will byte ;o) - Should I
test this version or is another one in the wings ?
Cheers


On Tue, Jun 23, 2015 at 10:37 PM, Patrick Wendell 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.4.1!
>
> This release fixes a handful of known issues in Spark 1.4.0, listed here:
> http://s.apache.org/spark-1.4.1
>
> The tag to be voted on is v1.4.1-rc1 (commit 60e08e5):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
> 60e08e50751fe3929156de956d62faea79f5b801
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> [published as version: 1.4.1]
> https://repository.apache.org/content/repositories/orgapachespark-1118/
> [published as version: 1.4.1-rc1]
> https://repository.apache.org/content/repositories/orgapachespark-1119/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.4.1!
>
> The vote is open until Saturday, June 27, at 06:32 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.4.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.4.1

2015-06-29 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 13:26 min
 mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK

Cheers


On Tue, Jun 23, 2015 at 10:37 PM, Patrick Wendell 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.4.1!
>
> This release fixes a handful of known issues in Spark 1.4.0, listed here:
> http://s.apache.org/spark-1.4.1
>
> The tag to be voted on is v1.4.1-rc1 (commit 60e08e5):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
> 60e08e50751fe3929156de956d62faea79f5b801
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> [published as version: 1.4.1]
> https://repository.apache.org/content/repositories/orgapachespark-1118/
> [published as version: 1.4.1-rc1]
> https://repository.apache.org/content/repositories/orgapachespark-1119/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.4.1!
>
> The vote is open until Saturday, June 27, at 06:32 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.4.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


except vs subtract

2015-07-02 Thread Krishna Sankar
Guys,
   Scala says except while python has subtract. (I verified that except
doesn't exist in python) Why the difference in syntax for the same
functionality ?
Cheers



Re: except vs subtract

2015-07-03 Thread Krishna Sankar
Thanks. Forgot about that ;o(

On Thu, Jul 2, 2015 at 11:57 PM, Reynold Xin  wrote:

> "except" is a keyword in Python unfortunately.
>
>
>
> On Thu, Jul 2, 2015 at 11:54 PM, Krishna Sankar 
> wrote:
>
>> Guys,
>>Scala says except while python has subtract. (I verified that except
>> doesn't exist in python) Why the difference in syntax for the same
>> functionality ?
>> Cheers
>> 
>>
>
>


Re: [VOTE] Release Apache Spark 1.4.1 (RC2)

2015-07-03 Thread Krishna Sankar
Yep, happens to me as well. Build loops.
Cheers


On Fri, Jul 3, 2015 at 2:40 PM, Ted Yu  wrote:

> Patrick:
> I used the following command:
> ~/apache-maven-3.3.1/bin/mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean
> package
>
> The build doesn't seem to stop.
> Here is tail of build output:
>
> [INFO] Dependency-reduced POM written at:
> /home/hbase/spark-1.4.1/bagel/dependency-reduced-pom.xml
> [INFO] Dependency-reduced POM written at:
> /home/hbase/spark-1.4.1/bagel/dependency-reduced-pom.xml
>
> Here is part of the stack trace for the build process:
>
> http://pastebin.com/xL2Y0QMU
>
> FYI
>
> On Fri, Jul 3, 2015 at 1:15 PM, Patrick Wendell 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.4.1!
>>
>> This release fixes a handful of known issues in Spark 1.4.0, listed here:
>> http://s.apache.org/spark-1.4.1
>>
>> The tag to be voted on is v1.4.1-rc2 (commit 07b95c7):
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
>> 07b95c7adf88f0662b7ab1c47e302ff5e6859606
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> [published as version: 1.4.1]
>> https://repository.apache.org/content/repositories/orgapachespark-1120/
>> [published as version: 1.4.1-rc2]
>> https://repository.apache.org/content/repositories/orgapachespark-1121/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc2-docs/
>>
>> Please vote on releasing this package as Apache Spark 1.4.1!
>>
>> The vote is open until Monday, July 06, at 22:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.4.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: [VOTE] Release Apache Spark 1.4.1 (RC2)

2015-07-03 Thread Krishna Sankar
I have 3.3.3
USS-Defiant:NW ksankar$ mvn -version
Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06;
2015-04-22T04:57:37-07:00)
Maven home: /usr/local/apache-maven-3.3.3
Java version: 1.7.0_60, vendor: Oracle Corporation
Java home:
/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.10.3", arch: "x86_64", family: "mac"

Let me nuke it and reinstall maven.

Cheers


On Fri, Jul 3, 2015 at 3:41 PM, Patrick Wendell  wrote:

> What if you use the built-in maven (i.e. build/mvn). It might be that
> we require a newer version of maven than you have. The release itself
> is built with maven 3.3.3:
>
> https://github.com/apache/spark/blob/master/build/mvn#L72
>
> - Patrick
>
> On Fri, Jul 3, 2015 at 3:19 PM, Krishna Sankar 
> wrote:
> > Yep, happens to me as well. Build loops.
> > Cheers
> > 
> >
> > On Fri, Jul 3, 2015 at 2:40 PM, Ted Yu  wrote:
> >>
> >> Patrick:
> >> I used the following command:
> >> ~/apache-maven-3.3.1/bin/mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive
> clean
> >> package
> >>
> >> The build doesn't seem to stop.
> >> Here is tail of build output:
> >>
> >> [INFO] Dependency-reduced POM written at:
> >> /home/hbase/spark-1.4.1/bagel/dependency-reduced-pom.xml
> >> [INFO] Dependency-reduced POM written at:
> >> /home/hbase/spark-1.4.1/bagel/dependency-reduced-pom.xml
> >>
> >> Here is part of the stack trace for the build process:
> >>
> >> http://pastebin.com/xL2Y0QMU
> >>
> >> FYI
> >>
> >> On Fri, Jul 3, 2015 at 1:15 PM, Patrick Wendell 
> >> wrote:
> >>>
> >>> Please vote on releasing the following candidate as Apache Spark
> version
> >>> 1.4.1!
> >>>
> >>> This release fixes a handful of known issues in Spark 1.4.0, listed
> here:
> >>> http://s.apache.org/spark-1.4.1
> >>>
> >>> The tag to be voted on is v1.4.1-rc2 (commit 07b95c7):
> >>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
> >>> 07b95c7adf88f0662b7ab1c47e302ff5e6859606
> >>>
> >>> The release files, including signatures, digests, etc. can be found at:
> >>> http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc2-bin/
> >>>
> >>> Release artifacts are signed with the following key:
> >>> https://people.apache.org/keys/committer/pwendell.asc
> >>>
> >>> The staging repository for this release can be found at:
> >>> [published as version: 1.4.1]
> >>>
> https://repository.apache.org/content/repositories/orgapachespark-1120/
> >>> [published as version: 1.4.1-rc2]
> >>>
> https://repository.apache.org/content/repositories/orgapachespark-1121/
> >>>
> >>> The documentation corresponding to this release can be found at:
> >>>
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc2-docs/
> >>>
> >>> Please vote on releasing this package as Apache Spark 1.4.1!
> >>>
> >>> The vote is open until Monday, July 06, at 22:00 UTC and passes
> >>> if a majority of at least 3 +1 PMC votes are cast.
> >>>
> >>> [ ] +1 Release this package as Apache Spark 1.4.1
> >>> [ ] -1 Do not release this package because ...
> >>>
> >>> To learn more about Apache Spark, please see
> >>> http://spark.apache.org/
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>
> >
>


Re: Can not build master

2015-07-03 Thread Krishna Sankar
Patrick,
   I assume an RC3 will be out for folks like me to test the distribution.
As usual, I will run the tests when you have a new distribution.
Cheers


On Fri, Jul 3, 2015 at 4:38 PM, Patrick Wendell  wrote:

> Patch that added test-jar dependencies:
> https://github.com/apache/spark/commit/bfe74b34
>
> Patch that originally disabled dependency reduced poms:
>
> https://github.com/apache/spark/commit/984ad60147c933f2d5a2040c87ae687c14eb1724
>
> Patch that reverted the disabling of dependency reduced poms:
>
> https://github.com/apache/spark/commit/bc51bcaea734fe64a90d007559e76f5ceebfea9e
>
> On Fri, Jul 3, 2015 at 4:36 PM, Patrick Wendell 
> wrote:
> > Okay I did some forensics with Sean Owen. Some things about this bug:
> >
> > 1. The underlying cause is that we added some code to make the tests
> > of sub modules depend on the core tests. For unknown reasons this
> > causes Spark to hit MSHADE-148 for *some* combinations of build
> > profiles.
> >
> > 2. MSHADE-148 can be worked around by disabling building of
> > "dependency reduced poms" because then the buggy code path is
> > circumvented. Andrew Or did this in a patch on the 1.4 branch.
> > However, that is not a tenable option for us because our *published*
> > pom files require dependency reduction to substitute in the scala
> > version correctly for the poms published to maven central.
> >
> > 3. As a result, Andrew Or reverted his patch recently, causing some
> > package builds to start failing again (but publishing works now).
> >
> > 4. The reason this is not detected in our test harness or release
> > build is that it is sensitive to the profiles enabled. The combination
> > of profiles we enable in the test harness and release builds do not
> > trigger this bug.
> >
> > The best path I see forward right now is to do the following:
> >
> > 1. Disable creation of dependency reduced poms by default (this
> > doesn't matter for people doing a package build) so typical users
> > won't have this bug.
> >
> > 2. Add a profile that re-enables that setting.
> >
> > 3. Use the above profile when publishing release artifacts to maven
> central.
> >
> > 4. Hope that we don't hit this bug for publishing.
> >
> > - Patrick
> >
> > On Fri, Jul 3, 2015 at 3:51 PM, Tarek Auel  wrote:
> >> Doesn't change anything for me.
> >>
> >> On Fri, Jul 3, 2015 at 3:45 PM Patrick Wendell 
> wrote:
> >>>
> >>> Can you try using the built in maven "build/mvn..."? All of our builds
> >>> are passing on Jenkins so I wonder if it's a maven version issue:
> >>>
> >>> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/
> >>>
> >>> - Patrick
> >>>
> >>> On Fri, Jul 3, 2015 at 3:14 PM, Ted Yu  wrote:
> >>> > Please take a look at SPARK-8781
> >>> > (https://github.com/apache/spark/pull/7193)
> >>> >
> >>> > Cheers
> >>> >
> >>> > On Fri, Jul 3, 2015 at 3:05 PM, Tarek Auel 
> wrote:
> >>> >>
> >>> >> I found a solution, there might be a better one.
> >>> >>
> >>> >> https://github.com/apache/spark/pull/7217
> >>> >>
> >>> >> On Fri, Jul 3, 2015 at 2:28 PM Robin East 
> >>> >> wrote:
> >>> >>>
> >>> >>> Yes me too
> >>> >>>
> >>> >>> On 3 Jul 2015, at 22:21, Ted Yu  wrote:
> >>> >>>
> >>> >>> This is what I got (the last line was repeated non-stop):
> >>> >>>
> >>> >>> [INFO] Replacing original artifact with shaded artifact.
> >>> >>> [INFO] Replacing
> >>> >>> /home/hbase/spark/bagel/target/spark-bagel_2.10-1.5.0-SNAPSHOT.jar
> >>> >>> with
> >>> >>>
> >>> >>>
> /home/hbase/spark/bagel/target/spark-bagel_2.10-1.5.0-SNAPSHOT-shaded.jar
> >>> >>> [INFO] Dependency-reduced POM written at:
> >>> >>> /home/hbase/spark/bagel/dependency-reduced-pom.xml
> >>> >>> [INFO] Dependency-reduced POM written at:
> >>> >>> /home/hbase/spark/bagel/dependency-reduced-pom.xml
> >>> >>>
> >>> >>> On Fri, Jul 3, 2015 at 1:13 PM, Tarek Auel 
> >>> >>> wrote:
> >>> 
> >>>  Hi all,
> >>> 
> >>>  I am trying to build the master, but it stucks and prints
> >>> 
> >>>  [INFO] Dependency-reduced POM written at:
> >>>  /Users/tarek/test/spark/bagel/dependency-reduced-pom.xml
> >>> 
> >>>  build command:  mvn -DskipTests clean package
> >>> 
> >>>  Do others have the same issue?
> >>> 
> >>>  Regards,
> >>>  Tarek
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.4.1 (RC3)

2015-07-07 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 27:24 min
 mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. joins,sql,set operations,udf OK
Cheers


On Tue, Jul 7, 2015 at 12:06 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.4.1!
>
> This release fixes a handful of known issues in Spark 1.4.0, listed here:
> http://s.apache.org/spark-1.4.1
>
> The tag to be voted on is v1.4.1-rc3 (commit 3e8ae38):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
> 3e8ae38944f13895daf328555c1ad22cd590b089
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> [published as version: 1.4.1]
> https://repository.apache.org/content/repositories/orgapachespark-1123/
> [published as version: 1.4.1-rc3]
> https://repository.apache.org/content/repositories/orgapachespark-1124/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc3-docs/
>
> Please vote on releasing this package as Apache Spark 1.4.1!
>
> The vote is open until Friday, July 10, at 20:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.4.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.4.1 (RC4)

2015-07-09 Thread Krishna Sankar
+1

1. Compiled OSX 10.10 (Yosemite) OK Total time: 38:11 min
 mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. joins,sql,set operations,udf OK

Cheers


On Wed, Jul 8, 2015 at 10:55 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.4.1!
>
> This release fixes a handful of known issues in Spark 1.4.0, listed here:
> http://s.apache.org/spark-1.4.1
>
> The tag to be voted on is v1.4.1-rc4 (commit dbaa5c2):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
> dbaa5c294eb565f84d7032e387e4b8c1a56e4cd2
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> [published as version: 1.4.1]
> https://repository.apache.org/content/repositories/orgapachespark-1125/
> [published as version: 1.4.1-rc4]
> https://repository.apache.org/content/repositories/orgapachespark-1126/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.4.1-rc4-docs/
>
> Please vote on releasing this package as Apache Spark 1.4.1!
>
> The vote is open until Sunday, July 12, at 06:55 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.4.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-27 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
 mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK
(--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
com.databricks:spark-csv_2.11:1.2.0 worked)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. joins,sql,set operations,udf OK

Cheers


On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc2:
>
> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc2) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1141/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1140/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> 
> What justifies a -1 vote for this release?
> 
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already
> present in 1.4, minor regressions, or bugs related to new features will not
> block this release.
>
>
> ===
> What should happen to JIRA tickets still targeting 1.5.0?
> ===
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==
> Major changes to help you focus your testing
> ==
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression into DataFrame column
> - Improved support for NaN values
> - StructType now supports ordering
> - TimestampType precision is reduced to 1us
> - 100 new built-in expressions, including date/time, string, math
> - memory and local disk only checkpointing
>
> DataFrame/SQL Backend Execution
>
> - Code generation on by default
> - Improved join, aggregation, shuffle, sorting with cache friendly
> algorithms and external algorithms
> - Improved window function performance
> - Better metrics instrumentation and reporting for DF/SQL execution plan

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-03 Thread Krishna Sankar
+?

1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
 mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK
(--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
com.databricks:spark-csv_2.11:1.2.0 worked)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK

Two Problems:

1. The synthetic column names are lowercase ( i.e. now ‘sum(OrderPrice)’;
previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; previously 'AVG(Total)').
So programs that depend on the case of the synthetic column names would
fail.
2. orders_3.groupBy("Year","Month").sum('Total').show()
fails with the error ‘java.io.IOException: Unable to acquire 4194304
bytes of memory’
orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails with
the same error
Is this a known bug ?
Cheers

P.S: Sorry for the spam, forgot Reply All

On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc3:
>
> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc3) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1143/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1142/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> 
> What justifies a -1 vote for this release?
> 
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already
> present in 1.4, minor regressions, or bugs related to new features will not
> block this release.
>
>
> ===
> What should happen to JIRA tickets still targeting 1.5.0?
> ===
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==
> Major changes to help you focus your testing
> ==
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression into DataFrame column

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-04 Thread Krishna Sankar
Thanks Tom.  Interestingly it happened between RC2 and RC3.
Now my vote is +1/2 unless the memory error is known and has a workaround.

Cheers



On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves  wrote:

> The upper/lower case thing is known.
> https://issues.apache.org/jira/browse/SPARK-9550
> I assume it was decided to be ok and its going to be in the release notes
>  but Reynold or Josh can probably speak to it more.
>
> Tom
>
>
>
> On Thursday, September 3, 2015 10:21 PM, Krishna Sankar <
> ksanka...@gmail.com> wrote:
>
>
> +?
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
>  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> 2. Tested pyspark, mllib
> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> 2.2. Linear/Ridge/Laso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>Center And Scale OK
> 2.5. RDD operations OK
>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>Model evaluation/optimization (rank, numIter, lambda) with
> itertools OK
> 3. Scala - MLlib
> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> 3.2. LinearRegressionWithSGD OK
> 3.3. Decision Tree OK
> 3.4. KMeans OK
> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> 3.6. saveAsParquetFile OK
> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> registerTempTable, sql OK
> 3.8. result = sqlContext.sql("SELECT
> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> 4.0. Spark SQL from Python OK
> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
> 5.0. Packages
> 5.1. com.databricks.spark.csv - read/write OK
> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
> com.databricks:spark-csv_2.11:1.2.0 worked)
> 6.0. DataFrames
> 6.1. cast,dtypes OK
> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> 6.3. All joins,sql,set operations,udf OK
>
> Two Problems:
>
> 1. The synthetic column names are lowercase ( i.e. now ‘sum(OrderPrice)’;
> previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; previously 'AVG(Total)').
> So programs that depend on the case of the synthetic column names would
> fail.
> 2. orders_3.groupBy("Year","Month").sum('Total').show()
> fails with the error ‘java.io.IOException: Unable to acquire 4194304
> bytes of memory’
> orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails with
> the same error
> Is this a known bug ?
> Cheers
> 
> P.S: Sorry for the spam, forgot Reply All
>
> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc3:
>
> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc3) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1143/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1142/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> 
> What justifies a -1 vote for this release?
> 
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already
> present in 1.4, minor reg

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-04 Thread Krishna Sankar
Yin,
   It is the
https://github.com/xsankar/global-bd-conf/blob/master/004-Orders.ipynb.
Cheers


On Fri, Sep 4, 2015 at 9:58 AM, Yin Huai  wrote:

> Hi Krishna,
>
> Can you share your code to reproduce the memory allocation issue?
>
> Thanks,
>
> Yin
>
> On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar 
> wrote:
>
>> Thanks Tom.  Interestingly it happened between RC2 and RC3.
>> Now my vote is +1/2 unless the memory error is known and has a workaround.
>>
>> Cheers
>> 
>>
>>
>> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves  wrote:
>>
>>> The upper/lower case thing is known.
>>> https://issues.apache.org/jira/browse/SPARK-9550
>>> I assume it was decided to be ok and its going to be in the release
>>> notes  but Reynold or Josh can probably speak to it more.
>>>
>>> Tom
>>>
>>>
>>>
>>> On Thursday, September 3, 2015 10:21 PM, Krishna Sankar <
>>> ksanka...@gmail.com> wrote:
>>>
>>>
>>> +?
>>>
>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
>>>  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>>> 2. Tested pyspark, mllib
>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>> 2.2. Linear/Ridge/Laso Regression OK
>>> 2.3. Decision Tree, Naive Bayes OK
>>> 2.4. KMeans OK
>>>Center And Scale OK
>>> 2.5. RDD operations OK
>>>   State of the Union Texts - MapReduce, Filter,sortByKey (word count)
>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>>Model evaluation/optimization (rank, numIter, lambda) with
>>> itertools OK
>>> 3. Scala - MLlib
>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>> 3.2. LinearRegressionWithSGD OK
>>> 3.3. Decision Tree OK
>>> 3.4. KMeans OK
>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
>>> 3.6. saveAsParquetFile OK
>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>>> registerTempTable, sql OK
>>> 3.8. result = sqlContext.sql("SELECT
>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
>>> 4.0. Spark SQL from Python OK
>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'")
>>> OK
>>> 5.0. Packages
>>> 5.1. com.databricks.spark.csv - read/write OK
>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But
>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>>> 6.0. DataFrames
>>> 6.1. cast,dtypes OK
>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>> 6.3. All joins,sql,set operations,udf OK
>>>
>>> Two Problems:
>>>
>>> 1. The synthetic column names are lowercase ( i.e. now
>>> ‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’, now ‘avg(Total)’;
>>> previously 'AVG(Total)'). So programs that depend on the case of the
>>> synthetic column names would fail.
>>> 2. orders_3.groupBy("Year","Month").sum('Total').show()
>>> fails with the error ‘java.io.IOException: Unable to acquire 4194304
>>> bytes of memory’
>>> orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails
>>> with the same error
>>> Is this a known bug ?
>>> Cheers
>>> 
>>> P.S: Sorry for the spam, forgot Reply All
>>>
>>> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes
>>> if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.5.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>>
>>> The tag to be voted on is v1.5.0-rc3:
>>>
>>> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-04 Thread Krishna Sankar
Excellent & Thanks Davies. Yep, now runs fine and takes 1/2 the time !
This was exactly why I had put in the elapsed time calculations.
And thanks for the new pyspark.sql.functions.

+1 from my side for 1.5.0 RC3.
Cheers


On Fri, Sep 4, 2015 at 9:57 PM, Davies Liu  wrote:

> Could you update the notebook to use builtin SQL function month and year,
> instead of Python UDF? (they are introduced in 1.5).
>
> Once remove those two udfs, it runs successfully, also much faster.
>
> On Fri, Sep 4, 2015 at 2:22 PM, Krishna Sankar 
> wrote:
> > Yin,
> >It is the
> > https://github.com/xsankar/global-bd-conf/blob/master/004-Orders.ipynb.
> > Cheers
> > 
> >
> > On Fri, Sep 4, 2015 at 9:58 AM, Yin Huai  wrote:
> >>
> >> Hi Krishna,
> >>
> >> Can you share your code to reproduce the memory allocation issue?
> >>
> >> Thanks,
> >>
> >> Yin
> >>
> >> On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar 
> >> wrote:
> >>>
> >>> Thanks Tom.  Interestingly it happened between RC2 and RC3.
> >>> Now my vote is +1/2 unless the memory error is known and has a
> >>> workaround.
> >>>
> >>> Cheers
> >>> 
> >>>
> >>>
> >>> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves 
> wrote:
> >>>>
> >>>> The upper/lower case thing is known.
> >>>> https://issues.apache.org/jira/browse/SPARK-9550
> >>>> I assume it was decided to be ok and its going to be in the release
> >>>> notes  but Reynold or Josh can probably speak to it more.
> >>>>
> >>>> Tom
> >>>>
> >>>>
> >>>>
> >>>> On Thursday, September 3, 2015 10:21 PM, Krishna Sankar
> >>>>  wrote:
> >>>>
> >>>>
> >>>> +?
> >>>>
> >>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min
> >>>>  mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> >>>> 2. Tested pyspark, mllib
> >>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
> >>>> 2.2. Linear/Ridge/Laso Regression OK
> >>>> 2.3. Decision Tree, Naive Bayes OK
> >>>> 2.4. KMeans OK
> >>>>Center And Scale OK
> >>>> 2.5. RDD operations OK
> >>>>   State of the Union Texts - MapReduce, Filter,sortByKey (word
> >>>> count)
> >>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
> >>>>Model evaluation/optimization (rank, numIter, lambda) with
> >>>> itertools OK
> >>>> 3. Scala - MLlib
> >>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
> >>>> 3.2. LinearRegressionWithSGD OK
> >>>> 3.3. Decision Tree OK
> >>>> 3.4. KMeans OK
> >>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
> >>>> 3.6. saveAsParquetFile OK
> >>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
> >>>> registerTempTable, sql OK
> >>>> 3.8. result = sqlContext.sql("SELECT
> >>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders
> INNER
> >>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
> >>>> 4.0. Spark SQL from Python OK
> >>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State =
> 'WA'")
> >>>> OK
> >>>> 5.0. Packages
> >>>> 5.1. com.databricks.spark.csv - read/write OK
> >>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work.
> But
> >>>> com.databricks:spark-csv_2.11:1.2.0 worked)
> >>>> 6.0. DataFrames
> >>>> 6.1. cast,dtypes OK
> >>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
> >>>> 6.3. All joins,sql,set operations,udf OK
> >>>>
> >>>> Two Problems:
> >>>>
> >>>> 1. The synthetic column names are lowercase ( i.e. now
> >>>> ‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’, now ‘avg(Total)’;
> >>>> previously 'AVG(Total)'). So programs that depend on the case of the
> >>>> synthetic column names would fail.
> >>>> 2. orders_3.groupBy("Year","Month").sum('Total').show()
> >>>> fails with the error ‘java.io.IOException: Unable to acquir

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:48 min
 mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib (iPython 4.0, FYI, notebook install is separate
“conda install python” and then “conda install jupyter”)
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages
com.databricks:spark-csv_2.10:1.2.0)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK
*Notes:*
1. Speed improvement in DataFrame functions groupBy, avg,sum et al. *Good
work*. I am working on a project to reduce processing time from ~24 hrs to
... Let us see what Spark does. The speedups would help a lot.
2. FYI, UDFs getM and getY work now (Thanks). Slower; saturates the CPU. A
non-scientific snapshot below. I know that this really has to be done more
rigorously, on a bigger machine, with more cores et al..
[image: Inline image 1] [image: Inline image 2]

On Thu, Sep 24, 2015 at 12:27 AM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.1
> [ ] -1 Do not release this package because ...
>
>
> The release fixes 81 known issues in Spark 1.5.0, listed here:
> http://s.apache.org/spark-1.5.1
>
> The tag to be voted on is v1.5.1-rc1:
>
> https://github.com/apache/spark/commit/4df97937dbf68a9868de58408b9be0bf87dbbb94
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (1.5.1) can be found at:
> *https://repository.apache.org/content/repositories/orgapachespark-1148/
> *
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> What justifies a -1 vote for this release?
> 
> -1 vote should occur for regressions from Spark 1.5.0. Bugs already
> present in 1.5.0 will not block this release.
>
> ===
> What should happen to JIRA tickets still targeting 1.5.1?
> ===
> Please target 1.5.2 or 1.6.0.
>
>
>
>


Re: [ANNOUNCE] Announcing Spark 1.5.1

2015-10-12 Thread Krishna Sankar
I think the key is to vote a specific set of source tarballs without any
binary artifacts. The specific binaries are useful but shouldn't be part of
the voting process. Makes sense, we really cannot prove (and no need to)
that the  binaries do not contain malware, but the source can be proven to
be clean by inspection, I assume.
Cheers


On Mon, Oct 12, 2015 at 6:56 AM, Tom Graves 
wrote:

> I know there are multiple things being talked about here, but  I agree
> with Patrick here, we vote on the source distribution - src tarball (and of
> course the tag should match).  Perhaps in principle we vote on all the
> other specific binary distributions since they are generated from source
> tarball but that isn't the main thing and I surely don't test and verify
> each one of those.
>
> Tom
>
>
>
> On Monday, October 12, 2015 12:13 AM, Sean Owen 
> wrote:
>
>
> No we are voting on the artifacts being released (too) in principle.
> Although of course the artifacts should be a deterministic function of the
> source at a certain point in time.
> I think the concern is about putting Spark binaries or its dependencies
> into a source release. That should not happen, but it is not what has
> happened here.
>
> On Mon, Oct 12, 2015, 6:03 AM Patrick Wendell  wrote:
>
> Oh I see - yes it's the build/. I always thought release votes related to
> a source tag rather than specific binaries. But maybe we can just fix it in
> 1.5.2 if there is concern about mutating binaries. It seems reasonable to
> me.
>
> For tests... in the past we've tried to avoid having jars inside of the
> source tree, including some effort to generate jars on the fly which a lot
> of our tests use. I am not sure whether it's a firm policy that you can't
> have jars in test folders, though. If it is, we could probably do some
> magic to get rid of these few ones that have crept in.
>
> - Patrick
>
> On Sun, Oct 11, 2015 at 9:57 PM, Sean Owen  wrote:
>
> Agree, but we are talking about the build/ bit right?
> I don't agree that it invalidates the release, which is probably the more
> important idea. As a point of process, you would not want to modify and
> republish the artifact that was already released after being voted on -
> unless it was invalid in which case we spin up 1.5.1.1 or something.
> But that build/ directory should go in future releases.
> I think he is talking about more than this though and the other jars look
> like they are part of tests, and still nothing to do with Spark binaries.
> Those can and should stay.
>
> On Mon, Oct 12, 2015, 5:35 AM Patrick Wendell  wrote:
>
> I think Daniel is correct here. The source artifact incorrectly includes
> jars. It is inadvertent and not part of our intended release process. This
> was something I noticed in Spark 1.5.0 and filed a JIRA and was fixed by
> updating our build scripts to fix it. However, our build environment was
> not using the most current version of the build scripts. See related links:
>
> https://issues.apache.org/jira/browse/SPARK-10511
> https://github.com/apache/spark/pull/8774/files
>
> I can update our build environment and we can repackage the Spark 1.5.1
> source tarball. To not include sources.
>
>
> - Patrick
>
> On Sun, Oct 11, 2015 at 8:53 AM, Sean Owen  wrote:
>
> Daniel: we did not vote on a tag. Please again read the VOTE email I
> linked to you:
>
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-1-RC1-tt14310.html#none
>
> among other things, it contains a link to the concrete source (and
> binary) distribution under vote:
>
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>
> You can still examine it, sure.
>
> Dependencies are *not* bundled in the source release. You're again
> misunderstanding what you are seeing. Read my email again.
>
> I am still pretty confused about what the problem is. This is entirely
> business as usual for ASF projects. I'll follow up with you offline if
> you have any more doubts.
>
> On Sun, Oct 11, 2015 at 4:49 PM, Daniel Gruno 
> wrote:
> > Here's my issue:
> >
> > How am I to audit that the dependencies you bundle are in fact what you
> > claim they are?  How do I know they don't contain malware or - in light
> > of recent events - emissions test rigging? ;)
> >
> > I am not interested in a git tag - that means nothing in the ASF voting
> > process, you cannot vote on a tag, only on a release candidate. The VCS
> > in use is irrelevant in this issue. If you can point me to a release
> > candidate archive that was voted upon and does not contain binary
> > applications, all is well.
> >
> > If there is no such thing, and we cannot come to an understanding, I
> > will exercise my ASF Members' rights and bring this to the attention of
> > the board of directors and ask for a clarification of the legality of
> this.
> >
> > I find it highly irregular. Perhaps it is something some projects do in
> > the Java community, but that doesn't make it permissible in my view

Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-26 Thread Krishna Sankar
Guys,
   The sc.version returns 1.5.1 in python and scala. Is anyone getting the
same results ? Probably I am doing something wrong.
Cheers


On Sun, Oct 25, 2015 at 12:07 AM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 1.5.2. The vote is open until Wed Oct 28, 2015 at 08:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.2
> [ ] -1 Do not release this package because ...
>
>
> The release fixes 51 known issues in Spark 1.5.1, listed here:
> http://s.apache.org/spark-1.5.2
>
> The tag to be voted on is v1.5.2-rc1:
> https://github.com/apache/spark/releases/tag/v1.5.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> *http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc1-bin/
> *
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> - as version 1.5.2-rc1:
> https://repository.apache.org/content/repositories/orgapachespark-1151
> - as version 1.5.2:
> https://repository.apache.org/content/repositories/orgapachespark-1150
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-v1.5.2-rc1-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> What justifies a -1 vote for this release?
> 
> -1 vote should occur for regressions from Spark 1.5.1. Bugs already
> present in 1.5.1 will not block this release.
>
> ===
> What should happen to JIRA tickets still targeting 1.5.2?
> ===
> Please target 1.5.3 or 1.6.0.
>
>
>


Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-06 Thread Krishna Sankar
+1 (non-binding, of course) (Hope I made it in time. ~T-20 !)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 25:52 min
 mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib (iPython 4.0, FYI, notebook install is separate
“conda install ipython” and then “conda install jupyter”)
2.0 Spark version is 1.5.2.
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages
com.databricks:spark-csv_2.10:1.2.0)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK
Cheers


On Tue, Nov 3, 2015 at 3:22 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.2
> [ ] -1 Do not release this package because ...
>
>
> The release fixes 59 known issues in Spark 1.5.1, listed here:
> http://s.apache.org/spark-1.5.2
>
> The tag to be voted on is v1.5.2-rc2:
> https://github.com/apache/spark/releases/tag/v1.5.2-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> - as version 1.5.2-rc2:
> https://repository.apache.org/content/repositories/orgapachespark-1153
> - as version 1.5.2:
> https://repository.apache.org/content/repositories/orgapachespark-1152
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> What justifies a -1 vote for this release?
> 
> -1 vote should occur for regressions from Spark 1.5.1. Bugs already
> present in 1.5.1 will not block this release.
>
>
>


Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-08 Thread Krishna Sankar
In addition to the wrong entry point, I suspect there is a cache problem as
well. I have seen strange errors that disappear completely once the ivy
cache is deleted.
Cheers


On Sun, Nov 8, 2015 at 7:54 PM, Ted Yu  wrote:

> Why did you directly jump to spark-streaming-mqtt module ?
>
> Can you drop 'spark-streaming-mqtt' and try again ?
>
> Not sure why 1.5.0-SNAPSHOT showed up.
> Were you using RC2 source ?
>
> Cheers
>
> On Sun, Nov 8, 2015 at 7:28 PM, 欧锐 <494165...@qq.com> wrote:
>
>>
>> build spark-streaming-mqtt_2.10 failed!
>>
>> nohup mvn -X -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive
>> -Phive-thriftserver -DskipTests clean package -rf
>> :spark-streaming-mqtt_2.10 &
>>
>> [DEBUG] org.scala-tools.testing:test-interface:jar:0.5:test
>> [DEBUG] org.apache.activemq:activemq-core:jar:5.7.0:test
>> [DEBUG] org.apache.geronimo.specs:geronimo-jms_1.1_spec:jar:1.1.1:test
>> [DEBUG] org.apache.activemq:kahadb:jar:5.7.0:test
>> [DEBUG] org.apache.activemq.protobuf:activemq-protobuf:jar:1.1:test
>> [DEBUG] org.fusesource.mqtt-client:mqtt-client:jar:1.3:test
>> [DEBUG] org.fusesource.hawtdispatch:hawtdispatch-transport:jar:1.11:test
>> [DEBUG] org.fusesource.hawtdispatch:hawtdispatch:jar:1.11:test
>> [DEBUG] org.fusesource.hawtbuf:hawtbuf:jar:1.9:test
>> [DEBUG]
>> org.apache.geronimo.specs:geronimo-j2ee-management_1.1_spec:jar:1.0.1:test
>> [DEBUG] org.springframework:spring-context:jar:3.0.7.RELEASE:test
>> [DEBUG] org.springframework:spring-aop:jar:3.0.7.RELEASE:test
>> [DEBUG] aopalliance:aopalliance:jar:1.0:test
>> [DEBUG] org.springframework:spring-beans:jar:3.0.7.RELEASE:test
>> [DEBUG] org.springframework:spring-core:jar:3.0.7.RELEASE:test
>> [DEBUG] commons-logging:commons-logging:jar:1.1.1:test
>> [DEBUG] org.springframework:spring-expression:jar:3.0.7.RELEASE:test
>> [DEBUG] org.springframework:spring-asm:jar:3.0.7.RELEASE:test
>> [DEBUG] org.jasypt:jasypt:jar:1.9.0:test
>> [DEBUG] org.spark-project.spark:unused:jar:1.0.0:compile
>> [DEBUG] org.scalatest:scalatest_2.10:jar:2.2.1:test
>> [DEBUG] org.scala-lang:scala-reflect:jar:2.10.4:provided
>> [INFO]
>> 
>> [INFO] Reactor Summary:
>> [INFO]
>> [INFO] Spark Project External MQTT  FAILURE [
>> 2.403 s]
>> [INFO] Spark Project External MQTT Assembly ... SKIPPED
>> [INFO] Spark Project External ZeroMQ .. SKIPPED
>> [INFO] Spark Project External Kafka ... SKIPPED
>> [INFO] Spark Project Examples . SKIPPED
>> [INFO] Spark Project External Kafka Assembly .. SKIPPED
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>> [INFO] Total time: 4.471 s
>> [INFO] Finished at: 2015-11-09T11:10:57+08:00 
>> [INFO] Final Memory: 31M/173M
>> [INFO]
>> 
>> [WARNING] The requested profile "hive" could not be activated because it
>> does not exist.
>> [ERROR] Failed to execute goal on project spark-streaming-mqtt_2.10:
>> Could not resolve dependencies for project
>> org.apache.spark:spark-streaming-mqtt_2.10:jar:1.5.0-SNAPSHOT: The
>> following artifacts could not be resolved:
>> org.apache.spark:spark-streaming_2.10:jar:1.5.0-SNAPSHOT,
>> org.apache.spark:spark-core_2.10:jar:1.5.0-SNAPSHOT,
>> org.apache.spark:spark-core_2.10:jar:tests:1.5.0-SNAPSHOT,
>> org.apache.spark:spark-launcher_2.10:jar:1.5.0-SNAPSHOT,
>> org.apache.spark:spark-network-common_2.10:jar:1.5.0-SNAPSHOT,
>> org.eclipse.paho:org.eclipse.paho.client.mqttv3:jar:1.0.1: Failure to find
>> org.apache.spark:spark-streaming_2.10:jar:1.5.0-20150818.023902-334 in
>> http://maven.cnsuning.com/content/groups/public/ was cached in the local
>> repository, resolution will not be reattempted until the update interval of
>> suning_maven_repo has elapsed or updates are forced -> [Help 1]
>> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
>> goal on project spark-streaming-mqtt_2.10: Could not resolve dependencies
>> for project org.apache.spark:spark-streaming-mqtt_2.10:jar:1.5.0-SNAPSHOT:
>> The following artifacts could not be resolved:
>> org.apache.spark:spark-streaming_2.10:jar:1.5.0-SNAPSHOT,
>> org.apache.spark:spark-core_2.10:jar:1.5.0-SNAPSHOT,
>> org.apache.spark:spark-core_2.10:jar:tests:1.5.0-SNAPSHOT,
>> org.apache.spark:spark-launcher_2.10:jar:1.5.0-SNAPSHOT,
>> org.apache.spark:spark-network-common_2.10:jar:1.5.0-SNAPSHOT,
>> org.eclipse.paho:org.eclipse.paho.client.mqttv3:jar:1.0.1: Failure to find
>> org.apache.spark:spark-streaming_2.10:jar:1.5.0-20150818.023902-334 in
>> http://maven.cnsuning.com/content/groups/public/ was cached in the local
>> repository, resolution will not be reattempted until the update interval of
>> suning_maven_repo has

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Krishna Sankar
Guys,
   The sc.version gives 1.6.0-SNAPSHOT. Need to change to 1.6.0. Can you pl
verify ?
Cheers


On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc2
> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
> *
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1169/
>
> The test repository (versioned as v1.6.0-rc2) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1168/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==
> == Major changes to help you focus your testing ==
> ==
>
> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>
>- SPARK-2629  
>trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>- SPARK-12165 
>SPARK-12189  Fix
>bugs in eviction of storage memory by execution.
>- SPARK-12258  correct
>passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>- SPARK-11787  Parquet
>Performance - Improve Parquet scan performance when using flat schemas.
>- SPARK-10810 
>Session Management - Isolated devault database (i.e USE mydb) even on
>shared clusters.
>- SPARK-   Dataset
>API - A type-safe API (similar to RDDs) that performs many operations
>on serialized binary data and code generation (i.e. Project Tungsten).
>- SPARK-1  Unified
>Memory Management - Shared memory for execution and caching instead of
>exclusive division of the regions.
>- SPARK-11197  SQL
>Queries on Files - Concise syntax for running SQL queries over files
>of any supported format without registering a table.
>- SPARK-11745  Reading
>non-standard JSON files - Added options to read non-standard JSON
>files (e.g. single-quotes, unquoted attributes)
>- SPARK-10412  
> Per-operator
>Metrics for SQL Execution - Display statistics on a peroperator basis
>for memory usage and spilled data size.
>- SPARK-11329  Star
>(*) expansion for StructTypes - Makes it easier to nest and unest
>arbitrary numbers of columns
>- SPARK-10917 ,
>SPARK-11149 

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-17 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 29:32 min
 mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib (iPython 4.0)
2.0 Spark version is 1.6.0
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages
com.databricks:spark-csv_2.10:1.3.0)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK

Cheers & Good work guys


On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 19, 2015 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc3
> (168c89e07c51fa24b0bb88582c739cec0acb44d7)
> *
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1174/
>
> The test repository (versioned as v1.6.0-rc3) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1173/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc3-docs/
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==
> == Major changes to help you focus your testing ==
> ==
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
>- SPARK-2629  
>trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>- SPARK-12165 
>SPARK-12189  Fix
>bugs in eviction of storage memory by execution.
>- SPARK-12258  correct
>passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>- SPARK-11787  Parquet
>Performance - Improve Parquet scan performance when using fla

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-25 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 29:25 min
 mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib (iPython 4.0)
2.0 Spark version is 1.6.0
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages
com.databricks:spark-csv_2.10:1.3.0)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK

Cheers & Holiday "Spark-ling" Wishes !


On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc4
> (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
> *
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1176/
>
> The test repository (versioned as v1.6.0-rc4) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1175/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==
> == Major changes to help you focus your testing ==
> ==
>
> Notable changes since 1.6 RC3
>
>   - SPARK-12404 - Fix serialization error for Datasets with
> Timestamps/Arrays/Decimal
>   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
>   - SPARK-12395 - Fix join columns of outer join for DataFrame using
>   - SPARK-12413 - Fix mesos HA
>
> Notable changes since 1.6 RC2
> - SPARK_VERSION has been set correctly
> - SPARK-12199 ML Docs are publishing correctly
> - SPARK-12345 Mesos cluster mode has been fixed
>
> Notable changes since 1.6 RC1
> Spark Streaming
>
>- SPARK-2629  
>trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>- SPARK-12165 
>SPARK-12189  Fix
>bugs in eviction of storage memory by 

Re: [GRAPHX] Graph Algorithms and Spark

2016-04-21 Thread Krishna Sankar
Hi,

   1. Yep, GraphX is stable and would be a good choice for you to implement
   algorithms. For a quick intro you can refer to our Strata MLlib tutorial
   GraphX slides http://goo.gl/Ffq2Az
   2. GraphX has implemented algorithms like PageRank &
   ConnectedComponents[1]
   3. It also has primitives to develop the kind of algorithms that you are
   talking about
   4. For you to implement interesting algorithms, the main APIs of
   interest would be the pregel API and the aggregateMessages API[2]. Am sure
   you will also use the map*, subgraph and the join APIs.

Cheers

[1]
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.GraphOps
[2]
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Graph

On Thu, Apr 21, 2016 at 11:47 AM, tgensol 
wrote:

> Hi there,
>
> I am working in a group of the University of Michigan, and we are trying to
> make (and find first) some Distributed graph algorithms.
>
> I know spark, and I found GraphX. I read the docs, but I only found Latent
> Dirichlet Allocation algorithms working with GraphX, so I was wondering why
> ?
>
> Basically, the groupe wants to implement Minimal Spanning Tree, kNN,
> shortest path at first.
>
> So my askings are :
> Is graphX enough stable for developing this kind of algorithms on it ?
> Do you know some algorithms like these working on top of GraphX ? And if
> no,
> why do you think, nobody tried to do it ? Is this too hard ? Or just
> because
> nobody needs it ?
>
> Maybe, it is only my knowledge about GraphX which is weak, and it is not
> possible to make these algorithms with GraphX.
>
> Thanking you in advance,
> Best regards,
>
> Thibaut
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/GRAPHX-Graph-Algorithms-and-Spark-tp17301.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-18 Thread Krishna Sankar
+1. Looks Good.
The mllib results are in line with 1.6.1. Deprecation messages. I will
convert to ml and test later in the day.
Also will try GraphX exercises for our Strata London Tutorial

Quick Notes:

   1. pyspark env variables need to be changed
   - IPYTHON and IPYTHON_OPTS are removed
  - This works
 - PYSPARK_DRIVER_PYTHON=ipython
 PYSPARK_DRIVER_PYTHON_OPTS="notebook"
 ~/Downloads/spark-2.0.0-preview/bin/pyspark --packages
 com.databricks:spark-csv_2.10:1.4.0
  2.  maven 3.3.9 is required. (I was running 3.3.3)
   3.  Tons of interesting warnings and deprecations.
  - The messages look descriptive and very helpful (Thanks. This will
  help migration to 2.0, mllib -> ml et al). Will dig deeper.
  4. Compiled OSX 10.10 (Yosemite) OK Total time: 31:28 min
mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
   - Spark version is 2.0.0-preview
  - Tested pyspark, mllib (iPython 4.2.0)

Cheers & Good work folks


On Wed, May 18, 2016 at 7:28 AM, Sean Owen  wrote:

> I think it's a good idea. Although releases have been preceded before
> by release candidates for developers, it would be good to get a formal
> preview/beta release ratified for public consumption ahead of a new
> major release. Better to have a little more testing in the wild to
> identify problems before 2.0.0 is finalized.
>
> +1 to the release. License, sigs, etc check out. On Ubuntu 16 + Java
> 8, compilation and tests succeed for "-Pyarn -Phive
> -Phive-thriftserver -Phadoop-2.6".
>
> On Wed, May 18, 2016 at 6:40 AM, Reynold Xin  wrote:
> > Hi,
> >
> > In the past the Apache Spark community have created preview packages (not
> > official releases) and used those as opportunities to ask community
> members
> > to test the upcoming versions of Apache Spark. Several people in the
> Apache
> > community have suggested we conduct votes for these preview packages and
> > turn them into formal releases by the Apache foundation's standard.
> Preview
> > releases are not meant to be functional, i.e. they can and highly likely
> > will contain critical bugs or documentation errors, but we will be able
> to
> > post them to the project's website to get wider feedback. They should
> > satisfy the legal requirements of Apache's release policy
> > (http://www.apache.org/dev/release.html) such as having proper licenses.
> >
> >
> > Please vote on releasing the following candidate as Apache Spark version
> > 2.0.0-preview. The vote is open until Friday, May 20, 2015 at 11:00 PM
> PDT
> > and passes if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 2.0.0-preview
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is 2.0.0-preview
> > (8f5a04b6299e3a47aca13cbb40e72344c0114860)
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-bin/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The documentation corresponding to this release can be found at:
> >
> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/
> >
> > The list of resolved issues are:
> >
> https://issues.apache.org/jira/browse/SPARK-15351?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.0.0
> >
> >
> > If you are a Spark user, you can help us test this release by taking an
> > existing Apache Spark workload and running on this candidate, then
> reporting
> > any regressions.
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Thanks For a Job Well Done !!!

2016-06-18 Thread Krishna Sankar
Hi all,
   Just wanted to thank all for the dataset API - most of the times we see
only bugs in these lists ;o).

   - Putting some context, this weekend I was updating the SQL chapters of
   my book - it had all the ugliness of SchemaRDD,
   registerTempTable, take(10).foreach(println)
   and take(30).foreach(e=>println("%15s | %9.2f |".format(e(0),e(1 ;o)
   - I remember Hossein Falaki chiding me about the ugly println statements
  !
  - Took me a little while to grok the dataset, sparksession,
  spark.read.option("header","true").option("inferSchema","true").csv(...)
et
  al.
 - I am a big R fan and know the language pretty decent - so the
 constructs are familiar
  - Once I got it ( I am sure still there are more mysteries to uncover
   ...) it was just beautiful - well done folks !!!
   - One sees the contrast a lot better while teaching or writing books,
   because one has to think thru the old, the new and the transitional arc
  - I even remember the good old days when we were discussing whether
  Spark would get the dataframes like R at one of Paco's sessions !
  - And now, it looks very decent for data wrangling.

Cheers & keep up the good work

P.S: My next chapter is the MLlib - need to convert to ml. Should be
interesting ... I am a glutton for punishment - of the Spark kind, of
course !


Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-22 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 37:11 min
 mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib (iPython 4.0)
2.0 Spark version is 1.6.2
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Lasso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
2.5. RDD operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lambda) with itertools
OK
3. Scala - MLlib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWithSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
3.6. saveAsParquetFile OK
3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
registerTempTable, sql OK
3.8. result = sqlContext.sql("SELECT
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER
JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK
4.0. Spark SQL from Python OK
4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK
5.0. Packages
5.1. com.databricks.spark.csv - read/write OK (--packages
com.databricks:spark-csv_2.10:1.4.0)
6.0. DataFrames
6.1. cast,dtypes OK
6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
6.3. All joins,sql,set operations,udf OK
7.0. GraphX/Scala
7.1. Create Graph (small and bigger dataset) OK
7.2. Structure APIs - OK
7.3. Social Network/Community APIs - OK
7.4. Algorithms (PageRank of 2 datasets, aggregateMessages() ) OK

Cheers & Good Work, Folks


On Sun, Jun 19, 2016 at 9:24 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.2. The vote is open until Wednesday, June 22, 2016 at 22:00 PDT and
> passes if a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v1.6.2-rc2
> (54b1121f351f056d6b67d2bb4efe0d553c0f7482)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1186/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-docs/
>
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 1.6.1.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This is a maintenance release in the 1.6.x series.  Bugs already present
> in 1.6.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
>
>
>


Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-28 Thread Krishna Sankar
+1
Pulled & built on MacOS X, EC2 Amazon Linux
Ran test programs on OS X, 5 node c3.4xlarge cluster
Cheers



On Wed, May 28, 2014 at 7:36 PM, Andy Konwinski wrote:

> +1
> On May 28, 2014 7:05 PM, "Xiangrui Meng"  wrote:
>
> > +1
> >
> > Tested apps with standalone client mode and yarn cluster and client
> modes.
> >
> > Xiangrui
> >
> > On Wed, May 28, 2014 at 1:07 PM, Sean McNamara
> >  wrote:
> > > Pulled down, compiled, and tested examples on OS X and ubuntu.
> > > Deployed app we are building on spark and poured data through it.
> > >
> > > +1
> > >
> > > Sean
> > >
> > >
> > > On May 26, 2014, at 8:39 AM, Tathagata Das <
> tathagata.das1...@gmail.com>
> > wrote:
> > >
> > >> Please vote on releasing the following candidate as Apache Spark
> > version 1.0.0!
> > >>
> > >> This has a few important bug fixes on top of rc10:
> > >> SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853
> > >> SPARK-1870: https://github.com/apache/spark/pull/848
> > >> SPARK-1897: https://github.com/apache/spark/pull/849
> > >>
> > >> The tag to be voted on is v1.0.0-rc11 (commit c69d97cd):
> > >>
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a
> > >>
> > >> The release files, including signatures, digests, etc. can be found
> at:
> > >> http://people.apache.org/~tdas/spark-1.0.0-rc11/
> > >>
> > >> Release artifacts are signed with the following key:
> > >> https://people.apache.org/keys/committer/tdas.asc
> > >>
> > >> The staging repository for this release can be found at:
> > >>
> https://repository.apache.org/content/repositories/orgapachespark-1019/
> > >>
> > >> The documentation corresponding to this release can be found at:
> > >> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/
> > >>
> > >> Please vote on releasing this package as Apache Spark 1.0.0!
> > >>
> > >> The vote is open until Thursday, May 29, at 16:00 UTC and passes if
> > >> a majority of at least 3 +1 PMC votes are cast.
> > >>
> > >> [ ] +1 Release this package as Apache Spark 1.0.0
> > >> [ ] -1 Do not release this package because ...
> > >>
> > >> To learn more about Apache Spark, please see
> > >> http://spark.apache.org/
> > >>
> > >> == API Changes ==
> > >> We welcome users to compile Spark applications against 1.0. There are
> > >> a few API changes in this release. Here are links to the associated
> > >> upgrade guides - user facing changes have been kept as small as
> > >> possible.
> > >>
> > >> Changes to ML vector specification:
> > >>
> >
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10
> > >>
> > >> Changes to the Java API:
> > >>
> >
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
> > >>
> > >> Changes to the streaming API:
> > >>
> >
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
> > >>
> > >> Changes to the GraphX API:
> > >>
> >
> http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
> > >>
> > >> Other changes:
> > >> coGroup and related functions now return Iterable[T] instead of Seq[T]
> > >> ==> Call toSeq on the result to restore the old behavior
> > >>
> > >> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> > >> ==> Call toSeq on the result to restore old behavior
> > >
> >
>


Re: Contributing Spark Infrastructure Configuration Docs

2014-06-05 Thread Krishna Sankar
Stephen,
We are working thru Dell configurations; would be happy to review your
diagrams and offer feedback from our experience. Let me know the URLs.
Cheers



On Thu, Jun 5, 2014 at 2:51 PM, Stephen Watt  wrote:

> Hi Folks
>
> My name is Steve Watt and I work in the CTO Office at Red Hat. I've
> recently spent quite a bit of time designing single rack and multi-rack
> infrastructures for Spark for our own hardware procurement at Red Hat and I
> thought the diagrams and server specs for both Dell and HP would be useful
> to the broader community as well. Even if folks don't want to go with my
> exact design, having the designs as a starting point should save  quite a
> bit of time. I think I can fold this quite easily into
> http://spark.apache.org/docs/latest/hardware-provisioning.html
>
> However, before submitting a pull request for the modified page I thought
> I'd send this note up front and see if anyone wanted to provide some
> feedback first. If there is none, I'll submit the pull request early next
> week.
>
> Regards
> Steve Watt
>


Re: [VOTE] Release Apache Spark 1.0.1 (RC1)

2014-06-27 Thread Krishna Sankar
+1
Compiled for CentOS 6.5, deployed in our 4 node cluster (Hadoop 2.2, YARN)
Smoke Tests (sparkPi,spark-shell, web UI) successful

Cheers



On Thu, Jun 26, 2014 at 7:06 PM, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.0.1!
>
> The tag to be voted on is v1.0.1-rc1 (commit 7feeda3):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7feeda3d729f9397aa15ee8750c01ef5aa601962
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.0.1-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1020/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.0.1-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.0.1!
>
> The vote is open until Monday, June 30, at 03:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.0.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> === About this release ===
> This release fixes a few high-priority bugs in 1.0 and has a variety
> of smaller fixes. The full list is here: http://s.apache.org/b45. Some
> of the more visible patches are:
>
> SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
> SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size.
> SPARK-1790: Support r3 instance types on EC2.
>
> This is the first maintenance release on the 1.0 line. We plan to make
> additional maintenance releases as new fixes come in.
>
> - Patrick
>


Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-05 Thread Krishna Sankar
+1

   - Compiled rc2 w/ CentOS 6.5, Yarn,Hadoop 2.2.0 - successful
   - Smoke Test (scala,python) (distributed cluster) - successful
   - We had ran Java/SparkSQL (count, distinct et al) ~250M records RDD
   over HBase 0.98.3 over last build (rc1) - successful
   - Stand alone multi-node cluster is working better for us than Yarn

Cheers



On Fri, Jul 4, 2014 at 12:40 PM, Patrick Wendell  wrote:

> I'll start the voting with a +1 - ran tests on the release candidate
> and ran some basic programs. RC1 passed our performance regression
> suite, and there are no major changes from that RC.
>
> On Fri, Jul 4, 2014 at 12:39 PM, Patrick Wendell 
> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> 1.0.1!
> >
> > The tag to be voted on is v1.0.1-rc1 (commit 7d1043c):
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.0.1-rc2/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1021/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/
> >
> > Please vote on releasing this package as Apache Spark 1.0.1!
> >
> > The vote is open until Monday, July 07, at 20:45 UTC and passes if
> > a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.0.1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
> > === Differences from RC1 ===
> > This release includes only one "blocking" patch from rc1:
> > https://github.com/apache/spark/pull/1255
> >
> > There are also smaller fixes which came in over the last week.
> >
> > === About this release ===
> > This release fixes a few high-priority bugs in 1.0 and has a variety
> > of smaller fixes. The full list is here: http://s.apache.org/b45. Some
> > of the more visible patches are:
> >
> > SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
> > SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame
> size.
> > SPARK-1790: Support r3 instance types on EC2.
> >
> > This is the first maintenance release on the 1.0 line. We plan to make
> > additional maintenance releases as new fixes come in.
>


Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Krishna Sankar
Well done guys. MapReduce sort at that time was a good feat and Spark now
has raised the bar with the ability to sort a PB.
Like some of the folks in the list, a summary of what worked (and didn't)
as well as the monitoring practices would be good.
Cheers

P.S: What are you folks planning next ?

On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia 
wrote:

> Hi folks,
>
> I interrupt your regularly scheduled user / dev list to bring you some
> pretty cool news for the project, which is that we've been able to use
> Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
> faster on 10x fewer nodes. There's a detailed writeup at
> http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
> Summary: while Hadoop MapReduce held last year's 100 TB world record by
> sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
> 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
>
> I want to thank Reynold Xin for leading this effort over the past few
> weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
> Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
> providing the machines to make this possible. Finally, this result would of
> course not be possible without the many many other contributions, testing
> and feature requests from throughout the community.
>
> For an engine to scale from these multi-hour petabyte batch jobs down to
> 100-millisecond streaming and interactive queries is quite uncommon, and
> it's thanks to all of you folks that we are able to make this happen.
>
> Matei
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>