Re: Handling nulls in vector columns is non-trivial

2017-06-21 Thread Franklyn D'souza
>From the documentation it states that ` The input columns should be of DoubleType or FloatType.` so i dont think that is what im looking for. Also in general the API around vectors is highly lacking, especially from the pyspark side. Very common vector operations like addition, subtractions and d

Re: Handling nulls in vector columns is non-trivial

2017-06-21 Thread Maciej Szymkiewicz
Since 2.2 there is Imputer: https://github.com/apache/spark/blob/branch-2.2/examples/src/main/python/ml/imputer_example.py which should at least partially address the problem. On 06/22/2017 03:03 AM, Franklyn D'souza wrote: > I just wanted to highlight some of the rough edges around using > vect

Why does Spark SQL use custom spark.sql.execution.id local property not SparkContext.setJobGroup?

2017-06-21 Thread Jacek Laskowski
Hi, Just noticed that Spark SQL uses spark.sql.execution.id local property (via SQLExecution.withNewExecutionId [1]) to group Spark jobs logically together while Structured Streaming uses SparkContext.setJobGroup [2] to do the same. I think Structured Streaming is more correct as it uses what Spa

Handling nulls in vector columns is non-trivial

2017-06-21 Thread Franklyn D'souza
I just wanted to highlight some of the rough edges around using vectors in columns in dataframes. If there is a null in a dataframe column containing vectors pyspark ml models like logistic regression will completely fail. However from what i've read there is no good way to fill in these nulls wi

Re: [build system] when it rains... berkeley lost power. again. use new url to visit jenkins

2017-06-21 Thread shane knapp
ok, amplab.cs.berkeley.edu is back up and you can reach jenkins. On Wed, Jun 21, 2017 at 4:18 PM, shane knapp wrote: > a lot of berkeley cs infrastructure we depend on is still down. no > ETA as to when they'll be up. > > On Wed, Jun 21, 2017 at 3:43 PM, shane knapp wrote: >> a construction cre

Re: [build system] when it rains... berkeley lost power. again. use new url to visit jenkins

2017-06-21 Thread shane knapp
a lot of berkeley cs infrastructure we depend on is still down. no ETA as to when they'll be up. On Wed, Jun 21, 2017 at 3:43 PM, shane knapp wrote: > a construction crew working outside hit an underground power line, and > power has just been restored. our servers are coming back up, and > acc

Re: [build system] when it rains... berkeley lost power. again. use new url to visit jenkins

2017-06-21 Thread shane knapp
a construction crew working outside hit an underground power line, and power has just been restored. our servers are coming back up, and access to jenkins should be restored shortly. On Wed, Jun 21, 2017 at 2:14 PM, shane knapp wrote: > ...it pours. > > we lost power in our building, including t

[build system] when it rains... berkeley lost power. again. use new url to visit jenkins

2017-06-21 Thread shane knapp
...it pours. we lost power in our building, including the machine room where amplab.cs.berkeley.edu lives. jenkins is still up and you can visit the site by ignoring the reverse proxy: https://hadrian.ist.berkeley.edu/jenkins/ the bad news is that pull request builds won't run. ETA on power res

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-21 Thread Imran Rashid
-1 I'm sorry for discovering this so late, but I just filed https://issues.apache.org/jira/browse/SPARK-21165 which I think should be a blocker, its a regression from 2.1 On Wed, Jun 21, 2017 at 1:43 PM, Nick Pentreath wrote: > As before, release looks good, all Scala, Python tests pass. R test

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-21 Thread Nick Pentreath
As before, release looks good, all Scala, Python tests pass. R tests fail with same issue in SPARK-21093 but it's not a blocker. +1 (binding) On Wed, 21 Jun 2017 at 01:49 Michael Armbrust wrote: > I will kick off the voting with a +1. > > On Tue, Jun 20, 2017 at 4:49 PM, Michael Armbrust > wr

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-21 Thread Sean Owen
+1 Sigs/hashes look good. Tests pass on Java 8 / Ubuntu 17 with -Pyarn -Phive -Phadoop-2.7 for me. The only open issues for 2.2.0 are: SPARK-21144 Unexpected results when the data schema and partition schema have the duplicate columns SPARK-18267 Distribute PySpark via Python Package Index (pypi

[build system] patching post-mortem: back to normal!

2017-06-21 Thread shane knapp
all systems were updated fully, as it had been over a year since i'd last done it. risky, i know but... things that went right: * a lot of vulnerabilities in the systems were patched. short list: - CVE-2017-1000364 (stack guard) - CVE-2017-1000363 (stack overflow) - CVE-2017-1000366 (gnu C

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-21 Thread Michael Armbrust
This vote fails. Please test RC5. On Jun 21, 2017 6:50 AM, "Nick Pentreath" wrote: > Thanks, I added the details of my environment to the JIRA (for what it's > worth now, as the issue is identified) > > On Wed, 14 Jun 2017 at 11:28 Hyukjin Kwon wrote: > >> Actually, I opened - https://issues.a

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-21 Thread Nick Pentreath
Thanks, I added the details of my environment to the JIRA (for what it's worth now, as the issue is identified) On Wed, 14 Jun 2017 at 11:28 Hyukjin Kwon wrote: > Actually, I opened - https://issues.apache.org/jira/browse/SPARK-21093. > > 2017-06-14 17:08 GMT+09:00 Hyukjin Kwon : > >> For a shor