Re: GraphX: New graph operator

2015-06-02 Thread Reynold Xin
Hi Tarek, I took a quick look at the materials you shared. It actually seems to me it'd be super easy to express a graph as two DataFrames: one for edges (srcid, dstid, and other edge attributes) and one for vertices (vid, and other vertex attributes). Then intersection is just edges1.intersect

Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-02 Thread Patrick Wendell
He all - a tiny nit from the last e-mail. The tag is v1.4.0-rc4. The exact commit and all other information is correct. (thanks Shivaram who pointed this out). On Tue, Jun 2, 2015 at 8:53 PM, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.4.

[VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-02 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit 22596c5): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 22596c534a38cfdda91aef18aa9037ab101e4251 The release files, including signatures, digests, etc. ca

[RESULT] [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-02 Thread Patrick Wendell
This vote is cancelled in favor of RC4. Thanks everyone for the thorough testing of this RC. We are really close, but there were a few blockers found. I've cut a new RC to incorporate those issues. The following patches were merged during the RC3 testing period: (blockers) 4940630 [SPARK-8020] [

Re: DataFrame.withColumn very slow when used iteratively?

2015-06-02 Thread Reynold Xin
.select itself is the bulk add right? On Tue, Jun 2, 2015 at 5:32 PM, Andrew Ash wrote: > Would it be valuable to create a .withColumns([colName], [ColumnObject]) > method that adds in bulk rather than iteratively? > > Alternatively effort might be better spent in making .withColumn() > singular

Re: Unit tests can generate spurious shutdown messages

2015-06-02 Thread Reynold Xin
Can you submit a pull request for it? Thanks. On Tue, Jun 2, 2015 at 4:25 AM, Mick Davies wrote: > If I write unit tests that indirectly initialize > org.apache.spark.util.Utils, > for example use sql types, but produce no logging, I get the following > unpleasant stack trace in my test output.

Re: DataFrame.withColumn very slow when used iteratively?

2015-06-02 Thread Andrew Ash
Would it be valuable to create a .withColumns([colName], [ColumnObject]) method that adds in bulk rather than iteratively? Alternatively effort might be better spent in making .withColumn() singular faster. On Tue, Jun 2, 2015 at 3:46 PM, Reynold Xin wrote: > We improved this in 1.4. Adding 100

Re: Possible space improvements to shuffle

2015-06-02 Thread John Carrino
Yes, I think that bug is what I want. Thank you. So I guess the current reason is that we don't want to buffer up numMapper incoming streams. So we just iterate through each and transfer it over in full because that is more network efficient? I'm not sure I understand why you wouldn't want to so

Re: [SQL] Write parquet files under partition directories?

2015-06-02 Thread Reynold Xin
Almost all dataframe stuff are tracked by this umbrella ticket: https://issues.apache.org/jira/browse/SPARK-6116 For the reader/writer interface, it's here: https://issues.apache.org/jira/browse/SPARK-7654 https://github.com/apache/spark/pull/6175 On Tue, Jun 2, 2015 at 3:57 PM, Matt Cheah wro

Re: [SQL] Write parquet files under partition directories?

2015-06-02 Thread Matt Cheah
Excellent! Where can I find the code, pull request, and Spark ticket where this was introduced? Thanks, -Matt Cheah From: Reynold Xin Date: Monday, June 1, 2015 at 10:25 PM To: Matt Cheah Cc: "dev@spark.apache.org" , Mingyu Kim , Andrew Ash Subject: Re: [SQL] Write parquet files under pa

Re: createDataframe from s3 results in error

2015-06-02 Thread Reynold Xin
Maybe an incompatible Hive package or Hive metastore? On Tue, Jun 2, 2015 at 3:25 PM, Ignacio Zendejas wrote: > From RELEASE: > > "Spark 1.3.1 built for Hadoop 2.4.0 > > Build flags: -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests > -Pkinesis-asl -Pspark-ganglia-lgpl -Phadoop-provided -Ph

Re: createDataframe from s3 results in error

2015-06-02 Thread Ignacio Zendejas
>From RELEASE: "Spark 1.3.1 built for Hadoop 2.4.0 Build flags: -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests -Pkinesis-asl -Pspark-ganglia-lgpl -Phadoop-provided -Phive -Phive-thriftserver " And this stacktrace may be more useful: http://pastebin.ca/3016483 On Tue, Jun 2, 2015 at 3:13

Re: createDataframe from s3 results in error

2015-06-02 Thread Reynold Xin
What version of Spark is this? On Tue, Jun 2, 2015 at 3:13 PM, Ignacio Zendejas wrote: > I've run into an error when trying to create a dataframe. Here's the code: > > -- > from pyspark import StorageLevel > from pyspark.sql import Row > > table = 'blah' > ssc = HiveContext(sc) > > data = sc.tex

createDataframe from s3 results in error

2015-06-02 Thread Ignacio Zendejas
I've run into an error when trying to create a dataframe. Here's the code: -- from pyspark import StorageLevel from pyspark.sql import Row table = 'blah' ssc = HiveContext(sc) data = sc.textFile('s3://bucket/some.tsv') def deserialize(s): p = s.strip().split('\t') p[-1] = float(p[-1]) ret

Re: CSV Support in SparkR

2015-06-02 Thread Shivaram Venkataraman
Thanks for testing. We should probably include a section for this in the SparkR programming guide given how popular CSV files are in R. Feel free to open a PR for that if you get a chance. Shivaram On Tue, Jun 2, 2015 at 2:20 PM, Eskilson,Aleksander < alek.eskil...@cerner.com> wrote: > Seems to

Re: CSV Support in SparkR

2015-06-02 Thread Eskilson,Aleksander
Seems to work great in the master build. It’s really good to have this functionality. Regards, Alek Eskilson From: , Aleksander Eskilson mailto:alek.eskil...@cerner.com>> Date: Tuesday, June 2, 2015 at 2:59 PM To: "shiva...@eecs.berkeley.edu" mailto:shiva...@

Re: Possible space improvements to shuffle

2015-06-02 Thread Josh Rosen
The relevant JIRA that springs to mind is https://issues.apache.org/jira/browse/SPARK-2926 If an aggregator and ordering are both defined, then the map side of sort-based shuffle will sort based on the key ordering so that map-side spills can be efficiently merged. We do not currently do a sort-b

Possible space improvements to shuffle

2015-06-02 Thread John Carrino
One thing I have noticed with ExternalSorter is that if an ordering is not defined, it does the sort using only the partition_id, instead of (parition_id, hash). This means that on the reduce side you need to pull the entire dataset into memory before you can begin iterating over the results. I f

Re: CSV Support in SparkR

2015-06-02 Thread Shivaram Venkataraman
There was a bug in the SparkContext creation that I fixed yesterday. https://github.com/apache/spark/commit/6b44278ef7cd2a278dfa67e8393ef30775c72726 If you build from master it should be fixed. Also I think we might have a rc4 which should have this Thanks Shivaram On Tue, Jun 2, 2015 at 12:56

Re: CSV Support in SparkR

2015-06-02 Thread Eskilson,Aleksander
Ah, alright, cool. I’ll rebuild and let you know. Thanks again, Alek From: Shivaram Venkataraman mailto:shiva...@eecs.berkeley.edu>> Reply-To: "shiva...@eecs.berkeley.edu" mailto:shiva...@eecs.berkeley.edu>> Date: Tuesday, June 2, 2015 at 2:57 PM To: Aleksande

Re: CSV Support in SparkR

2015-06-02 Thread Eskilson,Aleksander
Hey, that’s pretty convenient. Unfortunately, although the package seems to pull fine into the session, I’m getting class not found exceptions with: Caused by: org.apache.spark.SparkExcetion: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3

Re: DataFrame.withColumn very slow when used iteratively?

2015-06-02 Thread Reynold Xin
We improved this in 1.4. Adding 100 columns took 4s on my laptop. https://issues.apache.org/jira/browse/SPARK-7276 Still not the fastest, but much faster. scala> Seq((1, 2)).toDF("a", "b") res6: org.apache.spark.sql.DataFrame = [a: int, b: int] scala> scala> val start = System.nanoTime start: L

DataFrame.withColumn very slow when used iteratively?

2015-06-02 Thread zsampson
Hey, I'm seeing extreme slowness in withColumn when it's used in a loop. I'm running this code: for (int i = 0; i < NUM_ITERATIONS ++i) { df = df.withColumn("col"+i, new Column(new Literal(i, DataTypes.IntegerType))); } where df is initially a trivial dataframe. Here are the results of runni

Re: CSV Support in SparkR

2015-06-02 Thread Shivaram Venkataraman
Hi Alek As Burak said, you can already use the spark-csv with SparkR in the 1.4 release. So right now I use it with something like this # Launch SparkR ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3 df <- read.df(sqlContext, "./nycflights13.csv", "com.databricks.spark.csv", header="t

Re: CSV Support in SparkR

2015-06-02 Thread Burak Yavuz
Hi, cc'ing Shivaram here, because he worked on this yesterday. If I'm not mistaken, you can use the following workflow: ```./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3``` and then ```df <- read.df(sqlContext, "/data", "csv", header = "true")``` Best, Burak On Tue, Jun 2, 2015 a

CSV Support in SparkR

2015-06-02 Thread Eskilson,Aleksander
Are there any intentions to provide first class support for CSV files as one of the loadable file types in SparkR? Data brick’s spark-csv API [1] has support for SQL, Python, and Java/Scala, and implements most of the arguments of R’s read.table API [2], but currently there is no way to load CSV

Re: about Spark MLlib StandardScaler's Implementation

2015-06-02 Thread Joseph Bradley
Your understanding is correct: When used without centering (withMean = false), the 2 implementations are different: * R: normalize by RMS * MLlib: normalize by stddev With centering, they are the same. It's hard to say which one is better a priori, but my guess is that most R users center their da

Unit tests can generate spurious shutdown messages

2015-06-02 Thread Mick Davies
If I write unit tests that indirectly initialize org.apache.spark.util.Utils, for example use sql types, but produce no logging, I get the following unpleasant stack trace in my test output. This caused by the the Utils class adding a shutdown hook which logs the message logDebug("Shutdown hook ca

Re: [VOTE] Release Apache Spark 1.4.0 (RC3)

2015-06-02 Thread Olivier Girardot
Hi everyone, I think there's a blocker on PySpark the "when" functions in python seems to be broken but the Scala API seems fine. Here's a snippet demonstrating that with Spark 1.4.0 RC3 : In [*1*]: df = sqlCtx.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) In [*2*]:

Re: please use SparkFunSuite instead of ScalaTest's FunSuite from now on

2015-06-02 Thread Steve Loughran
thanks. All I'd need would be the base class, so that new tests can be written to work across branches. -steve On 1 Jun 2015, at 18:45, Andrew Or mailto:and...@databricks.com>> wrote: It will be within the next few days 2015-06-01 9:17 GMT-07:00 Reynold Xin mailto:r...@databricks.com>>: I do

about Spark MLlib StandardScaler's Implementation

2015-06-02 Thread RoyGaoVLIS
Hi, When I was trying to add test case for ML’s StandardScaler, I found MLlib’s StandardScaler’s output different from R with params(withMean false, withScale true) Because columns is divided by root-mean-square rather than standard deviation in R, the scale function. I’ m

Re: GraphX: New graph operator

2015-06-02 Thread Tarek Auel
Okay thanks for your feedback. What is the expected behavior of union? Like Union and/or union all of SQL? Union all would be more or less trivial if we just concatenate the vertices and edges (vertex Id conflicts have to be resolved). Should union look for duplicates on the actual attribute (VD)