Re: Build fail...

2015-05-08 Thread Andrew Or
Thanks for pointing this out. I reverted that commit. 2015-05-08 19:01 GMT-07:00 Ted Yu : > Looks like you're right: > > > https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/427/console > > [error] > /home/jenkins/workspace/Spark

Re: Build fail...

2015-05-08 Thread Ted Yu
Looks like you're right: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/427/console [error] /home/jenkins/workspace/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/centos/core/src/main/scala/org/apache/spark/MapOut

Re: Build fail...

2015-05-08 Thread rtimp
Hi, >From what I myself noticed a few minutes ago, I think branch-1.3 might be failing to compile due to the most recent commit. I tried reverting to commit 7fd212b575b6227df5068844416e51f11740e771 (the commit prior to the head) on that branch and recompiling, and was successful. As Ferris would s

Re: Recent Spark test failures

2015-05-08 Thread Ted Yu
Andrew: Do you think the -M and -A options described here can be used in test runs ? http://scalatest.org/user_guide/using_the_runner Cheers On Wed, May 6, 2015 at 5:41 PM, Andrew Or wrote: > Dear all, > > I'm sure you have all noticed that the Spark tests have been fairly > unstable recently.

Intellij Spark Source Compilation

2015-05-08 Thread rtimp
Hello, I'm trying to compile the master branch of the spark source (25889d8) in intellij. I followed the instructions in the wiki https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools, namely I downloaded IntelliJ 14.1.2 with jre 1.7.0_55, imported pom.xml, generated all sources

Re: Having pyspark.sql.types.StructType implement __iter__()

2015-05-08 Thread Reynold Xin
Sure. On Fri, May 8, 2015 at 2:43 PM, Nicholas Chammas wrote: > StructType looks an awful lot like a Python dictionary. > > However, it doesn’t implement __iter__() > , so doing > a quick conversion like this doesn’t work: > > >>>

Having pyspark.sql.types.StructType implement __iter__()

2015-05-08 Thread Nicholas Chammas
StructType looks an awful lot like a Python dictionary. However, it doesn’t implement __iter__() , so doing a quick conversion like this doesn’t work: >>> df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))>>> >>>

Re: branch-1.4 nightly builds?

2015-05-08 Thread Nicholas Chammas
https://issues.apache.org/jira/browse/SPARK-1517 That issue should probably be unassigned since I am not actively working on it. (I can't unassign myself.) Nick On Fri, May 8, 2015 at 5:38 PM Punyashloka Biswal wrote: > Dear Spark devs, > > Does anyone maintain nightly builds for branch-1.4? I

branch-1.4 nightly builds?

2015-05-08 Thread Punyashloka Biswal
Dear Spark devs, Does anyone maintain nightly builds for branch-1.4? I'd like to start testing against it, and having a regularly updated build on a well-publicized repository would be a great help! Punya

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-08 Thread Josh Rosen
Do you have any more specific profiling data that you can share? I'm curious to know where AppendOnlyMap.changeValue is being called from. On Fri, May 8, 2015 at 1:26 PM, Michal Haris wrote: > +dev > On 6 May 2015 10:45, "Michal Haris" wrote: > > > Just wanted to check if somebody has seen sim

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-08 Thread Michal Haris
+dev On 6 May 2015 10:45, "Michal Haris" wrote: > Just wanted to check if somebody has seen similar behaviour or knows what > we might be doing wrong. We have a relatively complex spark application > which processes half a terabyte of data at various stages. We have profiled > it in several ways

Re: DataFrame distinct vs RDD distinct

2015-05-08 Thread Olivier Girardot
I'll try to reproduce what has been reported to me first :) and I'll let you know. Thanks ! Le jeu. 7 mai 2015 à 21:16, Michael Armbrust a écrit : > I'd happily merge a PR that changes the distinct implementation to be more > like Spark core, assuming it includes benchmarks that show better > pe

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Nicholas Chammas
Ah, neat. So in the example I gave earlier, I’d do this to get columns from specific dataframes: >>> df12.select(df1['a'], df2['other']) DataFrame[a: bigint, other: string]>>> df12.select(df1['a'], df2['other']).show() a other 4 I dunno This perhaps should be documented in an examp

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Reynold Xin
You can actually just use df1['a'] in projection to differentiate. e.g. in Scala (similar things work in Python): scala> val df1 = Seq((1, "one")).toDF("a", "b") df1: org.apache.spark.sql.DataFrame = [a: int, b: string] scala> val df2 = Seq((2, "two")).toDF("a", "b") df2: org.apache.spark.sql.D

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Nicholas Chammas
Oh, I didn't know about that. Thanks for the pointer, Rakesh. I wonder why they did that, as opposed to taking the cue from SQL and prefixing column names with a specifiable dataframe alias. The suffix approach seems quite ugly. Nick On Fri, May 8, 2015 at 2:47 PM Rakesh Chalasani wrote: > To

Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Rakesh Chalasani
To add to the above discussion, Pandas, allows suffixing and prefixing to solve this issue http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html Rakesh On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas wrote: > DataFrames, as far as I can tell, don’t have an equivalent to

Re: [SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Michael Armbrust
What version of Spark are you using? It appears that at least in master we are doing the conversion correctly, but its possible older versions of applySchema do not. If you can reproduce the same bug in master, can you open a JIRA? On Fri, May 8, 2015 at 1:36 AM, Haopu Wang wrote: > I want to

DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Nicholas Chammas
DataFrames, as far as I can tell, don’t have an equivalent to SQL’s table aliases. This is essential when joining dataframes that have identically named columns. >>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, >>> "other": "I know"}']))>>> df2 = sqlContext.jsonRDD(sc.pa

Re: [build system] QA infrastructure wiki updated w/latest package installs/versions

2015-05-08 Thread Patrick Wendell
Thanks Shane - really useful as I know that several companies are interested in having in-house replicas of our QA infra. On Fri, May 8, 2015 at 7:15 PM, shane knapp wrote: > so i spent a good part of the morning parsing out all of the packages and > versions of things that we have installed on o

[build system] QA infrastructure wiki updated w/latest package installs/versions

2015-05-08 Thread shane knapp
so i spent a good part of the morning parsing out all of the packages and versions of things that we have installed on our jenkins workers: https://cwiki.apache.org/confluence/display/SPARK/Spark+QA+Infrastructure if you're looking to set up something to mimic our build system, this should be a g

Re: Easy way to convert Row back to case class

2015-05-08 Thread Reynold Xin
In 1.4, you can do row.getInt("colName") In 1.5, some variant of this will come to allow you to turn a DataFrame into a typed RDD, where the case class's field names match the column names. https://github.com/apache/spark/pull/5713 On Fri, May 8, 2015 at 11:01 AM, Will Benton wrote: > This m

Re: [SparkR] is toDF() necessary

2015-05-08 Thread Shivaram Venkataraman
Agree that toDF is not very useful. In fact it was removed from the namespace in a recent change https://github.com/apache/spark/commit/4e930420c19ae7773b138dfc7db8fc03b4660251 Thanks Shivaram On Fri, May 8, 2015 at 1:10 AM, Sun, Rui wrote: > toDF() is defined to convert an RDD to a DataFrame.

Re: Easy way to convert Row back to case class

2015-05-08 Thread Will Benton
This might not be the easiest way, but it's pretty easy: you can use Row(field_1, ..., field_n) as a pattern in a case match. So if you have a data frame with foo as an int column and bar as a String columns and you want to construct instances of a case class that wraps these up, you can do so

Re: Back-pressure for Spark Streaming

2015-05-08 Thread Akhil Das
We had a similar issue while working on one of our usecase where we were processing at a moderate throughput (around 500MB/S). When the processing time exceeds the batch duration, it started to throw up blocknotfound exceptions, i made a workaround for that issue and is explained over here http://a

Re: Spark Streaming with Tachyon : Some findings

2015-05-08 Thread Haoyuan Li
Thanks for the updates! Best, Haoyuan On Fri, May 8, 2015 at 8:40 AM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Just a followup on this Thread . > > I tried Hierarchical Storage on Tachyon ( > http://tachyon-project.org/Hierarchy-Storage-on-Tachyon.html ) , and that > see

Easy way to convert Row back to case class

2015-05-08 Thread Ulanov, Alexander
Hi, I created a dataset RDD[MyCaseClass], converted it to DataFrame and saved to Parquet file, following https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds When I load this dataset with sqlContext.parquetFile, I get DataFrame with column names as in initia

Re: Spark Streaming with Tachyon : Some findings

2015-05-08 Thread Dibyendu Bhattacharya
Just a followup on this Thread . I tried Hierarchical Storage on Tachyon ( http://tachyon-project.org/Hierarchy-Storage-on-Tachyon.html ) , and that seems to have worked and I did not see any any Spark Job failed due to BlockNotFoundException. below is my Hierarchical Storage settings.. -Dtach

Re: [build infra] quick downtime again tomorrow morning for DOCKER

2015-05-08 Thread shane knapp
...and this is done. thanks for your patience! On Fri, May 8, 2015 at 7:00 AM, shane knapp wrote: > this is happening now. > > On Thu, May 7, 2015 at 3:40 PM, shane knapp wrote: > >> yes, docker. that wonderful little wrapper for linux containers will be >> installed and ready for play on all

Back-pressure for Spark Streaming

2015-05-08 Thread François Garillot
Hi guys, We[1] are doing a bit of work on Spark Streaming, to help it face situations where the throughput of data on an InputStream is (momentarily) susceptible to overwhelm the Receiver(s) memory. The JIRA & design doc is here: https://issues.apache.org/jira/browse/SPARK-7398 We'd sure appreci

Re: [build infra] quick downtime again tomorrow morning for DOCKER

2015-05-08 Thread shane knapp
yes, absolutely. right now i'm just getting the basics set up for a student's build in the lab. later on today i will be updating the spark wiki qa infrastructure page w/more information. On Fri, May 8, 2015 at 7:06 AM, Punyashloka Biswal wrote: > Just curious: will docker allow new capabiliti

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Punyashloka Biswal
Is there a foolproof way to access methods exclusively (instead of picking between columns and methods at runtime)? Here are two ideas, neither of which seems particularly Pythonic - pyspark.sql.methods(df).name() - df.__methods__.name() Punya On Fri, May 8, 2015 at 10:06 AM Nicholas Chamm

Re: [build infra] quick downtime again tomorrow morning for DOCKER

2015-05-08 Thread Punyashloka Biswal
Just curious: will docker allow new capabilities for the Spark build? (Where can I read more?) Punya On Fri, May 8, 2015 at 10:00 AM shane knapp wrote: > this is happening now. > > On Thu, May 7, 2015 at 3:40 PM, shane knapp wrote: > > > yes, docker. that wonderful little wrapper for linux co

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Nicholas Chammas
And a link to SPARK-7035 (which Xiangrui mentioned in his initial email) for the lazy. On Fri, May 8, 2015 at 3:41 AM Xiangrui Meng wrote: > On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman > wrote: > > I dont know much about Python style

Re: [build infra] quick downtime again tomorrow morning for DOCKER

2015-05-08 Thread shane knapp
this is happening now. On Thu, May 7, 2015 at 3:40 PM, shane knapp wrote: > yes, docker. that wonderful little wrapper for linux containers will be > installed and ready for play on all of the jenkins workers tomorrow morning. > > the downtime will be super quick: i just need to kill the jenki

Re: NoClassDefFoundError with Spark 1.3

2015-05-08 Thread Ganelin, Ilya
All – the issue was much more subtle. I’d accidentally included a reference to a static object in a class that I wasn’t actually including in my build – hence the unrelated run-time error. Thanks for the clarification on what the “provided” scope means. Ilya Ganelin [cid:F5843713-66AA-443B-ABB

Re: NoClassDefFoundError with Spark 1.3

2015-05-08 Thread Olivier Girardot
You're trying to launch using sbt run some "provided" dependency, the goal of the "provided" scope is exactly to exclude this dependency from runtime, considering it as "provided" by the environment. You configuration is correct to create an assembly jar - but not to use sbt run to test your proje

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-08 Thread Steve Loughran
> 2. I can add a hadoop-2.6 profile that sets things up for s3a, azure and > openstack swift. Added: https://issues.apache.org/jira/browse/SPARK-7481 One thing to consider here is testing; the s3x clients themselves have some tests that individuals/orgs can run against different S3 install

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-08 Thread Steve Loughran
> On 7 May 2015, at 18:02, Matei Zaharia wrote: > > We should make sure to update our docs to mention s3a as well, since many > people won't look at Hadoop's docs for this. > > Matei > 1. to use s3a you'll also need an amazon toolkit JAR on the cp 2. I can add a hadoop-2.6 profile that sets

Re: NoClassDefFoundError with Spark 1.3

2015-05-08 Thread Akhil Das
Looks like the jar you provided has some missing classes. Try this: scalaVersion := "2.10.4" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.3.0", "org.apache.spark" %% "spark-sql" % "1.3.0" % "provided", "org.apache.spark" %% "spark-mllib" % "1.3.0" % "provided",

[SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Haopu Wang
I want to filter a DataFrame based on a Date column. If the DataFrame object is constructed from a scala case class, it's working (either compare as String or Date). But if the DataFrame is generated by specifying a Schema to an RDD, it doesn't work. Below is the exception and test code. D

[SparkR] is toDF() necessary

2015-05-08 Thread Sun, Rui
toDF() is defined to convert an RDD to a DataFrame. But it is just a very thin wrapper of createDataFrame() by help the caller avoid input of SQLContext. Since Scala/pySpark does not have toDF(), and we'd better keep API as narrow and simple as possible. Is toDF() really necessary? Could we elim

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Xiangrui Meng
On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman wrote: > I dont know much about Python style, but I think the point Wes made about > usability on the JIRA is pretty powerful. IMHO the number of methods on a > Spark DataFrame might not be much more compared to Pandas. Given that it > looks l

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Shivaram Venkataraman
I dont know much about Python style, but I think the point Wes made about usability on the JIRA is pretty powerful. IMHO the number of methods on a Spark DataFrame might not be much more compared to Pandas. Given that it looks like users are okay with the possibility of collisions in Pandas I think

Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Xiangrui Meng
Hi all, In PySpark, a DataFrame column can be referenced using df["abcd"] (__getitem__) and df.abcd (__getattr__). There is a discussion on SPARK-7035 on compatibility issues with the __getattr__ approach, and I want to collect more inputs on this. Basically, if in the future we introduce a new m