What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-06 Thread YiZhi Liu
Hi everyone, I'm curious about the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS. Both of them are optimized using LBFGS, the only difference I see is LogisticRegression takes DataFrame while LogisticRegressionWithLBFGS takes RDD. So

Re: failure notice

2015-10-06 Thread Tathagata Das
Unfortunately, there is not an obvious way to do this. I am guessing that you want to partition your stream such that the same keys always go to the same executor, right? You could do it by writing a custom RDD. See ShuffledRDD

Re: multiple count distinct in SQL/DataFrame?

2015-10-06 Thread Reynold Xin
To provide more context, if we do remove this feature, the following SQL query would throw an AnalysisException: select count(distinct colA), count(distinct colB) from foo; The following should still work: select count(distinct colA) from foo; The following should also work: select count(disti

multiple count distinct in SQL/DataFrame?

2015-10-06 Thread Reynold Xin
The current implementation of multiple count distinct in a single query is very inferior in terms of performance and robustness, and it is also hard to guarantee correctness of the implementation in some of the refactorings for Tungsten. Supporting a better version of it is possible in the future,

Re: Adding Spark Testing functionality

2015-10-06 Thread Holden Karau
I'll put together a google doc and send that out (in the meantime a quick guide of sort of how the current package can be used is in the blog post I did at http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/ ) If people think its better to keep as a pack

Re: Adding Spark Testing functionality

2015-10-06 Thread Patrick Wendell
Hey Holden, It would be helpful if you could outline the set of features you'd imagine being part of Spark in a short doc. I didn't see a README on the existing repo, so it's hard to know exactly what is being proposed. As a general point of process, we've typically avoided merging modules into S

Adding Spark Testing functionality

2015-10-06 Thread Holden Karau
Hi Spark Devs, So this has been brought up a few times before, and generally on the user list people get directed to use spark-testing-base. I'd like to start moving some of spark-testing-base's functionality into Spark so that people don't need a library to do what is (hopefully :p) a very common

Re: CQs on WindowedStream created on running StreamingContext

2015-10-06 Thread Yogesh Mahajan
Anyone knows about this ? TD ? -yogesh > On 30-Sep-2015, at 1:25 pm, Yogs wrote: > > Hi, > > We intend to run adhoc windowed continuous queries on spark streaming data. > The queries could be registered/deregistered dynamically or can be submitted > through command line. Currently Spark str

Re: SparkR dataframe UDF

2015-10-06 Thread Hossein
User defined functions written in R are not supposed yet. You can implement your UDF in Scala, register it in sqlContext and use it in SparkR, provided that you share your context between R and Scala. --Hossein On Friday, October 2, 2015, Renyi Xiong wrote: > Hi Shiva, > > Is Dataframe UDF impl

Re: failure notice

2015-10-06 Thread Renyi Xiong
yes, it can recover on a different node. it uses write-ahead-log, checkpoints offsets of both ingress and egress (e.g. using zookeeper and/or kafka), replies on the streaming engine's deterministic operations. by replaying back a certain range of data based on checkpointed ingress offset (at least

Re: StructType has more rows, than corresponding Row has objects.

2015-10-06 Thread Eugene Morozov
Davies, that seemed to be my issue, my colleague helped me to resolved it. The problem was that we build RDD and corresponding StructType by ourselves (no json, parquet, cassandra, etc - we take a list of business objects and convert them to Rows, then infer struct type) and I missed one thing. --

Re: Pyspark dataframe read

2015-10-06 Thread Koert Kuipers
i personally find the comma separated paths feature much more important than commas in paths (which one could argue you should avoid). but assuming people want to keep commas as legitimate characters in paths: https://issues.apache.org/jira/browse/SPARK-10185 https://github.com/apache/spark/pull/8

Re: Pyspark dataframe read

2015-10-06 Thread Reynold Xin
I think the problem is that comma is actually a legitimate character for file name, and as a result ... On Tuesday, October 6, 2015, Josh Rosen wrote: > Could someone please file a JIRA to track this? > https://issues.apache.org/jira/browse/SPARK > > On Tue, Oct 6, 2015 at 1:21 AM, Koert Kuipers

Re: Pyspark dataframe read

2015-10-06 Thread Josh Rosen
Could someone please file a JIRA to track this? https://issues.apache.org/jira/browse/SPARK On Tue, Oct 6, 2015 at 1:21 AM, Koert Kuipers wrote: > i ran into the same thing in scala api. we depend heavily on comma > separated paths, and it no longer works. > > > On Tue, Oct 6, 2015 at 3:02 AM, B

Re: Pyspark dataframe read

2015-10-06 Thread Koert Kuipers
i ran into the same thing in scala api. we depend heavily on comma separated paths, and it no longer works. On Tue, Oct 6, 2015 at 3:02 AM, Blaž Šnuderl wrote: > Hello everyone. > > It seems pyspark dataframe read is broken for reading multiple files. > > sql.read.json( "file1,file2") fails wit

Pyspark dataframe read

2015-10-06 Thread Blaž Šnuderl
Hello everyone. It seems pyspark dataframe read is broken for reading multiple files. sql.read.json( "file1,file2") fails with java.io.IOException: No input paths specified in job. This used to work in spark 1.4 and also still work with sc.textFile Blaž