Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Sasha Kacanski
thanks for quick reply, I will check the link. Hopefully, with conversion to py3, or 3.4 we could take advantage of asyncio and other cool new stuff ... On Thu, Jan 29, 2015 at 7:41 PM, Reynold Xin wrote: > It is something like this: > https://issues.apache.org/jira/browse/SPARK-5097 > > On the

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Reynold Xin
It is something like this: https://issues.apache.org/jira/browse/SPARK-5097 On the master branch, we have a Pandas like API already. On Thu, Jan 29, 2015 at 4:31 PM, Sasha Kacanski wrote: > Hi Reynold, > In my project I want to use Python API too. > When you mention DF's are we talking about p

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Sasha Kacanski
Hi Reynold, In my project I want to use Python API too. When you mention DF's are we talking about pandas or this is something internal to spark py api. If you could elaborate a bit on this or point me to alternate documentation. Thanks much --sasha On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin wr

Re: renaming SchemaRDD -> DataFrame

2015-01-29 Thread Cheng Lian
Yes, when a DataFrame is cached in memory, it's stored in an efficient columnar format. And you can also easily persist it on disk using Parquet, which is also columnar. Cheng On 1/29/15 1:24 PM, Koert Kuipers wrote: to me the word DataFrame does come with certain expectations. one of them is

Re: renaming SchemaRDD -> DataFrame

2015-01-29 Thread Cheng Lian
Forgot to mention that you can find it here . On 1/29/15 1:59 PM, Cheng Lian wrote: Yes, when a DataFrame is cached in memory, it'

Re: renaming SchemaRDD -> DataFrame

2015-01-29 Thread Koert Kuipers
to me the word DataFrame does come with certain expectations. one of them is that the data is stored columnar. in R data.frame internally uses a list of sequences i think, but since lists can have labels its more like a SortedMap[String, Array[_]]. this makes certain operations very cheap (such as

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Reynold Xin
Once the data frame API is released for 1.3, you can write your thing in Python and get the same performance. It can't express everything, but for basic things like projection, filter, join, aggregate and simple numeric computation, it should work pretty well. On Thu, Jan 29, 2015 at 12:45 PM, rt

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Davies Liu
Hey, Without having Python as fast as Scala/Java, I think it's impossible to similar performance in PySpark as in Scala/Java. Jython is also much slower than Scala/Java. With Jython, we can avoid the cost of manage multiple process and RPC, we may still need to do the data conversion between Java

How to speed PySpark to match Scala/Java performance

2015-01-29 Thread rtshadow
Hi, In my company, we've been trying to use PySpark to run ETLs on our data. Alas, it turned out to be terribly slow compared to Java or Scala API (which we ended up using to meet performance criteria). To be more quantitative, let's consider simple case: I've generated test file (848MB): /seq 1

Re: Any interest in 'weighting' VectorTransformer which does component-wise scaling?

2015-01-29 Thread Octavian Geagla
Thanks for the responses. How would something like HadamardProduct or similar be in order to keep it explicit? Would still be a VectorTransformer so the name and trait would hopefully lead to a somewhat self-documenting class. Xiangrui, do you mean Hadamard product or transform? My initial pr

TimeoutException on tests

2015-01-29 Thread Dirceu Semighini Filho
Hi All, I'm trying to use a local build spark, adding the pr 1290 to the 1.2.0 build and after I do the build, I my tests start to fail. should create labeledpoint *** FAILED *** (10 seconds, 50 milliseconds) [info] java.util.concurrent.TimeoutException: Futures timed out after [1 millisecon

Re: renaming SchemaRDD -> DataFrame

2015-01-29 Thread Evan Chan
+1 having proper NA support is much cleaner than using null, at least the Java null. On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks wrote: > You've got to be a little bit careful here. "NA" in systems like R or pandas > may have special meaning that is distinct from "null". > > See, e.g. htt

Re: RDD.combineBy without intermediate (k,v) pair allocation

2015-01-29 Thread Mohit Jaggi
Francois, RDD.aggregate() does not support aggregation by key. But, indeed, that is the kind of implementation I am looking for, one that does not allocate intermediate space for storing (K,V) pairs. When working with large datasets this type of intermediate memory allocation wrecks havoc with g

Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-29 Thread Robert C Senkbeil
+1 I verified that the REPL jars published work fine with the Spark Kernel project (can build/test against them). Signed, Chip Senkbeil From: Krishna Sankar To: Sean Owen Cc: Patrick Wendell , "dev@spark.apache.org" Date: 01/28/2015 02:52 PM Subject:Re: [VOT

Re: emergency jenkins restart soon

2015-01-29 Thread shane knapp
the master builds triggered around ~1am last night (according to the logs), so it looks like we're back in business. On Wed, Jan 28, 2015 at 10:32 PM, shane knapp wrote: > np! the master builds haven't triggered yet, but let's give the rube > goldberg machine a minute to get it's bearings. > >

Re: Data source API | Support for dynamic schema

2015-01-29 Thread Aniket Bhatnagar
Thanks Reynold and Cheng. It does seem quiet a bit of heavy lifting to have schema per row. I will for now settle with having to do a union schema of all the schema versions and complain any incompatibilities :-) Looking forward to do great things with the API! Thanks, Aniket On Thu Jan 29 2015