Re: Python UDF performance at large scale

2015-06-23 Thread Davies Liu
Fare points, I also like simpler solutions. The overhead of Python task could be a few of milliseconds, which means we also should eval them as batches (one Python task per batch). Decreasing the batch size for UDF sounds reasonable to me, together with other tricks to reduce the data in socket/p

[VOTE] Release Apache Spark 1.4.1

2015-06-23 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.4.1! This release fixes a handful of known issues in Spark 1.4.0, listed here: http://s.apache.org/spark-1.4.1 The tag to be voted on is v1.4.1-rc1 (commit 60e08e5): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=co

Re: Python UDF performance at large scale

2015-06-23 Thread Justin Uang
// + punya Thanks for your quick response! I'm not sure that using an unbounded buffer is a good solution to the locking problem. For example, in the situation where I had 500 columns, I am in fact storing 499 extra columns on the java side, which might make me OOM if I have to store many rows. I

Re: Python UDF performance at large scale

2015-06-23 Thread Davies Liu
Thanks for looking into it, I'd like the idea of having ForkingIterator. If we have unlimited buffer in it, then will not have the problem of deadlock, I think. The writing thread will be blocked by Python process, so there will be not much rows be buffered(still be a reason to OOM). At least, this

Python UDF performance at large scale

2015-06-23 Thread Justin Uang
BLUF: BatchPythonEvaluation's implementation is unusable at large scale, but I have a proof-of-concept implementation that avoids caching the entire dataset. Hi, We have been running into performance problems using Python UDFs with DataFrames at large scale. >From the implementation of BatchPyth

Re: how can I write a language "wrapper"?

2015-06-23 Thread Matei Zaharia
Just FYI, it would be easiest to follow SparkR's example and add the DataFrame API first. Other APIs will be designed to work on DataFrames (most notably machine learning pipelines), and the surface of this API is much smaller than of the RDD API. This API will also give you great performance as

Re: how can I write a language "wrapper"?

2015-06-23 Thread Shivaram Venkataraman
Every language has its own quirks / features -- so I don't think there exists a document on how to go about doing this for a new language. The most related write up I know of is the wiki page on PySpark internals https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals written by Josh Ro

how can I write a language "wrapper"?

2015-06-23 Thread Vasili I. Galchin
Hello, I want to add language support for another language(other than Scala, Java et. al.). Where is documentation that explains to provide support for a new language? Thank you, Vasili

Re: Calculating tuple count /input rate with time

2015-06-23 Thread Tathagata Das
This should give accurate count for each batch, though for getting the rate you have to make sure that you streaming app is stable, that is, batches are processed as fast as they are received (scheduling delay in the spark streaming UI is approx 0). TD On Tue, Jun 23, 2015 at 2:49 AM, anshu shukl

[DataFrame] partitionBy issues

2015-06-23 Thread vladio
Hi, I'm running into a strange memory scaling issue when using the partitionBy feature of DataFrameWriter. I've generated a table (a CSV file) with 3 columns (A, B and C) and 32*32 different entries, with size on disk of about 20kb. There are 32 distinct values for column A and 32 distinct values

Re: [SparkSQL 1.4]Could not use concat with UDF in where clause

2015-06-23 Thread Michael Armbrust
Can you file a JIRA please? On Tue, Jun 23, 2015 at 1:42 AM, StanZhai wrote: > Hi all, > > After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered > the > following exception when use concat with UDF in where clause: > > ===Exception > org.apa

HyperLogLogUDT

2015-06-23 Thread Nick Pentreath
Hey Spark devs I've been looking at DF UDFs and UDAFs. The approx distinct is using hyperloglog, but there is only an option to return the count as a Long. It can be useful to be able to return and store the actual data structure (ie serialized HLL). This effectively allows one to do aggregation

Calculating tuple count /input rate with time

2015-06-23 Thread anshu shukla
I am calculating input rate using the following logic. And i think this foreachRDD is always running on driver (println are seen on driver) 1- Is there any other way to do that in less cost . 2- Will this give me the correct count for rate . //code - inputStream.foreachRDD(new Function, Void

OK to add committers active on JIRA to JIRA admin role?

2015-06-23 Thread Sean Owen
There are some committers who are active on JIRA and sometimes need to do things that require JIRA admin access -- in particular thinking of adding a new person as "Contributor" in order to assign them a JIRA. We can't change what roles can do what (think that INFRA ticket is dead) but can add to t

[SparkSQL 1.4]Could not use concat with UDF in where clause

2015-06-23 Thread StanZhai
Hi all, After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the following exception when use concat with UDF in where clause: ===Exception org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved ob