from:"Justin Uang"

Purpose of broadcast timeout

2019-01-30 Thread Justin Uang

Hi all, We have noticed a lot of broadcast timeouts on our pipelines, and from some inspection, it seems that they happen when I have two threads trying to save two different DataFrames. We use the FIFO scheduler, so if I launch a job that needs all the executors, the second DataFrame's collect on

Re: Using UDFs in Java without registration

2017-07-26 Thread Justin Uang

the typetags for arguments but haven't got around to use them > yet. I think it'd make sense to have them and do the auto cast, but we can > have rules in analysis to forbid certain casts (e.g. don't auto cast double > to int). > > > On Sat, May 30, 2015 at 7:12 AM,

Making BatchPythonEvaluation actually Batch

2016-01-31 Thread Justin Uang

Hey guys, BLUF: sorry for the length of this email, trying to figure out how to batch Python UDF executions, and since this is my first time messing with catalyst, would like any feedback My team is starting to use PySpark UDFs quite heavily, and performance is a huge blocker. The extra roundtrip

Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Justin Uang

Hi, If I had a df and I wrote it out via partitionBy("id"), presumably, when I load in the df and do a groupBy("id"), a shuffle shouldn't be necessary right? Effectively, we can load in the dataframe with a hash partitioner already set, since each task can simply read all the folders where id= whe

Re: Returning numpy types from udfs

2015-12-05 Thread Justin Uang

Filed here: https://issues.apache.org/jira/browse/SPARK-12157 On Sat, Dec 5, 2015 at 3:08 PM Reynold Xin wrote: > Not aware of any jira ticket, but it does sound like a great idea. > > > On Sat, Dec 5, 2015 at 11:03 PM, Justin Uang > wrote: > >> Hi, >> >

Returning numpy types from udfs

2015-12-05 Thread Justin Uang

Hi, I have fallen into the trap of returning numpy types from udfs, such as np.float64 and np.int. It's hard to find the issue because they behave pretty much as regular pure Python floats and doubles, so can we make PYSPARK automatically translate them? If so, I'll create a Jira ticket. Justin

Subtract implementation using broadcast

2015-11-27 Thread Justin Uang

Hi, I have seen massive gains with the broadcast hint for joins with DataFrames, and I was wondering if we have thought about allowing the broadcast hint for the implementation of subtract and intersect. Right now, when I try it, it says that there is no plan for the broadcast hint. Justin

Info about Dataset

2015-11-03 Thread Justin Uang

Hi, I was looking through some of the PRs slated for 1.6.0 and I noted something called a Dataset, which looks like a new concept based off of the scaladoc for the class. Can anyone point me to some references/design_docs regarding the choice to introduce the new concept? I presume it is probably

Re: Off-heap storage and dynamic allocation

2015-11-03 Thread Justin Uang

#x27;t even think the > current offheap API makes much sense, and we should consider just removing > it to simplify things. > > On Tue, Nov 3, 2015 at 1:20 PM, Justin Uang wrote: > >> Alright, we'll just stick with normal caching then. >> >> Just for future

Re: Off-heap storage and dynamic allocation

2015-11-03 Thread Justin Uang

books can be idle for long periods of time while holding onto cached rdds. On Tue, Nov 3, 2015 at 10:15 PM Reynold Xin wrote: > It is lost unfortunately (although can be recomputed automatically). > > > On Tue, Nov 3, 2015 at 1:13 PM, Justin Uang wrote: > >> Thanks for your res

Re: Pickle Spark DataFrame

2015-11-03 Thread Justin Uang

Is the Manager a python multiprocessing manager? Why are you using parallelism on python when theoretically most of the heavy lifting is done via spark? On Wed, Oct 28, 2015 at 4:27 PM agg212 wrote: > I would just like to be able to put a Spark DataFrame in a manager.dict() > and > be able to ge

Re: Off-heap storage and dynamic allocation

2015-11-03 Thread Justin Uang

ally > simplify the internals. > > > > > On Tue, Nov 3, 2015 at 7:59 AM, Justin Uang wrote: > >> Yup, but I'm wondering what happens when an executor does get removed, >> but when we're using tachyon. Will the cached data still be available, >> since

Re: Off-heap storage and dynamic allocation

2015-11-03 Thread Justin Uang

http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation> > controls this. > > On Fri, Oct 30, 2015 at 12:14 PM Justin Uang > wrote: > >> Hey guys, >> >> According to the docs for 1.5.1, when an executor is removed for dynamic >> allocation, t

Off-heap storage and dynamic allocation

2015-10-30 Thread Justin Uang

Hey guys, According to the docs for 1.5.1, when an executor is removed for dynamic allocation, the cached data is gone. If I use off-heap storage like tachyon, conceptually there isn't this issue anymore, but is the cached data still available in practice? This would be great because then we would

Re: Python UDAFs

2015-10-02 Thread Justin Uang

Cool, filed here: https://issues.apache.org/jira/browse/SPARK-10915 On Fri, Oct 2, 2015 at 3:21 PM Reynold Xin wrote: > No, not yet. > > > On Fri, Oct 2, 2015 at 12:20 PM, Justin Uang > wrote: > >> Hi, >> >> Is there a Python API for UDAFs? >> >> Thanks! >> >> Justin >> > >

Python UDAFs

2015-10-02 Thread Justin Uang

Hi, Is there a Python API for UDAFs? Thanks! Justin

Fast Iteration while developing

2015-09-07 Thread Justin Uang

Hi, What is the normal workflow for the core devs? - Do we need to build the assembly jar to be able to run it from the spark repo? - Do you use sbt or maven to do the build? - Is zinc only usuable for maven? I'm asking because the current process I have right now is to do sbt build, which means

Re: PySpark on PyPi

2015-08-20 Thread Justin Uang

One other question: Do we have consensus on publishing the pip-installable source distribution to PyPI? If so, is that something that the maintainers need to add to the process that they use to publish releases? On Thu, Aug 20, 2015 at 5:44 PM Justin Uang wrote: > I would prefer to just do

Re: PySpark on PyPi

2015-08-20 Thread Justin Uang

g 10, 2015 at 11:23 AM, Davies Liu < >> dav...@databricks.com> >> >> >>> wrote: >> >> >>>> >> >> >>>> I think so, any contributions on this are welcome. >> >> >>>> >> >&g

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-31 Thread Justin Uang

as a comment to that ticket > and we'll make sure to test both cases and break it out if the root cause > ends up being different. > > On Tue, Jul 28, 2015 at 2:48 PM, Justin Uang > wrote: > >> Sweet! Does this cover DataFrame#rdd also using the cached query from >&

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Justin Uang

t; On Tue, Jul 28, 2015 at 2:36 AM, Justin Uang > wrote: > >> Hey guys, >> >> I'm running into some pretty bad performance issues when it comes to >> using a CrossValidator, because of caching behavior of DataFrames. >> >> The root of the problem

Re: PySpark on PyPi

2015-07-28 Thread Justin Uang

tually work until you > install Pandas. > > Punya > On Wed, Jul 22, 2015 at 12:49 PM Justin Uang > wrote: > >> // + *Davies* for his comments >> // + Punya for SA >> >> For development and CI, like Olivier mentioned, I think it would be >> hugely be

DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Justin Uang

Hey guys, I'm running into some pretty bad performance issues when it comes to using a CrossValidator, because of caching behavior of DataFrames. The root of the problem is that while I have cached my DataFrame representing the features and labels, it is caching at the DataFrame level, while Cros

Re: PySpark on PyPi

2015-07-22 Thread Justin Uang

// + *Davies* for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of

Re: how can I write a language "wrapper"?

2015-06-29 Thread Justin Uang

My guess is that if you are just wrapping the spark sql APIs, you can get away with not having to reimplement a lot of the complexities in Pyspark like storing everything in RDDs as pickled byte arrays, pipelining RDDs, doing aggregations and joins in the python interpreters, etc. Since the canoni

Re: Python UDF performance at large scale

2015-06-25 Thread Justin Uang

, you can give it a try. > > On Wed, Jun 24, 2015 at 4:39 PM, Justin Uang > wrote: > > Correct, I was running with a batch size of about 100 when I did the > tests, > > because I was worried about deadlocks. Do you have any concerns regarding > > the batched synchron

Re: Python UDF performance at large scale

2015-06-24 Thread Justin Uang

? On Wed, Jun 24, 2015 at 7:27 PM Davies Liu wrote: > From you comment, the 2x improvement only happens when you have the > batch size as 1, right? > > On Wed, Jun 24, 2015 at 12:11 PM, Justin Uang > wrote: > > FYI, just submitted a PR to Pyrolite to remove their Sto

Re: Python UDF performance at large scale

2015-06-24 Thread Justin Uang

r. >> >> BTW, what do your UDF looks like? How about to use Jython to run >> simple Python UDF (without some external libraries). >> >> On Tue, Jun 23, 2015 at 8:21 PM, Justin Uang >> wrote: >> > // + punya >> > >> > Thanks for your qu

Re: Python UDF performance at large scale

2015-06-23 Thread Justin Uang

nding out the PR? > > On Tue, Jun 23, 2015 at 3:27 PM, Justin Uang > wrote: > > BLUF: BatchPythonEvaluation's implementation is unusable at large scale, > but > > I have a proof-of-concept implementation that avoids caching the entire > > dataset. > > &g

Python UDF performance at large scale

2015-06-23 Thread Justin Uang

BLUF: BatchPythonEvaluation's implementation is unusable at large scale, but I have a proof-of-concept implementation that avoids caching the entire dataset. Hi, We have been running into performance problems using Python UDFs with DataFrames at large scale. >From the implementation of BatchPyth

Re: Catalyst: Reusing already computed expressions within a projection

2015-05-31 Thread Justin Uang

ntial performance gains? >> >> >> On Sat, May 30, 2015 at 9:02 AM, Justin Uang >> wrote: >> >>> On second thought, perhaps can this be done by writing a rule that >>> builds the dag of dependencies between expressions, then convert it into >

Re: Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Justin Uang

approach? On Sat, May 30, 2015 at 11:30 AM Justin Uang wrote: > If I do the following > > df2 = df.withColumn('y', df['x'] * 7) > df3 = df2.withColumn('z', df2.y * 3) > df3.explain() > > Then the result is > > > Project

Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Justin Uang

If I do the following df2 = df.withColumn('y', df['x'] * 7) df3 = df2.withColumn('z', df2.y * 3) df3.explain() Then the result is > Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS y#64,((x#59 * 7.0) AS y#64 * 3.0) AS z#65] > PhysicalRDD [date#56,id#57,timestamp#58,x

Re: Using UDFs in Java without registration

2015-05-30 Thread Justin Uang

> } > > Ideally, we should ask for both argument class and return class, so we can > do the proper type conversion (e.g. if the UDF expects a string, but the > input expression is an int, Catalyst can automatically add a cast). > However, we haven't implemented those in UserDefin

Using UDFs in Java without registration

2015-05-29 Thread Justin Uang

I would like to define a UDF in Java via a closure and then use it without registration. In Scala, I believe there are two ways to do this: myUdf = functions.udf({ _ + 5}) myDf.select(myUdf(myDf("age"))) or myDf.select(functions.callUDF({_ + 5}, DataTypes.IntegerType, myDf("age")))

Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Justin Uang

; for more information. > >>> import types > >>> types > '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py'> > > Without renaming, our `types.py` will conflict with it when you run > unittests in pyspark/sql/ . > > On Tue, May 26, 2015

Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Justin Uang

In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was renamed to pyspark/sql/_types.py and then some magic in pyspark/sql/__init__.py dynamically renamed the module back to types. I imagine that this is some naming conflict with Python 3, but what was the error that showed up? Th

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-22 Thread Justin Uang

I'm working on one of the Palantir teams using Spark, and here is our feedback: We have encountered three issues when upgrading to spark 1.4.0. I'm not sure they qualify as a -1, as they come from using non-public APIs and multiple spark contexts for the purposes of testing, but I do want to bring

UDTs and StringType upgrade issue for Spark 1.4.0

2015-05-22 Thread Justin Uang

We ran into an issue regarding Strings in UDTs when upgrading to Spark 1.4.0-rc. I understand that it's a non-public APIs, so it's expected, but I just wanted to bring it up for awareness and so we can maybe change the release notes to mention them =) Our UDT was serializing to a StringType, but n

Re: RDD split into multiple RDDs

2015-05-19 Thread Justin Uang

To do it in one pass, conceptually what you would need to do is to consume the entire parent iterator and store the values either in memory or on disk, which is generally something you want to avoid given that the parent iterator length is unbounded. If you need to start spilling to disk, you might

First-class support for pip/virtualenv in pyspark

2015-04-23 Thread Justin Uang

Hi, I have been trying to figure out how to ship a python package that I have been working on, and this has brought up a couple questions to me. Please note that I'm fairly new to python package management, so any feedback/corrections is welcome =) It looks like the --py-files support we have mer

Infinite recursion when using SQLContext#createDataFrame(JavaRDD[Row], java.util.List[String])

2015-04-19 Thread Justin Uang

Hi, I have a question regarding SQLContext#createDataFrame(JavaRDD[Row], java.util.List[String]). It looks like when I try to call it, it results in an infinite recursion that overflows the stack. I filed it here: https://issues.apache.org/jira/browse/SPARK-6999. What is the best way to fix this?

42 matches

Mail list logo