Hi all,
We have noticed a lot of broadcast timeouts on our pipelines, and from some
inspection, it seems that they happen when I have two threads trying to
save two different DataFrames. We use the FIFO scheduler, so if I launch a
job that needs all the executors, the second DataFrame's collect on
the typetags for arguments but haven't got around to use them
> yet. I think it'd make sense to have them and do the auto cast, but we can
> have rules in analysis to forbid certain casts (e.g. don't auto cast double
> to int).
>
>
> On Sat, May 30, 2015 at 7:12 AM,
Hey guys,
BLUF: sorry for the length of this email, trying to figure out how to batch
Python UDF executions, and since this is my first time messing with
catalyst, would like any feedback
My team is starting to use PySpark UDFs quite heavily, and performance is a
huge blocker. The extra roundtrip
Hi,
If I had a df and I wrote it out via partitionBy("id"), presumably, when I
load in the df and do a groupBy("id"), a shuffle shouldn't be necessary
right? Effectively, we can load in the dataframe with a hash partitioner
already set, since each task can simply read all the folders where
id= whe
Filed here:
https://issues.apache.org/jira/browse/SPARK-12157
On Sat, Dec 5, 2015 at 3:08 PM Reynold Xin wrote:
> Not aware of any jira ticket, but it does sound like a great idea.
>
>
> On Sat, Dec 5, 2015 at 11:03 PM, Justin Uang
> wrote:
>
>> Hi,
>>
>
Hi,
I have fallen into the trap of returning numpy types from udfs, such as
np.float64 and np.int. It's hard to find the issue because they behave
pretty much as regular pure Python floats and doubles, so can we make
PYSPARK automatically translate them?
If so, I'll create a Jira ticket.
Justin
Hi,
I have seen massive gains with the broadcast hint for joins with
DataFrames, and I was wondering if we have thought about allowing the
broadcast hint for the implementation of subtract and intersect.
Right now, when I try it, it says that there is no plan for the broadcast
hint.
Justin
Hi,
I was looking through some of the PRs slated for 1.6.0 and I noted
something called a Dataset, which looks like a new concept based off of the
scaladoc for the class. Can anyone point me to some references/design_docs
regarding the choice to introduce the new concept? I presume it is probably
#x27;t even think the
> current offheap API makes much sense, and we should consider just removing
> it to simplify things.
>
> On Tue, Nov 3, 2015 at 1:20 PM, Justin Uang wrote:
>
>> Alright, we'll just stick with normal caching then.
>>
>> Just for future
books can be idle for long periods of time while holding
onto cached rdds.
On Tue, Nov 3, 2015 at 10:15 PM Reynold Xin wrote:
> It is lost unfortunately (although can be recomputed automatically).
>
>
> On Tue, Nov 3, 2015 at 1:13 PM, Justin Uang wrote:
>
>> Thanks for your res
Is the Manager a python multiprocessing manager? Why are you using
parallelism on python when theoretically most of the heavy lifting is done
via spark?
On Wed, Oct 28, 2015 at 4:27 PM agg212 wrote:
> I would just like to be able to put a Spark DataFrame in a manager.dict()
> and
> be able to ge
ally
> simplify the internals.
>
>
>
>
> On Tue, Nov 3, 2015 at 7:59 AM, Justin Uang wrote:
>
>> Yup, but I'm wondering what happens when an executor does get removed,
>> but when we're using tachyon. Will the cached data still be available,
>> since
http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation>
> controls this.
>
> On Fri, Oct 30, 2015 at 12:14 PM Justin Uang
> wrote:
>
>> Hey guys,
>>
>> According to the docs for 1.5.1, when an executor is removed for dynamic
>> allocation, t
Hey guys,
According to the docs for 1.5.1, when an executor is removed for dynamic
allocation, the cached data is gone. If I use off-heap storage like
tachyon, conceptually there isn't this issue anymore, but is the cached
data still available in practice? This would be great because then we would
Cool, filed here: https://issues.apache.org/jira/browse/SPARK-10915
On Fri, Oct 2, 2015 at 3:21 PM Reynold Xin wrote:
> No, not yet.
>
>
> On Fri, Oct 2, 2015 at 12:20 PM, Justin Uang
> wrote:
>
>> Hi,
>>
>> Is there a Python API for UDAFs?
>>
>> Thanks!
>>
>> Justin
>>
>
>
Hi,
Is there a Python API for UDAFs?
Thanks!
Justin
Hi,
What is the normal workflow for the core devs?
- Do we need to build the assembly jar to be able to run it from the spark
repo?
- Do you use sbt or maven to do the build?
- Is zinc only usuable for maven?
I'm asking because the current process I have right now is to do sbt build,
which means
One other question: Do we have consensus on publishing the pip-installable
source distribution to PyPI? If so, is that something that the maintainers
need to add to the process that they use to publish releases?
On Thu, Aug 20, 2015 at 5:44 PM Justin Uang wrote:
> I would prefer to just do
g 10, 2015 at 11:23 AM, Davies Liu <
>> dav...@databricks.com>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> I think so, any contributions on this are welcome.
>> >> >>>>
>> >&g
as a comment to that ticket
> and we'll make sure to test both cases and break it out if the root cause
> ends up being different.
>
> On Tue, Jul 28, 2015 at 2:48 PM, Justin Uang
> wrote:
>
>> Sweet! Does this cover DataFrame#rdd also using the cached query from
>&
t; On Tue, Jul 28, 2015 at 2:36 AM, Justin Uang
> wrote:
>
>> Hey guys,
>>
>> I'm running into some pretty bad performance issues when it comes to
>> using a CrossValidator, because of caching behavior of DataFrames.
>>
>> The root of the problem
tually work until you
> install Pandas.
>
> Punya
> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang
> wrote:
>
>> // + *Davies* for his comments
>> // + Punya for SA
>>
>> For development and CI, like Olivier mentioned, I think it would be
>> hugely be
Hey guys,
I'm running into some pretty bad performance issues when it comes to using
a CrossValidator, because of caching behavior of DataFrames.
The root of the problem is that while I have cached my DataFrame
representing the features and labels, it is caching at the DataFrame level,
while Cros
// + *Davies* for his comments
// + Punya for SA
For development and CI, like Olivier mentioned, I think it would be hugely
beneficial to publish pyspark (only code in the python/ dir) on PyPI. If
anyone wants to develop against PySpark APIs, they need to download the
distribution and do a lot of
My guess is that if you are just wrapping the spark sql APIs, you can get
away with not having to reimplement a lot of the complexities in Pyspark
like storing everything in RDDs as pickled byte arrays, pipelining RDDs,
doing aggregations and joins in the python interpreters, etc.
Since the canoni
, you can give it a try.
>
> On Wed, Jun 24, 2015 at 4:39 PM, Justin Uang
> wrote:
> > Correct, I was running with a batch size of about 100 when I did the
> tests,
> > because I was worried about deadlocks. Do you have any concerns regarding
> > the batched synchron
?
On Wed, Jun 24, 2015 at 7:27 PM Davies Liu wrote:
> From you comment, the 2x improvement only happens when you have the
> batch size as 1, right?
>
> On Wed, Jun 24, 2015 at 12:11 PM, Justin Uang
> wrote:
> > FYI, just submitted a PR to Pyrolite to remove their Sto
r.
>>
>> BTW, what do your UDF looks like? How about to use Jython to run
>> simple Python UDF (without some external libraries).
>>
>> On Tue, Jun 23, 2015 at 8:21 PM, Justin Uang
>> wrote:
>> > // + punya
>> >
>> > Thanks for your qu
nding out the PR?
>
> On Tue, Jun 23, 2015 at 3:27 PM, Justin Uang
> wrote:
> > BLUF: BatchPythonEvaluation's implementation is unusable at large scale,
> but
> > I have a proof-of-concept implementation that avoids caching the entire
> > dataset.
> >
&g
BLUF: BatchPythonEvaluation's implementation is unusable at large scale,
but I have a proof-of-concept implementation that avoids caching the entire
dataset.
Hi,
We have been running into performance problems using Python UDFs with
DataFrames at large scale.
>From the implementation of BatchPyth
ntial performance gains?
>>
>>
>> On Sat, May 30, 2015 at 9:02 AM, Justin Uang
>> wrote:
>>
>>> On second thought, perhaps can this be done by writing a rule that
>>> builds the dag of dependencies between expressions, then convert it into
>
approach?
On Sat, May 30, 2015 at 11:30 AM Justin Uang wrote:
> If I do the following
>
> df2 = df.withColumn('y', df['x'] * 7)
> df3 = df2.withColumn('z', df2.y * 3)
> df3.explain()
>
> Then the result is
>
> > Project
If I do the following
df2 = df.withColumn('y', df['x'] * 7)
df3 = df2.withColumn('z', df2.y * 3)
df3.explain()
Then the result is
> Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS y#64,((x#59
* 7.0) AS y#64 * 3.0) AS z#65]
> PhysicalRDD [date#56,id#57,timestamp#58,x
> }
>
> Ideally, we should ask for both argument class and return class, so we can
> do the proper type conversion (e.g. if the UDF expects a string, but the
> input expression is an int, Catalyst can automatically add a cast).
> However, we haven't implemented those in UserDefin
I would like to define a UDF in Java via a closure and then use it without
registration. In Scala, I believe there are two ways to do this:
myUdf = functions.udf({ _ + 5})
myDf.select(myUdf(myDf("age")))
or
myDf.select(functions.callUDF({_ + 5}, DataTypes.IntegerType,
myDf("age")))
; for more information.
> >>> import types
> >>> types
> '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py'>
>
> Without renaming, our `types.py` will conflict with it when you run
> unittests in pyspark/sql/ .
>
> On Tue, May 26, 2015
In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was
renamed to pyspark/sql/_types.py and then some magic in
pyspark/sql/__init__.py dynamically renamed the module back to types. I
imagine that this is some naming conflict with Python 3, but what was the
error that showed up?
Th
I'm working on one of the Palantir teams using Spark, and here is our
feedback:
We have encountered three issues when upgrading to spark 1.4.0. I'm not
sure they qualify as a -1, as they come from using non-public APIs and
multiple spark contexts for the purposes of testing, but I do want to bring
We ran into an issue regarding Strings in UDTs when upgrading to Spark
1.4.0-rc. I understand that it's a non-public APIs, so it's expected, but I
just wanted to bring it up for awareness and so we can maybe change the
release notes to mention them =)
Our UDT was serializing to a StringType, but n
To do it in one pass, conceptually what you would need to do is to consume
the entire parent iterator and store the values either in memory or on
disk, which is generally something you want to avoid given that the parent
iterator length is unbounded. If you need to start spilling to disk, you
might
Hi,
I have been trying to figure out how to ship a python package that I have
been working on, and this has brought up a couple questions to me. Please
note that I'm fairly new to python package management, so any
feedback/corrections is welcome =)
It looks like the --py-files support we have mer
Hi,
I have a question regarding SQLContext#createDataFrame(JavaRDD[Row],
java.util.List[String]). It looks like when I try to call it, it results in
an infinite recursion that overflows the stack. I filed it here:
https://issues.apache.org/jira/browse/SPARK-6999.
What is the best way to fix this?
42 matches
Mail list logo