I am not a Spark committer and haven't been working on Spark for a while.
However, I was heavily involved in the original cogroup work and we are
using cogroup functionality pretty heavily and I want to give my two cents
here.
I think this is a nice improvement and I hope someone from the PySpark
Aha I see. Thanks Hyukjin!
On Tue, Jul 19, 2022 at 9:09 PM Hyukjin Kwon wrote:
> This is done by cloudpickle. They pickle global variables referred within
> the func together, and register it to the global imported modules.
>
> On Wed, 20 Jul 2022 at 00:55, Li Jin wrote:
>
>
Hi,
I have a question about how does "imports" get send to the python worker.
For example, I have
def foo(x):
return np.abs(x)
If I run this code directly, it obviously failed (because np is undefined
on the driver process):
sc.paralleilize([1, 2, 3]).map(foo).collect()
However, if I add
Hi dear devs,
I recently came across checkpoint functionality in Spark and found (a
little surprising) that checkpoint causes the DataFrame to be computed
twice unless cache is called before checkpoint.
My guess is that this is probably hard to fix and/or maybe checkpoint
feature is not very freq
Li Jin wrote:
> I am going to review this carefully today. Thanks for the work!
>
> Li
>
> On Wed, Jan 1, 2020 at 10:34 PM Hyukjin Kwon wrote:
>
>> Thanks for comments Maciej - I am addressing them.
>> adding Li Jin too.
>>
>> I plan to proceed this la
I am going to review this carefully today. Thanks for the work!
Li
On Wed, Jan 1, 2020 at 10:34 PM Hyukjin Kwon wrote:
> Thanks for comments Maciej - I am addressing them.
> adding Li Jin too.
>
> I plan to proceed this late this week or early next week to make it on
> time bef
Dear Spark devs,
I am debugging a weird "Java gateway process exited before sending the
driver its port number" when creating SparkSession with pyspark. I am
running the following simple code with pytest:
"
from pyspark.sql import SparkSession
def test_spark():
spark = SparkSession.builde
Thanks for summary!
I have a question that is semi-related - What's the process to propose a
feature to be included in the final Spark 3.0 release?
In particular, I am interested in
https://issues.apache.org/jira/browse/SPARK-28006. I am happy to do the
work so want to make sure I don't miss the
Congrats to all!
On Tue, Sep 17, 2019 at 6:51 PM Bryan Cutler wrote:
> Congratulations, all well deserved!
>
> On Thu, Sep 12, 2019, 3:32 AM Jacek Laskowski wrote:
>
>> Hi,
>>
>> What a great news! Congrats to all awarded and the community for voting
>> them in!
>>
>> p.s. I think it should go
ou access. Can you try
> again now?
>
> thanks,
>
> Chris
>
>
>
> On Mon, Apr 15, 2019 at 9:49 PM Li Jin wrote:
>
>> Hi Chris,
>>
>> Thanks! The permission to the google doc is maybe not set up properly. I
>> cannot view the doc by default.
>&
>> I agree, it would be great to have a document to comment on.
>>
>> The main thing that stands out right now is that this is only for PySpark
>> and states that it will not be added to the Scala API. Why not make this
>> available since most of the work would be do
level and I don't
>> think it's necessary to include details of the Python worker, we can hash
>> that out after the SPIP is approved.
>>
>> Bryan
>>
>> On Mon, Apr 8, 2019 at 10:43 AM Li Jin wrote:
>>
>>> Thanks Chris, look forward
traightforward. If anyone has any ideas as to how
> this might be achieved in an elegant manner I’d be happy to hear them!
>
> Thanks,
>
> Chris
>
> On 26 Feb 2019, at 14:55, Li Jin wrote:
>
> Thank you both for the reply. Chris and I have very similar use cases for
>
Hi,
Pandas UDF supports input as struct type. However, note that it will be
turned into python dict because pandas itself does not have native struct
type.
On Fri, Mar 8, 2019 at 2:55 PM peng yu wrote:
> Yeah, that seems most likely i have wanted, does the scalar Pandas UDF
> support input is a
based or union
> all collect list based).
>
> I might be biased, but find the approach very useful in project to
> simplify and speed up transformations, and remove a lot of intermediate
> stages (distinct + join => just cogroup).
>
> Plus spark 2.4 introduced a lot of
I am wondering do other people have opinion/use case on cogroup?
On Wed, Feb 20, 2019 at 5:03 PM Li Jin wrote:
> Alessandro,
>
> Thanks for the reply. I assume by "equi-join", you mean "equality full
> outer join" .
>
> Two issues I see with equity outer
"groupby" on such key
> 3) finally apply a udaf (you can have a look here if you are not familiar
> with them
> https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), that
> will process each group "in isolation".
>
> HTH,
> Alessandro
>
> On
Hi,
We have been using Pyspark's groupby().apply() quite a bit and it has been
very helpful in integrating Spark with our existing pandas-heavy libraries.
Recently, we have found more and more cases where groupby().apply() is not
sufficient - In some cases, we want to group two dataframes by the
equivalent in python?
>
> thanks,
> Imran
>
> On Fri, Dec 7, 2018 at 1:33 PM Li Jin wrote:
>
>> Imran,
>>
>> Thanks for sharing this. When working on interop between Spark and
>> Pandas/Arrow in the past, we also faced some issues due to the different
>
Imran,
Thanks for sharing this. When working on interop between Spark and
Pandas/Arrow in the past, we also faced some issues due to the different
definitions of timestamp in Spark and Pandas/Arrow, because Spark timestamp
has Instant semantics and Pandas/Arrow timestamp has either LocalDateTime
o
> (2) If the method forces evaluation this matches most obvious way that
would implemented then we should add it with a note in the docstring
I am not sure about this because force evaluation could be something that
has side effect. For example, df.count() can realize a cache and if we
implement _
Although I am not specifically involved in DSv2, I think having this kind
of meeting is definitely helpful to discuss, move certain effort forward
and keep people on the same page. Glad to see this kind of working group
happening.
On Thu, Oct 25, 2018 at 5:58 PM John Zhuge wrote:
> Great idea!
>
:
> Definitely!
> numba numbers are amazing
>
> --
> *From:* Wes McKinney
> *Sent:* Saturday, September 8, 2018 7:46 AM
> *To:* Li Jin
> *Cc:* dev@spark.apache.org
> *Subject:* Re: [DISCUSS] PySpark Window UDF
>
> hi Li,
>
> These result
people think? I'd love to
hear community's feedbacks.
Links:
You can reproduce benchmark with numpy variant by using the branch:
https://github.com/icexelloss/spark/tree/window-udf-numpy
PR link:
https://github.com/apache/spark/pull/22305
On Wed, May 16, 2018 at 3:34 PM Li Jin wrote:
Hi Imran,
My understanding is that doctests and unittests are orthogonal - doctests
are used to make sure docstring examples are correct and are not meant to
replace unittests.
Functionalities are covered by unit tests to ensure correctness and
doctests are used to test the docstring, not the func
Thanks for looking into this Shane. If we can only have a single python 3
version, I agree 3.6 would be better than 3.5. Otherwise, ideally I think
it would be nice to test all supported 3.x versions (latest micros should
be fine).
On Mon, Aug 20, 2018 at 7:07 PM shane knapp wrote:
> initially,
I agree with Byran. If it's acceptable to have another job to test with
Python 3.5 and pyarrow 0.10.0, I am leaning towards upgrading arrow.
Arrow 0.10.0 has tons of bug fixes and improves from 0.8.0, including
important memory leak fixes such as
https://issues.apache.org/jira/browse/ARROW-1973. I
Hi Linar,
This seems useful. But perhaps reusing the same function name is better?
http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame
Currently createDataFrame takes an RDD of any kind of SQL data
representation(e.g. row, tuple, int, boolean,
Hi Xiangrui,
Thanks for sending this out. I have left some comments on the google doc:
https://docs.google.com/document/d/1dFOFV3LOu6deSNd8Ndp87-wsdxQqA9cnkuj35jUlrmQ/edit#heading=h.84jotgsrp6bj
Look forward to your response.
Li
On Mon, Jun 18, 2018 at 11:33 AM, Xiangrui Meng wrote:
> Hi all,
uld do the
> code review more carefully.
>
> Xiao
>
> 2018-06-14 9:18 GMT-07:00 Li Jin :
>
>> Are there objection to restore the behavior for PySpark users? I am happy
>> to submit a patch.
>>
>> On Thu, Jun 14, 2018 at 12:15 PM Reynold Xin wrote:
>>
Are there objection to restore the behavior for PySpark users? I am happy
to submit a patch.
On Thu, Jun 14, 2018 at 12:15 PM Reynold Xin wrote:
> The behavior change is not good...
>
> On Thu, Jun 14, 2018 at 9:05 AM Li Jin wrote:
>
>> Ah, looks like it's this change:
&
ile? but if so that
> would have been true for a while now.
>
>
> On Thu, Jun 14, 2018 at 10:38 AM Li Jin wrote:
>
>> Hey all,
>>
>> I just did a clean checkout of github.com/apache/spark but failed to
>> start PySpark, this is what I did:
>>
>> gi
I can work around by using:
bin/pyspark --conf spark.sql.catalogImplementation=in-memory
now, but still wonder what's going on with HiveConf..
On Thu, Jun 14, 2018 at 11:37 AM, Li Jin wrote:
> Hey all,
>
> I just did a clean checkout of github.com/apache/spark but failed to
&
Hey all,
I just did a clean checkout of github.com/apache/spark but failed to start
PySpark, this is what I did:
git clone g...@github.com:apache/spark.git; cd spark; build/sbt package;
bin/pyspark
And got this exception:
(spark-dev) Lis-MacBook-Pro:spark icexelloss$ bin/pyspark
Python 3.6.3 |
n Fri, Jun 8, 2018 at 3:57 PM, Herman van Hövell tot Westerflier <
her...@databricks.com> wrote:
> But that is still cheaper than executing that expensive UDF for each row
> in your dataset right?
>
> On Fri, Jun 8, 2018 at 9:51 PM Li Jin wrote:
>
>> I see. Thanks fo
on is not run right? So it is
> still lazy.
>
>
> On Fri, Jun 8, 2018 at 12:35 PM Li Jin wrote:
>
>> Hi All,
>>
>> Sorry for the long email title. I am a bit surprised to find that the
>> current optimizer rule "ConvertToLocalRelation" causes expre
Hi All,
Sorry for the long email title. I am a bit surprised to find that the
current optimizer rule "ConvertToLocalRelation" causes expressions to be
eager-evaluated in planning phase, this can be demonstrated with the
following code:
scala> val myUDF = udf((x: String) => { println("UDF evaled")
@aexp.com.invalid> wrote:
>
>> I agree
>>
>>
>>
>>
>>
>>
>>
>> Thanks
>>
>> Himanshu
>>
>>
>>
>> *From:* Li Jin [mailto:ice.xell...@gmail.com]
>> *Sent:* Friday, March 23, 2018 8:24 PM
>> *To:
I'd like to bring https://issues.apache.org/jira/browse/SPARK-24373 to
people's attention cause it could be a regression from 2.2.
I will leave it to more experienced people to decide whether this should be
a blocker or not.
On Wed, May 23, 2018 at 12:54 PM, Marcelo Vanzin
wrote:
> Sure. Also,
Hi All,
I have been looking into leverage the Arrow and Pandas UDF work we have
done so far for Window UDF in PySpark. I have done some investigation and
believe there is a way to do PySpark window UDF efficiently.
The basic idea is instead of passing each window to Python separately, we
can pass
ng, the WINDOW
>>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND
>>>>> UNBOUNDED FOLLOWING.
>>>>
>>>>
>>>> It sort of makes sense if you think about it. If there is no ordering
>>>> there is no way to have a
PM, Reynold Xin wrote:
> Seems like a bug.
>
>
>
> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin wrote:
>
>> Hi Devs,
>>
>> I am seeing some behavior with window functions that is a bit unintuitive
>> and would like to get some clarification.
>>
>>
Hi Devs,
I am seeing some behavior with window functions that is a bit unintuitive
and would like to get some clarification.
When using aggregation function with window, the frame boundary seems to
change depending on the order of the window.
Example:
(1)
df = spark.createDataFrame([[0, 1], [0,
Hi All,
I came across these two types MatrixUDT and VectorUDF in Spark ML when
doing feature extraction and preprocessing with PySpark. However, when
trying to do some basic operations, such as vector multiplication and
matrix multiplication, I had to go down to Python UDF.
It seems to be it woul
Thanks for the pointer!
On Mon, Mar 12, 2018 at 1:40 PM, Sean Owen wrote:
> (There was also https://github.com/sryza/spark-timeseries -- might be
> another point of reference for you.)
>
> On Mon, Mar 12, 2018 at 10:33 AM Li Jin wrote:
>
>> Hi All,
>>
>> Thi
Hi All,
This is Li Jin. We (me and my fellow colleagues at Two Sigma) have been
using Spark for time series analysis for the past two years and it has been
a success to scale up our time series analysis.
Recently, we start a conversation with Reynold about potential
opportunities to collaborate
Congrats!
On Fri, Mar 2, 2018 at 5:49 PM Holden Karau wrote:
> Congratulations and welcome everyone! So excited to see the project grow
> our committer base.
>
> On Mar 2, 2018 2:42 PM, "Reynold Xin" wrote:
>
>> Congrats and welcome!
>>
>>
>> On Fri, Mar 2, 2018 at 10:41 PM, Matei Zaharia
>> wr
Hi community,
Following instruction on https://spark.apache.org/improvement-proposals.html,
I'd like to propose a SPIP: as-of join in Spark SQL.
Here is the Jira:
https://issues.apache.org/jira/browse/SPARK-22947
If you are interested, please take a look and let me know what you think. I
am look
Sorry, s/ordered distributed/ordered distribution/g
On Mon, Dec 4, 2017 at 10:37 AM, Li Jin wrote:
> Just to give another data point: most of the data we use with Spark are
> sorted on disk, having a way to allow data source to pass ordered
> distributed to DataFrames is really usef
Just to give another data point: most of the data we use with Spark are
sorted on disk, having a way to allow data source to pass ordered
distributed to DataFrames is really useful for us.
On Mon, Dec 4, 2017 at 9:12 AM, Николай Ижиков
wrote:
> Hello, guys.
>
> Thank you for answers!
>
> > I thi
I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I
don't think we need to support the new functionality with older version of
pandas (Takuya's reason 3)
One thing I am not sure is how complicated it is to support pandas < 0.19.2
with old non-Arrow interops and require panda
51 matches
Mail list logo