Thanks for feedback.

Hyukjin Kwon:
> My only worry is, users who depends on lower pandas versions

That's what I worried and one of the reasons I moved this discussion here.

Li Jin:
> how complicated it is to support pandas < 0.19.2 with old non-Arrow
interops

In my original PR (https://github.com/apache/spark/pull/19607) we will fix
the behavior of timestamp values for Pandas.
If we need to support old Pandas, we will need at least some workarounds
like in the following link:
https://github.com/apache/spark/blob/e919ed55758f75733d56287d5a49326b1067a44c/python/pyspark/sql/types.py#L1718-L1774


Thanks.


On Wed, Nov 15, 2017 at 12:59 AM, Li Jin <ice.xell...@gmail.com> wrote:

> I think this makes sense. PySpark/Pandas interops in 2.3 are new anyway, I
> don't think we need to support the new functionality with older version of
> pandas (Takuya's reason 3)
>
> One thing I am not sure is how complicated it is to support pandas <
> 0.19.2 with old non-Arrow interops and require pandas >= 0.19.2 for new
> Arrow interops. Maybe it makes sense to allow user keep using their PySpark
> code if they don't want to use any of the new stuff. If this is still
> complicated, I would be leaning towards not supporting < 0.19.2.
>
>
> On Tue, Nov 14, 2017 at 6:04 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>
>> +0 to drop it as I said in the PR. I am seeing It brings a lot of hard
>> time to get the cool changes through, and is slowing down them to get
>> pushed.
>>
>> My only worry is, users who depends on lower pandas versions (Pandas
>> 0.19.2 seems released less then a year before. In the similar time, Spark
>> 2.1.0 was released).
>>
>> If this worry is less than I expected, I definitely support it. It should
>> speed up those cool changes.
>>
>>
>> On 14 Nov 2017 7:14 pm, "Takuya UESHIN" <ues...@happy-camper.st> wrote:
>>
>> Hi all,
>>
>> I'd like to raise a discussion about Pandas version.
>> Originally we are discussing it at https://github.com/apache/s
>> park/pull/19607 but we'd like to ask for feedback from community.
>>
>>
>> Currently we don't explicitly specify the Pandas version we are
>> supporting but we need to decide what version we should support because:
>>
>>   - There have been a number of API evolutions around extension dtypes
>> that make supporting pandas 0.18.x and lower challenging.
>>
>>   - Sometimes Pandas older than 0.19.2 doesn't handle timestamp values
>> properly. We want to provide properer support for timestamp values.
>>
>>   - If users want to use vectorized UDFs, or toPandas / createDataFrame
>> from Pandas DataFrame with Arrow which will be released in Spark 2.3, users
>> have to upgrade Pandas 0.19.2 or upper anyway because we need pyarrow
>> internally, which supports only 0.19.2 or upper.
>>
>>
>> The point I'd like to ask is:
>>
>> Can we drop support old Pandas (<0.19.2)?
>> If not, what version should we support?
>>
>>
>> References:
>>
>> - vectorized UDF
>>   - https://github.com/apache/spark/pull/18659
>>   - https://github.com/apache/spark/pull/18732
>> - toPandas with Arrow
>>   - https://github.com/apache/spark/pull/18459
>> - createDataFrame from pandas DataFrame with Arrow
>>   - https://github.com/apache/spark/pull/19646
>>
>>
>> Any comments are welcome!
>>
>> Thanks.
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> http://twitter.com/ueshin
>>
>>
>>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Reply via email to