Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-05-13 Thread Wes McKinney
> > > > need use something such as flatbuffer to do my own? > > > > > > > > > > > > > > > > > > On Thu, Apr 25, 2019 at 5:57 PM Antoine Pitrou < > > > anto...@python.org > > > > > > > > > > > > > > w

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-05-09 Thread Tim Swast
o avoid > > > > copies, > > > > > > > > > otherwise there's no benefit to using Arrow over pickle. > > > > > > > > > > > > > > > > > > Perhaps would you like to try and use pickle5 with > > out-of-

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-05-09 Thread Tim Swast
> > > > > > > > > > > > > So it seems that RecordBatch serialization is able to avoid > > > > copies, > > > > > > > > > otherwise there's no benefit to using Arrow over pickle. > > > > > > > > > > > >

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-05-09 Thread Shawn Yang
> > > > > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > Le 25/04/2019 à 11:23, Shawn Yang a écrit : > > > > > > > > > Hi Antoine, > > > > > > > >

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-30 Thread Wes McKinney
er-images.githubusercontent.com/12445254/56651475-aaaea300-66bb-11e9-8b4f-4632e96bd079.png > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://user-images.githubusercontent.com/12445254/56651484-b5693800-6

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-29 Thread Shawn Yang
; https://user-images.githubusercontent.com/12445254/56651484-b5693800-66bb-11e9-9b1f-d004212e6aac.png > > > > > > > > > > > > > > > > > > > > > > > https://user-images.githubusercontent.com/12445254/56651490-b8fcbf00-66bb-11e9-8f01-ef4919

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-26 Thread Antoine Pitrou
It's "arbitrary" from Arrow's point of view, because Arrow itself cannot represent this data (except as a binary blob). Though, as Micah said, this may change at some point. Instead of extending Arrow to fit this use case, perhaps it would be better to write a separate library that sits atop Ar

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-25 Thread Micah Kornfield
t.com/12445254/56629689-c9437880-6680-11e9-8756-02acb47fdb30.png > > > > > > > > > > > > Regards > > > > > > Shawn. > > > > > > > > > > > > On Thu, Apr 25, 2019 at 4:03 PM Antoine Pitrou < > anto...@python.org > >

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-25 Thread Philipp Moritz
ython.org>> wrote: > > > > > > > > > > > > > > > Hi Shawn, > > > > > > > > > > Your images don't appear here. It seems they weren't attached > to > > > > your > > > > >

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-25 Thread Shawn Yang
> > > > > > > Hi Shawn, > > > > > > > > Your images don't appear here. It seems they weren't attached to > > > your > > > > e-mail? > > > > > > > > About serialization

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-25 Thread Shawn Yang
Your images don't appear here. It seems they weren't attached to > >> your > >>> e-mail? > >>> > >>> About serialization: I am still working on PEP 574 (*), which I > hope > >>> will be integrated in Python 3.8. The standalone "pic

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-25 Thread Antoine Pitrou
;> e-mail? >>> >>> About serialization: I am still working on PEP 574 (*), which I hope >>> will be integrated in Python 3.8. The standalone "pickle5" module is >>> also available as a backport. Both Arrow and Numpy support it. You >> may &

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-25 Thread Wes McKinney
will be integrated in Python 3.8. The standalone "pickle5" module is > > > also available as a backport. Both Arrow and Numpy support it. You > > may > > > get different pickle performance using it, especially on large data. > > > > > &g

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-25 Thread Shawn Yang
I hope > > will be integrated in Python 3.8. The standalone "pickle5" module is > > also available as a backport. Both Arrow and Numpy support it. You > may > > get different pickle performance using it, especially on large data. > > > > (

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-25 Thread Antoine Pitrou
) https://www.python.org/dev/peps/pep-0574/ > > Regards > > Antoine. > > > Le 25/04/2019 à 05:19, Shawn Yang a écrit : > > > >     Motivate > > > > We want to use arrow as a general data serialization framework in >

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-25 Thread Shawn Yang
and Numpy support it. You may > get different pickle performance using it, especially on large data. > > (*) https://www.python.org/dev/peps/pep-0574/ > > Regards > > Antoine. > > > Le 25/04/2019 à 05:19, Shawn Yang a écrit : > > > > Motivate > > &g

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-25 Thread Antoine Pitrou
and Numpy support it. You may get different pickle performance using it, especially on large data. (*) https://www.python.org/dev/peps/pep-0574/ Regards Antoine. Le 25/04/2019 à 05:19, Shawn Yang a écrit : > > Motivate > > We want to use arrow as a general data serialization f

Use arrow as a general data serialization framework in distributed stream data processing

2019-04-24 Thread Shawn Yang
Motivate We want to use arrow as a general data serialization framework in distributed stream data processing. We are working on ray <https://github.com/ray-project/ray>, written in c++ in low-level and java/python in high-level. We want to transfer streaming data between java/py