R: [Python] Retrieving a RecordBatch from plasma inside a function

ALBERTO Bocchinfuso Wed, 21 Feb 2018 02:05:12 -0800

Hi,

Have you had any news on this issue?
Do you plan to solve it for the next releases of Arrow, or is there any way to 
avoid the problem?


Thanks in advance,
Alberto
Da: Philipp Moritz<mailto:pcmor...@gmail.com>
Inviato: venerdì 9 febbraio 2018 00:30
A: dev@arrow.apache.org<mailto:dev@arrow.apache.org>
Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a function

Thanks! I can indeed reproduce this problem. I'm a bit busy right now and
plan to look into it on the weekend.

Here is the preliminary backtrace for everybody interested:

CESS (code=1, address=0x111138158)

    frame #0: 0x000000010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:

->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi

    0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for:
PyInt_FromLong

    0x10e645805 <+37>: testq  %rax, %rax

    0x10e645808 <+40>: je     0x10e64580c               ; <+44>

(lldb) bt

* thread #1: tid = 0xf1378e, 0x000000010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28,
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,
address=0x111138158)

  * frame #0: 0x000000010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

    frame #1: 0x000000010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*)
+ 133

    frame #2: 0x000000010e613b25
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933

    frame #3: 0x000000010c2f83bc libpython2.7.dylib`PyObject_Repr + 60

    frame #4: 0x000000010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx +
22305

On Tue, Feb 6, 2018 at 1:24 AM, ALBERTO Bocchinfuso <
alberto_boc...@hotmail.it> wrote:

> Hi,
>
> I’m using python 3.5.2 and pyarrow 0.8.0
>
> As key, I put a string of 20 bytes, of course. I’m doing it differently
> from the canonical way since I’m no more using python 2.7, but python 3,
> and this seemed to me to be the right way to create a string of 20 bytes.
> The full code is:
>
> import pyarrow as pa
> import pyarrow.plasma as plasma
>
> def retrieve1():
>              client = plasma.connect('test', "", 0)
>
>              key = "keynumber1keynumber1"
>              pid = plasma.ObjectID(bytearray(key,'UTF-8'))
>
>              [buff] = client .get_buffers([pid])
>              batch = pa.RecordBatchStreamReader(buff).read_next_batch()
>
>              print(batch)
>              print(batch.schema)
>              print(batch[0])
>
>              return batch
>
> client = plasma.connect('test', "", 0)
>
> test1 = [1, 12, 23, 3, 21, 34]
> test1 = pa.array(test1, pa.int32())
>
> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
>
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> sink = pa.MockOutputStream()
> stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> stream_writer.write_batch(batch)
> stream_writer.close()
>
> bff = client.create(pid, sink.size())
>
> stream = pa.FixedSizeBufferWriter(bff)
> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> writer.write_batch(batch)
> client.seal(pid)
>
> batch = retrieve1()
> print(batch)
> print(batch.schema)
> print(batch[0])
>
> I hope this helps,
> thank you
>
> Da: Philipp Moritz<mailto:pcmor...@gmail.com>
> Inviato: martedì 6 febbraio 2018 00:00
> A: dev@arrow.apache.org<mailto:dev@arrow.apache.org>
> Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a
> function
>
> Hey Alberto,
>
> Thanks for your message! I'm trying to reproduce it.
>
> Can you attach the code you use to write the batch into the store?
>
> Also can you say which version of Python and Arrow you are using? On my
> installation, I get
>
> ```
>
> In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
>
> ------------------------------------------------------------
> ---------------
>
> ValueError                                Traceback (most recent call last)
>
> <ipython-input-5-fbec5bb33c33> in <module>()
>
> ----> 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
>
>
> plasma.pyx in pyarrow.plasma.ObjectID.__cinit__()
>
>
> ValueError: Object ID must by 20 bytes, is keynumber1keynumber1
> ```
>
> (the canonical way to do this would be plasma.ObjectID(b
> "keynumber1keynumber1"))
>
> Best,
> Philipp.
>
> On Mon, Feb 5, 2018 at 10:09 AM, ALBERTO Bocchinfuso <
> alberto_boc...@hotmail.it> wrote:
>
> > Good morning,
> >
> > I am experiencing problems with the RecordBatches stored in plasma in a
> > particular situation.
> >
> > If I return a RecordBatch as result of a python function, I am able to
> > read just the metadata, while I get an error when reading the columns.
> >
> > For example, the following code
> > def retrieve1():
> >         client = plasma.connect('test', "", 0)
> >
> >         key = "keynumber1keynumber1"
> >         pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> >
> >         [buff] = client .get_buffers([pid])
> >         batch = pa.RecordBatchStreamReader(buff).read_next_batch()
> >         return batch
> >
> > batch = retrieve1()
> > print(batch)
> > print(batch.schema)
> > print(batch[0])
> >
> > Represents a simple python code in which a function is in charge of
> > retrieving the RecordBatch from the plasma store, and then returns it to
> > the caller. Running the previous example I get:
> > <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48>
> > FIELD1: int32
> > metadata
> > --------
> > {}
> > <pyarrow.lib.Int32Array object at 0x7fd0ebfc0f98>
> > [
> >   1,
> >   12,
> >   23,
> >   3,
> >   21,
> >   34
> > ]
> > <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48>
> > FIELD1: int32
> > metadata
> > --------
> > {}
> > Errore di segmentazione (core dump creato)
> >
> >
> > If I retrieve and use the data in the same part of the code (as I do in
> > the function retrieve1(), but it also works when I put everything in the
> > main program.) everything runs without problems.
> >
> > Also the problem seems to be related to the particular case in which I
> > retrieve the RecordBatch from the plasma store, since the following
> > (simpler) code:
> > def create():
> >         test1 = [1, 12, 23, 3, 21, 34]
> >         test1 = pa.array(test1, pa.int32())
> >
> >         batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
> >         print(batch)
> >         print(batch.schema)
> >         print(batch[0])
> >         return batch
> >
> > batch1 = create()
> > print(batch1)
> > print(batch1.schema)
> > print(batch1[0])
> >
> > Prints:
> >
> > <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598>
> > FIELD1: int32
> > <pyarrow.lib.Int32Array object at 0x7f5f691fd9a8>
> > [
> >   1,
> >   12,
> >   23,
> >   3,
> >   21,
> >   34
> > ]
> > <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598>
> > FIELD1: int32
> > <pyarrow.lib.Int32Array object at 0x7f5f7e29f318>
> > [
> >   1,
> >   12,
> >   23,
> >   3,
> >   21,
> >   34
> > ]
> >
> > Which is what I expect.
> >
> > Is this issue known or am I doing something wrong when retrieving the
> > RecordBatch from plasma?
> >
> > Also I would like to pinpoint the fact that this problem was as easy to
> > find as hard to re-create. For this reason, there can be other situations
> > in which the same problem arises that I did not experienced, since I
> mostly
> > deal with plasma and I’ve been using only python so long: the
> description I
> > gave is not intended to be complete.
> >
> > Thank you,
> > Alberto
> >
>
>

R: [Python] Retrieving a RecordBatch from plasma inside a function

Reply via email to