Re: [Python] Retrieving a RecordBatch from plasma inside a function

Philipp Moritz Thu, 08 Feb 2018 15:30:27 -0800

Thanks! I can indeed reproduce this problem. I'm a bit busy right now and
plan to look into it on the weekend.


Here is the preliminary backtrace for everybody interested:

CESS (code=1, address=0x111138158)

    frame #0: 0x000000010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:

->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi

    0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for:
PyInt_FromLong

    0x10e645805 <+37>: testq  %rax, %rax

    0x10e645808 <+40>: je     0x10e64580c               ; <+44>

(lldb) bt

* thread #1: tid = 0xf1378e, 0x000000010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28,
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,
address=0x111138158)

  * frame #0: 0x000000010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

    frame #1: 0x000000010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*)
+ 133

    frame #2: 0x000000010e613b25
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933

    frame #3: 0x000000010c2f83bc libpython2.7.dylib`PyObject_Repr + 60

    frame #4: 0x000000010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx +
22305

On Tue, Feb 6, 2018 at 1:24 AM, ALBERTO Bocchinfuso <
alberto_boc...@hotmail.it> wrote:

> Hi,
>
> I’m using python 3.5.2 and pyarrow 0.8.0
>
> As key, I put a string of 20 bytes, of course. I’m doing it differently
> from the canonical way since I’m no more using python 2.7, but python 3,
> and this seemed to me to be the right way to create a string of 20 bytes.
> The full code is:
>
> import pyarrow as pa
> import pyarrow.plasma as plasma
>
> def retrieve1():
>              client = plasma.connect('test', "", 0)
>
>              key = "keynumber1keynumber1"
>              pid = plasma.ObjectID(bytearray(key,'UTF-8'))
>
>              [buff] = client .get_buffers([pid])
>              batch = pa.RecordBatchStreamReader(buff).read_next_batch()
>
>              print(batch)
>              print(batch.schema)
>              print(batch[0])
>
>              return batch
>
> client = plasma.connect('test', "", 0)
>
> test1 = [1, 12, 23, 3, 21, 34]
> test1 = pa.array(test1, pa.int32())
>
> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
>
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> sink = pa.MockOutputStream()
> stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> stream_writer.write_batch(batch)
> stream_writer.close()
>
> bff = client.create(pid, sink.size())
>
> stream = pa.FixedSizeBufferWriter(bff)
> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> writer.write_batch(batch)
> client.seal(pid)
>
> batch = retrieve1()
> print(batch)
> print(batch.schema)
> print(batch[0])
>
> I hope this helps,
> thank you
>
> Da: Philipp Moritz<mailto:pcmor...@gmail.com>
> Inviato: martedì 6 febbraio 2018 00:00
> A: dev@arrow.apache.org<mailto:dev@arrow.apache.org>
> Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a
> function
>
> Hey Alberto,
>
> Thanks for your message! I'm trying to reproduce it.
>
> Can you attach the code you use to write the batch into the store?
>
> Also can you say which version of Python and Arrow you are using? On my
> installation, I get
>
> ```
>
> In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
>
> ------------------------------------------------------------
> ---------------
>
> ValueError                                Traceback (most recent call last)
>
> <ipython-input-5-fbec5bb33c33> in <module>()
>
> ----> 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
>
>
> plasma.pyx in pyarrow.plasma.ObjectID.__cinit__()
>
>
> ValueError: Object ID must by 20 bytes, is keynumber1keynumber1
> ```
>
> (the canonical way to do this would be plasma.ObjectID(b
> "keynumber1keynumber1"))
>
> Best,
> Philipp.
>
> On Mon, Feb 5, 2018 at 10:09 AM, ALBERTO Bocchinfuso <
> alberto_boc...@hotmail.it> wrote:
>
> > Good morning,
> >
> > I am experiencing problems with the RecordBatches stored in plasma in a
> > particular situation.
> >
> > If I return a RecordBatch as result of a python function, I am able to
> > read just the metadata, while I get an error when reading the columns.
> >
> > For example, the following code
> > def retrieve1():
> >         client = plasma.connect('test', "", 0)
> >
> >         key = "keynumber1keynumber1"
> >         pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> >
> >         [buff] = client .get_buffers([pid])
> >         batch = pa.RecordBatchStreamReader(buff).read_next_batch()
> >         return batch
> >
> > batch = retrieve1()
> > print(batch)
> > print(batch.schema)
> > print(batch[0])
> >
> > Represents a simple python code in which a function is in charge of
> > retrieving the RecordBatch from the plasma store, and then returns it to
> > the caller. Running the previous example I get:
> > <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48>
> > FIELD1: int32
> > metadata
> > --------
> > {}
> > <pyarrow.lib.Int32Array object at 0x7fd0ebfc0f98>
> > [
> >   1,
> >   12,
> >   23,
> >   3,
> >   21,
> >   34
> > ]
> > <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48>
> > FIELD1: int32
> > metadata
> > --------
> > {}
> > Errore di segmentazione (core dump creato)
> >
> >
> > If I retrieve and use the data in the same part of the code (as I do in
> > the function retrieve1(), but it also works when I put everything in the
> > main program.) everything runs without problems.
> >
> > Also the problem seems to be related to the particular case in which I
> > retrieve the RecordBatch from the plasma store, since the following
> > (simpler) code:
> > def create():
> >         test1 = [1, 12, 23, 3, 21, 34]
> >         test1 = pa.array(test1, pa.int32())
> >
> >         batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
> >         print(batch)
> >         print(batch.schema)
> >         print(batch[0])
> >         return batch
> >
> > batch1 = create()
> > print(batch1)
> > print(batch1.schema)
> > print(batch1[0])
> >
> > Prints:
> >
> > <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598>
> > FIELD1: int32
> > <pyarrow.lib.Int32Array object at 0x7f5f691fd9a8>
> > [
> >   1,
> >   12,
> >   23,
> >   3,
> >   21,
> >   34
> > ]
> > <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598>
> > FIELD1: int32
> > <pyarrow.lib.Int32Array object at 0x7f5f7e29f318>
> > [
> >   1,
> >   12,
> >   23,
> >   3,
> >   21,
> >   34
> > ]
> >
> > Which is what I expect.
> >
> > Is this issue known or am I doing something wrong when retrieving the
> > RecordBatch from plasma?
> >
> > Also I would like to pinpoint the fact that this problem was as easy to
> > find as hard to re-create. For this reason, there can be other situations
> > in which the same problem arises that I did not experienced, since I
> mostly
> > deal with plasma and I’ve been using only python so long: the
> description I
> > gave is not intended to be complete.
> >
> > Thank you,
> > Alberto
> >
>
>

Re: [Python] Retrieving a RecordBatch from plasma inside a function

Reply via email to