Re: [Python] Retrieving a RecordBatch from plasma inside a function

Wes McKinney Wed, 21 Feb 2018 08:12:28 -0800

Can we create a JIRA to track this issue?


On Wed, Feb 21, 2018 at 5:04 AM, ALBERTO Bocchinfuso
<alberto_boc...@hotmail.it> wrote:
> Hi,
>
> Have you had any news on this issue?
> Do you plan to solve it for the next releases of Arrow, or is there any way 
> to avoid the problem?
>
> Thanks in advance,
> Alberto
> Da: Philipp Moritz<mailto:pcmor...@gmail.com>
> Inviato: venerdì 9 febbraio 2018 00:30
> A: dev@arrow.apache.org<mailto:dev@arrow.apache.org>
> Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a function
>
> Thanks! I can indeed reproduce this problem. I'm a bit busy right now and
> plan to look into it on the weekend.
>
> Here is the preliminary backtrace for everybody interested:
>
> CESS (code=1, address=0x111138158)
>
>     frame #0: 0x000000010e6457fc
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28
>
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:
>
> ->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi
>
>     0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for:
> PyInt_FromLong
>
>     0x10e645805 <+37>: testq  %rax, %rax
>
>     0x10e645808 <+40>: je     0x10e64580c               ; <+44>
>
> (lldb) bt
>
> * thread #1: tid = 0xf1378e, 0x000000010e6457fc
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28,
> queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,
> address=0x111138158)
>
>   * frame #0: 0x000000010e6457fc
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28
>
>     frame #1: 0x000000010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*)
> + 133
>
>     frame #2: 0x000000010e613b25
> lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933
>
>     frame #3: 0x000000010c2f83bc libpython2.7.dylib`PyObject_Repr + 60
>
>     frame #4: 0x000000010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx +
> 22305
>
> On Tue, Feb 6, 2018 at 1:24 AM, ALBERTO Bocchinfuso <
> alberto_boc...@hotmail.it> wrote:
>
>> Hi,
>>
>> I’m using python 3.5.2 and pyarrow 0.8.0
>>
>> As key, I put a string of 20 bytes, of course. I’m doing it differently
>> from the canonical way since I’m no more using python 2.7, but python 3,
>> and this seemed to me to be the right way to create a string of 20 bytes.
>> The full code is:
>>
>> import pyarrow as pa
>> import pyarrow.plasma as plasma
>>
>> def retrieve1():
>>              client = plasma.connect('test', "", 0)
>>
>>              key = "keynumber1keynumber1"
>>              pid = plasma.ObjectID(bytearray(key,'UTF-8'))
>>
>>              [buff] = client .get_buffers([pid])
>>              batch = pa.RecordBatchStreamReader(buff).read_next_batch()
>>
>>              print(batch)
>>              print(batch.schema)
>>              print(batch[0])
>>
>>              return batch
>>
>> client = plasma.connect('test', "", 0)
>>
>> test1 = [1, 12, 23, 3, 21, 34]
>> test1 = pa.array(test1, pa.int32())
>>
>> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
>>
>> key = "keynumber1keynumber1"
>> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
>> sink = pa.MockOutputStream()
>> stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
>> stream_writer.write_batch(batch)
>> stream_writer.close()
>>
>> bff = client.create(pid, sink.size())
>>
>> stream = pa.FixedSizeBufferWriter(bff)
>> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
>> writer.write_batch(batch)
>> client.seal(pid)
>>
>> batch = retrieve1()
>> print(batch)
>> print(batch.schema)
>> print(batch[0])
>>
>> I hope this helps,
>> thank you
>>
>> Da: Philipp Moritz<mailto:pcmor...@gmail.com>
>> Inviato: martedì 6 febbraio 2018 00:00
>> A: dev@arrow.apache.org<mailto:dev@arrow.apache.org>
>> Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a
>> function
>>
>> Hey Alberto,
>>
>> Thanks for your message! I'm trying to reproduce it.
>>
>> Can you attach the code you use to write the batch into the store?
>>
>> Also can you say which version of Python and Arrow you are using? On my
>> installation, I get
>>
>> ```
>>
>> In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
>>
>> ------------------------------------------------------------
>> ---------------
>>
>> ValueError                                Traceback (most recent call last)
>>
>> <ipython-input-5-fbec5bb33c33> in <module>()
>>
>> ----> 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
>>
>>
>> plasma.pyx in pyarrow.plasma.ObjectID.__cinit__()
>>
>>
>> ValueError: Object ID must by 20 bytes, is keynumber1keynumber1
>> ```
>>
>> (the canonical way to do this would be plasma.ObjectID(b
>> "keynumber1keynumber1"))
>>
>> Best,
>> Philipp.
>>
>> On Mon, Feb 5, 2018 at 10:09 AM, ALBERTO Bocchinfuso <
>> alberto_boc...@hotmail.it> wrote:
>>
>> > Good morning,
>> >
>> > I am experiencing problems with the RecordBatches stored in plasma in a
>> > particular situation.
>> >
>> > If I return a RecordBatch as result of a python function, I am able to
>> > read just the metadata, while I get an error when reading the columns.
>> >
>> > For example, the following code
>> > def retrieve1():
>> >         client = plasma.connect('test', "", 0)
>> >
>> >         key = "keynumber1keynumber1"
>> >         pid = plasma.ObjectID(bytearray(key,'UTF-8'))
>> >
>> >         [buff] = client .get_buffers([pid])
>> >         batch = pa.RecordBatchStreamReader(buff).read_next_batch()
>> >         return batch
>> >
>> > batch = retrieve1()
>> > print(batch)
>> > print(batch.schema)
>> > print(batch[0])
>> >
>> > Represents a simple python code in which a function is in charge of
>> > retrieving the RecordBatch from the plasma store, and then returns it to
>> > the caller. Running the previous example I get:
>> > <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48>
>> > FIELD1: int32
>> > metadata
>> > --------
>> > {}
>> > <pyarrow.lib.Int32Array object at 0x7fd0ebfc0f98>
>> > [
>> >   1,
>> >   12,
>> >   23,
>> >   3,
>> >   21,
>> >   34
>> > ]
>> > <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48>
>> > FIELD1: int32
>> > metadata
>> > --------
>> > {}
>> > Errore di segmentazione (core dump creato)
>> >
>> >
>> > If I retrieve and use the data in the same part of the code (as I do in
>> > the function retrieve1(), but it also works when I put everything in the
>> > main program.) everything runs without problems.
>> >
>> > Also the problem seems to be related to the particular case in which I
>> > retrieve the RecordBatch from the plasma store, since the following
>> > (simpler) code:
>> > def create():
>> >         test1 = [1, 12, 23, 3, 21, 34]
>> >         test1 = pa.array(test1, pa.int32())
>> >
>> >         batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
>> >         print(batch)
>> >         print(batch.schema)
>> >         print(batch[0])
>> >         return batch
>> >
>> > batch1 = create()
>> > print(batch1)
>> > print(batch1.schema)
>> > print(batch1[0])
>> >
>> > Prints:
>> >
>> > <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598>
>> > FIELD1: int32
>> > <pyarrow.lib.Int32Array object at 0x7f5f691fd9a8>
>> > [
>> >   1,
>> >   12,
>> >   23,
>> >   3,
>> >   21,
>> >   34
>> > ]
>> > <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598>
>> > FIELD1: int32
>> > <pyarrow.lib.Int32Array object at 0x7f5f7e29f318>
>> > [
>> >   1,
>> >   12,
>> >   23,
>> >   3,
>> >   21,
>> >   34
>> > ]
>> >
>> > Which is what I expect.
>> >
>> > Is this issue known or am I doing something wrong when retrieving the
>> > RecordBatch from plasma?
>> >
>> > Also I would like to pinpoint the fact that this problem was as easy to
>> > find as hard to re-create. For this reason, there can be other situations
>> > in which the same problem arises that I did not experienced, since I
>> mostly
>> > deal with plasma and I’ve been using only python so long: the
>> description I
>> > gave is not intended to be complete.
>> >
>> > Thank you,
>> > Alberto
>> >
>>
>>
>

Re: [Python] Retrieving a RecordBatch from plasma inside a function

Reply via email to