Thanks! I can indeed reproduce this problem. I'm a bit busy right now and plan to look into it on the weekend.
Here is the preliminary backtrace for everybody interested: CESS (code=1, address=0x111138158) frame #0: 0x000000010e6457fc lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28 lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py: -> 0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi 0x10e645800 <+32>: callq 0x10e698170 ; symbol stub for: PyInt_FromLong 0x10e645805 <+37>: testq %rax, %rax 0x10e645808 <+40>: je 0x10e64580c ; <+44> (lldb) bt * thread #1: tid = 0xf1378e, 0x000000010e6457fc lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x111138158) * frame #0: 0x000000010e6457fc lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28 frame #1: 0x000000010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 133 frame #2: 0x000000010e613b25 lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933 frame #3: 0x000000010c2f83bc libpython2.7.dylib`PyObject_Repr + 60 frame #4: 0x000000010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305 On Tue, Feb 6, 2018 at 1:24 AM, ALBERTO Bocchinfuso < alberto_boc...@hotmail.it> wrote: > Hi, > > I’m using python 3.5.2 and pyarrow 0.8.0 > > As key, I put a string of 20 bytes, of course. I’m doing it differently > from the canonical way since I’m no more using python 2.7, but python 3, > and this seemed to me to be the right way to create a string of 20 bytes. > The full code is: > > import pyarrow as pa > import pyarrow.plasma as plasma > > def retrieve1(): > client = plasma.connect('test', "", 0) > > key = "keynumber1keynumber1" > pid = plasma.ObjectID(bytearray(key,'UTF-8')) > > [buff] = client .get_buffers([pid]) > batch = pa.RecordBatchStreamReader(buff).read_next_batch() > > print(batch) > print(batch.schema) > print(batch[0]) > > return batch > > client = plasma.connect('test', "", 0) > > test1 = [1, 12, 23, 3, 21, 34] > test1 = pa.array(test1, pa.int32()) > > batch = pa.RecordBatch.from_arrays([test1], ['FIELD1']) > > key = "keynumber1keynumber1" > pid = plasma.ObjectID(bytearray(key,'UTF-8')) > sink = pa.MockOutputStream() > stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema) > stream_writer.write_batch(batch) > stream_writer.close() > > bff = client.create(pid, sink.size()) > > stream = pa.FixedSizeBufferWriter(bff) > writer = pa.RecordBatchStreamWriter(stream, batch.schema) > writer.write_batch(batch) > client.seal(pid) > > batch = retrieve1() > print(batch) > print(batch.schema) > print(batch[0]) > > I hope this helps, > thank you > > Da: Philipp Moritz<mailto:pcmor...@gmail.com> > Inviato: martedì 6 febbraio 2018 00:00 > A: dev@arrow.apache.org<mailto:dev@arrow.apache.org> > Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a > function > > Hey Alberto, > > Thanks for your message! I'm trying to reproduce it. > > Can you attach the code you use to write the batch into the store? > > Also can you say which version of Python and Arrow you are using? On my > installation, I get > > ``` > > In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8")) > > ------------------------------------------------------------ > --------------- > > ValueError Traceback (most recent call last) > > <ipython-input-5-fbec5bb33c33> in <module>() > > ----> 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8")) > > > plasma.pyx in pyarrow.plasma.ObjectID.__cinit__() > > > ValueError: Object ID must by 20 bytes, is keynumber1keynumber1 > ``` > > (the canonical way to do this would be plasma.ObjectID(b > "keynumber1keynumber1")) > > Best, > Philipp. > > On Mon, Feb 5, 2018 at 10:09 AM, ALBERTO Bocchinfuso < > alberto_boc...@hotmail.it> wrote: > > > Good morning, > > > > I am experiencing problems with the RecordBatches stored in plasma in a > > particular situation. > > > > If I return a RecordBatch as result of a python function, I am able to > > read just the metadata, while I get an error when reading the columns. > > > > For example, the following code > > def retrieve1(): > > client = plasma.connect('test', "", 0) > > > > key = "keynumber1keynumber1" > > pid = plasma.ObjectID(bytearray(key,'UTF-8')) > > > > [buff] = client .get_buffers([pid]) > > batch = pa.RecordBatchStreamReader(buff).read_next_batch() > > return batch > > > > batch = retrieve1() > > print(batch) > > print(batch.schema) > > print(batch[0]) > > > > Represents a simple python code in which a function is in charge of > > retrieving the RecordBatch from the plasma store, and then returns it to > > the caller. Running the previous example I get: > > <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48> > > FIELD1: int32 > > metadata > > -------- > > {} > > <pyarrow.lib.Int32Array object at 0x7fd0ebfc0f98> > > [ > > 1, > > 12, > > 23, > > 3, > > 21, > > 34 > > ] > > <pyarrow.lib.RecordBatch object at 0x7fd0ebfc0f48> > > FIELD1: int32 > > metadata > > -------- > > {} > > Errore di segmentazione (core dump creato) > > > > > > If I retrieve and use the data in the same part of the code (as I do in > > the function retrieve1(), but it also works when I put everything in the > > main program.) everything runs without problems. > > > > Also the problem seems to be related to the particular case in which I > > retrieve the RecordBatch from the plasma store, since the following > > (simpler) code: > > def create(): > > test1 = [1, 12, 23, 3, 21, 34] > > test1 = pa.array(test1, pa.int32()) > > > > batch = pa.RecordBatch.from_arrays([test1], ['FIELD1']) > > print(batch) > > print(batch.schema) > > print(batch[0]) > > return batch > > > > batch1 = create() > > print(batch1) > > print(batch1.schema) > > print(batch1[0]) > > > > Prints: > > > > <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598> > > FIELD1: int32 > > <pyarrow.lib.Int32Array object at 0x7f5f691fd9a8> > > [ > > 1, > > 12, > > 23, > > 3, > > 21, > > 34 > > ] > > <pyarrow.lib.RecordBatch object at 0x7f5f7b7a9598> > > FIELD1: int32 > > <pyarrow.lib.Int32Array object at 0x7f5f7e29f318> > > [ > > 1, > > 12, > > 23, > > 3, > > 21, > > 34 > > ] > > > > Which is what I expect. > > > > Is this issue known or am I doing something wrong when retrieving the > > RecordBatch from plasma? > > > > Also I would like to pinpoint the fact that this problem was as easy to > > find as hard to re-create. For this reason, there can be other situations > > in which the same problem arises that I did not experienced, since I > mostly > > deal with plasma and I’ve been using only python so long: the > description I > > gave is not intended to be complete. > > > > Thank you, > > Alberto > > > >