from:"John Muehlhausen"

Re: zero-copy Take?

2023-03-28 Thread John Muehlhausen

function > for that specific case. > > Best, > > Will > > On Tue, Mar 28, 2023 at 10:14 AM John Muehlhausen wrote: > > > Is there a way to pass a RecordBatch (or a batch wrapped as a Table) to > > Take and get back a Table composed of in-place (zero copy) slices

zero-copy Take?

2023-03-28 Thread John Muehlhausen

Is there a way to pass a RecordBatch (or a batch wrapped as a Table) to Take and get back a Table composed of in-place (zero copy) slices of the input? I suppose this is not too hard to code, just wondered if there is already a utility. Result Take(const Datum& values, const Datum& indices,

[Java] VectorSchemaRoot? batches->table

2022-12-12 Thread John Muehlhausen

Hello, pyarrow.Table from_batches(batches, Schema schema=None) Construct a Table from a sequence or iterator of Arrow RecordBatches. What is the equivalent of this in Java? What is the relationship between VectorSchemaRoot, Table and RecordBatch in Java? It all seems a bit different... Specifi

Re: Array::GetValue ?

2022-11-30 Thread John Muehlhausen

:GetValueBytes(int64_t index) > > > > > > > > > I think this would be problematic for Boolean? > > > > > > On Tue, Nov 15, 2022 at 11:01 AM John Muehlhausen wrote: > > > > > >> If that covers primitive and binary(string) types, that

Re: Array::GetValue ?

2022-11-15 Thread John Muehlhausen

If that covers primitive and binary(string) types, that would work for me. On Tue, Nov 15, 2022 at 13:50 Antoine Pitrou wrote: > > Then perhaps we can define a method: > > std::string_view FlatArray::GetValueBytes(int64_t index) > > ? > > > Le 15/11/2022 à 19:3

Re: Array::GetValue ?

2022-11-15 Thread John Muehlhausen

r place for this method if there is > > consensus on adding it. > > > > Cheers, > > Micah > > > > [1] > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_base.h#L219 > > > > On Mon, Nov 14, 2022 at 11:46 AM John Muehlha

Array::GetValue ?

2022-11-14 Thread John Muehlhausen

There exists: const uint8_t* BaseBinaryArray::GetValue(int64_t i, offset_type* out_length) const What about adding: const uint8_t* Array::GetValue(int64_t i, offset_type* out_length) const This would allow GetValue to get the untyped bytes/length of any value? E.g. out_length would be set to size

C# and -1 null_count

2022-10-20 Thread John Muehlhausen

if (fieldNullCount < 0) { throw new InvalidDataException("Null count length must be >= 0"); // TODO:Localize exception message } Above from Ipc/ArrowReaderImplementation.cs. pyarrow is fine with -1, probably due to the following. It would be ni

Re: compressed feather v2 "slicing from the middle"

2022-09-22 Thread John Muehlhausen

When building /// messages using the encapsulated IPC message, padding bytes may be written /// after a buffer, but such padding bytes do not need to be accounted for in /// the size here. length: long; } On Thu, Sep 22, 2022 at 9:10 AM John Muehlhausen wrote: > Regarding tab=feather.read_tab

Re: compressed feather v2 "slicing from the middle"

2022-09-22 Thread John Muehlhausen

e positions of the messages are declared in the file's footer's > "record_batches". > > [1] https://github.com/apache/arrow/blob/master/format/Message.fbs#L87 > > Best, > Jorge > > > On Thu, Sep 22, 2022 at 3:01 AM John Muehlhausen wrote: > >

Re: compressed feather v2 "slicing from the middle"

2022-09-21 Thread John Muehlhausen

5 On Wed, Sep 21, 2022 at 7:49 PM John Muehlhausen wrote: > The following seems like good news... like I should be able to decompress > just one column of a RecordBatch in the middle of a compressed feather v2 > file. Is there a Python API for this kind of access? C++? > > ///

Re: compressed feather v2 "slicing from the middle"

2022-09-21 Thread John Muehlhausen

/// compression does not yield appreciable savings. BUFFER } On Wed, Sep 21, 2022 at 7:03 PM John Muehlhausen wrote: > ``Internal structure supports random access and slicing from the middle. > This also means that you can read a large file chunk by chunk without > having to pull the whole t

compressed feather v2 "slicing from the middle"

2022-09-21 Thread John Muehlhausen

``Internal structure supports random access and slicing from the middle. This also means that you can read a large file chunk by chunk without having to pull the whole thing into memory.'' https://ursalabs.org/blog/2020-feather-v2/ For a compressed v2 file, can I decompress just one column of a ba

std::string_view?

2022-07-12 Thread John Muehlhausen

error: invalid operands to binary expression ('nonstd::sv_lite::basic_string_view >' and 'basic_string_view') This from val == "str"sv Is there a way to access a util::string_view as a std::string_view other than re-building a std::string_view from data()/size() ? -John

Re: StreamDecoder zero-copy (?) for pre-framed contiguous Messages

2022-07-01 Thread John Muehlhausen

ons& options, io::InputStream* stream); On Fri, Jul 1, 2022 at 3:18 PM John Muehlhausen wrote: > If I call `Consume(std::shared_ptr buffer)` and it is already > pre-framed to contain (e.g.) an entire RecordBatch Message and nothing > else, will it use this Buffer in zero-copy mode w

StreamDecoder zero-copy (?) for pre-framed contiguous Messages

2022-07-01 Thread John Muehlhausen

If I call `Consume(std::shared_ptr buffer)` and it is already pre-framed to contain (e.g.) an entire RecordBatch Message and nothing else, will it use this Buffer in zero-copy mode when calling my Listener::OnRecordBatchDecoded() implementation? I.e. will data in that RecordBatch refer directly to

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen

om default C++ memory pool on Linux, and/or interception/auditing > of system pool" on Tue, 14 Jun 2022 09:06:51 -0500, > John Muehlhausen wrote: > > > Hello, > > > > This comment is regarding installation with `apt` on ubuntu 18.04 ... > > `libarrow-dev/

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen

oc -fno-builtin-__libc_memalign -fno-builtin-__posix_memalign -fno-builtin-operator_new -fno-builtin-operator_delete" cmake --preset ninja-debug-minimal -DARROW_JEMALLOC=OFF -DARROW_MIMALLOC=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_PREFIX=/usr/local .. On Tue, Jun 14, 2022 at 12:36 PM John Muehl

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen

My best guess at this moment is that the Arrow lib I'm using was built with a compiler that had something like __builtin_posix_memalign in effect ?? I say this because deploying __builtin_malloc has the same deleterious effect on my own .so On Tue, Jun 14, 2022 at 10:53 AM John Muehlh

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen

> > > Arrow still uses the system allocator for all non-buffer allocations. > > So, for example, when reading in a large IPC file, the majority of the > > data will be allocated by Arrow's memory pool. However, the schema, > > and the wrapper array object itself wi

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen

I take that back... the preload is not intercepting memory_pool.cc -> SystemAllocator -> AllocateAligned -> posix_memalign (if indeed this is the system allocator path), although it is intercepting posix_memalign from a different .so On Tue, Jun 14, 2022 at 10:27 AM John Muehlhausen wr

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen

4, 2022 at 9:06 AM John Muehlhausen wrote: > Hello, > > This comment is regarding installation with `apt` on ubuntu 18.04 ... > `libarrow-dev/bionic,now 8.0.0-1 amd64` > > I'm a bit confused about the memory pool situation: > > * I run with `ARROW_DEFAULT

Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen

Hello, This comment is regarding installation with `apt` on ubuntu 18.04 ... `libarrow-dev/bionic,now 8.0.0-1 amd64` I'm a bit confused about the memory pool situation: * I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that `arrow::default_memory_pool()->backend_name() == arrow::system_m

Create large IPC format record batch(es) in-place without copy or prior data analysis

2021-10-20 Thread John Muehlhausen

Motivation: We have memory-mappable Arrow IPC files with N batches where column(s) are sorted to support binary search. Because log2(n) < log2(n/2)+log2(n/2) and binary search is required on each batch, we prefer the batches to be as large as possible to reduce total search time... perhaps larger

Re: pyarrow vc++ redistributable?

2020-10-06 Thread John Muehlhausen

to build. > > This is one of many reasons we recommend using conda to organizations > because things like the VS runtime are automatically handled. I'm not > sure if there's a way to equivalently handle this with pip > > On Tue, Oct 6, 2020 at 9:16 AM John Muehlhausen

pyarrow vc++ redistributable?

2020-10-06 Thread John Muehlhausen

"pip install pyarrow If you encounter any importing issues of the pip wheels on Windows, you may need to install the Visual C++ Redistributable for Visual Studio 2015." http://arrow.apache.org/docs/python/install.html Just now wading into the use of pyarrow on Windows. Users are confused and irr

Re: [DISCUSS] Format additions for encoding/compression

2020-01-24 Thread John Muehlhausen

a good idea. > > Cheers, > Micah > > > [1] > > https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf > [2] https://github.com/apache/arrow/pull/4815 > [3] > > https://github.com/apache/arrow/blob/master/docs/source/format/Colu

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen

new datatypes there is no separate flag to check? On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney wrote: > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen wrote: > > > > Again, I know very little about Parquet, so your patience is appreciated. > > > > At the moment I

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen

have compression algorithm where the columnar engine can > benefit from it [1] than marginally improving a file-system-os > specific feature. > > François > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf > > > > > On Thu, Jan 23, 2020 at 12:

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen

n Thu, Jan 23, 2020 at 11:23 AM Antoine Pitrou wrote: > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > > Perhaps related to this thread, are there any current or proposed tools to > > transform columns for fixed-length data types according to a "shuffle?" >

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2020-01-23 Thread John Muehlhausen

Perhaps related to this thread, are there any current or proposed tools to transform columns for fixed-length data types according to a "shuffle?" For precedent see the implementation of the shuffle filter in hdf5. https://support.hdfgroup.org/ftp/HDF5//documentation/doc1.6/TechNotes/shuffling-alg

predict whether pa.array() will produce ChunkedArray

2019-12-03 Thread John Muehlhausen

Given input data and a type, how do we predict whether array() will produce ChunkedArray? I figure the formula involves: - the length of input - the type, and max length (to be conservative) for variable length types - some constant(s) that Arrow knows internally... that may change in the future?

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-18 Thread John Muehlhausen

n that > does modify headers) and then "touch" up the metadata for later analysis, > so it conforms to the specification (and standard libraries can be used). > > [1] https://github.com/apache/arrow/blob/master/format/Message.fbs#L49 > [2] https://github.com/apache/arrow/blob/mast

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-17 Thread John Muehlhausen

> I contend that it can only be useful and will never be harmful. What are > > the counter-examples of concrete harm? > > > I'm not sure there is anything obviously wrong, however changes to > semantics are always dangerous. One blemish on the current proposal is

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-16 Thread John Muehlhausen

"that's where the danger lies" What danger? I have no idea what the specific danger is, assuming that all reference implementations have test cases that hedge around this. I contend that it can only be useful and will never be harmful. What are the counter-examples of concrete harm?

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-16 Thread John Muehlhausen

s, we need it in a handful of implementations. I'm willing to provide all of them. To me that is the lowest complexity solution. -John On Wed, Oct 16, 2019 at 10:45 AM Wes McKinney wrote: > On Wed, Oct 16, 2019 at 10:17 AM John Muehlhausen wrote: > > > > "pyar

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-16 Thread John Muehlhausen

t works without my proposed change, we can go back to how the user ignores the empty/undefined array portions without knowing whether they exist. -John On Wed, Oct 16, 2019 at 10:45 AM Wes McKinney wrote: > On Wed, Oct 16, 2019 at 10:17 AM John Muehlhausen wrote: > > > > "pya

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-16 Thread John Muehlhausen

uot;smart" or "magical", instead maintaining tight > developer control over what is going on. > > - Wes > > On Wed, Oct 16, 2019 at 2:18 AM Micah Kornfield > wrote: > > > > Still thinking through the implications here, but to save others from > &g

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-15 Thread John Muehlhausen

fashion and therefore has some unused array elements. The change itself seems relatively simple. What negative consequences do we anticipate, if any? Thanks, -John On Fri, Jul 5, 2019 at 10:42 AM John Muehlhausen wrote: > This seems to help... still testing it though. > > Status GetF

Re: Looking ahead to 1.0

2019-10-15 Thread John Muehlhausen

ARROW-6837 (which, er, includes ARROW-6836) and ARROW-5916 have PRs. Would appreciate some feedback. I will finish the Python part of 6837 when I know I'm on the right track. Thanks, John On Thu, Oct 10, 2019 at 9:54 AM John Muehlhausen wrote: > The format change is ARROW-6836 .

build-support/update-flatbuffers.sh usage

2019-10-14 Thread John Muehlhausen

I'm missing something about this script. FORMAT_DIR=$CWD/../.. How can any of the fbs files be in ../../ when they are in format/ ?

Re: Looking ahead to 1.0

2019-10-10 Thread John Muehlhausen

ntegration tests to prove it. The issues you listed > sound more like C++ library changes to me? > > If you want to propose Format-related changes, that would need to > happen right away otherwise the ship will sail on that. > > - Wes > > On Wed, Oct 9, 2019 at 9:08 PM John M

Re: Looking ahead to 1.0

2019-10-09 Thread John Muehlhausen

ARROW-5916 ARROW-6836/6837 These are of particular interest to me because they enable recordbatch "incrementalism" which is useful for streaming applications: ARROW-5916 allows a recordbatch to pre-allocate space for future records that have not yet been populated, making it safe for readers to c

[jira] [Created] (ARROW-6840) [C++/Python] retrieve fd of open memory mapped file and Open() memory mapped file by fd

2019-10-09 Thread John Muehlhausen (Jira)

John Muehlhausen created ARROW-6840: --- Summary: [C++/Python] retrieve fd of open memory mapped file and Open() memory mapped file by fd Key: ARROW-6840 URL: https://issues.apache.org/jira/browse/ARROW-6840

[jira] [Created] (ARROW-6839) [Java] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)

John Muehlhausen created ARROW-6839: --- Summary: [Java] access File Footer custom_metadata Key: ARROW-6839 URL: https://issues.apache.org/jira/browse/ARROW-6839 Project: Apache Arrow Issue

[jira] [Created] (ARROW-6838) [JS] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)

John Muehlhausen created ARROW-6838: --- Summary: [JS] access File Footer custom_metadata Key: ARROW-6838 URL: https://issues.apache.org/jira/browse/ARROW-6838 Project: Apache Arrow Issue

[jira] [Created] (ARROW-6837) [C++/Python] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)

John Muehlhausen created ARROW-6837: --- Summary: [C++/Python] access File Footer custom_metadata Key: ARROW-6837 URL: https://issues.apache.org/jira/browse/ARROW-6837 Project: Apache Arrow

[jira] [Created] (ARROW-6836) [Format] add a custom_metadata:[KeyValue] field to the Footer table in File.fbs

2019-10-09 Thread John Muehlhausen (Jira)

John Muehlhausen created ARROW-6836: --- Summary: [Format] add a custom_metadata:[KeyValue] field to the Footer table in File.fbs Key: ARROW-6836 URL: https://issues.apache.org/jira/browse/ARROW-6836

Re: uncertain about JIRA issue granularity

2019-10-03 Thread John Muehlhausen

I thought I should open all of the issues for tracking even if I don't implement all of them right away? On Thu, Oct 3, 2019 at 5:46 PM Antoine Pitrou wrote: > > Le 04/10/2019 à 00:18, John Muehlhausen a écrit : > > I need to create two (or more) issues for > > cu

Re: arrow::io::MemoryMappedFile from fd rather than path

2019-10-03 Thread John Muehlhausen

PM Antoine Pitrou wrote: > > Le 03/10/2019 à 23:21, John Muehlhausen a écrit : > > > > Would we just make a variant of Open() that takes a fd rather than a > path? > > That sounds like a good idea. Would you like to open a JIRA and a PR? > > > Would this API hav

uncertain about JIRA issue granularity

2019-10-03 Thread John Muehlhausen

I need to create two (or more) issues for custom_metadata in Footer ... https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E and memory map based on fd ... https://lists.apache.org/thread.html/83373ab00f552ee8afd2bac2b2721468b

arrow::io::MemoryMappedFile from fd rather than path

2019-10-03 Thread John Muehlhausen

I have a situation where multiple processes need to access a memory mapped file. However, between the time the first process maps the file and the time a subsequent process in the group maps the file, the file may have been removed from the filesystem. (I.e. has no "path") Coordinating the cache

[jira] [Created] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths

2019-07-11 Thread John Muehlhausen (JIRA)

John Muehlhausen created ARROW-5916: --- Summary: [C++] Allow RecordBatch.length to be less than array lengths Key: ARROW-5916 URL: https://issues.apache.org/jira/browse/ARROW-5916 Project: Apache

flatbuffers vectors and --gen-object-api

2019-07-05 Thread John Muehlhausen

It seems as if Arrow expects for some vectors to be empty rather than null. (Examples: Footer.dictionaries, Field.children) Anyone using --gen-object-api with flatc will get code that writes null when (e.g.) _o->children.size() is zero in CreateField(). I may be missing something but I don't see

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-07-05 Thread John Muehlhausen

kely malformed"); } const flatbuf::FieldNode* node = nodes->Get(field_index); *//out->length = node->length();* *out->length = metadata_->length();* out->null_count = node->null_count(); out->offset = 0; return Status::OK(); } On Fri, Jul

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-07-05 Thread John Muehlhausen

So far it seems as if pyarrow is completely ignoring the RecordBatch.length field. More info to follow... On Tue, Jul 2, 2019 at 3:02 PM John Muehlhausen wrote: > Crikey! I'll do some testing around that and suggest some test cases to > ensure it continues to work, assuming t

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-07-02 Thread John Muehlhausen

Crikey! I'll do some testing around that and suggest some test cases to ensure it continues to work, assuming that it does. -John On Tue, Jul 2, 2019 at 2:41 PM Wes McKinney wrote: > Thanks for the attachment, it's helpful. > > On Tue, Jul 2, 2019 at 1:40 PM John

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-07-02 Thread John Muehlhausen

Attachments referred to in previous two messages: https://www.dropbox.com/sh/6ycfuivrx70q2jx/AAAt-RDaZWmQ2VqlM-0s6TqWa?dl=0 On Tue, Jul 2, 2019 at 1:14 PM John Muehlhausen wrote: > Thanks, Wes, for the thoughtful reply. I really appreciate the > engagement. In order to clarify things a

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-07-02 Thread John Muehlhausen

: on the one hand, length 1 RecordBatches that don't result in a stream that is computationally efficient. On the other hand, adding artificial latency by accumulating events before "freezing" a larger batch and only then making it available to computation. -John On Tue, Jul 2,

[Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-07-02 Thread John Muehlhausen

During my time building financial analytics and trading systems (23 years!), both the "batch processing" and "stream processing" paradigms have been extensively used by myself and by colleagues. Unfortunately, the tools used in these paradigms have not successfully overlapped. For example, an ana

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-06-30 Thread John Muehlhausen

If there is going to be a breaking change to the IPC format, I'd appreciate some discussion about an idea I had for RecordBatch metadata. I previously promised to create a discussion thread with an initial write-up but have not yet done so. I will try to do this tomorrow. (The basic idea is to h

Re: Propose custom_metadata for Footer

2019-06-11 Thread John Muehlhausen

; > > > > > Note here are the other places where we have such fields: > > > > > > * Field > > > * Schema > > > * Message > > > > > > An alternative solution would be to handle such metadata in a separate > > > file

Propose custom_metadata for Footer

2019-05-29 Thread John Muehlhausen

Original write of File: Schema: custom_metadata: {"value":1} Message Message Footer Schema: custom_metadata: {"value":1} Process appends messages (new data in bold): Schema: custom_metadata: {"value":1} Message Message *Message* *Footer* * Schema: custom_metadata: {"value":2}* Re-writing t

[jira] [Created] (ARROW-5439) [Java] Utilize stream EOS in File format

2019-05-29 Thread John Muehlhausen (JIRA)

John Muehlhausen created ARROW-5439: --- Summary: [Java] Utilize stream EOS in File format Key: ARROW-5439 URL: https://issues.apache.org/jira/browse/ARROW-5439 Project: Apache Arrow Issue

[jira] [Created] (ARROW-5438) [JS] Utilize stream EOS in File format

2019-05-29 Thread John Muehlhausen (JIRA)

John Muehlhausen created ARROW-5438: --- Summary: [JS] Utilize stream EOS in File format Key: ARROW-5438 URL: https://issues.apache.org/jira/browse/ARROW-5438 Project: Apache Arrow Issue Type

Re: Should EOS be mandatory for IPC File format?

2019-05-24 Thread John Muehlhausen

ach, so maybe we can just sort out C++ > for now > > On Wed, May 22, 2019 at 3:03 PM John Muehlhausen wrote: > > > > I added this to https://github.com/apache/arrow/pull/4372 and am hoping > CI > > will test it for me. Do Java/JS require separate JIRA entries? > &

Re: memory mapped IPC File of RecordBatches?

2019-05-24 Thread John Muehlhausen

ent across > platforms > > On Wed, May 22, 2019 at 11:02 PM John Muehlhausen wrote: > > > > Well, it works fine on Linux... and the Linux mmap man page seems to > > indicate you are right about MAP_PRIVATE: > > > > "It is unspecified whether changes ma

[Python] Any reason to exclude lt from ArrayValue ?

2019-05-24 Thread John Muehlhausen

We have __eq__ leaning on as_py() already ... any reason not to have __lt__ ? This makes it possible to use bisect to find slices in ordered data without a __getitem__ wrapper: 1176.0 key=pa.array(['AAPL']) 110.0 print(bisect.bisect_left(batch[3],key[0])) 64.0 print(bisect.bisec

Re: Python development setup and LLVM 7 / Gandiva

2019-05-23 Thread John Muehlhausen

ow/blob/master/ci/conda_env_cpp.yml#L31 > > On Thu, May 23, 2019 at 12:53 PM John Muehlhausen wrote: > > > > The pyarrow-dev conda environment does not include llvm 7, which appears > to > > be a requirement for Gandiva. > > > > So I'm just trying to figure out a pa

Re: Python development setup and LLVM 7 / Gandiva

2019-05-23 Thread John Muehlhausen

hon.rst > > Let us know if that does not work. > > - Wes > > On Wed, May 22, 2019 at 11:02 AM John Muehlhausen wrote: > > > > Set up pyarrow-dev conda environment as at > > https://arrow.apache.org/docs/developers/python.html > > > > Got the following

Re: memory mapped IPC File of RecordBatches?

2019-05-22 Thread John Muehlhausen

es it work as expected on MacOS. Still odd that the changes are only sometimes visible ... but I guess that is compatible with it being "unspecified." -John On Wed, May 22, 2019 at 8:56 PM John Muehlhausen wrote: > I'll mess with this on various platforms and report back. Tha

Re: memory mapped IPC File of RecordBatches?

2019-05-22 Thread John Muehlhausen

>field1 > 0 1.0 > 1 NaN > > Now ran dd to overwrite the file contents > > In [14]: batch.to_pandas() > Out[14]: > field1 > 0 NaN > 1 -245785081.0 > > On Wed, May 22, 2019 at 8:34 PM John Muehlhausen wrote: > > > > I don&#x

Re: memory mapped IPC File of RecordBatches?

2019-05-22 Thread John Muehlhausen

ithub.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L393 > > Some more investigation would be required > > On Wed, May 22, 2019 at 7:43 PM John Muehlhausen wrote: > > > > Is there an example somewhere of referring to the RecordBatch data in a > memory-mapped IPC

Re: memory mapped IPC File of RecordBatches?

2019-05-22 Thread John Muehlhausen

(new test attached) On Wed, May 22, 2019 at 8:09 PM John Muehlhausen wrote: > I don't think that is it. I changed my mmap to MAP_PRIVATE in the first > raw mmap test and the dd changes are still visible. I also changed to > storing the stream format instead of the file format an

memory mapped IPC File of RecordBatches?

2019-05-22 Thread John Muehlhausen

Is there an example somewhere of referring to the RecordBatch data in a memory-mapped IPC File in a zero-copy manner? I tried to do this in Python and must be doing something wrong. (I don't really care whether the example is Python or C++) In the attached test, when I get to the first prompt an

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread John Muehlhausen

/vector/ipc/ArrowFileWriter.java#L67 > > On Wed, May 22, 2019 at 12:24 PM John Muehlhausen wrote: > > > > https://github.com/apache/arrow/pull/4372 > > > > First contribution attempt... sorry in advance if I'm not coloring inside > > the lines! > >

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread John Muehlhausen

https://github.com/apache/arrow/pull/4372 First contribution attempt... sorry in advance if I'm not coloring inside the lines! On Wed, May 22, 2019 at 9:06 AM John Muehlhausen wrote: > I will submit a patch once I get set up for that. My crystal ball says > that some people w

[jira] [Created] (ARROW-5395) Utilize stream EOS in File format

2019-05-22 Thread John Muehlhausen (JIRA)

John Muehlhausen created ARROW-5395: --- Summary: Utilize stream EOS in File format Key: ARROW-5395 URL: https://issues.apache.org/jira/browse/ARROW-5395 Project: Apache Arrow Issue Type

Python development setup and LLVM 7 / Gandiva

2019-05-22 Thread John Muehlhausen

Set up pyarrow-dev conda environment as at https://arrow.apache.org/docs/developers/python.html Got the following error. I will disable Gandiva for now but I'd like to get it back at some point. I'm on Mac OS 10.13.6. CMake Error at cmake_modules/FindLLVM.cmake:33 (find_package): Could not fi

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread John Muehlhausen

tation is not "wrong". > > On Wed, May 22, 2019 at 8:37 AM John Muehlhausen wrote: > > > > I believe the change involves updating the File format notes as above, as > > well as something like the following. The format also mentions "there is > >

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread John Muehlhausen

ote: > This seems like a reasonable change. Is there any reason that we shouldnt > always append EOS? > > On Tuesday, May 21, 2019, John Muehlhausen wrote: > > > Wes, > > > > Check out reader.cpp. It seg faults when it gets to the next > > message-that

Re: Should EOS be mandatory for IPC File format?

2019-05-21 Thread John Muehlhausen

where messages are popped off the InputStream here > > > https://github.com/apache/arrow/blob/6f80ea4928f0d26ca175002f2e9f511962c8b012/cpp/src/arrow/ipc/message.cc#L281 > > If the end of the byte stream is reached, or EOS (0) is encountered, > then the stream reader stops iteration. >

Should EOS be mandatory for IPC File format?

2019-05-21 Thread John Muehlhausen

https://arrow.apache.org/docs/format/IPC.html#file-format If this stream marker is optional in the file format, doesn't this prevent someone from reading the file without being able to seek() it, e.g. if it is "piped in" to a program? Or otherwise they'll have to stream in the entire thing befo

Re: Pyarrow filter/sort/bsearch

2019-05-13 Thread John Muehlhausen

; On Mon, May 13, 2019 at 8:36 AM Wes McKinney > > wrote: > > > > > > > hi John -- I'd recommend implementing these capabilities as Kernel > > > > functions under cpp/src/arrow/compute, then they can be exposed in > > > > Python easily.

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen

; number of interested parties and start designing a proposal (which may > or may not include spec additions). > > Regards > > Antoine. > > > Le 13/05/2019 à 15:38, John Muehlhausen a écrit : > > Micah, yes, it all works at the moment. How have we staked out that it

Pyarrow filter/sort/bsearch

2019-05-13 Thread John Muehlhausen

Does pyarrow currently support filter/sort/search without conversion to pandas? I don’t see anything but want to be sure. Sorry if I overlooked it. Specific needs: 1- filter an arrow record batch and sort the results into a new batch 2- find slice locations for a sorted batch using binary search

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen

favor of > > making changes to the binary protocol for this use case; if others > > have opinions I'll let them speak for themselves. > > > > - Wes > > > > On Mon, May 13, 2019 at 7:50 AM John Muehlhausen wrote: > > > > > > Any thoughts on

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen

ocks so that readers know to call "Slice" on the blocks to obtain > only the written-so-far portion. I'm not likely to be in favor of > making changes to the binary protocol for this use case; if others > have opinions I'll let them speak for themselves. > >

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen

Any thoughts on a RecordBatch distinguishing size from capacity? (To borrow std::vector terminology) Thanks, John On Thu, May 9, 2019 at 2:46 PM John Muehlhausen wrote: > Wes et al, I think my core proposal is that Message.fbs:RecordBatch split > the "length" parameter into

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-09 Thread John Muehlhausen

e case of the file format, while the file is locked, a new RecordBatch would overwrite the previous file Footer and a new Footer would be written. In order to be able to delete or archive old data multiple files could be strung together in a logical series. -John On Tue, May 7, 2019 at 2:39

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread John Muehlhausen

f you'd like to experiment with creating an API for pre-allocating > > fixed-size Arrow protocol blocks and then mutating the data and > > metadata on disk in-place, please be our guest. We don't have the > > tools developed yet to do this for you > > > > - Wes

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread John Muehlhausen

27;m not sure how to better make my case -John On Tue, May 7, 2019 at 11:02 AM Wes McKinney wrote: > hi John, > > On Tue, May 7, 2019 at 10:53 AM John Muehlhausen wrote: > > > > Wes et al, I completed a preliminary study of populating a Feather file > > incrementally.

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread John Muehlhausen

t forking the project, IMHO that is a dark path > that leads nowhere good. We have a large community here and we accept > pull requests -- I think the challenge is going to be defining the use > case to suitable clarity that a general purpose solution can be > developed. > > - Wes

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread John Muehlhausen

l/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E > > > > > > > > On Mon, May 6, 2019 at 10:39 AM John Muehlhausen wrote: > > > > Wes, > > > > I’m not afraid of writing my own C++ code to deal with all of this on

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread John Muehlhausen

s restarted or two separate processes active simultaneously) you'll > > need to build up your own data structures to help with this. > > > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen wrote: > > > > > Hello, > > > > > > Glad to

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread John Muehlhausen

t is the > specific pattern you're trying to undertake for building. > > If you're trying to go across independent processes (whether the same > process restarted or two separate processes active simultaneously) you'll > need to build up your own data structures to hel

Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread John Muehlhausen

Hello, Glad to learn of this project— good work! If I allocate a single chunk of memory and start building Arrow format within it, does this chunk save any state regarding my progress? For example, suppose I allocate a column for floating point (fixed width) and a column for string (variable wid

97 matches

Mail list logo