Re: Arrow as a streaming format

2020-09-19 Thread Micah Kornfield
> > Furthermore, these types of queries seem to fit what I would call (for > lack of a better word) "sliding" dataframes. Arrow's aim (as I understand > it) is to standardized the static dataframe data structure memory model, > can it also support a sliding version? I don't think there are any ex

Re: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

2020-09-19 Thread Micah Kornfield
Hi Niklas, Two suggestions: * Try to adjust row_group_size on write_table [1] to a smaller then default value. If I read the code correctly this is currently 64 million rows [2], which seems potentially two high as a default (I'll open a JIRA about this). * If this is on linux/mac try setting the

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-19 Thread Wes McKinney
I took a look at https://github.com/kpamnany/partr and Julia's production iteration of that -- kpamnany/partr depends on libconcurrent's coroutine implementation which does not work on Windows. It appears that Julia is using libuv instead. If we're looking for a lighter-weight C coroutine implement

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-19 Thread Weston Pace
Ok, my skill with C++ got in the way of my ability to put something together. First, I did not realize that C++ futures were a little different than the definition I'm used to for futures. By default, C++ futures are not composable, you can't add continuations with `then`, `when_all` or `when_any

[NIGHTLY] Arrow Build Report for Job nightly-2020-09-19-0

2020-09-19 Thread Crossbow
Arrow Build Report for Job nightly-2020-09-19-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-19-0 Failed Tasks: - conda-linux-gcc-py36-aarch64: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-19-0-drone-conda-linux-gcc-py3