Re: [Flight Format] Authentication Redesign

2020-09-04 Thread James Duong
Are we concerned about backward compatibility with older FlightClients? Would it make sense to continue to support handshakes with auth payloads in addition to header-based authentication using middlewares? Perhaps we create a dedicated endpoint for server capabilities if we need to remain backward

Re: Multifile parquet support

2020-09-04 Thread Weston Pace
Hello Radu, If your goal is strictly "append" with common schema then maybe the terminology you are looking for is "append a parquet file to a parquet dataset" and not "append a row group to a multi-file parquet file". Parquet datasets (and arrow datasets) support having a common schema which is u

Re: Arrow as a streaming format

2020-09-04 Thread Micah Kornfield
Hi Pedro, I think the answer is it likely depends. The main trade-off in using Arrow in a streaming process is the high metadata overhead if you have very few rows. There have been prior discussions on the mailing list about row-based and streaming that might be useful [1][2] in expanding on the

Re: Multifile parquet support

2020-09-04 Thread Neal Richardson
Hi Radu, It might be easier to get feedback on some concrete code. Perhaps make a PR with a proof of concept and we can discuss there? Neal On Fri, Sep 4, 2020 at 4:27 AM Radu Teodorescu wrote: > Micah and all, > Thanks for that pointer, I certainly didn’t follow it in detail at the > time. > >

Re: Adding Parquet encryption support to PyArrow

2020-09-04 Thread Roee Shlomo
Sounds good. In the suggestion above the builders for FileEncryptionProperties/FileDecryptionProperties should not be exposed, so only key tools would create those. This is just one option of course. On 2020/09/03 20:44:26, Antoine Pitrou wrote: > > It would be useful for outsiders to expose

Re: Arrow as a streaming format

2020-09-04 Thread Radu Teodorescu
Hi Pedro, You should be able to use flight for this: pack you subscription call in a DoGet and listen on the FlightDataStream for new data. I thinkˆyou can control the granularity of your messages through the size of the record batches you are writing, but I am not a flight developer so don’t t

Re: Multifile parquet support

2020-09-04 Thread Radu Teodorescu
Micah and all, Thanks for that pointer, I certainly didn’t follow it in detail at the time. My question/thoughts are actually more limited in scope and I am specifically targeting features supported by the standard AND are supported by other major parquet implementation. Specifically I would li

[NIGHTLY] Arrow Build Report for Job nightly-2020-09-04-0

2020-09-04 Thread Crossbow
Arrow Build Report for Job nightly-2020-09-04-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-04-0 Failed Tasks: - test-conda-python-3.7-hdfs-2.9.2: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-04-0-github-test-conda-pyt

Arrow as a streaming format

2020-09-04 Thread Pedro Silva
Hello, This may be a stupid question but is Arrow used for or designed with streaming processing use-cases in mind, where data is non-stationary. I.e: Flink stream processing jobs? Particularly, is it possible from a given event source (say Kafka) to efficiently generate incremental record batche

Re: Adding Parquet encryption support to PyArrow

2020-09-04 Thread Gidon Gershinsky
Sure, I'll prep a brief summary on this by Sunday, got a weekend kicking in here today. Cheers, Gidon On Thu, Sep 3, 2020 at 11:44 PM Antoine Pitrou wrote: > > It would be useful for outsiders to expose what those two API levels > are, and to what usage they correspond. > Is Parquet encryption