Re: [DISCUSS] Iceberg community sync?

2019-10-07 Thread Gautam
+1 9 am PST on Tues/Wednesday works. On Mon, Oct 7, 2019 at 4:50 AM Jacques Nadeau wrote: > Tuesdays work best for me. > > On Sun, Oct 6, 2019, 4:18 PM Anton Okolnychyi > wrote: > >> Tuesday/Wednesday/Thursday works fine for me. Anything up to 19:00 UTC / >> 20:00 BST / 12:00 PDT is OK if 09:0

Re: [DISCUSS] Iceberg community sync?

2019-10-07 Thread 俊杰陈
+1 for once a month. Could we set an alternate time for CCT guys? On Mon, Oct 7, 2019 at 3:23 PM Gautam wrote: > > +1 9 am PST on Tues/Wednesday works. > > On Mon, Oct 7, 2019 at 4:50 AM Jacques Nadeau wrote: >> >> Tuesdays work best for me. >> >> On Sun, Oct 6, 2019, 4:18 PM Anton Okolnychyi

Streaming Partitioned Writing

2019-10-07 Thread Dave Sugden
Hi, We are using the iceberg spark datasource with spark structured streaming. The issue with this is, of course, the problem that the incoming partitions are not sorted. We have implemented our own streaming partition writer (extending DataSourceWriter, StreamWriter). We started by keeping the F

Re: [DISCUSS] Iceberg community sync?

2019-10-07 Thread Ryan Blue
Okay, let's set it for Tuesday the 8th (tomorrow) at 16:00 UTC since that works for most people. I'll send out an invite to everyone on this thread. If you'd like to be included, just send me a direct email. Everyone is welcome. We'll schedule the next one for a time when people in CCT can make it

Iceberg Vectorized Reads Meeting Notes (Oct 7)

2019-10-07 Thread Gautam
Hello Devs, We met to discuss progress and next steps on Vectorized read path in Iceberg. Here are my notes from the sync. Feel free to reply with clarifications in case I mis-quoted or missed anything. *Attendees*: Anjali Norwood Padma Pennumarthy Ryan Blue Samarth Jain Gautam Ko

Re: Streaming Partitioned Writing

2019-10-07 Thread Ryan Blue
The approach sounds okay to me. It's usually preferable to repartition the data by your partition dimensions to keep the number of data files that each writer needs to create to a minimum. Also, if buffering in memory starts taking too much memory, you can switch to using Avro instead of Parquet f