Hi folks, We are now consistently hitting this problem:
1) We want to write a data set in the order of 1-5TB of data into a partitioned Iceberg table. 2) Iceberg, on a single write, will fail a partitioned writer if it detects the data is not sorted. This has been discussed before in this list, and as per Ryan, this behavior is there to avoid the small files problem. 3) Thus, you are forced to do a sort. Ryan recommends to ‘sort by all partitioning columns, plus a high cardinality column like an ID’ (See Iceberg Mail Dev thread “Timeout waiting for connection from pool (S3)“). 4) For small writes, this is no problem, but I have had many job failures by trying to sort and running out of temp space while shuffling. The worker nodes are well configured, with 800GB of temp space available. 5) When it does succeed, then I have the issue that the sort consistently makes writes take at least double the time. To summarize: Given the cost of sorting TBs of data, any tips on what to do in Iceberg to avoid it? Should we allow unsorted writes to go thru with a flag? Thanks, Xabriel J Collazo Mojica | Senior Software Engineer | Adobe | xcoll...@adobe.com