Re: Very strange (AI generated) issues

2025-01-31 Thread Steve Loughran
What about extending the issue templates? Because of a growing problem with worthless LLM-generated issues, github MAY terminate any account doing this to our project [ ] I am a human being and am not creating AI generated issues. [ ] I accept that if I am posting AI-generated issues, my github ac

Re: Proposal: Parquet footer size in Iceberg metadata

2025-01-30 Thread Steve Loughran
Knowing the footer offset would be really useful if passed down to whatever is implementing the input stream, along with the actual file size. This can be used for prefetching the footer, as well as caching it (Azure ABFS, google GCS connectors): right now they guess that about 1MB is all they nee

Re: missing files in an Iceberg table

2025-01-30 Thread Steve Loughran
These people using S3 versioned buckets? If so, until actually purged, they are just hiding under tombstone markers Our little cloud-storage support-call library, cloudstore, has something to list and recover these https://github.com/steveloughran/cloudstore https://github.com/steveloughran/clou

Re: Very strange (AI generated) issues

2025-01-29 Thread Steve Loughran
Are these issues being manually created? maybe add a new checkbox [ ] I am not participating in any AI training/experiment and if it turns out that I am -I agree to compensate developers for the time wasted. Or have something to specifically handle new posters., or at least automatically flag th

Re: There is no easy way to secure Iceberg data. How can we improve?

2025-01-03 Thread Steve Loughran
actually, there is a way for the catalog to return S3 objects without granting access to the entire bucket: aws presigning: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html This offers time-bounded access to an object catalog will need to generate and return the pres

Re: There is no easy way to secure Iceberg data. How can we improve?

2025-01-02 Thread Steve Loughran
if the data is stored in S3 then if someone has unrestricted access to a single store containing all the data (default without S3 access grants, cloudera ranger extensions or some other access control mechanism to grant access to clients without sharing credentials) - then it's effectively impossib

Re: Storing catalog directly on object store

2024-12-06 Thread Steve Loughran
I am not expressing any opinion on the product whatsoever. What I will note is that I have spent 8 weeks full time this year dealing with AWS Java SDK problems in the more foundational parts of the SDK. https://github.com/steveloughran/engineering-proposals/blob/trunk/refactoring-s3a.md#aws-sdk-v

Re: Storing catalog directly on object store

2024-11-27 Thread Steve Loughran
There's a PR up from amazon to add this to the s3a connector https://github.com/apache/hadoop/pull/7011 targeting a 3.4.2 release early next year, though they've not updated the PR as requested yet. 1. It doesn't give you the same semantics as posix create-no-overwrite call -you only get t

Re: [DISCUSS] Variant Spec Location

2024-08-28 Thread Steve Loughran
> I think Parquet is a better place for the variant spec than Arrow. Parquet is upstream of nearly every project (other than ORC) log4j is that -but it doesn't mean that it is the right place. What is key is: what does it mean for parquet to have a variant type in there? Does it actually make se

Re: Welcome Péter, Amogh and Eduard to the Apache Iceberg PMC

2024-08-14 Thread Steve Loughran
congratulations all. On Tue, 13 Aug 2024 at 21:25, Russell Spitzer wrote: > Hi Y'all, > > It is my pleasure to let everyone know that the Iceberg PMC has voted to > have several talented individuals join us. > > So without further ado, please welcome Péter Váry, Amogh Jahagirdar and > Eduard Tud

Re: [DISCUSS] Filesystem in PyIceberg

2024-08-13 Thread Steve Loughran
On Tue, 13 Aug 2024 at 03:50, Xuanwo wrote: > Hi, André > > Thanks a lot for starting this thread. > > List operations on storage services are expensive and slow. That's why > Iceberg is designed to store metadata in files and avoid using list > operations in FileIO. However, `orphan file removal

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-30 Thread Steve Loughran
On Thu, 18 Jul 2024 at 00:02, Ryan Blue wrote: > Hey everyone, > > There has been some recent discussion about improving > HadoopTableOperations and the catalog based on those tables, but we've > discouraged using file system only table (or "hadoop" tables) for years now > because of major proble

Re: Building with JDK 21

2024-07-11 Thread Steve Loughran
A move to java 11 means it is time to move to Hadoop 3.3.x as the minimum release, anything 17+ means java 3.4.x. Which before long will go making java 11 its minimum version. That is: - cut the hadoop2 version/profile which is really java7. - be prepared to move to 3.4.x if some java11/17 incompa