First, anyone who is an active committer on a project with >5 k github
stars gets 6 month of claude max free
https://claude.com/contact-sales/claude-for-oss

Which means: many more asf committers will be experiencing what it can and
can't do.

I'm still learning what it can do, especially on any large body of code,
and am happy with the blocking of pure/overly AI generated content as it
will only create issues downstream. That's production code, tests, etc.
Documentation is an interesting one though, as the tools are good for tasks
like "review all links and flag broken ones" as well as "read the docs and
highlight inconsistencies".

One thing which may be good for any OSS project is to have official
CLAUDE.md, GEMINI.md and the copilot equivalents to provide strict
instructions to the AI tooling which it doesn't auto infer from the simple
/init commands (attached: those two for iceberg)

I'm thing of extra style and process, but also instructions to the AI to
stop it getting over-enthusiastic

   1. always use slf4j logging (had a bad experience with gemini replacing
   every log statement with system.out in my two file project as it couldn't
   see the output to debug test setup)
   2. thread safety requirements
   3. tests to go with the code to explore all branches and failure
   conditions
   4. use no content outside this directory tree
   5. add a  /* begin: AI */ and /* end: AI */ around changes of a given
   size (ASF policy after all)
   6. do not touch anything under /format

+ add the various .gemini/.copilot/.claude dirs with .gitignore set up to
ignore customisations.





On Tue, 10 Mar 2026 at 03:32, vaquar khan <[email protected]> wrote:

> Hi Huaxin, Junwang,
>
> I’ve been following this thread and I feel the same pain. Reviewing "AI
> slop" is the fastest way to burn out a committer, and Junwang is right,
> manual closing is just extra work we don't need .
>
> I've been working on a small utility called AIV (Automated Integrity
> Validation) to help with this exact problem at my day job. Instead of
> trying to "detect" AI which is a losing battle,it focuses on Logic Density.
> Essentially, it checks the ratio of real functional changes to boilerplate.
> If someone submits 300 lines of scaffolding but only 2 lines of actual
> logic, AIV flags it as "Low Substance." This directly addresses Sung’s
> point about "readiness" ,it forces the author to prove there’s actual work
> in the PR before a human ever looks at it .
>
> I’ve already put together a few Iceberg-specific Design Rules for testing.
> For example, it can catch when a PR tries to bypass the ExpireSnapshots API
> or ignores the new V4 metadata constraints,patterns that AI agents miss
> 100% of the time .
>
> It runs 100% locally or in a CI step with no API keys needed . If the
> community is interested, I’m happy to share the code ,it's already an
> apache licence and we could look at a non-blocking trial to help triage the
> incoming queue .
>
> Regards,
> Viquar Khan
>
> On Mon, 9 Mar 2026 at 22:13, Kevin Liu <[email protected]> wrote:
>
>> Thank you for bringing this up. I also feel like I've interacted with a
>> few of these PRs recently. My suspicion is that these PRs are created by an
>> "openclaw"-like agent that is automatically finding issues, creating prs,
>> and responding to reviews. This is slightly different from our previous
>> conversation, which was centered around AI-generated PRs with
>> human-in-the-loop. I've just ping the author in one of the suspected PR and
>> linked to the guidelines.
>>
>> I'm in favor of adding some more to the "Guidelines for AI-assisted
>> Contributions" section [1]. I want to especially call out the burden on the
>> reviewers and the limited reviewer resources.
>>
>> A wild idea: if we add an AGENTS.md to the Iceberg repo, maybe the agent
>> will respect it?
>>
>> Best,
>> Kevin Liu
>>
>>
>> [1]
>> https://iceberg.apache.org/contribute/#guidelines-for-ai-assisted-contributions
>>
>> On Mon, Mar 9, 2026 at 8:05 PM Alex Stephen via dev <
>> [email protected]> wrote:
>>
>>> One thing worth considering is a .github/PULL_REQUEST_TEMPLATE.md file.
>>>
>>> If somebody isn’t looking over their PR, they probably aren’t going to
>>> look over the guidelines around contributing. Especially if they’re located
>>> over in a docs page.
>>>
>>> A Pull Request Template forces them to see the community’s guidelines
>>> before they formally make the PR.
>>>
>>> On Mon, Mar 9, 2026 at 7:55 PM Sung Yun <[email protected]> wrote:
>>>
>>>> Thanks for raising this Huaxin. I do think this is very much worth
>>>> discussing.
>>>>
>>>> I also want to acknowledge that we recently updated the contribution
>>>> guide here [1], so there is already some baseline guidance in place around
>>>> AI-assisted contributions.
>>>>
>>>> My instinct is that we should be careful not to make this too much
>>>> about AI itself, even though I agree that AI is what has made this issue
>>>> much more pronounced. It is now much easier to generate PRs that look ready
>>>> for review on the surface, even when the author has not really gone through
>>>> the content carefully themselves.
>>>>
>>>> Because of that, I think it may be more useful to frame any additional
>>>> guidance around the quality and readiness of the contribution, rather than
>>>> around AI use by itself. That feels like a more durable way to set the
>>>> standard, since it focuses on things we can actually assess consistently in
>>>> review, rather than trying to determine how the content was produced.
>>>>
>>>> On that note, one practical place to start might be to have a more
>>>> formal guideline around when a PR should be marked draft versus ready for
>>>> review. I think a positive direction for the community would be to
>>>> strengthen contributor judgment around what it means for a PR to actually
>>>> be ready for reviewer attention, even if the change looks substantial on
>>>> the surface. We already have a fairly simple mention of the draft PR
>>>> process [2], and maybe that is a natural place to clarify our standard for
>>>> what should be labeled ready for review.
>>>>
>>>> I also think that kind of guideline would be constructive for someone
>>>> who is misreading the readiness of generated code. It gives them a clear
>>>> way to adjust their behavior going forward, without making the first
>>>> response a punishing one. If we start from an assumption of good intent,
>>>> that seems like a better way to help contributors build stronger judgment
>>>> over time.
>>>>
>>>> If the same pattern keeps repeating after that, then I think it makes
>>>> sense to handle it as a contribution-process issue, regardless of whether
>>>> generative tooling was involved. That may also be worth clarifying, and it
>>>> aligns with your question about limiting contributions from people who
>>>> repeatedly ignore these guidelines, although I hope clearer standards help
>>>> avoid getting to that point.
>>>>
>>>> Cheers,
>>>> Sung
>>>>
>>>> [1] https://github.com/apache/iceberg/pull/15213
>>>> [2] https://iceberg.apache.org/contribute/#pull-request-process
>>>>
>>>> On 2026/03/10 00:52:43 huaxin gao wrote:
>>>> > Hi everyone,
>>>> >
>>>> > Some recent PRs look like they were made entirely by AI: finding
>>>> issues,
>>>> > writing code, opening PRs, and replying to review comments, with no
>>>> human
>>>> > review and no disclosure.
>>>> >
>>>> > Our guidelines already say contributors are expected to understand
>>>> their
>>>> > code, verify AI output before submitting, and disclose AI usage. The
>>>> > problem is there's nothing about what happens when someone ignores
>>>> them.
>>>> >
>>>> > Should we define consequences? For example:
>>>> >
>>>> >
>>>> >    - Closing PRs that were clearly not reviewed by a human before
>>>> submitting
>>>> >    - Limiting contributions from people who repeatedly ignore these
>>>> >    guidelines
>>>> >
>>>> > It's OK to use AI to help write code, but submitting AI output without
>>>> > looking at it and leaving it to maintainers to catch the problems is
>>>> not
>>>> > OK.
>>>> >
>>>> > What do you all think?
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Huaxin
>>>> >
>>>>
>>>
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project

Apache Iceberg — a high-performance table format for huge analytic tables. This is the Java reference implementation. Multi-engine support: Spark (3.4, 3.5, 4.0, 4.1), Flink (1.20, 2.0, 2.1), Kafka Connect, Hive.

## Build Commands

Requires **Java 17 or 21**. Uses Gradle 8.x with parallel builds and build cache enabled.

```bash
# Build everything (skip tests for speed)
./gradlew build -x test -x integrationTest

# Run all tests in a module
./gradlew :iceberg-core:test

# Run a single test class
./gradlew :iceberg-core:test --tests org.apache.iceberg.TestSomeClass

# Run a single test method
./gradlew :iceberg-core:test --tests "org.apache.iceberg.TestSomeClass.testMethod"

# Format code (required before commits)
./gradlew spotlessApply

# Format across all Spark/Flink versions
./gradlew spotlessApply -DallModules

# Check API binary compatibility
./gradlew revapi
```

Module names follow the pattern `:iceberg-<module>`, e.g. `:iceberg-api`, `:iceberg-core`, `:iceberg-data`, `:iceberg-parquet`. Spark modules use `:iceberg-spark-<sparkVersion>` (e.g. `:iceberg-spark-4.1`). Flink modules use `:iceberg-flink-<flinkVersion>`.

By default only the default Spark/Flink versions are built. Use `-DsparkVersions=3.4,3.5,4.0,4.1` or `-DallModules` to include all.

## Architecture

| Module | Purpose |
|--------|---------|
| `api/` | Public interfaces: Table, Scan, Schema, Snapshot, Catalog, FileIO, DataFile, DeleteFile |
| `core/` | Core implementations: metadata, manifests, operations, transactions, catalog base classes |
| `common/` | Shared utilities used across modules |
| `data/` | Direct JVM table read/write access |
| `parquet/`, `orc/`, `arrow/` | File format integrations |
| `spark/` | Spark DSv2 integration (versioned subdirectories) |
| `flink/` | Flink integration (versioned subdirectories) |
| `hive-metastore/` | Hive Metastore Thrift client |
| `kafka-connect/` | Kafka Connect sink |
| `open-api/` | REST catalog OpenAPI spec |
| `format/` | Iceberg format specification (Markdown) |

**Key design patterns:**
- Immutable metadata — all table state is versioned through immutable snapshots
- Optimistic concurrency — ACID transactions via MVCC
- Lazy manifest loading — manifest files loaded on-demand
- Pluggable file formats (Parquet, ORC, Avro) and catalogs
- API binary compatibility enforced via RevAPI

## Testing

- **Framework:** JUnit 5 (Jupiter), AssertJ for assertions, Mockito for mocking
- **Integration tests:** Require Docker (TestContainers)
- Tests live in `src/test/java/` within each module
- Use `-DtestParallelism=auto` to parallelize test execution across CPU cores

## Code Style

- Run `./gradlew spotlessApply` to auto-format before committing
- Spotless enforces Google Java Format with Palantir baseline conventions
- Apache License v2 headers required on all source files
# Apache Iceberg

Apache Iceberg is a high-performance format for huge analytic tables. It brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

This repository contains the **Java reference implementation** of Iceberg.

## Project Structure & Architecture

Iceberg's Java implementation is organized into several key library modules:

| Module | Description |
| :--- | :--- |
| `iceberg-api` | Public Iceberg API (Table, Scan, Schema, Snapshot, Catalog, etc.) |
| `iceberg-common` | Shared utility classes used across other modules |
| `iceberg-core` | Core implementations of the API, metadata management, and Avro support |
| `iceberg-data` | Direct JVM table read/write access |
| `iceberg-parquet`, `iceberg-orc`, `iceberg-arrow` | Optional modules for specific file format support |
| `iceberg-hive-metastore` | Implementation of Iceberg tables backed by Hive Metastore |
| `iceberg-spark` | Spark Datasource V2 integration (versioned: 3.4, 3.5, 4.0, 4.1) |
| `iceberg-flink` | Flink integration (versioned: 1.20, 2.0, 2.1) |
| `iceberg-kafka-connect` | Kafka Connect sink and runtime modules |
| `format/` | Iceberg format specifications (Markdown) |

## Development Requirements

- **Java:** 17 or 21
- **Gradle:** 8.x (wrapper included)
- **Docker:** Required for running integration tests (uses TestContainers)

## Key Commands

### Building

```bash
# Build all modules and run tests
./gradlew build

# Build everything while skipping tests for speed
./gradlew build -x test -x integrationTest

# Build with all Spark/Flink/Kafka versions enabled
./gradlew build -DallModules -x test -x integrationTest
```

### Testing

```bash
# Run all tests in a specific module
./gradlew :iceberg-core:test

# Run a single test class
./gradlew :iceberg-core:test --tests org.apache.iceberg.TestSomeClass

# Run a single test method
./gradlew :iceberg-core:test --tests "org.apache.iceberg.TestSomeClass.testMethod"

# Enable parallel test execution
./gradlew test -DtestParallelism=auto
```

### Code Style & Maintenance

```bash
# Automatically format code (required before commits)
./gradlew spotlessApply

# Format across all Spark/Flink/Kafka versions
./gradlew spotlessApply -DallModules

# Check API binary compatibility
./gradlew revapi
```

## Development Conventions

- **Coding Style:** Enforced via [Spotless](https://github.com/diffplug/spotless) using Google Java Format with Palantir baseline conventions.
- **License Headers:** Every source file must include the Apache License v2 header.
- **Testing Framework:** JUnit 5 (Jupiter) for testing, AssertJ for fluent assertions, and Mockito for mocking.
- **Concurrency:** Iceberg relies on immutable metadata and optimistic concurrency (MVCC) for ACID transactions.
- **Versioning:** Spark and Flink integrations are version-specific and located in subdirectories (e.g., `spark/v3.5`).

## Documentation

- **Official Site:** [iceberg.apache.org](https://iceberg.apache.org)
- **Specifications:** [Iceberg Table Spec](https://iceberg.apache.org/spec/)
- **Local Docs:** For building documentation locally, refer to [site/README.md](site/README.md).

Reply via email to