Hi all, I'd like to refactor the entire OSSFileIO implementation to improve its performance and fix several bugs. ## Background First, let me briefly explain how the following test results were obtained. I implemented a FileIO benchmark that runs both S3FileIO and OSSFileIO against the same Aliyun OSS bucket from the same VM for comparison (Aliyun OSS is S3 protocol compatible). I also ensured that disk, memory, CPU, and network bandwidth were not bottlenecks, and used identical runtime parameters, so any performance differences in the results should come from the FileIO implementation itself. ## Issues ### 1. Random Read: Critical Performance Issue The random read code has a serious problem that results in extremely poor random read performance. **Test Results** ``` Benchmark (bufferSizeKB) (fileIOClass) (fileSizeKB) Mode Cnt Score Error Units FileIOBenchmark.randomRead 1024 org.apache.iceberg.aws.s3.S3FileIO 131072 avgt 4 1817.108 ± 37.337 ms/op FileIOBenchmark.randomRead 1024 org.apache.iceberg.aliyun.oss.OSSFileIO 131072 avgt 5 27164.064 ± 24437.452 ms/op ``` With a buffer size of 1MB and total file size of 128MB, OSSFileIO is more than 10x slower than S3FileIO. **Analysis** When a random read ends, `OSSInputStream` calls the underlying `close()` method, which continues to consume the remaining TCP data, causing unnecessary waiting. In contrast, `S3InputStream` calls `abort()`, which directly tears down the TCP connection. **Problems and Impact** 1. Calling `close()` results in wasted time and network bandwidth. This has significant impact — a 20x performance degradation may make it completely unusable in certain scenarios. 2. `OSSInputStream` does not implement `RangeReadable`, so every random read disrupts the sequential read stream. This has moderate impact. ### 2. Sequential Write: Poor Performance **Test Results** ``` Benchmark (bufferSizeKB) (fileIOClass) (fileSizeKB) Mode Cnt Score Error Units FileIOBenchmark.sequentialWrite 1024 org.apache.iceberg.aliyun.oss.OSSFileIO 1048576 avgt 5 4162.820 ± 162.809 ms/op FileIOBenchmark.sequentialWrite 1024 org.apache.iceberg.aws.s3.S3FileIO 1048576 avgt 4 1615.085 ± 73.897 ms/op ``` With a buffer size of 1MB and total file size of 1GB, OSSFileIO is about 2x slower. In terms of per-stream bandwidth, S3FileIO achieves roughly 640MB/s while OSSFileIO achieves only about 249MB/s. **Analysis** The current OSSFileIO implementation writes data to a local file first, then uploads the entire file via the `PutObject` API. S3FileIO, for large files, uploads in parts (default 32MB per part) asynchronously and with multiple concurrent uploads, so the upload time overlaps with upper-layer business logic. **Problem List** 1. Sequential write performance is roughly 2x worse. Moderate impact — usable but suboptimal. 2. File size has an upper limit. The maximum file size for `PutObject` is 5GB, while multipart upload supports up to about 48TB. This may cause unavailability in some scenarios. 3. Page cache thrashing. Since OSSFile accumulates data into a single local file, dirty pages in the page cache may trigger disk flushing. In contrast, S3FileIO's 32MB part files are deleted after upload, avoiding excessive page cache accumulation. In memory-constrained or disk-performance-constrained environments, this may become an upload throughput bottleneck. ### 3. OSS SDK Version Update The OSS SDK now has a brand new V2 version (see https://github.com/aliyun/alibabacloud-oss-java-sdk-v2 <https://github.com/aliyun/alibabacloud-oss-java-sdk-v2 >), which offers improvements in both community activity and performance. ## Plan I propose to complete this work in two phases: 1. Refactor the entire OSSFileIO to fix the issues described above. 2. Continue with deeper performance optimizations based on Aliyun OSS-specific features and pefetch. Looking forward to your feedback and suggestions! Thanks, Liquan Liu
