GitHub user MukundaKatta edited a comment on the discussion: Safe way to periodically add arrow RecordBatch to a file
Quick trade-off summary for "periodic writes + random access + crash tolerant": Parquet writes row groups per call, but the file is only valid once `FileWriter::Close()` writes the footer, so a crash mid-training can lose everything. Arrow IPC **stream** format is append-safe (no footer, batches self-contained with length prefix) but has no random access. Arrow IPC **file** format has random access via its footer but the same close-to-read issue. In practice the cleanest shape for your case is one IPC file per epoch plus an append-only JSONL manifest (`fsync` per line), random access goes through the manifest, crash loss is bounded to the in-flight file. Dataset API unions the per-epoch files for scans when you need to read everything. GitHub link: https://github.com/apache/arrow/discussions/48124#discussioncomment-16658186 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
