GitHub user MukundaKatta edited a comment on the discussion: Safe way to 
periodically add arrow RecordBatch to a file

Quick trade-off summary for "periodic writes + random access + crash tolerant": 
Parquet writes row groups per call, but the file is only valid once 
`FileWriter::Close()` writes the footer, so a crash mid-training can lose 
everything. Arrow IPC **stream** format is append-safe (no footer, batches 
self-contained with length prefix) but has no random access. Arrow IPC **file** 
format has random access via its footer but the same close-to-read issue. In 
practice the cleanest shape for your case is one IPC file per epoch plus an 
append-only JSONL manifest (`fsync` per line), random access goes through the 
manifest, crash loss is bounded to the in-flight file. Dataset API unions the 
per-epoch files for scans when you need to read everything.

GitHub link: 
https://github.com/apache/arrow/discussions/48124#discussioncomment-16658186

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to