GitHub user MukundaKatta added a comment to the discussion: Safe way to
periodically add arrow RecordBatch to a file
Yes, and the choice between Arrow IPC and Parquet really matters for your
access patterns — they behave very differently when you want "append
periodically + random access later."
### Parquet: append = new row group per write
Parquet files are a sequence of row groups followed by a footer.
`parquet::arrow::FileWriter` lets you write N row groups in a single open file:
```cpp
#include <parquet/arrow/writer.h>
auto schema = batch_0->schema();
std::shared_ptr<parquet::arrow::FileWriter> writer;
ARROW_ASSIGN_OR_RAISE(
writer, parquet::arrow::FileWriter::Open(
*schema, arrow::default_memory_pool(), out_stream,
parquet::WriterProperties::Builder().build(),
parquet::ArrowWriterProperties::Builder().store_schema()->build()));
// At each epoch end:
ARROW_RETURN_NOT_OK(writer->WriteRecordBatch(*batch_epoch_N));
// Flush so readers can see the data even mid-write:
ARROW_RETURN_NOT_OK(writer->NewRowGroup(batch_epoch_N->num_rows()));
// On training end:
ARROW_RETURN_NOT_OK(writer->Close()); // writes the footer
```
Important: **the file is not a valid Parquet file until `Close()` writes the
footer.** If the training job crashes before Close, you lose *all* row groups —
this is the "safely survive interruption" question at the core of your design.
Two common workarounds:
**(a) One file per N epochs + a dataset view.** Close the writer every K
epochs, start a new file. Readers open the directory as a Parquet Dataset;
Arrow transparently unions the files. You lose at most K epochs on crash.
Random access is fine — Parquet's page index + row group statistics let readers
skip cheaply.
**(b) Periodic footer write via `Close()` + reopen.** Close the file, then
reopen with a fresh `FileWriter` — but Parquet doesn't natively support
appending to an existing file in one footer; you'd need to either rewrite the
footer (library-level surgery, fragile) or just go with (a).
### Arrow IPC stream format: true append-safe
If your priority is "every epoch's write is durably readable even after a
crash," Arrow IPC **stream** format is a better fit:
```cpp
#include <arrow/ipc/writer.h>
std::shared_ptr<arrow::ipc::RecordBatchWriter> writer;
ARROW_ASSIGN_OR_RAISE(writer, arrow::ipc::MakeStreamWriter(out_stream, schema));
// Each epoch:
ARROW_RETURN_NOT_OK(writer->WriteRecordBatch(*batch_epoch_N));
// Flush the underlying OS buffer so a crash-after-write keeps the data.
ARROW_RETURN_NOT_OK(out_stream->Flush());
```
The stream format has no footer — each batch is self-contained with its own
length prefix. A truncated file is still a valid prefix of a stream: readers
read batches until EOF or a bad frame and stop. No "file gets corrupted if
interrupted" failure mode.
Trade-off: **no random access.** To read batch 47, you have to sequentially
skip 0..46. For your use case that may be fine (reading is much rarer than
writing), but if you want O(1) jump-to-epoch, you need an index.
### The best of both: IPC file + external index
Arrow IPC **file** format (different from stream) has a footer with per-message
offsets — it's designed for random access. It has the same "crash-before-Close
= unreadable footer" problem as Parquet, but you can work around it by writing
an external index:
```
heavily_nested_output/
epoch-0001.arrow (IPC file, one batch, 1 epoch's data)
epoch-0002.arrow
...
MANIFEST.jsonl ({"epoch": 1, "file": "...", "rows": 123, "sha256": "..."})
```
- One file per epoch = bounded crash loss (last in-flight write).
- MANIFEST is append-only, `fsync` after each line.
- Random access to epoch N = one `fopen(MANIFEST)` scan +
`mmap(epoch-NNNN.arrow)`.
- Union view for scans: `arrow::dataset::FileSystemDataset::Make(...)` on the
directory.
This is effectively what modern training frameworks (Weights & Biases' artifact
layer, MLflow's run artifacts) do under the hood.
### Recommendation for your shape
Given "training loop, epoch-end writes, random access reads, tolerate-crash":
1. **One Arrow IPC *file* per epoch** (one row group equivalent), plus an
append-only JSONL manifest with fsync. Simplest, most crash-resilient, random
access via manifest.
2. If you specifically want one file on disk, **Parquet with
row-group-per-epoch + `FileWriter::Close()` in an `atexit` hook + periodic
checkpoint files**. More moving parts.
3. If you never need random access (replay-only analysis), **Arrow IPC stream**
+ `Flush()` per epoch is the minimum viable code.
The binary-blob trick you linked for the nested tensor shapes transfers cleanly
across all three.
GitHub link:
https://github.com/apache/arrow/discussions/48124#discussioncomment-16658186
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]