GitHub user MukundaKatta added a comment to the discussion: Safe way to 
periodically add arrow RecordBatch to a file

Yes, and the choice between Arrow IPC and Parquet really matters for your 
access patterns — they behave very differently when you want "append 
periodically + random access later."

### Parquet: append = new row group per write

Parquet files are a sequence of row groups followed by a footer. 
`parquet::arrow::FileWriter` lets you write N row groups in a single open file:

```cpp
#include <parquet/arrow/writer.h>

auto schema = batch_0->schema();
std::shared_ptr<parquet::arrow::FileWriter> writer;
ARROW_ASSIGN_OR_RAISE(
    writer, parquet::arrow::FileWriter::Open(
                *schema, arrow::default_memory_pool(), out_stream,
                parquet::WriterProperties::Builder().build(),
                
parquet::ArrowWriterProperties::Builder().store_schema()->build()));

// At each epoch end:
ARROW_RETURN_NOT_OK(writer->WriteRecordBatch(*batch_epoch_N));
// Flush so readers can see the data even mid-write:
ARROW_RETURN_NOT_OK(writer->NewRowGroup(batch_epoch_N->num_rows()));

// On training end:
ARROW_RETURN_NOT_OK(writer->Close());   // writes the footer
```

Important: **the file is not a valid Parquet file until `Close()` writes the 
footer.** If the training job crashes before Close, you lose *all* row groups — 
this is the "safely survive interruption" question at the core of your design.

Two common workarounds:

**(a) One file per N epochs + a dataset view.** Close the writer every K 
epochs, start a new file. Readers open the directory as a Parquet Dataset; 
Arrow transparently unions the files. You lose at most K epochs on crash. 
Random access is fine — Parquet's page index + row group statistics let readers 
skip cheaply.

**(b) Periodic footer write via `Close()` + reopen.** Close the file, then 
reopen with a fresh `FileWriter` — but Parquet doesn't natively support 
appending to an existing file in one footer; you'd need to either rewrite the 
footer (library-level surgery, fragile) or just go with (a).

### Arrow IPC stream format: true append-safe

If your priority is "every epoch's write is durably readable even after a 
crash," Arrow IPC **stream** format is a better fit:

```cpp
#include <arrow/ipc/writer.h>

std::shared_ptr<arrow::ipc::RecordBatchWriter> writer;
ARROW_ASSIGN_OR_RAISE(writer, arrow::ipc::MakeStreamWriter(out_stream, schema));

// Each epoch:
ARROW_RETURN_NOT_OK(writer->WriteRecordBatch(*batch_epoch_N));
// Flush the underlying OS buffer so a crash-after-write keeps the data.
ARROW_RETURN_NOT_OK(out_stream->Flush());
```

The stream format has no footer — each batch is self-contained with its own 
length prefix. A truncated file is still a valid prefix of a stream: readers 
read batches until EOF or a bad frame and stop. No "file gets corrupted if 
interrupted" failure mode.

Trade-off: **no random access.** To read batch 47, you have to sequentially 
skip 0..46. For your use case that may be fine (reading is much rarer than 
writing), but if you want O(1) jump-to-epoch, you need an index.

### The best of both: IPC file + external index

Arrow IPC **file** format (different from stream) has a footer with per-message 
offsets — it's designed for random access. It has the same "crash-before-Close 
= unreadable footer" problem as Parquet, but you can work around it by writing 
an external index:

```
heavily_nested_output/
  epoch-0001.arrow   (IPC file, one batch, 1 epoch's data)
  epoch-0002.arrow
  ...
  MANIFEST.jsonl     ({"epoch": 1, "file": "...", "rows": 123, "sha256": "..."})
```

- One file per epoch = bounded crash loss (last in-flight write).
- MANIFEST is append-only, `fsync` after each line.
- Random access to epoch N = one `fopen(MANIFEST)` scan + 
`mmap(epoch-NNNN.arrow)`.
- Union view for scans: `arrow::dataset::FileSystemDataset::Make(...)` on the 
directory.

This is effectively what modern training frameworks (Weights & Biases' artifact 
layer, MLflow's run artifacts) do under the hood.

### Recommendation for your shape

Given "training loop, epoch-end writes, random access reads, tolerate-crash":

1. **One Arrow IPC *file* per epoch** (one row group equivalent), plus an 
append-only JSONL manifest with fsync. Simplest, most crash-resilient, random 
access via manifest.
2. If you specifically want one file on disk, **Parquet with 
row-group-per-epoch + `FileWriter::Close()` in an `atexit` hook + periodic 
checkpoint files**. More moving parts.
3. If you never need random access (replay-only analysis), **Arrow IPC stream** 
+ `Flush()` per epoch is the minimum viable code.

The binary-blob trick you linked for the nested tensor shapes transfers cleanly 
across all three.


GitHub link: 
https://github.com/apache/arrow/discussions/48124#discussioncomment-16658186

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to