ArnavBalyan commented on code in PR #3269:
URL: https://github.com/apache/parquet-java/pull/3269#discussion_r2298310434


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:
##########
@@ -1804,14 +1804,27 @@ private static void copy(SeekableInputStream from, 
PositionOutputStream to, long
    * @throws IOException if there is an error while writing
    */
   public void end(Map<String, String> extraMetaData) throws IOException {
+    final long footerStart = out.getPos();
+
+    // Build the footer metadata) in memory using the helper stream
+    InMemoryPositionOutputStream buffer = new 
InMemoryPositionOutputStream(footerStart);

Review Comment:
   Thanks for the reply! Maybe there is a gap in my understanding. The goal of 
this change was to reduce the likelihood of partial data written to disk during 
the footer write. 
   
   Currently column index, offset, bloom filter and footer is serialized and 
bytes are sent to stream in stages (serialize + **write**, serialize + 
**write** .... etc). any network failure in the serialization process can 
corrupt data (since previous stage may have sent the bytes to disk if the 
internal stream buffer got full, the exposure window is large). 
   
   With the new change the serialized outputs are built in memory against a 
virtual position, then emitting once (serialize + serialize + serialize + 
**write**). This minimizes the risk of partial data writen, since data is held 
back until all serialization is complete. The underlying concrete stream may 
have varying sizes of in memory buffer and may decide to flush to disk at 
anytime, but avoided in the latter code, since the write is done once at the 
end. 
   
   Completely agree this will not make it atomic, and hard to eliminate this 
kind of problem in the way current code is structured, but this should reduce 
the likelyhood of corrupt data during the footer write. This change is only 
meant to optmize the footer. 
   If I am missing something about the internal stream that negates the gap 
removal, happy to revise and iterate 😀, Appreciate the review and guidance 🙏



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to