streaming new data into bigger parquet file

Igor Berman Wed, 06 Jul 2016 02:28:06 -0700

Hi
I was reading following tutorial
https://docs.cloud.databricks.com/docs/latest/databricks_guide/07%20Spark%20Streaming/08%20Write%20Output%20To%20S3.html



of streaming data to s3 of databricks_guide
and it states that sometimes I need to do compaction of small files(e.g.
from spark streaming) into compacted big file(I understand why - better
read performance, to solve "many small files" problem etc)

My questions are:
1. what happens when I have big parquet file partitioned by some field and
I want to append new small files into this big file? Is spark overrides
whole data or it can append the new data at the end?
2. while appending process happens - how can I ensure that readers of big
parquet files are not blocked and won't get any errors?(i.e. are files are
"available" when appending new data to them?)

I will highly appreciate any pointers

thanks in advance,
Igor

streaming new data into bigger parquet file

Reply via email to