Hi,
I'm ingesting a lot of small JSON files and convert them to unified parquet
files, but even the unified files are fairly small (~10MB).
I want to run a merge operation every hour on the existing files, but it
takes a lot of time for such a small amount of data: about 3 GB spread of
3000 parquet files.
Basically what I'm doing is load files in the existing directory, coalesce
them and save to the new dir:
val parquetFiles=sqlContext.parquetFile("/requests_merged/inproc")
parquetFiles.coalesce(2).saveAsParquetFile("/requests_merged/$currday")
Doing this takes over an hour on my 3 node cluster...
Is there a better way to achieve this ?
Any ideas what can cause such a simple operation take so long?
Thanks,
Daniel