Hi,
I'm running spark3 on Kubernetes and using S3A staging committer (directory
committer) to write data to s3 bucket. The same set up works fine with
spark2 but with spark3 the final data (writing in parquet format) is not
visible in s3 bucket and when read operation is performed on that parquet
data it fails as it is a empty path without any data.
As s3a committer requires shared file system (like NFS or HDFS) for staging
data i have set up a shared PVC for all executors and drivers(i.e.,
spark.hadoop.fs.s3a.committer.staging.tmp.path set to shared PVC with
readWriteMany)

In S3 bucket i can see only _SUCCESS file without any data.

bash-4.2# s3cmd ls  --no-ssl --host=${AWS_ENDPOINT} --host-bucket=
s3://rookbucket/shiva/ --recursive | grep people.parquet
2021-02-22 11:55      4074   s3://rookbucket/shiva/people.parquet/_SUCCESS
bash-4.2#

The _SUCCESS file is in json format with below content:

==============================
{
  "name" : "org.apache.hadoop.fs.s3a.commit.files.SuccessData/1",
  "timestamp" : 1613994948681,
  "date" : "Mon Feb 22 11:55:48 UTC 2021",
  "hostname" : "spark-thrift-hdfs",
  "committer" : "directory",
  "description" : "Task committer attempt_20210222115547_0000_m_000000_0",
  "metrics" : {
    "stream_write_block_uploads" : 0,
    "files_created" : 5,
    "S3guard_metadatastore_put_path_latencyNumOps" : 0,
    "stream_write_block_uploads_aborted" : 0,
    "committer_commits_reverted" : 0,
    "op_open" : 2,
    "stream_closed" : 12,
    "committer_magic_files_created" : 0,
    "object_copy_requests" : 0,
    "s3guard_metadatastore_initialization" : 0,
    "S3guard_metadatastore_put_path_latency90thPercentileLatency" : 0,
    "stream_write_block_uploads_committed" : 0,
    "S3guard_metadatastore_throttle_rate75thPercentileFrequency (Hz)" : 0,
    "S3guard_metadatastore_throttle_rate90thPercentileFrequency (Hz)" : 0,
    "committer_bytes_committed" : 0,
    "op_create" : 5,
    "stream_read_fully_operations" : 0,
    "committer_commits_completed" : 0,
    "object_put_requests_active" : 0,
    "s3guard_metadatastore_retry" : 0,
    "stream_write_block_uploads_active" : 0,
    "stream_opened" : 12,
    "S3guard_metadatastore_throttle_rate95thPercentileFrequency (Hz)" : 0,
    "op_create_non_recursive" : 0,
    "object_continue_list_requests" : 0,
    "committer_jobs_completed" : 5,
    "S3guard_metadatastore_put_path_latency50thPercentileLatency" : 0,
    "stream_close_operations" : 12,
    "stream_read_operations" : 378,
    "object_delete_requests" : 4,
    "fake_directories_deleted" : 8,
    "stream_aborted" : 0,
    "op_rename" : 0,
    "object_multipart_aborted" : 0,
    "committer_commits_created" : 0,
    "op_get_file_status" : 26,
    "s3guard_metadatastore_put_path_request" : 9,
    "committer_commits_failed" : 0,
    "stream_bytes_read_in_close" : 0,
    "op_glob_status" : 1,
    "stream_read_exceptions" : 0,
    "op_exists" : 5,
    "stream_read_version_mismatches" : 0,
    "S3guard_metadatastore_throttle_rate50thPercentileFrequency (Hz)" : 0,
    "S3guard_metadatastore_put_path_latency95thPercentileLatency" : 0,
    "stream_write_block_uploads_pending" : 4,
    "directories_created" : 0,
    "S3guard_metadatastore_throttle_rateNumEvents" : 0,
    "S3guard_metadatastore_put_path_latency99thPercentileLatency" : 0,
    "stream_bytes_backwards_on_seek" : 0,
    "stream_bytes_read" : 2997558,
    "stream_write_total_data" : 16282,
    "committer_jobs_failed" : 0,
    "stream_read_operations_incomplete" : 29,
    "files_copied_bytes" : 0,
    "op_delete" : 8,
    "object_put_bytes_pending" : 0,
    "stream_write_block_uploads_data_pending" : 0,
    "op_list_located_status" : 0,
    "object_list_requests" : 19,
    "stream_forward_seek_operations" : 0,
    "committer_tasks_completed" : 0,
    "committer_commits_aborted" : 0,
    "object_metadata_requests" : 45,
    "object_put_requests_completed" : 4,
    "stream_seek_operations" : 0,
    "op_list_status" : 0,
    "store_io_throttled" : 0,
    "stream_write_failures" : 0,
    "op_get_file_checksum" : 0,
    "files_copied" : 0,
    "ignored_errors" : 8,
    "committer_bytes_uploaded" : 0,
    "committer_tasks_failed" : 0,
    "stream_bytes_skipped_on_seek" : 0,
    "op_list_files" : 0,
    "files_deleted" : 0,
    "stream_bytes_discarded_in_abort" : 0,
    "op_mkdirs" : 1,
    "op_copy_from_local_file" : 0,
    "op_is_directory" : 1,
    "s3guard_metadatastore_throttled" : 0,
    "S3guard_metadatastore_put_path_latency75thPercentileLatency" : 0,
    "stream_write_total_time" : 0,
    "stream_backward_seek_operations" : 0,
    "object_put_requests" : 4,
    "object_put_bytes" : 16282,
    "directories_deleted" : 0,
    "op_is_file" : 2,
    "S3guard_metadatastore_throttle_rate99thPercentileFrequency (Hz)" : 0
  },
  "diagnostics" : {
    "fs.s3a.metadatastore.impl" :
"org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore",
    "fs.s3a.committer.magic.enabled" : "false",
    "fs.s3a.metadatastore.authoritative" : "false"
  },
  "filenames" : [ ]
}

===============================
With same s3 bucket if i run spark job with spark2 then it writes data to
s3://rookbucket/shiva/people.parquet/  and the _SUCCESS file looks similar
to above one but "filenames" key in that json contain list of part files
(parquet's data files) but with spark3 it is empty list as shown above.
There is no exception or error during write operation, but read fails to get
the schema as the parquet file is empty.

Not sure what is causing the issue, I have attached the spark configuration
which are used to submit the job as attachment( spark-default.conf
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t11249/spark-default.conf>
 
).

I'm using Ceph as underlying storage for s3 buckets and if I use rados
command to check data i can see parquet data with file name containing
multipart upload in some path like below (but not in final output s3 path)

bash-4.2# rados ls  -p rook-ceph-store.rgw.buckets.data | grep
"part-00000-43466165-16d1-4b36-ab90-acb6c3c309a5"
4bd26ab1-6211-4aa5-92d9-9595ad0ee383.454449.1__multipart_shiva/people.parquet/part-00000-43466165-16d1-4b36-ab90-acb6c3c309a5-c000-spark-66e7529285e54226b94d61c2263be83b.snappy.parquet.2~d3to_jPrAO_BLxTu74GXr_g_sz4pvQF.1

Could someone help me to debug this issue or any known issue around this?

Regards,
Shiva






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to