Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-06 Thread Mich Talebzadeh
So what are you intending to do with the resultset produced? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-05 Thread Marco Costantini
Hi Mich, Thank you. Ah, I want to avoid bringing all data to the driver node. That is my understanding of what will happen in that case. Perhaps, I'll trigger a Lambda to rename/combine the files after PySpark writes them. Cheers, Marco. On Thu, May 4, 2023 at 5:25 PM Mich Talebzadeh wrote: >

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Mich Talebzadeh
you can try df2.coalesce(1).write.mode("overwrite").json("/tmp/pairs.json") hdfs dfs -ls /tmp/pairs.json Found 2 items -rw-r--r-- 3 hduser supergroup 0 2023-05-04 22:21 /tmp/pairs.json/_SUCCESS -rw-r--r-- 3 hduser supergroup 96 2023-05-04 22:21 /tmp/pairs.json/part-0-21f1

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini
Hi Mich, Thank you. Are you saying this satisfies my requirement? On the other hand, I am smelling something going on. Perhaps the Spark 'part' files should not be thought of as files, but rather pieces of a conceptual file. If that is true, then your approach (of which I'm well aware) makes sense

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Mich Talebzadeh
AWS S3, or Google gs are hadoop compatible file systems (HCFS) , so they do sharding to improve read performance when writing to HCFS file systems. Let us take your code for a drive import findspark findspark.init() from pyspark.sql import SparkSession from pyspark.sql.functions import struct fro

Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini
Hello, I am testing writing my DataFrame to S3 using the DataFrame `write` method. It mostly does a great job. However, it fails one of my requirements. Here are my requirements. - Write to S3 - use `partitionBy` to automatically make folders based on my chosen partition columns - control the res