Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Arwin Tio
I am trying to use Spark's **bucketBy** feature on a pretty large dataset.

```java
dataframe.write()
.format("parquet")
.bucketBy(500, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
.option("path", "s3://my-bucket")
.saveAsTable("my_table");
```

The problem is that my Spark cluster has about 500 partitions/tasks/executors 
(not sure the terminology), so I end up with files that look like:

```
part-1-{UUID}_1.c000.snappy.parquet
part-1-{UUID}_2.c000.snappy.parquet
...
part-1-{UUID}_00500.c000.snappy.parquet

part-2-{UUID}_1.c000.snappy.parquet
part-2-{UUID}_2.c000.snappy.parquet
...
part-2-{UUID}_00500.c000.snappy.parquet

part-00500-{UUID}_1.c000.snappy.parquet
part-00500-{UUID}_2.c000.snappy.parquet
...
part-00500-{UUID}_00500.c000.snappy.parquet
```

That's 500x500=25 bucketed parquet files! It takes forever for the 
`FileOutputCommitter` to commit that to S3.

Is there a way to generate **one file per bucket**, like in Hive? Or is there a 
better way to deal with this problem? As of now it seems like I have to choose 
between lowering the parallelism of my cluster (reduce number of writers) or 
reducing the parallelism of my parquet files (reduce number of buckets), which 
will lower the parallelism of my downstream jobs.

Thanks


Re: Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Phillip Henry
Hi, Arwin.

If I understand you correctly, this is totally expected behaviour.

I don't know much about saving to S3 but maybe you could write to HDFS
first then copy everything to S3? I think the write to HDFS will probably
be much faster as Spark/HDFS will write locally or to a machine on the same
LAN. After writing to HDFS, you can then iterate over the resulting
sub-directories (representing each bucket) and coalesce the files in them.

Regards,

Phillip




On Thu, Jul 4, 2019 at 8:22 AM Arwin Tio  wrote:

> I am trying to use Spark's **bucketBy** feature on a pretty large dataset.
>
> ```java
> dataframe.write()
> .format("parquet")
> .bucketBy(500, bucketColumn1, bucketColumn2)
> .mode(SaveMode.Overwrite)
> .option("path", "s3://my-bucket")
> .saveAsTable("my_table");
> ```
>
> The problem is that my Spark cluster has about 500
> partitions/tasks/executors (not sure the terminology), so I end up with
> files that look like:
>
> ```
> part-1-{UUID}_1.c000.snappy.parquet
> part-1-{UUID}_2.c000.snappy.parquet
> ...
> part-1-{UUID}_00500.c000.snappy.parquet
>
> part-2-{UUID}_1.c000.snappy.parquet
> part-2-{UUID}_2.c000.snappy.parquet
> ...
> part-2-{UUID}_00500.c000.snappy.parquet
>
> part-00500-{UUID}_1.c000.snappy.parquet
> part-00500-{UUID}_2.c000.snappy.parquet
> ...
> part-00500-{UUID}_00500.c000.snappy.parquet
> ```
>
> That's 500x500=25 bucketed parquet files! It takes forever for the
> `FileOutputCommitter` to commit that to S3.
>
> Is there a way to generate **one file per bucket**, like in Hive? Or is
> there a better way to deal with this problem? As of now it seems like I
> have to choose between lowering the parallelism of my cluster (reduce
> number of writers) or reducing the parallelism of my parquet files (reduce
> number of buckets), which will lower the parallelism of my downstream jobs.
>
> Thanks
>


Re: Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Silvio Fiorito
You need to first repartition (at a minimum by bucketColumn1) since each task 
will write out the buckets/files. If the bucket keys are distributed randomly 
across the RDD partitions, then you will get multiple files per bucket.

From: Arwin Tio 
Date: Thursday, July 4, 2019 at 3:22 AM
To: "user@spark.apache.org" 
Subject: Parquet 'bucketBy' creates a ton of files

I am trying to use Spark's **bucketBy** feature on a pretty large dataset.

```java
dataframe.write()
.format("parquet")
.bucketBy(500, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
.option("path", "s3://my-bucket")
.saveAsTable("my_table");
```

The problem is that my Spark cluster has about 500 partitions/tasks/executors 
(not sure the terminology), so I end up with files that look like:

```
part-1-{UUID}_1.c000.snappy.parquet
part-1-{UUID}_2.c000.snappy.parquet
...
part-1-{UUID}_00500.c000.snappy.parquet

part-2-{UUID}_1.c000.snappy.parquet
part-2-{UUID}_2.c000.snappy.parquet
...
part-2-{UUID}_00500.c000.snappy.parquet

part-00500-{UUID}_1.c000.snappy.parquet
part-00500-{UUID}_2.c000.snappy.parquet
...
part-00500-{UUID}_00500.c000.snappy.parquet
```

That's 500x500=25 bucketed parquet files! It takes forever for the 
`FileOutputCommitter` to commit that to S3.

Is there a way to generate **one file per bucket**, like in Hive? Or is there a 
better way to deal with this problem? As of now it seems like I have to choose 
between lowering the parallelism of my cluster (reduce number of writers) or 
reducing the parallelism of my parquet files (reduce number of buckets), which 
will lower the parallelism of my downstream jobs.

Thanks


Spark 2.4.3 with hadoop 3.2 docker image.

2019-07-04 Thread José Luis Pedrosa
Hi All

I'm trying to create docker images that can access azure services using
abfs hadoop driver, which is only available in haddop 3.2.

So I downloaded spark without Hadoop and generated spark images using the
docker-image-tool.sh  itself.

In a new image using the resulting image as FROM, I've added hadoop 3.2
binary distro and following
https://spark.apache.org/docs/2.2.0/hadoop-provided.html I've set:

export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)


Then when launching the jobs in K8s, it turns out, that the driver uses
internally spark-submit for the driver

but
it seems that launches with java directly for the executor

Result is that drivers can run correctly, but executors fails due to
missing sl4j class

Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/Logger
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.slf4j.Logger
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)


If I'd add it manually to the class path, then another hadoop class would
be missing.

What is the right way to generate a docker image for spark 2.4 with a
custom hadoop distribution?


Thanks and regards
JL


Avro support broken?

2019-07-04 Thread Paul Wais
Dear List,

Has anybody gotten avro support to work in pyspark?  I see multiple
reports of it being broken on Stackoverflow and added my own repro to
this ticket: 
https://issues.apache.org/jira/browse/SPARK-27623?focusedCommentId=16878896&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16878896

Cheers,
-Paul

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Learning Spark

2019-07-04 Thread Vikas Garg
Hi,

I am new Spark learner. Can someone guide me with the strategy towards
getting expertise in PySpark.

Thanks!!!


Re: Learning Spark

2019-07-04 Thread Kurt Fehlhauer
Are you a data scientist or data engineer?


On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg  wrote:

> Hi,
>
> I am new Spark learner. Can someone guide me with the strategy towards
> getting expertise in PySpark.
>
> Thanks!!!
>


Re: Learning Spark

2019-07-04 Thread ayan guha
My best advise is to go through the docs and listen to lots of demo/videos
from spark committers.

On Fri, 5 Jul 2019 at 3:03 pm, Kurt Fehlhauer  wrote:

> Are you a data scientist or data engineer?
>
>
> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg  wrote:
>
>> Hi,
>>
>> I am new Spark learner. Can someone guide me with the strategy towards
>> getting expertise in PySpark.
>>
>> Thanks!!!
>>
> --
Best Regards,
Ayan Guha


Re: Learning Spark

2019-07-04 Thread Vikas Garg
I am currently working as a data engineer and I am working on Power BI,
SSIS (ETL Tool). For learning purpose, I have done the setup PySpark and
also able to run queries through Spark on multi node cluster DB (I am using
Vertica DB and later will move on HDFS or SQL Server).

I have good knowledge of Python also.

On Fri, 5 Jul 2019 at 10:32, Kurt Fehlhauer  wrote:

> Are you a data scientist or data engineer?
>
>
> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg  wrote:
>
>> Hi,
>>
>> I am new Spark learner. Can someone guide me with the strategy towards
>> getting expertise in PySpark.
>>
>> Thanks!!!
>>
>