Building Spark + hadoop docker for openshift

2020-03-30 Thread Antoine DUBOIS
ecret" \ --conf "spark.kubernetes.namespace=spark2" \ --conf "spark.executor.instances=4" \ --class SparkPi "local:///opt/jar/sparkpi_2.10-1.0.jar" 10 of course /opt/jar/sparkpi_2.10-1.0.jar is part of my docker build. Thank you in advance. Antoine DUBOIS CCIN2P3 smime.p7s Description: S/MIME Cryptographic Signature

Re: Solved: Identify bottleneck

2019-12-20 Thread Antoine DUBOIS
number of core and processing uncompressed data is indeed faster. My bottleneck seems to be the compression. Thank you all and have a merry chrismas De: "ayan guha" À: "Enrico Minack" Cc: "Antoine DUBOIS" , "Chris Teoh" , user@spark.apache.org En

Re: Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
Also, the framework allow to execute all the modification at the same time as one big request (but i wont paste it here, it would not be really relevant De: "Antoine DUBOIS" À: "Enrico Minack" Cc: "Chris Teoh" , "user @spark" Envoyé: Mercr

Re: Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
didn't had time to let it finish. De: "Enrico Minack" À: "Chris Teoh" , "Antoine DUBOIS" Cc: "user @spark" Envoyé: Mercredi 18 Décembre 2019 14:29:07 Objet: Re: Identify bottleneck Good points, but single-line CSV files are splitable (n

Re: Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
nack" À: user@spark.apache.org, "Antoine DUBOIS" Envoyé: Mercredi 18 Décembre 2019 11:13:38 Objet: Re: Identify bottleneck How many withColumn statements do you have? Note that it is better to use a single select, rather than lots of withColumn. This also makes drops redundant. Readin

Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
Hello I'm working on an ETL based on csv describing file systems to transform it into parquet so I can work on them easily to extract informations. I'm using Mr. Powers framework Daria to do so. I've quiet different input and a lot of transformation and the framework helps organize the code.

Spark and ZStandard

2019-08-19 Thread Antoine DUBOIS
Hello, I'm using hadoop 3.1.2 with Yarn and Spark 2.4.2: I'm trying to read file compressed with zstd command line from spark shell. However after a huge fight to finally understand issue in library import and other stuff, I no longer face error when trying to read those files. However If I tr

Spark streaming

2019-05-17 Thread Antoine DUBOIS
Hello, I've a question regarding a use case. I have an ETL using spark and working great. I use cephFS mounted on all spark node to store data. However one problem I have is that b2zipping + transfer from source to spark storage is really long. I would like to be able to process the file as

Using spark and mesos container with host_path volume

2018-12-03 Thread Antoine DUBOIS
Hello, I'm trying to mount a local ceph volume to my mesos container. My cephfs is mounted on all agent at /ceph I'm using spark 2.4 with hadoop 3.11 and I'm not using Docker to deploy spark. The only option I could find to mount a volume though is the following (which is also a line I added t