Merging Parquet Files

2020-08-31 Thread Tzahi File
Hi, I would like to develop a process that merges parquet files. My first intention was to develop it with PySpark using coalesce(1) - to create only 1 file. This process is going to run on a huge amount of files. I wanted your advice on what is the best way to implement it (PySpark isn't a must)

Adding Partioned Field to The File

2020-08-31 Thread Tzahi File
Hi, I'm using PySpark to write df to s3 in parquet. I would like to add the partitioned columns to the file as well. What is the best way to do this? e.g df.write.partitionBy('day','hour') file out come -> day,hour,time,name and not time,name Thanks! Tzahi

Re: Merging Parquet Files

2020-08-31 Thread Tzahi File
al terabytes is bad. > > It depends on your use case but you might look also at partitions etc. > > Am 31.08.2020 um 16:17 schrieb Tzahi File : > >  > Hi, > > I would like to develop a process that merges parquet files. > My first intention was to develop it with PySpa

Spark Parquet file size

2020-11-10 Thread Tzahi File
Hi, We have many Spark jobs that create multiple small files. We would like to improve analyst reading performance, doing so I'm testing the parquet optimal file size. I've found that the optimal file size should be around 1GB, and not less than 128MB, depending on the size of the data. I took on

Spark performance over S3

2021-04-06 Thread Tzahi File
Hi All, We have a spark cluster on aws ec2 that has 60 X i3.4xlarge. The spark job running on that cluster reads from an S3 bucket and writes to that bucket. the bucket and the ec2 run in the same region. As part of our efforts to reduce the runtime of our spark jobs we found there's serious la

Re: Spark performance over S3

2021-04-07 Thread Tzahi File
ndations from Cloudera > <https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/s3-performance.html> > for optimal use of S3A. > > Thanks, > Hariharan > > > > On Wed, Apr 7, 2021 at 12:15 AM Tzahi File wrote: > >> Hi All, &

Convert timestamp to unix miliseconds

2021-08-04 Thread Tzahi File
Hi All, I'm using spark 2.4 and trying to convert a timestamp column to unix with milliseconds using the unix_timestamp function. I tried to convert the result to double cast(unix_timestamp(timestamp) as double). I also tried using the timestamp format "-MM-dd HH:mm:ss.sss" and no matter what

Performance Issue

2019-01-08 Thread Tzahi File
Hello, I have some performance issue running SQL query on Spark. The query contains one parquet partitioned table (partition by date) one each partition is about 200gb and simple table with about 100 records. The spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface for runnin

Re: Performance Issue

2019-01-10 Thread Tzahi File
; Regards, > Gourav Sengupta > > On Wed, Jan 9, 2019 at 1:53 AM 大啊 wrote: > >> What is your performance issue? >> >> >> >> >> >> At 2019-01-08 22:09:24, "Tzahi File" wrote: >> >> Hello, >> >> I have some

Re: Performance Issue

2019-01-13 Thread Tzahi File
a smaller data > set by having joined on the device_id before as a subquery or separate > query. And when you are writing the output of the JOIN between csv_file and > raw_e to ORDER BY the output based on campaign_ID. > > Thanks and Regards, > Gourav Sengupta > > > On Thu,

Re: Performance Issue

2019-01-13 Thread Tzahi File
ourav Sengupta wrote: > Hi Tzahi, > > I think that SPARK automatically broadcasts with the latest versions, but > you might have to check with your version. Did you try filtering first and > then doing the LEFT JOIN? > > Regards, > Gourav Sengupta > > On Sun, Jan 13

[Spark SQL] failure in query

2019-08-25 Thread Tzahi File
Hi, I encountered some issue to run a spark SQL query, and will happy to some advice. I'm trying to run a query on a very big data set (around 1.5TB) and it getting failures in all of my tries. A template of the query is as below: insert overwrite table partition(part) select /*+ BROADCAST(c) */

Caching tables in spark

2019-08-28 Thread Tzahi File
Hi, Looking for your knowledge with some question. I have 2 different processes that read from the same raw data table (around 1.5 TB). Is there a way to read this data once and cache it somehow and to use this data in both processes? Thanks -- Tzahi File Data Engineer [image: ironSource

Re: Caching tables in spark

2019-08-28 Thread Tzahi File
t; Take a look at this article >> >> >> >> >> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-caching.html >> >> >> >> *From:* Tzahi File >> *Sent:* Wednesday, August 28, 2019 5:18 AM >> *To:* user >> *Subject:* Caching t

Using Percentile in Spark SQL

2019-11-11 Thread Tzahi File
Hi, Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a percentile function. I'm trying to improve this job by moving it to run with spark SQL. Any suggestions on how to use a percentile function in Spark? Thanks, -- Tzahi File Data Engineer [image: ironSou

Re: Using Percentile in Spark SQL

2019-11-11 Thread Tzahi File
ntRank >> function on the data frame. Are you currently using a user-defined function >> for this task? Because I bet that's what's slowing you down. >> >> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File >> wrote: >> >>> Hi, >>> >>

Issue With mod function in Spark SQL

2019-12-17 Thread Tzahi File
I have in my spark sql query a calculated field that gets the value if field1 % 3. I'm using this field as a partition so I expected to get 3 partitions in the mentioned case, and I do get. The issue happened with even numbers (instead of 3 - 4,2 ... ). When I tried to use even numbers, for exampl

Re: Issue With mod function in Spark SQL

2019-12-17 Thread Tzahi File
no.. there're 100M records both even and odd On Tue, Dec 17, 2019 at 8:13 PM Russell Spitzer wrote: > Is there a chance your data is all even or all odd? > > On Tue, Dec 17, 2019 at 11:01 AM Tzahi File > wrote: > >> I have in my spark sql query a calculated fi

Splitting resource in Spark cluster

2019-12-29 Thread Tzahi File
Hi All, I'm using one spark cluster cluster that contains 50 nodes from type i3.4xl (16Vcores). I'm trying to run 4 Spark SQL queries simultaneously. The data is split to 10 even partitions and the 4 queries run on the same data,but different partition. I have tried to configure the cluster so ea

Spark Adaptive configuration

2020-04-22 Thread Tzahi File
Hi, I saw that spark has an option to adapt the join and shuffle configuration. For example: "spark.sql.adaptive.shuffle.targetPostShuffleInputSize" I wanted to know if you had an experience with such configuration, how it changed the performance? Another question is whether along Spark SQL quer

Issue with pyspark query

2020-06-10 Thread Tzahi File
Hi, This is a general question regarding moving spark SQL query to PySpark, if needed I will add some more from the errors log and query syntax. I'm trying to move a spark SQL query to run through PySpark. The query syntax and spark configuration are the same. For some reason the query failed to r

Getting PySpark Partitions Locations

2020-06-25 Thread Tzahi File
Hi, I'm using pyspark to write df to s3, using the following command: "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)". Is there any way to get the partitions created? e.g. day=2020-06-20/hour=1/country=US

Re: Getting PySpark Partitions Locations

2020-06-25 Thread Tzahi File
I don't want to query with a distinct on the partitioned columns, the df contains over 1 Billion of records. I just want to know the partitions that were created.. On Thu, Jun 25, 2020 at 4:04 PM Jörn Franke wrote: > By doing a select on the df ? > > Am 25.06.2020 um 14:52 schr

Can't import confluent_kafka package

2024-11-19 Thread Tzahi File
Hello, I'm running PySpark on Dataproc and I tried to add a new python package. I zipped the confluent_kafka package and added it in the spark submit using py-files. For some reason I keep on getting the following error: ModuleNotFoundError: No module named 'confluent_kafka.cimpl' When running l