Are you trying to run on cloud ?
On Mon, 3 Oct 2022, 21:55 Sachit Murarka, wrote:
> Hello,
>
> I am reading too many files in Spark 3.2(Parquet) . It is not giving any
> error in the logs. But after spark.read.parquet , it is not able to proceed
> further.
> Can anyone please suggest if there is
splitting if your data), in case those files are big.
>
> Enrico
>
>
> Am 14.09.22 um 21:57 schrieb Sid:
>
> Okay so you mean to say that parquet compresses the denormalized data
> using snappy so it won't affect the performance.
>
> Only using snappy will affe
Okay so you mean to say that parquet compresses the denormalized data using
snappy so it won't affect the performance.
Only using snappy will affect the performance
Am I correct?
On Thu, 15 Sep 2022, 01:08 Amit Joshi, wrote:
> Hi Sid,
>
> Snappy itself is not splittable. But t
Try spark.driver.maxResultsSize =0
On Mon, 12 Sep 2022, 09:46 rajat kumar, wrote:
> Hello Users,
>
> My 2 tasks are running forever. One of them gave a java heap space error.
> I have 10 Joins , all tables are big. I understand this is data skewness.
> Apart from changes at code level , any prop
like *part*.snappy.parquet*
So, when I read this data will it affect my performance?
Please help me if there is any understanding gap.
Thanks,
Sid
Table A has two Partitions A & B where data is hashed
based on hash value and sorted within partitions
So my question is how does it comes to know that which partition from Table
A has to be joined or searched with which partition from Table B ?
TIA,
Sid
Hi Everyone,
Thanks a lot for your answers. It helped me a lot to clear the concept :)
Best,
Sid
On Mon, Aug 1, 2022 at 12:17 AM Vinod KC wrote:
> Hi Sid,
> This example code with output will add some more clarity
>
> spark-shell --conf spark.sql.shuffle.partiti
s are segregated in different partitions
based on unique keys they are not matching because x_1/x_2 !=x_8/x_9
How do you ensure that the results are matched?
Best,
Sid
On Sun, Jul 31, 2022 at 1:34 AM Amit Joshi
wrote:
> Hi Sid,
>
> Salting is normally a technique to add random characte
some
understanding gap?
Could anyone help me in layman's terms?
TIA,
Sid
rations on Rdds. Just the
validation part.
Therefore, wanted to understand is there any better way to achieve this?
Thanks,
Sid
Yeah, I understood that now.
Thanks for the explanation, Bjorn.
Sid
On Wed, Jul 6, 2022 at 1:46 AM Bjørn Jørgensen
wrote:
> Ehh.. What is "*duplicate column*" ? I don't think Spark supports that.
>
> duplicate column = duplicate rows
>
>
> tir. 5. jul. 202
Hi Team,
I still need help in understanding how reading works exactly?
Thanks,
Sid
On Mon, Jun 20, 2022 at 2:23 PM Sid wrote:
> Hi Team,
>
> Can somebody help?
>
> Thanks,
> Sid
>
> On Sun, Jun 19, 2022 at 3:51 PM Sid wrote:
>
>> Hi,
>>
>> I alrea
in
>> reality spark does store intermediate results on disk, just in less places
>> than MR
>>
>> --- Original Message ---
>> On Saturday, July 2nd, 2022 at 5:27 PM, Sid
>> wrote:
>>
>> I have explained the same thing in a very layman
I have explained the same thing in a very layman's terms. Go through it
once.
On Sat, 2 Jul 2022, 19:45 krexos, wrote:
>
> I think I understand where Spark saves IO.
>
> in MR we have map -> reduce -> map -> reduce -> map -> reduce ...
>
> which writes results do disk at the end of each such "a
enough memory to store intermediate results) on an average involves 3
times less disk I/O i.e only while reading the data and writing the final
data to the disk.
Thanks,
Sid
On Sat, 2 Jul 2022, 17:58 krexos, wrote:
> Hello,
>
> One of the main "selling points" of Spark is tha
> > https://en.m.wikipedia.org/wiki/Serverless_computing
> >
> >
> >
> > søn. 26. jun. 2022, 10:26 skrev Sid :
> >
> > > Hi Team,
> > >
> > > I am developing a spark job in glue and have read that glue is
> serverless.
> > > I know t
uld be the join operation performed? on another worker node
or it is performed on the driver side?
Somebody, please help me to understand this by correcting me w.r.t my
points or just adding an explanation to it.
TIA,
Sid
managed by Glue and we don't have to worry about the
underlying infrastructure?
Please help me to understand in layman's terms.
Thanks,
Sid
on is called it is
submitted to the DAG scheduler and so on...
Thanks,
Sid
On Sat, Jun 25, 2022 at 12:34 PM Tufan Rakshit wrote:
> Please find the answers inline please .
> 1) Can I apply predicate pushdown filters if I have data stored in S3 or
> it should be used only while reading from DB
s the data in the memory and then transfers to another partition".
Is it correct? If not, please correct me.
Please help me to understand these things in layman's terms if my
assumptions are not correct.
Thanks,
Sid
Where can I find information on the size of the datasets supported by AWS
Glue? I didn't see it on the documentation
Also, if I want to process TBs of data for eg 1TB what should be the ideal
EMR cluster configuration?
Could you please guide me on this?
Thanks,
Sid.
On Thu, 23 Jun 2022,
Hi Team,
Could anyone help me in the below problem:
https://stackoverflow.com/questions/72724999/how-to-calculate-number-of-g-1-workers-in-aws-glue-for-processing-1tb-data
Thanks,
Sid
Thanks all for your answers. Much appreciated.
On Thu, Jun 23, 2022 at 6:07 AM Yong Walt wrote:
> We have many cases like this. it won't cause OOM.
>
> Thanks
>
> On Wed, Jun 22, 2022 at 8:28 PM Sid wrote:
>
>> I have a 150TB CSV file.
>>
>> I have a to
Hi Enrico,
Thanks for the insights.
Could you please help me to understand with one example of compressed files
where the file wouldn't be split in partitions and will put load on a
single partition and might lead to OOM error?
Thanks,
Sid
On Wed, Jun 22, 2022 at 6:40 PM Enrico Minack
I have a 150TB CSV file.
I have a total of 100 TB RAM and 100TB disk. So If I do something like this
spark.read.option("header","true").csv(filepath).show(false)
Will it lead to an OOM error since it doesn't have enough memory? or it
will spill data onto the disk and process it?
Thanks,
Sid
Hi,
Thanks for your answers. Much appreciated
I know that we can cache the data frame in memory or disk but I want to
understand when the data frame is loaded initially and where does it reside
by default?
Thanks,
Sid
On Wed, Jun 22, 2022 at 6:10 AM Yong Walt wrote:
> These are the ba
Hi Team,
I have a few doubts about the below questions:
1) data frame will reside where? memory? disk? memory allocation about data
frame?
2) How do you configure each partition?
3) Is there any way to calculate the exact partitions needed to load a
specific file?
Thanks,
Sid
Hi Team,
Can somebody help?
Thanks,
Sid
On Sun, Jun 19, 2022 at 3:51 PM Sid wrote:
> Hi,
>
> I already have a partitioned JSON dataset in s3 like the below:
>
> edl_timestamp=2022090800
>
> Now, the problem is, in the earlier 10 days of data collection there was
stand how the spark reads the data.
Does it full dataset and filter on the basis of the last saved timestamp or
does it filter only what is required? If the second case is true, then it
should have read the data since the latest data is correct.
So just trying to understand. Could anyone help here?
Thanks,
Sid
est_batch")
>
>
> the above code should be able to then be run with a udf as long as we are
> able to control the parallelism with the help of executor count and task
> cpi configuration.
>
> But once again, this is just an unnecessary overkill.
>
>
> Regards,
> Gour
n the number of executors that you put in the job.
>
> But once again, this is kind of an overkill, for fetching data from a API,
> creating a simple python program works quite well.
>
>
> Regards,
> Gourav
>
> On Mon, Jun 13, 2022 at 9:28 AM Sid wrote:
>
>>
Hi Gourav,
Do you have any examples or links, please? That would help me to understand.
Thanks,
Sid
On Mon, Jun 13, 2022 at 1:42 PM Gourav Sengupta
wrote:
> Hi,
> I think that serialising data using spark is an overkill, why not use
> normal python.
>
> Also have you tried
result
}
# print("Customer Request Body::")
# print(json.dumps(custRequestBody))
response = call_to_cust_bulk_api(policyUrl, custRequestBody)
print(response)
finalDFStatus = finalDF.withColumn("edl_timestamp",
to_timestamp(lit(F.TimeNow(.withColumn(
"status_for_each_batch",
lit(str(response)))
print("Max Value:::")
print(maxValue)
print("Next I:::")
i = rangeNum + 1
print(i)
This is my very first approach to hitting the APIs with Spark. So, could
anyone please help me to redesign the approach, or can share some links or
references using which I can go to the depth of this and rectify myself.
How can I scale this?
Any help is much appreciated.
TIA,
Sid
Request Body::")
# print(json.dumps(custRequestBody))
response = call_to_cust_bulk_api(policyUrl, custRequestBody)
print(response)
finalDFStatus = finalDF.withColumn("edl_timestamp",
to_timestamp(lit(F.TimeNow(
Hi Enrico,
Thanks for your time. Much appreciated.
I am expecting the payload to be as a JSON string to be a record like below:
{"A":"some_value","B":"some_value"}
Where A and B are the columns in my dataset.
On Fri, Jun 10, 2022 at 6:09 PM Enrico Mi
Hi Stelios,
Thank you so much for your help.
If I use lit it gives an error of column not iterable.
Can you suggest a simple way of achieving my use case? I need to send the
entire column record by record to the API in JSON format.
TIA,
Sid
On Fri, Jun 10, 2022 at 2:51 PM Stelios Philippou
rtition(finalDF.rdd.getNumPartitions()).withColumn("status_for_batch
>>
>> To
>>
>> finalDF.repartition(finalDF.rdd.getNumPartitions()).withColumn(col("status_for_batch")
>>
>>
>>
>>
>> On Thu, 9 Jun 2022, 22:32 Sid, wrote:
>
Hi Experts,
I am facing one problem while passing a column to the method. The problem
is described in detail here:
https://stackoverflow.com/questions/72565095/how-to-pass-columns-as-a-json-record-to-the-api-method-using-pyspark
TIA,
Sid
on, Jun 6, 2022 at 4:07 PM Sid wrote:
>
>> Hi experts,
>>
>>
>> When we load any file, I know that based on the information in the spark
>> session about the executors location, status and etc , the data is
>> distributed among the worker nodes and executors.
or it is directly distributed amongst the workers?
Thanks,
Sid
Yeah, Stelios. It worked. Could you please post it as an answer so that I
can accept it on the post and can be of help to people?
Thanks,
Sid
On Mon, May 30, 2022 at 4:42 PM Stelios Philippou
wrote:
> Sid,
>
> According to the error that i am seeing there, this is the Date Forma
Best Regards,
Sid
Hi Team,
I need help with the below problem:
https://stackoverflow.com/questions/72422872/unable-to-format-double-values-in-pyspark?noredirect=1#comment127940175_72422872
What am I doing wrong?
Thanks,
Siddhesh
orrupt record?
Or is there anyway to process such kind of records.
Thanks,
Sid
On Thu, May 26, 2022 at 10:54 PM Gourav Sengupta
wrote:
> Hi,
> can you please give us a simple map of what the input is and what the
> output should be like? From your description it looks a bit difficult
e/spark/blob/8f610d1b4ce532705c528f3c085b0289b2b17a94/python/pyspark/pandas/namespace.py#L216
> probably have to be updated with the default options. This is so that
> pandas API on spark will be like pandas.
>
> tor. 26. mai 2022 kl. 17:38 skrev Sid :
>
>> I was passing the wrong escape cha
this
approach :) Fingers crossed.
Thanks,
Sid
On Thu, May 26, 2022 at 8:43 PM Apostolos N. Papadopoulos <
papad...@csd.auth.gr> wrote:
> Since you cannot create the DF directly, you may try to first create an
> RDD of tuples from the file
>
> and then convert the RDD to
Thanks for opening the issue, Bjorn. However, could you help me to address
the problem for now with some kind of alternative?
I am actually stuck in this since yesterday.
Thanks,
Sid
On Thu, 26 May 2022, 18:48 Bjørn Jørgensen,
wrote:
> Yes, it looks like a bug that we also have in pandas
Hello Everyone,
I have posted a question finally with the dataset and the column names.
PFB link:
https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
Thanks,
Sid
On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen
wrote:
> Sid, dump one of yours files.
>
&
I have 10 columns with me but in the dataset, I observed that some records
have 11 columns of data(for the additional column it is marked as null).
But, how do I handle this?
Thanks,
Sid
On Thu, May 26, 2022 at 2:22 AM Sid wrote:
> How can I do that? Any examples or links, please. So, t
therefore, was wondering if spark could handle this in a much better way.
Thanks,
Sid
On Thu, May 26, 2022 at 2:19 AM Gavin Ray wrote:
> Forgot to reply-all last message, whoops. Not very good at email.
>
> You need to normalize the CSV with a parser that can escape commas inside
>
ot;, "true").option("multiline",
"true").option("inferSchema", "true").option("quote",
'"').option(
"delimiter", ",").csv("path")
What else I can do?
Thanks,
records in total. These are the sample records.
Should I load it as RDD and then may be using a regex should eliminate the
new lines? Or how it should be? with ". /n" ?
Any suggestions?
Thanks,
Sid
counts are not matching? Also what could be the possible reason for
that simple count error?
Environment:
AWS GLUE 1.X
10 workers
Spark 2.4.3
Thanks,
Sid
Yes,
It created a list of records separated by , and it was created faster as
well.
On Wed, 27 Apr 2022, 13:42 Gourav Sengupta,
wrote:
> Hi,
> did that result in valid JSON in the output file?
>
> Regards,
> Gourav Sengupta
>
> On Tue, Apr 26, 2022 at 8:18 PM Sid wr
I have .txt files with JSON inside it. It is generated by some API calls by
the Client.
On Wed, Apr 27, 2022 at 12:39 AM Bjørn Jørgensen
wrote:
> What is that you have? Is it txt files or json files?
> Or do you have txt files with JSON inside?
>
>
>
> tir. 26. apr. 2022 k
Thanks for your time, everyone :)
Much appreciated.
I solved it using jq utility since I was dealing with JSON. I have solved
it using below script:
find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt
Thanks,
Sid
On Tue, Apr 26, 2022 at 9:37 PM B
Hello,
Can somebody help me with the below problem?
https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark
Thanks,
Sid
df1 = df.count()
3) df.persist()
4) df.repartition().write.mode().parquet("")
So please help me to understand how it should be exactly and why? If I am
not correct
Thanks,
Sid
Hi Team,
I need help with the below problem:
https://stackoverflow.com/questions/71613292/how-to-use-columns-in-if-else-condition-in-pyspark
Thanks,
Sid
the intended comparison is then; these are not equivalent
> ways of doing the same thing, or does not seem so as far as I can see.
>
> On Sun, Feb 27, 2022 at 12:30 PM Sid wrote:
>
>> My bad.
>>
>> Aggregation Query:
>>
>> # Write your MySQL query statement
Hi Enrico,
Thanks for your time :)
Consider a huge data volume scenario, If I don't use any keywords like
distinct, which one would be faster ? Window with partitionBy or normal SQL
aggregation methods? and how does df.groupBy().reduceByGroups() work
internally ?
Thanks,
Sid
On Mon, F
k
from Department d join Employee e on e.departmentId=d.id ) a where rnk<=3
Time Taken: 790 ms
Thanks,
Sid
On Sun, Feb 27, 2022 at 11:35 PM Sean Owen wrote:
> Those two queries are identical?
>
> On Sun, Feb 27, 2022 at 11:30 AM Sid wrote:
>
>> Hi Team,
>>
>> I am awa
re.
Please correct me if I am wrong.
TIA,
Sid
orted out. Like for string columns. In this case, are the string column
sorted ? My question might be duplicate but I want to understand what
happens in case there is a non-numeric data as the joining keys.
Thanks,
Sid
On Thu, 24 Feb 2022, 02:45 Mich Talebzadeh,
wrote:
> Yes correct beca
citly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 23 Feb 2022 at 20:07, Sid wrote:
>
>> Hi Mich,
>>
>> Thanks for the link. I will go through it. I have two
en in this case what can I do? Suppose I have a "Department"
column and need to join with the other table based on "Department". So, can
I sort the string as well? What does it exactly mean by non-sortable keys?
Thanks,
Sid
On Wed, Feb 23, 2022 at 11:46 PM Mich Talebzadeh
wrote
Okay. So what should I do if I get such data?
On Wed, Feb 23, 2022 at 11:59 PM Sean Owen wrote:
> There is no record "345" here it seems, right? it's not that it exists and
> has null fields; it's invalid w.r.t. the schema that the rest suggests.
>
> On Wed, Feb
Hi,
Can you help me with my doubts? Any links would also be helpful.
Thanks,
Sid
On Wed, Feb 23, 2022 at 1:22 AM Sid Kal wrote:
> Hi Mich / Gourav,
>
> Thanks for your time :) Much appreciated. I went through the article
> shared by Mich about the query execution plan. I
},
"Party3": {
"FIRSTNAMEBEN": "ABCDDE",
"ALIASBEN": "",
"RELATIONSHIPTYPE": "ABC, FGHIJK LMN",
"DATEOFBIRTH": "7/Oct/1969"
}
},
"GeneratedTime": "2022-01-30 03:09:26"
},
{
"345": {
},
"GeneratedTime": "2022-01-30 03:09:26"
}
]
However, when I try to display this JSON using below code, it doesn't show
the blank records. In my case I don't get any records for 345 since it is
null but I want to display it in the final flattened dataset.
val df = spark.read.option("multiline",
true).json("/home/siddhesh/Documents/nested_json.json")
Spark version:3.1.1
Thanks,
Sid
n Wed, Feb 23, 2022 at 8:28 PM Bjørn Jørgensen
wrote:
> You will. Pandas API on spark that `imported with from pyspark import
> pandas as ps` is not pandas but an API that is using pyspark under.
>
> ons. 23. feb. 2022 kl. 15:54 skrev Sid :
>
>> Hi Bjørn,
>>
>> Tha
Hi Bjørn,
Thanks for your reply. This doesn't help while loading huge datasets. Won't
be able to achieve spark functionality while loading the file in
distributed manner.
Thanks,
Sid
On Wed, Feb 23, 2022 at 7:38 PM Bjørn Jørgensen
wrote:
> from pyspark import pandas as ps
>
&
Hi Gourav,
Thanks for your time.
I am worried about the distribution of data in case of a huge dataset file.
Is Koalas still a better option to go ahead with? If yes, how can I use it
with Glue ETL jobs? Do I have to pass some kind of external jars for it?
Thanks,
Sid
On Wed, Feb 23, 2022 at 7
("/home/.../Documents/test_excel.xlsx")
It is giving me the below error message:
java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager
I tried several Jars for this error but no luck. Also, what would be the
efficient way to load it?
Thanks,
Sid
her you are using SPARK Dataframes or SPARK SQL,
> and the settings in SPARK (look at the settings for SPARK 3.x) and a few
> other aspects you will see that the plan is quite cryptic and difficult to
> read sometimes.
>
> Regards,
> Gourav Sengupta
>
> On Sun, Feb 20, 2022 at
Thank you so much for your reply, Mich. I will go through it. However, I
want to understand how to read this plan? If I face any errors or if I want
to look how spark is cost optimizing or how should we approach it?
Could you help me in layman terms?
Thanks,
Sid
On Sun, 20 Feb 2022, 17:50 Mich
is happening and what needs to be modified on the query part? Also
internally since spark by default uses sort merge join as I can see from
the plan but when does it opts for Sort-Merge Join and when does it opts
for Shuffle-Hash Join?
Thanks,
Sid
or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 27 Jan 2022 at 18:46, Sid Kal
Okay sounds good.
So, below two options would help me to capture CDC changes:
1) Delta lake
2) Maintaining snapshot of records with some indicators and timestamp.
Correct me if I'm wrong
Thanks,
Sid
On Thu, 27 Jan 2022, 23:59 Mich Talebzadeh,
wrote:
> There are two ways of
property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 27 Jan 20
Hi Mich,
Thanks for your time.
Data is stored in S3 via DMS which is read in the Spark jobs.
How can I mark as a soft delete ?
Any small snippet / link / example. Anything would help.
Thanks,
Sid
On Thu, 27 Jan 2022, 22:26 Mich Talebzadeh,
wrote:
> Where ETL data is stored?
>
>
,
Sid
I have implemented the advise and now the issue is resolved .
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Python-code-crashing-on-ReduceByKey-if-I-return-custom-class-object-tp17317p17426.html
Sent from the Apache Spark User List mailing list archive a
r
Spark assembly has been built with Hive, including Datanucleus jars on
classpath Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties 14/10/25 05:02:39 WARN Utils:
Your hostname, Sid resolves to a loopback address: 127.0.1.1; using
192.168.0.15 instead (on interfac
83 matches
Mail list logo