I've tried that too - it doesn't work. It does a repetition, but not right
after the broadcast join - it does a lot more processing and does the
repetition right before I do my next sortmerge join (stage 12 I described
above)
As the heavy processing is before the sort merge join, it still doesn't h
Abdeali Kothari wrote
> My entire CSV is less than 20KB.
> By somewhere in between, I do a broadcast join with 3500 records in
> another
> file.
> After the broadcast join I have a lot of processing to do. Overall, the
> time to process a single record goes up-to 5mins on 1 executor
>
> I'm trying
My entire CSV is less than 20KB.
By somewhere in between, I do a broadcast join with 3500 records in another
file.
After the broadcast join I have a lot of processing to do. Overall, the
time to process a single record goes up-to 5mins on 1 executor
I'm trying to increase the partitions that my da
Abdeali Kothari wrote
> I am using Spark 2.3.0 and trying to read a CSV file which has 500
> records.
> When I try to read it, spark says that it has two stages: 10, 11 and then
> they join into stage 12.
What's your CSV size per file? I think Spark optimizer may put many files
into one task when
Daniel Haviv wrote
> Hi,
> I'm trying to debug an issue with Spark so I've set log level to DEBUG but
> at the same time I'd like to avoid the httpclient.wire's verbose output by
> setting it to WARN.
>
> I tried the following log4.properties config but I'm still getting DEBUG
> outputs for httpcl
Hi Dimitri,
you can do the following:
1. create an initial dataframe from an empty csv
2. use "union" to insert new rows
Do not forget that Spark cannot replace a DBMS. Spark is mainly be used
for analytics.
If you need select/insert/delete/update capabilities, perhaps you should
look at a
This is one dirty, quick way to do it:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/321083416305398/3601578643761083/latest.html
Cheers
Jules
Sent from my iPhone
Pardon the dumb thumb typos :)
> On Jun 30, 2018, at 7:46 AM, d
I am new to Pyspark and want to initialize a new empty dataframe with
sqlContext() with two columns ("Column1", "Column2"), and i want to append
rows dynamically in a for loop.
Is there any way to achieve this?
Thank you in advance.