Re: Repartition not working on a csv file

2018-06-30 Thread Abdeali Kothari
I've tried that too - it doesn't work. It does a repetition, but not right after the broadcast join - it does a lot more processing and does the repetition right before I do my next sortmerge join (stage 12 I described above) As the heavy processing is before the sort merge join, it still doesn't h

Re: Repartition not working on a csv file

2018-06-30 Thread yujhe.li
Abdeali Kothari wrote > My entire CSV is less than 20KB. > By somewhere in between, I do a broadcast join with 3500 records in > another > file. > After the broadcast join I have a lot of processing to do. Overall, the > time to process a single record goes up-to 5mins on 1 executor > > I'm trying

Re: Repartition not working on a csv file

2018-06-30 Thread Abdeali Kothari
My entire CSV is less than 20KB. By somewhere in between, I do a broadcast join with 3500 records in another file. After the broadcast join I have a lot of processing to do. Overall, the time to process a single record goes up-to 5mins on 1 executor I'm trying to increase the partitions that my da

Re: Repartition not working on a csv file

2018-06-30 Thread yujhe.li
Abdeali Kothari wrote > I am using Spark 2.3.0 and trying to read a CSV file which has 500 > records. > When I try to read it, spark says that it has two stages: 10, 11 and then > they join into stage 12. What's your CSV size per file? I think Spark optimizer may put many files into one task when

Re: Setting log level to DEBUG while keeping httpclient.wire on WARN

2018-06-30 Thread yujhe.li
Daniel Haviv wrote > Hi, > I'm trying to debug an issue with Spark so I've set log level to DEBUG but > at the same time I'd like to avoid the httpclient.wire's verbose output by > setting it to WARN. > > I tried the following log4.properties config but I'm still getting DEBUG > outputs for httpcl

Re: Create an Empty dataframe

2018-06-30 Thread Apostolos N. Papadopoulos
Hi Dimitri, you can do the following: 1. create an initial dataframe from an empty csv 2. use "union" to insert new rows Do not forget that Spark cannot replace a DBMS. Spark is mainly be used for analytics. If you need select/insert/delete/update capabilities, perhaps you should look at a

Re: Create an Empty dataframe

2018-06-30 Thread Jules Damji
This is one dirty, quick way to do it: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/321083416305398/3601578643761083/latest.html Cheers Jules Sent from my iPhone Pardon the dumb thumb typos :) > On Jun 30, 2018, at 7:46 AM, d

Create an Empty dataframe

2018-06-30 Thread dimitris plakas
I am new to Pyspark and want to initialize a new empty dataframe with sqlContext() with two columns ("Column1", "Column2"), and i want to append rows dynamically in a for loop. Is there any way to achieve this? Thank you in advance.