Re: spark df.write.partitionBy run very slow

2019-03-14 Thread JF Chen
But now I have another question, how to determine which data node the spark task is writing? It's really important for diving in the problem . Regard, Junfeng Chen On Thu, Mar 14, 2019 at 2:26 PM Shyam P wrote: > cool. > > On Tue, Mar 12, 2019 at 9:08 AM JF Chen wrote: > >> Hi >> Finally I fo

Re: Structured Streaming & Query Planning

2019-03-14 Thread Alessandro Solimando
Hello Paolo, generally speaking, query planning is mostly based on statistics and distributions of data values for the involved columns, which might significantly change over time in a streaming context, so for me it makes a lot of sense that it is run at every schedule, even though I understand yo

How does spark operate internally for an indivisual task?

2019-03-14 Thread swastik mittal
I am running a grep application on spark 2.3.4 and scala version 2.11. I have an input textfile of 813MB stored on a remote source (not a part of spark infrastructure) using hdfs. My application just reads the textfile line by line from hdfs server and filters for a given keyword in each line and o

Re: Yarn job is Stuck

2019-03-14 Thread swastik mittal
It is possible that the Application Master is not getting started. Try increasing the memory limit of the application master in yarn-site.xml or in capacity-scheduler if you have it configured. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --

Yarn job is Stuck

2019-03-14 Thread dimitris plakas
Hello everyone, I have set up a 3node hadoop cluster according to this tutorial: https://linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/#run-yarn and i run the example about yarn (the one with the books) that is described in this tutorial in order to test if everything w

Re: Multiple context in one Driver

2019-03-14 Thread Marcelo Vanzin
It doesn't work (except if you're extremely lucky), it will eat your lunch and will also kick your dog. And it's not even going to be an option in the next version of Spark. On Wed, Mar 13, 2019 at 11:38 PM Ido Friedman wrote: > > Hi, > > I am researching the use of multiple sparkcontext in one

Structured Streaming & Query Planning

2019-03-14 Thread Paolo Platter
Hi All, I would like to understand why in a streaming query ( that should not be able to change its behaviour along iterations ) there is a queryPlanning-Duration effort ( in my case is 33% of trigger interval ) at every schedule. I don’t uderstand why this is needed and if it is possible to d

Re: Windowing LAG function Usage in Spark2.2 Dataset scala

2019-03-14 Thread Magnus Nilsson
import org.apache.spark.sql.expressions.Window val partitionBy = Window.partitionBy("name", "sit").orderBy("data_date") val newDf = df.withColumn("PreviousDate", lag("uniq_im", 1).over(partitionBy)) Cheers... On Thu, Mar 14, 2019 at 4:55 AM anbu wrote: > Hi, > > To calculate LAG functions dif