SparkSQL read Hive transactional table

2018-10-15 Thread daily
Hi, I use HCatalog Streaming Mutation API to write data to hive transactional table, and then, I use SparkSQL to read data from the hive transactional table. I get the right result. However, SparkSQL uses more time to read hive orc bucket transactional table, beacause SparkSQL rea

Re: Timestamp Difference/operations

2018-10-15 Thread Srabasti Banerjee
Hi Paras,Check out the link Spark Scala: DateDiff of two columns by hour or minute | | | | || | | | | | Spark Scala: DateDiff of two columns by hour or minute I have two timestamp columns in a dataframe that I'd like to get the minute difference of, or alternatively

Re: overcommit: cpus / vcores

2018-10-15 Thread Peter Liu
Hi Khaled, I have attached the spark streaming config below in (a). In case of the 100vcore run (see the initial email), I used 50 executors where each executor has 2 vcores and 3g memory. For 70 vcore case, 35 executors, for 80 vcore case, 40 executors. In the yarn config (yarn-site.xml, (b) bel

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-15 Thread Imran Rashid
I just discovered https://issues.apache.org/jira/browse/SPARK-25738 with some more testing. I only marked it as critical, but seems pretty bad -- I'll defer to others opinion On Sat, Oct 13, 2018 at 4:15 PM Dongjoon Hyun wrote: > Yes. From my side, it's -1 for RC3. > > Bests, > Dongjoon. > > On

Re: overcommit: cpus / vcores

2018-10-15 Thread Khaled Zaouk
Hi Peter, What parameters are you putting in your spark streaming configuration? What are you putting as number of executor instances and how many cores per executor are you setting in your Spark job? Best, Khaled On Mon, Oct 15, 2018 at 9:18 PM Peter Liu wrote: > Hi there, > > I have a syste

re: overcommit: cpus / vcores

2018-10-15 Thread Peter Liu
Hi there, I have a system with 80 vcores and a relatively light spark streaming workload. Overcomming the vcore resource (i.e. > 80) in the config (see (a) below) seems to help to improve the average spark batch time (see (b) below). Is there any best practice guideline on resource overcommit wit

Re: [Events] Events not fired for SaveAsTextFile (?)

2018-10-15 Thread Bolke de Bruin
Hi Fokko Spark fires it off for many other things. It does so for ML pipelines and it does make information available for data frames. We use S3 in this case I just simplified the example. It is important to know what process took what action. Only spark knows this and it does supply this informa

Re: [Events] Events not fired for SaveAsTextFile (?)

2018-10-15 Thread Driesprong, Fokko
Hi Bolke, I would argue that Spark is not the right level of abstraction of doing this. I would create a wrapper around the particular filesystem: http://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html Therefore you can write a wrapper around the LocalFileSystem if data will

[Events] Events not fired for SaveAsTextFile (?)

2018-10-15 Thread Bolke de Bruin
Hi, Apologies upfront if this should have gone to user@ but it seems a developer question so here goes. We are trying to improve a listener to track lineage across our platform. This requires tracking where data comes from and where it goes to. E.g. sc.setLogLevel("INFO"); val data = sc.textF

Re: Coalesce behaviour

2018-10-15 Thread Koert Kuipers
i realize it is unlikely all data will be local to tasks, so placement will not be optimal and there will be some network traffic, but is this the same as a shuffle? in CoalesceRDD it shows a NarrowDependency, which i thought meant it could be implemented without a shuffle. On Mon, Oct 15, 2018 a