Re: Strange codegen error for SortMergeJoin in Spark 2.2.1

2018-06-06 Thread Kazuaki Ishizaki
Thank you for reporting a problem. Would it be possible to create a JIRA entry with a small program that can reproduce this problem? Best Regards, Kazuaki Ishizaki From: Rico Bergmann To: "user@spark.apache.org" Date: 2018/06/05 19:58 Subject:Strange codegen error for SortMer

Re: Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-06-06 Thread amihay gonen
If you are using kafka direct connect api it might be committing offset back to kafka itself בתאריך יום ה׳, 7 ביוני 2018, 4:10, מאת licl ‏: > I met the same issue and I have try to delete the checkpoint dir before the > job , > > But spark seems can read the correct offset even though after the

If there is timestamp type data in DF, Spark 2.3 toPandas is much slower than spark 2.2.

2018-06-06 Thread 李斌松
If there is timestamp type data in DF, Spark 2.3 toPandas is much slower than spark 2.2.

Pyspark Join and then column select is showing unexpected output

2018-06-06 Thread bis_g
I am not sure if the long work is doing this to me but I am seeing some unexpected behavior in spark 2.2.0 I have created a toy example as below toy_df = spark.createDataFrame([ ['p1','a'], ['p1','b'], ['p1','c'], ['p2','a'], ['p2','b'], ['p2','d']],schema=['patient','drug']) I create another da

Re: Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-06-06 Thread licl
I met the same issue and I have try to delete the checkpoint dir before the job , But spark seems can read the correct offset even though after the checkpoint dir is deleted , I don't know how spark do this without checkpoint's metadata. -- Sent from: http://apache-spark-user-list.1001560.n3.

Spark ML online serving

2018-06-06 Thread Holden Karau
At Spark Summit some folks were talking about model serving and we wanted to collect requirements from the community. -- Twitter: https://twitter.com/holdenkarau

Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-06-06 Thread spark receiver
Use unix time and write the unix time to oracle as number column type ,create virtual column in oracle database for the unix time like “oracle_time generated always as (to_date('1970010108','MMDDHH24')+(1/24/60/60)*unixtime ) > On Mar 20, 2018, at 11:08 PM, Gurusamy Thirupathy wrote: > >

Re: Dataframe from 1.5G json (non JSONL)

2018-06-06 Thread raksja
Its happenning in the executor # # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="kill -9 %p" # Executing /bin/sh -c "kill -9 25800"... -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ ---

FINAL REMINDER: Apache EU Roadshow 2018 in Berlin next week!

2018-06-06 Thread sharan
Hello Apache Supporters and Enthusiasts This is a final reminder that our Apache EU Roadshow will be held in Berlin next week on 13th and 14th June 2018. We will have 28 different sessions running over 2 days that cover some great topics. So if you are interested in Microservices, Internet of

Re: [SparkLauncher] stateChanged event not received in standalone cluster mode

2018-06-06 Thread Marcelo Vanzin
That feature has not been implemented yet. https://issues.apache.org/jira/browse/SPARK-11033 On Wed, Jun 6, 2018 at 5:18 AM, Behroz Sikander wrote: > I have a client application which launches multiple jobs in Spark Cluster > using SparkLauncher. I am using Standalone cluster mode. Launching jobs

Re: [SparkLauncher] stateChanged event not received in standalone cluster mode

2018-06-06 Thread bsikander
Any help would be appreciated. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RE: [External] Re: Sorting in Spark on multiple partitions

2018-06-06 Thread Sing, Jasbir
Hi Jorn, We are using Spark 2.2.0 for our development. Below is the code snippet for your reference: var newDf = data.repartition(col(userid)).sortWithinPartitions(sid,time) newDf.write.format("parquet").saveAsTable("tempData") newDf.coalesce(1).write.format(outputFormat).option("header", "true"

Re: Reg:- Py4JError in Windows 10 with Spark

2018-06-06 Thread Jay
Are you running this in local mode or cluster mode ? If you are running in cluster mode have you ensured that numpy is present on all nodes ? On Tue 5 Jun, 2018, 2:43 AM @Nandan@, wrote: > Hi , > I am getting error :- > > --

Re: Dataframe from 1.5G json (non JSONL)

2018-06-06 Thread Jay
I might have missed it but can you tell if the OOM is happening in driver or executor ? Also it would be good if you can post the actual exception. On Tue 5 Jun, 2018, 1:55 PM Nicolas Paris, wrote: > IMO your json cannot be read in parallell at all then spark only offers > you > to play again w

[SparkLauncher] stateChanged event not received in standalone cluster mode

2018-06-06 Thread Behroz Sikander
I have a client application which launches multiple jobs in Spark Cluster using SparkLauncher. I am using *Standalone* *cluster mode*. Launching jobs works fine till now. I use launcher.startApplication() to launch. But now, I have a requirement to check the states of my Driver process. I added a

[Spark Streaming] Distinct Count on unrelated columns

2018-06-06 Thread Aakash Basu
Hi guys, Posted a question (link) on StackOverflow, any help? Thanks, Aakash.