RE: mapPartitioningWithIndex in Dataframe

2017-08-05 Thread Mendelson, Assaf
First I believe you mean on the Dataset API rather than the dataframe API. You can easily add the partition index as a new column to your dataframe using spark_partition_id() Then a normal mapPartitions should work fine (i.e. you should create the appropriate case class which includes the partiti

kafka settting, enable.auto.commit to false is being overridden and I lose data. please help!

2017-08-05 Thread shyla deshpande
Hello All, I am using spark 2.0.2 and spark-streaming-kafka-0-10_2.11 . I am setting enable.auto.commit to false, and manually want to commit the offsets after my output operation is successful. So when a exception is raised during during the processing I do not want the offsets to be committed. B

Trying to connect Spark 1.6 to Hive

2017-08-05 Thread toletum
Hi everybody I'm trying to connect Spark to Hive. Hive uses Derby Server for metastore_db. $SPARK_HOME/conf/hive-site.xml javax.jdo.option.ConnectionURL jdbc:derby://derby:1527/metastore_db;create=true JDBC connect string for a JDBC metastore javax.jdo.option.ConnectionDriverName org.ap

Re: SPARK Issue in Standalone cluster

2017-08-05 Thread Marco Mistroni
Uh believe me there are lots of ppl on this list who will send u code snippets if u ask... 😀 Yes that is what Steve pointed out, suggesting also that for that simple exercise you should perform all operations on a spark standalone instead (or alt. Use an nfs on the cluster) I'd agree with his sugg

Re: SPARK Issue in Standalone cluster

2017-08-05 Thread Gourav Sengupta
Hi Marco, For the first time in several years FOR THE VERY FIRST TIME. I am seeing someone actually executing code and providing response. It feel wonderful that at least someone considered to respond back by executing code and just did not filter out each and every technical details to brood only

PySpark Streaming keeps dying

2017-08-05 Thread Riccardo Ferrari
Hi list, I have Sark 2.2.0 in standalone mode and python 3.6. It is a very small testing cluster with two nodes. I am running (trying) a streaming job that simple read from kafka, apply an ML model and store it back into kafka. The job is run with following parameters: "--conf spark.cores.max=2 --