Checkpointing calls the job twice?

2015-10-17 Thread jatinganhotra
Hi, I noticed that when you checkpoint a given RDD, it results in performing the action twice as I can see 2 jobs being executed in the Spark UI. Example: val logFile = "/data/pagecounts" sc.setCheckpointDir("/checkpoints") val logData = sc.textFile(logFile, 2) val as = logData.filter(line => lin

Re: Problem installing Sparck on Windows 8

2015-10-17 Thread Marco Mistroni
HI still having issues in installing spark on windows 8 the spark web console runs successfully.. i can run spark pi example, however wheni run spark-shell i am getting the following exception java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /t mp/hive on HDFS should

Should I convert json into parquet?

2015-10-17 Thread Gavin Yue
I have json files which contains timestamped events. Each event associate with a user id. Now I want to group by user id. So converts from Event1 -> UserIDA; Event2 -> UserIDA; Event3 -> UserIDB; To intermediate storage. UserIDA -> (Event1, Event2...) UserIDB-> (Event3...) Then I will label po

Re: repartition vs partitionby

2015-10-17 Thread Adrian Tanase
If the dataset allows it you can try to write a custom partitioner to help spark distribute the data more uniformly. Sent from my iPhone On 17 Oct 2015, at 16:14, shahid ashraf mailto:sha...@trialx.com>> wrote: yes i know about that,its in case to reduce partitions. the point here is the data

Spark Streaming scheduler delay VS driver.cores

2015-10-17 Thread Adrian Tanase
Hi, I’ve recently bumped up the resources for a spark streaming job – and the performance started to degrade over time. it was running fine on 7 nodes with 14 executor cores each (via Yarn) until I bumped executor.cores to 22 cores/node (out of 32 on AWS c3.xlarge, 24 for yarn) The driver has 2

Output println info in LogMessage Info ?

2015-10-17 Thread kali.tumm...@gmail.com
Hi All, I n Unix I can print some warning or info using LogMessage WARN "Hi All" or LogMessage INFO "Hello World" is there similar thing in Spark ? Imagine I wan to print count of RDD in Logs instead of using Println Thanks Sri -- View this message in context: http://apache-spark-user-lis

Re: can I use Spark as alternative for gem fire cache ?

2015-10-17 Thread Ndjido Ardo Bar
Hi Kali, If I do understand you well, Tachyon ( http://tachyon-project.org) can be good alternative. You can use Spark Api to load and persist data into Tachyon. Hope that will help. Ardo > On 17 Oct 2015, at 15:28, "kali.tumm...@gmail.com" > wrote: > > Hi All, > > Can spark be used as a

can I use Spark as alternative for gem fire cache ?

2015-10-17 Thread kali.tumm...@gmail.com
Hi All, Can spark be used as an alternative to gem fire cache ? we use gem fire cache to save (cache) dimension data in memory which is later used by our Java custom made ETL tool can I do something like below ? can I cache a RDD in memory for a whole day ? as of I know RDD will get empty once t

Re: repartition vs partitionby

2015-10-17 Thread shahid ashraf
yes i know about that,its in case to reduce partitions. the point here is the data is skewed to few partitions.. On Sat, Oct 17, 2015 at 6:27 PM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > You can use coalesce function, if you want to reduce the number of > partitions. This one

Re: s3a file system and spark deployment mode

2015-10-17 Thread Raghavendra Pandey
You can add classpath info in hadoop env file... Add the following line to your $HADOOP_HOME/etc/hadoop/hadoop-env.sh export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/* Add the following line to $SPARK_HOME/conf/spark-env.sh export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/

Re: repartition vs partitionby

2015-10-17 Thread Raghavendra Pandey
You can use coalesce function, if you want to reduce the number of partitions. This one minimizes the data shuffle. -Raghav On Sat, Oct 17, 2015 at 1:02 PM, shahid qadri wrote: > Hi folks > > I need to reparation large set of data around(300G) as i see some portions > have large data(data skew)

Re: Complex transformation on a dataframe column

2015-10-17 Thread Raghavendra Pandey
Here is a quick code sample I can come up with : case class Input(ID:String, Name:String, PhoneNumber:String, Address: String) val df = sc.parallelize(Seq(Input("1", "raghav", "0123456789", "houseNo:StreetNo:City:State:Zip"))).toDF() val formatAddress = udf { (s: String) => s.split(":").mkString("

Re: Spark on Mesos / Executor Memory

2015-10-17 Thread Bharath Ravi Kumar
To be precise, the MesosExecutorBackend's Xms & Xmx equal spark.executor.memory. So there's no question of expanding or contracting the memory held by the executor. On Sat, Oct 17, 2015 at 5:38 PM, Bharath Ravi Kumar wrote: > David, Tom, > > Thanks for the explanation. This confirms my suspicion

Re: How to have Single refernce of a class in Spark Streaming?

2015-10-17 Thread Deenar Toraskar
Swetha Look at http://spark.apache.org/docs/latest/programming-guide.html#shared-variables Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on

Re: Spark on Mesos / Executor Memory

2015-10-17 Thread Bharath Ravi Kumar
David, Tom, Thanks for the explanation. This confirms my suspicion that the executor was holding on to memory regardless of tasks in execution once it expands to occupy memory in keeping with spark.executor.memory. There certainly is scope for improvement here, though I realize there will substan

PySpark: breakdown application execution time and fine-tuning

2015-10-17 Thread saluc
Hello, I am using PySpark to develop my big-data application. I have the impression that most of the execution of my application is spent on the infrastructure (distributing the code and the data in the cluster, IPC between the Python processes and the JVM) rather than on the computation itself.

repartition vs partitionby

2015-10-17 Thread shahid qadri
Hi folks I need to reparation large set of data around(300G) as i see some portions have large data(data skew) i have pairRDDs [({},{}),({},{}),({},{})] what is the best way to solve the the problem - To unsubscribe, e-mail: us