unserialize error in sparkR

2015-07-26 Thread Jennifer15
Hi, I have a newbie question; I get the following error by increasing the number of samples in my sample script samplescript.R , which is written in Spark1.2 (no error for small sample of error): Error in unserial

PYSPARK_DRIVER_PYTHON="ipython" spark/bin/pyspark Does not create SparkContext

2015-07-26 Thread Zerony Zhao
Hello everyone, I have a newbie question. $SPARK_HOME/bin/pyspark will create SparkContext automatically. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.4.1 /_/ Using Python version 2.7.3 (default, Ju

Re: Asked to remove non-existent executor exception

2015-07-26 Thread Ted Yu
If I read the code correctly, that error message came from CoarseGrainedSchedulerBackend. There may be existing / future error messages, other than the one cited below, which are useful. Maybe change the log level of this message to DEBUG ? Cheers On Sun, Jul 26, 2015 at 3:28 PM, Mridul Muralid

Re: Asked to remove non-existent executor exception

2015-07-26 Thread Mridul Muralidharan
Simply customize your log4j confit instead of modifying code if you don't want messages from that class. Regards Mridul On Sunday, July 26, 2015, Sea <261810...@qq.com> wrote: > This exception is so ugly!!! The screen is full of these information when > the program runs a long time, and they

Re: Custom partitioner

2015-07-26 Thread Ted Yu
You can write subclass of Partitioner whose getPartition() returns partition number corresponding to the given key. Take a look at core/src/main/scala/org/apache/spark/api/python/PythonPartitioner.scala for an example. Cheers On Sun, Jul 26, 2015 at 1:43 PM, Hafiz Mujadid wrote: > Hi > > I hav

Custom partitioner

2015-07-26 Thread Hafiz Mujadid
Hi I have csv data in which i have a column of date time. I want to partition my data in 12 partitions with each partition containing data of one month only. I am not getting how to write such partitioner and how to use that partitioner to read write data. Kindly help me in this regard. Thanks

Re: RDD[Future[T]] => Future[RDD[T]]

2015-07-26 Thread Ayoub Benali
It doesn't work because mapPartitions expects a function f:(Iterator[T]) ⇒ Iterator[U] while .sequence wraps the iterator in a Future 2015-07-26 22:25 GMT+02:00 Ignacio Blasco : > Maybe using mapPartitions and .sequence inside it? > El 26/7/2015 10:22 p. m., "Ayoub" escribió: > >> Hello, >> >> I

Re: RDD[Future[T]] => Future[RDD[T]]

2015-07-26 Thread Ignacio Blasco
Maybe using mapPartitions and .sequence inside it? El 26/7/2015 10:22 p. m., "Ayoub" escribió: > Hello, > > I am trying to convert the result I get after doing some async IO : > > val rdd: RDD[T] = // some rdd > > val result: RDD[Future[T]] = rdd.map(httpCall) > > Is there a way collect all futur

RDD[Future[T]] => Future[RDD[T]]

2015-07-26 Thread Ayoub
Hello, I am trying to convert the result I get after doing some async IO : val rdd: RDD[T] = // some rdd val result: RDD[Future[T]] = rdd.map(httpCall) Is there a way collect all futures once they are completed in a *non blocking* (i.e. without scala.concurrent Await) and lazy way? If the RDD

Re: Spark on Tomcat has exception IncompatibleClassChangeError: Implementing class

2015-07-26 Thread Zoran Jeremic
Yes. You're right. I didn't get it till now. Thanks. On Sun, Jul 26, 2015 at 7:36 AM, Ted Yu wrote: > bq. [INFO] \- org.apache.spark:spark-core_2.10:jar:1.4.0:compile > > I think the above notation means spark-core_2.10 is the last dependency. > > Cheers > > On Thu, Jul 23, 2015 at 9:22 PM, Zor

Writing streaming data to cassandra creates duplicates

2015-07-26 Thread Priya Ch
Hi All, I have a problem when writing streaming data to cassandra. Or existing product is on Oracle DB in which while wrtiting data, locks are maintained such that duplicates in the DB are avoided. But as spark has parallel processing architecture, if more than 1 thread is trying to write same d

Spark - Cassandra (timestamp question)

2015-07-26 Thread Ivan Babic
Hi, I am using Spark to load data form Cassandra. One of the fields in C* table is timestamp. When queried in C* it looks like this "2015-06-01 02:56:07-0700" After loading data in to Spark DataFrame (using sqlContext) and printing it from there, I lose the last field (4-digit time zone) and than

回复: Asked to remove non-existent executor exception

2015-07-26 Thread Sea
This exception is so ugly!!! The screen is full of these information when the program runs a long time, and they will not fail the job. I comment it in the source code. I think this information is useless because the executor is already removed and I don't know what does the executor id mean

spark as a lookup engine for dedup

2015-07-26 Thread Shushant Arora
Hi I have a requirement for processing large events but ignoring duplicate at the same time. Events are consumed from kafka and each event has a eventid. It may happen that an event is already processed and came again at some other offset. 1.Can I use Spark RDD to persist processed events and th

Schema evolution in tables

2015-07-26 Thread sim
The schema merging section of the Spark SQL documentation shows an example of schema evolution in a partitioned table. Is this functionality only available when creating a Spark SQL table? dataFrameWithEvolvedSche

Re: Download Apache Spark on Windows 7 for a Proof of Concept installation

2015-07-26 Thread Peter Leventis
Thank you for the answers. I followed numerous recipes including videos and uncounted many obstacles such as 7-Zip unable to unzip the *.gx file and to the need to use SBT. My situation is fixed. I use a Windows 7 PC (not Linux). I would be very grateful for an approach that simply works. This is

Re: Long running streaming application - worker death

2015-07-26 Thread Ashwin Giridharan
Hi Aviemzur, As of now the Spark workers just run indefinitely in a loop irrespective of whether the data source (kafka) is active or lost its connection, due to the fact that it just reads the zookeeper for the offset of the data to be consumed. So when your DStream receiver is lost, its LOST! A

Re: Asked to remove non-existent executor exception

2015-07-26 Thread Ted Yu
You can list the files in tmpfs in reverse chronological order and remove the oldest until you have enough space. Cheers On Sun, Jul 26, 2015 at 12:43 AM, Pa Rö wrote: > i has seen that the "tempfs" is full, how i can clear this? > > 2015-07-23 13:41 GMT+02:00 Pa Rö : > >> hello spark community

Re: Spark on Tomcat has exception IncompatibleClassChangeError: Implementing class

2015-07-26 Thread Ted Yu
bq. [INFO] \- org.apache.spark:spark-core_2.10:jar:1.4.0:compile I think the above notation means spark-core_2.10 is the last dependency. Cheers On Thu, Jul 23, 2015 at 9:22 PM, Zoran Jeremic wrote: > Hi Yana, > > Sorry for late response. I just saw your email. At the end I ended with > the fo

Long running streaming application - worker death

2015-07-26 Thread aviemzur
Hi all, I have a question about long running streaming applications and workers that act as consumers. Specifically my program runs on a spark standalone cluster with a small number of workers, acting as kafka consumers using spark streaming. What I noticed was that in a long running application

Re: Parallelism of Custom receiver in spark

2015-07-26 Thread Michal Čizmazia
#1 see https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving #2 By default, all input data and persisted RDDs generated by DStream transformations are automatically cleared. Spark Streaming decides when to clear the data based on the transform

Re: Download Apache Spark on Windows 7 for a Proof of Concept installation

2015-07-26 Thread Jörn Franke
Use a Hadoop distribution that supports Windows and has Spark included. Generally - if you want to use windows - you should use the server version. Le sam. 25 juil. 2015 à 20:11, Peter Leventis a écrit : > I just wanted an easy step by step guide as to exactly what version of what > ever to down

Re: Writing binary files in Spark

2015-07-26 Thread Oren Shpigel
As I wrote before, the result of my pipeline is binary objects, which I want to write directly as raw bytes, and not serializing them again. Is it possible? On Sat, Jul 25, 2015 at 11:28 AM Akhil Das wrote: > Its been added from spark 1.1.0 i guess > https://issues.apache.org/jira/browse/SPARK-

Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Louis Hust
I got it, thanks for that 2015-07-26 17:21 GMT+08:00 Paolo Platter : > If you want a performance boost, you need to load the full table in > memory using caching and them execute your query directly on cached > dataframe. Otherwise you use spark only as a bridge and you don't leverage > the dist

R: Spark is much slower than direct access MySQL

2015-07-26 Thread Paolo Platter
If you want a performance boost, you need to load the full table in memory using caching and them execute your query directly on cached dataframe. Otherwise you use spark only as a bridge and you don't leverage the distributed in memory engine of spark. Paolo Inviata dal mio Windows Phone

Re: Parquet writing gets progressively slower

2015-07-26 Thread Cheng Lian
Actually no. In general, Spark SQL doesn't trust Parquet summary files. The reason is that it's not unusual to fail to write Parquet summary files. For example, Hive never writes summary files for Parquet tables because it uses NullOutputCommitter, which bypasses Parquet's own output committer.

Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Louis Hust
Thanks for your explain 2015-07-26 16:22 GMT+08:00 Shixiong Zhu : > Oh, I see. That's the total time of executing a query in Spark. Then the > difference is reasonable, considering Spark has much more work to do, e.g., > launching tasks in executors. > > Best Regards, > Shixiong Zhu > > 2015-07-2

Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Shixiong Zhu
Oh, I see. That's the total time of executing a query in Spark. Then the difference is reasonable, considering Spark has much more work to do, e.g., launching tasks in executors. Best Regards, Shixiong Zhu 2015-07-26 16:16 GMT+08:00 Louis Hust : > Look at the given url: > > Code can be found at:

Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Louis Hust
Look at the given url: Code can be found at: https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java 2015-07-26 16:14 GMT+08:00 Shixiong Zhu : > Could you clarify how you measure the Spark time cost? Is it the total > time of running the query? If so, it's possible

Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Shixiong Zhu
Could you clarify how you measure the Spark time cost? Is it the total time of running the query? If so, it's possible because the overhead of Spark dominates for small queries. Best Regards, Shixiong Zhu 2015-07-26 15:56 GMT+08:00 Jerrick Hoang : > how big is the dataset? how complicated is the

Re: Spark is much slower than direct access MySQL

2015-07-26 Thread Jerrick Hoang
how big is the dataset? how complicated is the query? On Sun, Jul 26, 2015 at 12:47 AM Louis Hust wrote: > Hi, all, > > I am using spark DataFrame to fetch small table from MySQL, > and i found it cost so much than directly access MySQL Using JDBC. > > Time cost for Spark is about 2033ms, and di

Spark is much slower than direct access MySQL

2015-07-26 Thread Louis Hust
Hi, all, I am using spark DataFrame to fetch small table from MySQL, and i found it cost so much than directly access MySQL Using JDBC. Time cost for Spark is about 2033ms, and direct access at about 16ms. Code can be found at: https://github.com/louishust/sparkDemo/blob/master/src/main/java/Di

Re: Asked to remove non-existent executor exception

2015-07-26 Thread Pa Rö
i has seen that the "tempfs" is full, how i can clear this? 2015-07-23 13:41 GMT+02:00 Pa Rö : > hello spark community, > > i have build an application with geomesa, accumulo and spark. > if it run on spark local mode, it is working, but on spark > cluster not. in short it says: No space left on