from:"Evo Eftimov"

RE: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Evo Eftimov

That was a) fuzzy b) insufficient – one can certainly use forach (only) on DStream RDDs – it works as empirical observation As another empirical observation: For each partition results in having one instance of the lambda/closure per partition when e.g. publishing to output systems like

RE: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Evo Eftimov

For each partition results in having one instance of the lambda/closure per partition when e.g. publishing to output systems like message brokers, databases and file systems - that increases the level of parallelism of your output processing -Original Message- From: dgoldenberg [mailto:dg

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov

Also try to increase the number of partions gradually – not in one big jump from 20 to 100 but adding e.g. 10 at a time and see whether there is a correlation with adding more RAM to the executors From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Wednesday, July 8, 2015 1:26 PM To

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov

them will be in a suspended mode waiting for free core (Thread contexts also occupy additional RAM ) From: Aniruddh Sharma [mailto:asharma...@gmail.com] Sent: Wednesday, July 8, 2015 12:52 PM To: Evo Eftimov Subject: Re: Out of Memory Errors on less number of cores in proportion to Partitions

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov

This is most likely due to the internal implementation of ALS in MLib. Probably for each parallel unit of execution (partition in Spark terms) the implementation allocates and uses a RAM buffer where it keeps interim results during the ALS iterations If we assume that the size of that intern

RE:

2015-07-07 Thread Evo Eftimov

ext... dstream.foreachRDD{ rdd => myRDD = myRDD.union(rdd.filter(myfilter)).cashe() } From: Gerard Maas [mailto:gerard.m...@gmail.com] Sent: Tuesday, July 7, 2015 1:55 PM To: Evo Eftimov Cc: Anand Nalya; spark users Subject: Re: Evo, I'd let the OP clarify the qu

RE:

2015-07-07 Thread Evo Eftimov

spark.streaming.unpersist = false // in order for SStreaming to not drop the raw RDD data spark.cleaner.ttl = why is the above suggested provided the persist/vache operation on the constantly unioniuzed Batch RDD will have to be invoked anyway (after every union with DStream RDD), besides

RE:

2015-07-07 Thread Evo Eftimov

The “RDD” aka Batch RDD which you load from file, will be kept for as long as the Spark Framework is instantiated / running – you can also ensure it is flagged explicitly as Persisted e.g. In Memory and/or disk From: Anand Nalya [mailto:anand.na...@gmail.com] Sent: Tuesday, July 7, 2015 12:3

R "on spark"

2015-06-27 Thread Evo Eftimov

I had a look at the new R "on Spark" API / Feature in Spark 1.4.0 For those "skilled in the art" (of R and distributed computing) it will be immediately clear that "ON" is a marketing ploy and what it actually is is "TO" ie Spark 1.4.0 offers INTERFACE from R TO DATA stored in Spark in distributed

RE: Spark Streaming: limit number of nodes

2015-06-24 Thread Evo Eftimov

that limits the number of cores per Executor rather than the total cores for the job and hence will probably not yield the effect you need From: Wojciech Pituła [mailto:w.pit...@gmail.com] Sent: Wednesday, June 24, 2015 10:49 AM To: Evo Eftimov; user@spark.apache.org Subject: Re: Spark

RE: Spark Streaming: limit number of nodes

2015-06-24 Thread Evo Eftimov

There is no direct one to one mapping between Executor and Node Executor is simply the spark framework term for JVM instance with some spark framework system code running in it A node is a physical server machine You can have more than one JVM per node And vice versa you can hav

RE: Web UI vs History Server Bugs

2015-06-23 Thread Evo Eftimov

Probably your application has crashed or was terminated without invoking the stop method of spark context - in such cases it doesn't create the empty flag file which apparently tells the history server that it can safely show the log data - simpy go to some of the other dirs of the history server t

Spark Streaming 1.3.0 ERROR LiveListenerBus

2015-06-19 Thread Evo Eftimov

Spark Streaming 1.3.0 on YARN during Job Execution keeps generating the following error while the application is running: ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTargetException etc etc Caused by: java.io.IOException: Filesystem closed

RE: Machine Learning on GraphX

2015-06-18 Thread Evo Eftimov

What is GraphX: - It can be viewed as a kind of Distributed, Parallel, Graph Database - It can be viewed as Graph Data Structure (Data Structures 101 from your CS course) - It features some off the shelve algos for Graph Processing and Navigation (Algos and Data

RE: Spark or Storm

2015-06-17 Thread Evo Eftimov

The only thing which doesn't make much sense in Spark Streaming (and I am not saying it is done better in Storm) is the iterative and "redundant" shipping of the essentially the same tasks (closures/lambdas/functions) to the cluster nodes AND re-launching them there again and again This is a l

RE: How does one decide no of executors/cores/memory allocation?

2015-06-16 Thread Evo Eftimov

Best is by measuring and recording how The Performance of your solution scales as The Workload scales - recording As In "Data Points recording" and then you can do some times series stat analysis and visualizations For example you can start with a single box with e.g. 8 CPU cores Use e.g. 1 or

RE: stop streaming context of job failure

2015-06-16 Thread Evo Eftimov

https://spark.apache.org/docs/latest/monitoring.html also subscribe to various Listeners for various Metrcis Types e.g. Job Stats/Statuses - this will allow you (in the driver) to decide when to stop the context gracefully (the listening and stopping can be done from a completely separa

RE: Join between DStream and Periodically-Changing-RDD

2015-06-15 Thread Evo Eftimov

“turn (keep turning) your HDFS file (Batch RDD) into a stream of messages (outside spark streaming)” – what I meant by that was “turn the Updates to your HDFS dataset into Messages” and send them as such to spark streaming From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Monday, June

RE: Join between DStream and Periodically-Changing-RDD

2015-06-15 Thread Evo Eftimov

the app And at the same time you join the DStream RDDs of your actual Streaming Data with the above continuously updated DStream RDD representing your HDFS file From: Ilove Data [mailto:data4...@gmail.com] Sent: Monday, June 15, 2015 5:19 AM To: Tathagata Das Cc: Evo Eftimov; Akhil Das

Re: Determining number of executors within RDD

2015-06-10 Thread Evo Eftimov

need to got for what is essentialy a bit of a hack ie runn8ng more than 1 workers per machine Sent from Samsung Mobile Original message From: Sandy Ryza Date:2015/06/10 21:31 (GMT+00:00) To: Evo Eftimov Cc: maxdml ,user@spark.apache.org Subject: Re: Determining number of

Re: Determining number of executors within RDD

2015-06-10 Thread Evo Eftimov

Yes i think it is ONE worker ONE executor as executor is nothing but jvm instance spawned by the worker To run more executors ie jvm instances on the same physical cluster node you need to run more than one worker on that node and then allocate only part of the sys resourced to that worker/ex

RE: Join between DStream and Periodically-Changing-RDD

2015-06-10 Thread Evo Eftimov

It depends on how big the Batch RDD requiring reloading is Reloading it for EVERY single DStream RDD would slow down the stream processing inline with the total time required to reload the Batch RDD ….. But if the Batch RDD is not that big then that might not be an issues especially in t

RE: Spark Streaming for Each RDD - Exception on Empty

2015-06-05 Thread Evo Eftimov

Foreachpartition callback is provided with Iterator by the Spark Frameowrk – while iterator.hasNext() …… Also check whether this is not some sort of Python Spark API bug – Python seems to be the foster child here – Scala and Java are the darlings From: John Omernik [mailto:j...@omernik.co

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov

once in your Spark Streaming App and then keep joining and then e.g. filtering every incoming DStream RDD with the (big static) Batch RDD From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Friday, June 5, 2015 3:27 PM To: 'Dmitry Goldenberg' Cc: 'Yiannis Gkoufas'; '

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov

It is called Indexed RDD https://github.com/amplab/spark-indexedrdd From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] Sent: Friday, June 5, 2015 3:15 PM To: Evo Eftimov Cc: Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org Subject: Re: How to share large resources like

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov

share data between Jobs, while an RDD is ALWAYS visible within Jobs using the same Spark Context From: Charles Earl [mailto:charles.ce...@gmail.com] Sent: Friday, June 5, 2015 12:10 PM To: Evo Eftimov Cc: Dmitry Goldenberg; Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org Subject: Re

RE: How to increase the number of tasks

2015-06-05 Thread Evo Eftimov

€ρ@Ҝ (๏̯͡๏) Cc: Evo Eftimov; user Subject: Re: How to increase the number of tasks just multiply 2-4 with the cpu core number of the node . 2015-06-05 18:04 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) : I did not change spark.default.parallelism, What is recommended value for it. On Fri, Jun 5, 2015 at 3

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov

Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark Batch Jobs (besides anyone can put something like that in 5 min), while I am under the impression that Dmytiy is working on Spark Streaming app Besides the Job Server is essentially for sharing the Spark Context betwe

RE: How to increase the number of tasks

2015-06-05 Thread Evo Eftimov

It may be that your system runs out of resources (ie 174 is the ceiling) due to the following 1. RDD Partition = (Spark) Task 2. RDD Partition != (Spark) Executor 3. (Spark) Task != (Spark) Executor 4. (Spark) Task = JVM Thread 5. (Spark) Executor = JVM insta

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Evo Eftimov

more From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] Sent: Wednesday, June 3, 2015 4:46 PM To: Evo Eftimov Cc: Cody Koeninger; Andrew Or; Gerard Maas; spark users Subject: Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Evo Eftimov

) Then just shutdown your currently running spark streaming job/app and restart it with new params to take advantage of the larger cluster From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] Sent: Wednesday, June 3, 2015 4:14 PM To: Cody Koeninger Cc: Andrew Or; Evo Eftimov; Gerard

RE: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Evo Eftimov

Dmitry was concerned about the “serialization cost” NOT the “memory footprint – hence option a) is still viable since a Broadcast is performed only ONCE for the lifetime of Driver instance From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, June 3, 2015 2:44 PM To: Evo Eftimov Cc

RE: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Evo Eftimov

Hmmm a spark streaming app code doesn't execute in the linear fashion assumed in your previous code snippet - to achieve your objectives you should do something like the following in terms of your second objective - saving the initialization and serialization of the params you can: a) broadcast

RE: Value for SPARK_EXECUTOR_CORES

2015-05-28 Thread Evo Eftimov

I don’t think the number of CPU cores controls the “number of parallel tasks”. The number of Tasks corresponds first and foremost to the number of (Dstream) RDD Partitions The Spark documentation doesn’t mention what is meant by “Task” in terms of Standard Multithreading Terminology ie a T

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov

free RAM ) and taking a performance hit from that, BUT only until there is no free RAM From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] Sent: Thursday, May 28, 2015 2:34 PM To: Evo Eftimov Cc: Gerard Maas; spark users Subject: Re: FW: Re: Autoscaling Spark cluster based on topic

FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov

Original message From: Evo Eftimov Date:2015/05/28 13:22 (GMT+00:00) To: Dmitry Goldenberg Cc: Gerard Maas ,spark users Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics? You can always spin new boxes i

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov

shut down your job gracefuly. Besides msnaging the offsets explicitly is not a big deal if necessary Sent from Samsung Mobile Original message From: Dmitry Goldenberg Date:2015/05/28 13:16 (GMT+00:00) To: Evo Eftimov Cc: Gerard Maas ,spark users Subject: Re: Autoscaling

RE: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov

@DG; The key metrics should be - Scheduling delay – its ideal state is to remain constant over time and ideally be less than the time of the microbatch window - The average job processing time should remain less than the micro-batch window - Number of Lost Jobs –

Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov

message From: Arush Kharbanda Date:2015/05/26 10:55 (GMT+00:00) To: canan chen Cc: Evo Eftimov ,user@spark.apache.org Subject: Re: How does spark manage the memory of executor with multiple tasks Hi Evo, Worker is the JVM and an executor runs on the JVM. And after Spark 1.4 you

RE: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov

9:30 AM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: How does spark manage the memory of executor with multiple tasks Yes, I know that one task represent a JVM thread. This is what I confused. Usually users want to specify the memory on task level, so how can I do it if task if

RE: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov

An Executor is a JVM instance spawned and running on a Cluster Node (Server machine). Task is essentially a JVM Thread – you can have as many Threads as you want per JVM. You will also hear about “Executor Slots” – these are essentially the CPU Cores available on the machine and granted for use

Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Evo Eftimov

A receiver occupies a cpu core, an executor is simply a jvm instance and as such it can be granted any number of cores and ram So check how many cores you have per executor Sent from Samsung Mobile Original message From: Mike Trienis Date:2015/05/22 21:51 (GMT+00:00) To:

RE: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Evo Eftimov

performance in the name of the reliability/integrity of your system ie not loosing messages) From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Friday, May 22, 2015 9:39 PM To: 'Tathagata Das'; 'Gautam Bajaj' Cc: 'user' Subject: RE: Storing spark processed out

RE: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Evo Eftimov

If the message consumption rate is higher than the time required to process ALL data for a micro batch (ie the next RDD produced for your stream) the following happens – lets say that e.g. your micro batch time is 3 sec: 1. Based on your message streaming and consumption rate, you ge

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov

OR you can run Drools in a Central Server Mode ie as a common/shared service, but that would slowdown your Spark Streaming job due to the remote network call which will have to be generated for every single message From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Friday, May 22, 2015

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov

The only “tricky” bit would be when you want to manage/update the Rule Base in your Drools Engines already running as Singletons in Executor JVMs on Worker Nodes. The invocation of Drools from Spark Streaming to evaluate a Rule already loaded in Drools is not a problem. From: Evo Eftimov

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov

From: Antonio Giambanco [mailto:antogia...@gmail.com] Sent: Friday, May 22, 2015 11:07 AM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: Spark Streaming and Drools Thanks a lot Evo, do you know where I can find some examples? Have a great one A G 2015-05-22 12:00 GMT+02:00

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov

You can deploy and invoke Drools as a Singleton on every Spark Worker Node / Executor / Worker JVM You can invoke it from e.g. map, filter etc and use the result from the Rule to make decision how to transform/filter an event/message From: Antonio Giambanco [mailto:antogia...@gmail.com]

RE: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-20 Thread Evo Eftimov

Check whether the name can be resolved in the /etc/hosts file (or DNS) of the worker (the same btw applies for the Node where you run the driver app – all other nodes must be able to resolve its name) From: Stephen Boesch [mailto:java...@gmail.com] Sent: Wednesday, May 20, 2015 10:07 AM

RE: Spark 1.3.1 Performance Tuning/Patterns for Data Generation Heavy/Throughput Jobs

2015-05-19 Thread Evo Eftimov

Is that a Spark or Spark Streaming application Re the map transformation which is required you can also try flatMap Finally an Executor is essentially a JVM spawn by a Spark Worker Node or YARN – giving 60GB RAM to a single JVM will certainly result in “off the charts” GC. I would sugges

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov

er latency Ps: ultimately though remember that none of this stuff is part of spark streming as of yet Sent from Samsung Mobile Original message From: Akhil Das Date:2015/05/18 16:56 (GMT+00:00) To: Evo Eftimov Cc: user@spark.apache.org Subject: RE: Spark Streaming an

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov

communication and facts From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Monday, May 18, 2015 2:28 PM To: Evo Eftimov Cc: Dmitry Goldenberg; user@spark.apache.org Subject: Re: Spark Streaming and reducing latency we = Sigmoid back-pressuring mechanism = Stoping the receiver from

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov

, as Evo says Spark Streaming DOES crash in “unceremonious way” when the free RAM available for In Memory Cashed RDDs gets exhausted From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Monday, May 18, 2015 2:03 PM To: Evo Eftimov Cc: Dmitry Goldenberg; user@spark.apache.org Subject: Re

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov

-4-1410542878200 not found From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Monday, May 18, 2015 12:13 PM To: 'Dmitry Goldenberg'; 'Akhil Das' Cc: 'user@spark.apache.org' Subject: RE: Spark Streaming and reducing latency You can use spark.strea

RE: Spark Streaming and reducing latency

2015-05-18 Thread Evo Eftimov

You can use spark.streaming.receiver.maxRate not set Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit

RE: Spark Streaming and reducing latency

2015-05-17 Thread Evo Eftimov

This is the nature of Spark Streaming as a System Architecture: 1. It is a batch processing system architecture (Spark Batch) optimized for Streaming Data 2. In terms of sources of Latency in such System Architecture, bear in mind that besides “batching”, there is also the Centra

RE: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-17 Thread Evo Eftimov

You can make ANY standard receiver sleep by implementing a custom Message Deserializer class with sleep method inside it. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Sunday, May 17, 2015 4:29 PM To: Haopu Wang Cc: user Subject: Re: [SparkStreaming] Is it possible to delay the s

RE: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Evo Eftimov

Ok thanks a lot for clarifying that – btw was your application a Spark Streaming App – I am also looking for confirmation that FAIR scheduling is supported for Spark Streaming Apps From: Richard Marscher [mailto:rmarsc...@localytics.com] Sent: Friday, May 15, 2015 7:20 PM To: Evo Eftimov

RE: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Evo Eftimov

me From: Tathagata Das [mailto:t...@databricks.com] Sent: Friday, May 15, 2015 6:45 PM To: Evo Eftimov Cc: user Subject: Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond How are you configuring the fair scheduler pools? On Fri, May 15, 2015 at 8:33 AM, Evo Eftimov wrote

Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-15 Thread Evo Eftimov

I have run / submitted a few Spark Streaming apps configured with Fair scheduling on Spark Streaming 1.2.0, however they still run in a FIFO mode. Is FAIR scheduling supported at all for Spark Streaming apps and from what release / version - e.g. 1.3.1 -- View this message in context: http://a

RE: swap tuple

2015-05-14 Thread Evo Eftimov

Where is the “Tuple” supposed to be in - you can refer to a “Tuple” if it was e.g. > From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of Holden Karau Sent: Thursday, May 14, 2015 5:56 PM To: Yasemin Kaya Cc: user@spark.apache.org Subject: Re: swap tuple Can you pas

RE: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Evo Eftimov

inherent to the “commercial” vendors, but I can confirm as fact it is also in effect to the “open source movement” (because human nature remains the same) From: David Morales [mailto:dmora...@stratio.com] Sent: Thursday, May 14, 2015 4:30 PM To: Paolo Platter Cc: Evo Eftimov; Matei Zaharia

RE: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Evo Eftimov

That has been a really rapid “evaluation” of the “work” and its “direction” From: David Morales [mailto:dmora...@stratio.com] Sent: Thursday, May 14, 2015 4:12 PM To: Matei Zaharia Cc: user@spark.apache.org Subject: Re: SPARKTA: a real-time aggregation engine based on Spark Streaming Than

RE: DStream Union vs. StreamingContext Union

2015-05-12 Thread Evo Eftimov

I can confirm it does work in Java From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Tuesday, May 12, 2015 5:53 PM To: Evo Eftimov Cc: Saisai Shao; user@spark.apache.org Subject: Re: DStream Union vs. StreamingContext Union Thanks Evo. I tried chaining Dstream unions like

RE: DStream Union vs. StreamingContext Union

2015-05-12 Thread Evo Eftimov

You can also union multiple DstreamRDDs in this way DstreamRDD1.union(DstreamRDD2).union(DstreamRDD3) etc etc Ps: the API is not “redundant” it offers several ways for achivieving the same thing as a convenience depending on the situation From: Vadim Bichutskiy [mailto:vadim.bichuts...@

RE: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Evo Eftimov

LOCAL_ONE. * spark.cassandra.output.throughput_mb_per_sec: maximum write throughput allowed per single core in MB/s limit this on long (+8 hour) runs to 70% of your max throughput as seen on a smaller job for stability From: Sergio Jiménez Barrio [mailto:drarse.a...@gmail.com] Sent: Sunday, May 10, 2015 12:59 PM To: E

RE: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Evo Eftimov

from Samsung Mobile Original message From: Evo Eftimov Date:2015/05/10 12:02 (GMT+00:00) To: 'Gerard Maas' Cc: 'Sergio Jiménez Barrio' ,'spark users' Subject: RE: Spark streaming closes with Cassandra Conector Hmm there is also a Connection

RE: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Evo Eftimov

and distribution profile which may skip a potential error in the Connection Pool code From: Gerard Maas [mailto:gerard.m...@gmail.com] Sent: Sunday, May 10, 2015 11:56 AM To: Evo Eftimov Cc: Sergio Jiménez Barrio; spark users Subject: Re: Spark streaming closes with Cassandra Conector I&#

RE: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Evo Eftimov

I think the message that it has written 2 rows is misleading If you look further down you will see that it could not initialize a connection pool for Casandra (presumably while trying to write the previously mentioned 2 rows) Another confirmation of this hypothesis is the phrase “error d

RE: spark streaming and computation

2015-05-10 Thread Evo Eftimov

You can implement a custom partitioner -Original Message- From: skippi [mailto:skip...@gmx.de] Sent: Sunday, May 10, 2015 10:19 AM To: user@spark.apache.org Subject: spark streaming and computation Assuming a web server access log shall be analyzed and target of computation shall be csv

RE: Map one RDD into two RDD

2015-05-07 Thread Evo Eftimov

: Bill Q [mailto:bill.q@gmail.com] Sent: Thursday, May 7, 2015 6:27 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: Map one RDD into two RDD The multi-threading code in Scala is quite simple and you can google it pretty easily. We used the Future framework. You can use Akka also

RE: Map one RDD into two RDD

2015-05-07 Thread Evo Eftimov

: Bill Q [mailto:bill.q@gmail.com] Sent: Thursday, May 7, 2015 4:55 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: Map one RDD into two RDD Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far, it

RE: Receiver Fault Tolerance

2015-05-06 Thread Evo Eftimov

This is about Kafka Receiver IF you are using Spark Streaming Ps: that book is now behind the curve in a quite a few areas since the release of 1.3.1 – read the documentation and forums From: James King [mailto:jakwebin...@gmail.com] Sent: Wednesday, May 6, 2015 1:09 PM To: user Subjec

RE: Creating topology in spark streaming

2015-05-06 Thread Evo Eftimov

To: Evo Eftimov Cc: anshu shukla; ayan guha; user@spark.apache.org Subject: Re: Creating topology in spark streaming Hi, I agree with Evo, Spark works at a different abstraction level than Storm, and there is not a direct translation from Storm topologies to Spark Streaming jobs. I think

RE: Map one RDD into two RDD

2015-05-06 Thread Evo Eftimov

RDD1 = RDD.filter() RDD2 = RDD.filter() From: Bill Q [mailto:bill.q@gmail.com] Sent: Tuesday, May 5, 2015 10:42 PM To: user@spark.apache.org Subject: Map one RDD into two RDD Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD,

RE: Creating topology in spark streaming

2015-05-06 Thread Evo Eftimov

What is called Bolt in Storm is essentially a combination of [Transformation/Action and DStream RDD] in Spark – so to achieve a higher parallelism for specific Transformation/Action on specific Dstream RDD simply repartition it to the required number of partitions which directly relates to the

RE: spark filestream problem

2015-05-02 Thread Evo Eftimov

a) copy the required file in a temp location and then b) move it from there to the dir monitored by spark filestream - this will ensure it is with recent timestamp -Original Message- From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Saturday, May 2, 2015 5:09 PM To: user

RE: spark filestream problem

2015-05-02 Thread Evo Eftimov

a) copy the required file in a temp location and then b) move it from there to the dir monitored by spark filestream - this will ensure it is with recent timestamp -Original Message- From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Saturday, May 2, 2015 5:07 PM To: user

spark filestream problem

2015-05-02 Thread Evo Eftimov

it seems that on Spark Streaming 1.2 the filestream API may have a bug - it doesn't detect new files when moving or renaming them on HDFS - only when copying them but that leads to a well known problem with .tmp files which get removed and make spark steraming filestream throw exception -- View

spark filestream problem

2015-05-02 Thread Evo Eftimov

it seems that on Spark Streaming 1.2 the filestream API may have a bug - it doesn't detect new files when moving or renaming them on HDFS - only when copying them but that leads to a well known problem with .tmp files which get removed and make spark steraming filestream throw exception -- View

spark filestrea problem

2015-05-02 Thread Evo Eftimov

it seems that on Spark Streaming 1.2 the filestream API may have a bug - it doesn't detect new files when moving or renaming them on HDFS - only when copying them but that leads to a well known problem with .tmp files which get removed and make spark steraming filestream throw exception

RE: Slower performance when bigger memory?

2015-04-24 Thread Evo Eftimov

You can resort to Serialized storage (still in memory) of your RDDs - this will obviate the need for GC since the RDD elements are stored as serialized objects off the JVM heap (most likely in Tachion which is distributed in memory files system used by Spark internally) Also review the Object O

RE: Tasks run only on one machine

2015-04-24 Thread Evo Eftimov

# of tasks = # of partitions, hence you can provide the desired number of partitions to the textFile API which should result a) in a better spatial distribution of the RDD b) each partition will be operated upon by a separate task You can provide the number of p -Original Message- Fro

RE: Custom paritioning of DSTream

2015-04-23 Thread Evo Eftimov

d" - every function (provided it is not Action) applied to an RDD within foreach is distributed across the cluster since it gets applied to an RDD From: davidkl [via Apache Spark User List] [mailto:ml-node+s1001560n22630...@n3.nabble.com] Sent: Thursday, April 23, 2015 10:13 AM To: Evo Ef

Custom paritioning of DSTream

2015-04-20 Thread Evo Eftimov

Is the only way to implement a custom partitioning of DStream via the foreach approach so to gain access to the actual RDDs comprising the DSTReam and hence their paritionBy method DSTReam has only a "repartition" method accepting only the number of partitions, BUT not the method of partitioning

RE: Super slow caching in 1.3?

2015-04-20 Thread Evo Eftimov

detailed description / spec of both From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, April 16, 2015 7:23 PM To: Evo Eftimov Cc: Christian Perez; user Subject: Re: Super slow caching in 1.3? Here are the types that we specialize, other types will be much slower. This is

RE: Equal number of RDD Blocks

2015-04-20 Thread Evo Eftimov

data more evenly you can partition it explicitly Also contact Data Bricks why the Receivers are not being distributed on different cluster nodes From: Laeeq Ahmed [mailto:laeeqsp...@yahoo.com] Sent: Monday, April 20, 2015 3:57 PM To: Evo Eftimov; user@spark.apache.org Subject: Re: Equal

RE: Equal number of RDD Blocks

2015-04-20 Thread Evo Eftimov

And what is the message rate of each topic mate – that was the other part of the required clarifications From: Laeeq Ahmed [mailto:laeeqsp...@yahoo.com] Sent: Monday, April 20, 2015 3:38 PM To: Evo Eftimov; user@spark.apache.org Subject: Re: Equal number of RDD Blocks Hi, I have two

RE: Equal number of RDD Blocks

2015-04-20 Thread Evo Eftimov

What is meant by “streams” here: 1. Two different DSTream Receivers producing two different DSTreams consuming from two different kafka topics, each with different message rate 2. One kafka topic (hence only one message rate to consider) but with two different DStream receivers

RE: writing to hdfs on master node much faster

2015-04-20 Thread Evo Eftimov

Check whether your partitioning results in balanced partitions ie partitions with similar sizes - one of the reasons for the performance differences observed by you may be that after your explicit repartitioning, the partition on your master node is much smaller than the RDD partitions on the ot

RE: Can a map function return null

2015-04-19 Thread Evo Eftimov

In fact you can return “NULL” from your initial map and hence not resort to Optional at all From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Sunday, April 19, 2015 9:48 PM To: 'Steve Lewis' Cc: 'Olivier Girardot'; 'user@spark.apache.org' Subject: RE:

RE: Can a map function return null

2015-04-19 Thread Evo Eftimov

Spark exception THEN as far as I am concerned, chess-mate From: Steve Lewis [mailto:lordjoe2...@gmail.com] Sent: Sunday, April 19, 2015 8:16 PM To: Evo Eftimov Cc: Olivier Girardot; user@spark.apache.org Subject: Re: Can a map function return null So you imagine something like this

Re: Can a map function return null

2015-04-19 Thread Evo Eftimov

I am on the move at the moment so i cant try it immediately but from previous memory / experience i think if you return plain null you will get a spark exception Anyway yiu can try it and see what happens and then ask the question If you do get exception try Optional instead of plain null Se

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-17 Thread Evo Eftimov

other apps would not be very appropriate in production because the two resource managers will be competing for cluster resources - but you can use this for performance tests From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Thursday, April 16, 2015 6:28 PM To: 'Manish Gupta 8'; &#

RE: How to do dispatching in Streaming?

2015-04-17 Thread Evo Eftimov

bottom lime / the big picture – in some models, friction can be a huge factor in the equations in some other it is just part of the landscape From: Gerard Maas [mailto:gerard.m...@gmail.com] Sent: Friday, April 17, 2015 10:12 AM To: Evo Eftimov Cc: Tathagata Das; Jianshi Huang; user; Shao

RE: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Evo Eftimov

the context of graphx From: Koert Kuipers [mailto:ko...@tresata.com] Sent: Thursday, April 16, 2015 10:31 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs i believe it is a generalization of some classes inside graphx, where

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov

Yes simply look for partitionby in the javadoc for e.g. PairJavaRDD From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: Thursday, April 16, 2015 9:57 PM To: Evo Eftimov Cc: Wang, Ningjun (LNG-NPV); user Subject: Re: How to join RDD keyValuePairs efficiently Does this same

AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Evo Eftimov

Can somebody from Data Briks sched more light on this Indexed RDD library https://github.com/amplab/spark-indexedrdd It seems to come from AMP Labs and most of the Data Bricks guys are from there What is especially interesting is whether the Point Lookup (and the other primitives) can work fr

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov

ine with 16GB and 4 cores). I found something called IndexedRDD on the web https://github.com/amplab/spark-indexedrdd Has anybody use it? Ningjun -Original Message- From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Thursday, April 16, 2015 12:18 PM To: 'Sean Owen'; Wang, Nin

RE: Super slow caching in 1.3?

2015-04-16 Thread Evo Eftimov

no limitations to the level of hierarchy in the Object Oriented Model of the RDD elements (limitations in terms of performance impact/degradation) – right? From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, April 16, 2015 7:23 PM To: Evo Eftimov Cc: Christian Perez; user

1 2 >

1 - 100 of 122 matches

Mail list logo