RE: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-20 Thread Evo Eftimov
Check whether the name can be resolved in the /etc/hosts file (or DNS) of the worker (the same btw applies for the Node where you run the driver app – all other nodes must be able to resolve its name) From: Stephen Boesch [mailto:java...@gmail.com] Sent: Wednesday, May 20, 2015 10:07 AM

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
You can deploy and invoke Drools as a Singleton on every Spark Worker Node / Executor / Worker JVM You can invoke it from e.g. map, filter etc and use the result from the Rule to make decision how to transform/filter an event/message From: Antonio Giambanco [mailto:antogia...@gmail.com]

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
From: Antonio Giambanco [mailto:antogia...@gmail.com] Sent: Friday, May 22, 2015 11:07 AM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: Spark Streaming and Drools Thanks a lot Evo, do you know where I can find some examples? Have a great one A G 2015-05-22 12:00 GMT+02:00

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
The only “tricky” bit would be when you want to manage/update the Rule Base in your Drools Engines already running as Singletons in Executor JVMs on Worker Nodes. The invocation of Drools from Spark Streaming to evaluate a Rule already loaded in Drools is not a problem. From: Evo Eftimov

RE: Spark Streaming and Drools

2015-05-22 Thread Evo Eftimov
OR you can run Drools in a Central Server Mode ie as a common/shared service, but that would slowdown your Spark Streaming job due to the remote network call which will have to be generated for every single message From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Friday, May 22, 2015

RE: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Evo Eftimov
If the message consumption rate is higher than the time required to process ALL data for a micro batch (ie the next RDD produced for your stream) the following happens – lets say that e.g. your micro batch time is 3 sec: 1. Based on your message streaming and consumption rate, you ge

RE: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Evo Eftimov
performance in the name of the reliability/integrity of your system ie not loosing messages) From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Friday, May 22, 2015 9:39 PM To: 'Tathagata Das'; 'Gautam Bajaj' Cc: 'user' Subject: RE: Storing spark processed out

Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Evo Eftimov
A receiver occupies a cpu core, an executor is simply a jvm instance and as such it can be granted any number of cores and ram So check how many cores you have per executor Sent from Samsung Mobile Original message From: Mike Trienis Date:2015/05/22 21:51 (GMT+00:00) To:

RE: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov
An Executor is a JVM instance spawned and running on a Cluster Node (Server machine). Task is essentially a JVM Thread – you can have as many Threads as you want per JVM. You will also hear about “Executor Slots” – these are essentially the CPU Cores available on the machine and granted for use

RE: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov
9:30 AM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: How does spark manage the memory of executor with multiple tasks Yes, I know that one task represent a JVM thread. This is what I confused. Usually users want to specify the memory on task level, so how can I do it if task if

Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov
message From: Arush Kharbanda Date:2015/05/26 10:55 (GMT+00:00) To: canan chen Cc: Evo Eftimov ,user@spark.apache.org Subject: Re: How does spark manage the memory of executor with multiple tasks Hi Evo, Worker is the JVM and an executor runs on the JVM. And after Spark 1.4 you

RE: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
@DG; The key metrics should be - Scheduling delay – its ideal state is to remain constant over time and ideally be less than the time of the microbatch window - The average job processing time should remain less than the micro-batch window - Number of Lost Jobs –

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
shut down your job gracefuly. Besides msnaging the offsets explicitly is not a big deal if necessary Sent from Samsung Mobile Original message From: Dmitry Goldenberg Date:2015/05/28 13:16 (GMT+00:00) To: Evo Eftimov Cc: Gerard Maas ,spark users Subject: Re: Autoscaling

FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
Original message From: Evo Eftimov Date:2015/05/28 13:22 (GMT+00:00) To: Dmitry Goldenberg Cc: Gerard Maas ,spark users Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics? You can always spin new boxes i

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Evo Eftimov
free RAM ) and taking a performance hit from that, BUT only until there is no free RAM From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] Sent: Thursday, May 28, 2015 2:34 PM To: Evo Eftimov Cc: Gerard Maas; spark users Subject: Re: FW: Re: Autoscaling Spark cluster based on topic

RE: Value for SPARK_EXECUTOR_CORES

2015-05-28 Thread Evo Eftimov
I don’t think the number of CPU cores controls the “number of parallel tasks”. The number of Tasks corresponds first and foremost to the number of (Dstream) RDD Partitions The Spark documentation doesn’t mention what is meant by “Task” in terms of Standard Multithreading Terminology ie a T

RE: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Evo Eftimov
Hmmm a spark streaming app code doesn't execute in the linear fashion assumed in your previous code snippet - to achieve your objectives you should do something like the following in terms of your second objective - saving the initialization and serialization of the params you can: a) broadcast

RE: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Evo Eftimov
Dmitry was concerned about the “serialization cost” NOT the “memory footprint – hence option a) is still viable since a Broadcast is performed only ONCE for the lifetime of Driver instance From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, June 3, 2015 2:44 PM To: Evo Eftimov Cc

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Evo Eftimov
) Then just shutdown your currently running spark streaming job/app and restart it with new params to take advantage of the larger cluster From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] Sent: Wednesday, June 3, 2015 4:14 PM To: Cody Koeninger Cc: Andrew Or; Evo Eftimov; Gerard

RE: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Evo Eftimov
more From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] Sent: Wednesday, June 3, 2015 4:46 PM To: Evo Eftimov Cc: Cody Koeninger; Andrew Or; Gerard Maas; spark users Subject: Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

RE: How to increase the number of tasks

2015-06-05 Thread Evo Eftimov
It may be that your system runs out of resources (ie 174 is the ceiling) due to the following 1. RDD Partition = (Spark) Task 2. RDD Partition != (Spark) Executor 3. (Spark) Task != (Spark) Executor 4. (Spark) Task = JVM Thread 5. (Spark) Executor = JVM insta

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark Batch Jobs (besides anyone can put something like that in 5 min), while I am under the impression that Dmytiy is working on Spark Streaming app Besides the Job Server is essentially for sharing the Spark Context betwe

RE: How to increase the number of tasks

2015-06-05 Thread Evo Eftimov
€ρ@Ҝ (๏̯͡๏) Cc: Evo Eftimov; user Subject: Re: How to increase the number of tasks just multiply 2-4 with the cpu core number of the node . 2015-06-05 18:04 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) : I did not change spark.default.parallelism, What is recommended value for it. On Fri, Jun 5, 2015 at 3

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
share data between Jobs, while an RDD is ALWAYS visible within Jobs using the same Spark Context From: Charles Earl [mailto:charles.ce...@gmail.com] Sent: Friday, June 5, 2015 12:10 PM To: Evo Eftimov Cc: Dmitry Goldenberg; Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org Subject: Re

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
It is called Indexed RDD https://github.com/amplab/spark-indexedrdd From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] Sent: Friday, June 5, 2015 3:15 PM To: Evo Eftimov Cc: Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org Subject: Re: How to share large resources like

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
once in your Spark Streaming App and then keep joining and then e.g. filtering every incoming DStream RDD with the (big static) Batch RDD From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Friday, June 5, 2015 3:27 PM To: 'Dmitry Goldenberg' Cc: 'Yiannis Gkoufas'; '

RE: Spark Streaming for Each RDD - Exception on Empty

2015-06-05 Thread Evo Eftimov
Foreachpartition callback is provided with Iterator by the Spark Frameowrk – while iterator.hasNext() …… Also check whether this is not some sort of Python Spark API bug – Python seems to be the foster child here – Scala and Java are the darlings From: John Omernik [mailto:j...@omernik.co

RE: Join between DStream and Periodically-Changing-RDD

2015-06-10 Thread Evo Eftimov
It depends on how big the Batch RDD requiring reloading is Reloading it for EVERY single DStream RDD would slow down the stream processing inline with the total time required to reload the Batch RDD ….. But if the Batch RDD is not that big then that might not be an issues especially in t

Re: Determining number of executors within RDD

2015-06-10 Thread Evo Eftimov
Yes  i think it is ONE worker ONE executor as executor is nothing but jvm instance spawned by the worker  To run more executors ie jvm instances on the same physical cluster node you need to run more than one worker on that node and then allocate only part of the sys resourced to that worker/ex

Re: Determining number of executors within RDD

2015-06-10 Thread Evo Eftimov
need to got for what is essentialy a bit of a hack ie runn8ng more than 1 workers per machine Sent from Samsung Mobile Original message From: Sandy Ryza Date:2015/06/10 21:31 (GMT+00:00) To: Evo Eftimov Cc: maxdml ,user@spark.apache.org Subject: Re: Determining number of

RE: Join between DStream and Periodically-Changing-RDD

2015-06-15 Thread Evo Eftimov
the app And at the same time you join the DStream RDDs of your actual Streaming Data with the above continuously updated DStream RDD representing your HDFS file From: Ilove Data [mailto:data4...@gmail.com] Sent: Monday, June 15, 2015 5:19 AM To: Tathagata Das Cc: Evo Eftimov; Akhil Das

RE: Join between DStream and Periodically-Changing-RDD

2015-06-15 Thread Evo Eftimov
“turn (keep turning) your HDFS file (Batch RDD) into a stream of messages (outside spark streaming)” – what I meant by that was “turn the Updates to your HDFS dataset into Messages” and send them as such to spark streaming From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Monday, June

RE: stop streaming context of job failure

2015-06-16 Thread Evo Eftimov
https://spark.apache.org/docs/latest/monitoring.html also subscribe to various Listeners for various Metrcis Types e.g. Job Stats/Statuses - this will allow you (in the driver) to decide when to stop the context gracefully (the listening and stopping can be done from a completely separa

RE: How does one decide no of executors/cores/memory allocation?

2015-06-16 Thread Evo Eftimov
Best is by measuring and recording how The Performance of your solution scales as The Workload scales - recording As In "Data Points recording" and then you can do some times series stat analysis and visualizations For example you can start with a single box with e.g. 8 CPU cores Use e.g. 1 or

RE: Spark or Storm

2015-06-17 Thread Evo Eftimov
The only thing which doesn't make much sense in Spark Streaming (and I am not saying it is done better in Storm) is the iterative and "redundant" shipping of the essentially the same tasks (closures/lambdas/functions) to the cluster nodes AND re-launching them there again and again This is a l

RE: Machine Learning on GraphX

2015-06-18 Thread Evo Eftimov
What is GraphX: - It can be viewed as a kind of Distributed, Parallel, Graph Database - It can be viewed as Graph Data Structure (Data Structures 101 from your CS course) - It features some off the shelve algos for Graph Processing and Navigation (Algos and Data

Spark Streaming 1.3.0 ERROR LiveListenerBus

2015-06-19 Thread Evo Eftimov
Spark Streaming 1.3.0 on YARN during Job Execution keeps generating the following error while the application is running: ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTargetException etc etc Caused by: java.io.IOException: Filesystem closed

RE: Web UI vs History Server Bugs

2015-06-23 Thread Evo Eftimov
Probably your application has crashed or was terminated without invoking the stop method of spark context - in such cases it doesn't create the empty flag file which apparently tells the history server that it can safely show the log data - simpy go to some of the other dirs of the history server t

RE: Spark Streaming: limit number of nodes

2015-06-24 Thread Evo Eftimov
There is no direct one to one mapping between Executor and Node Executor is simply the spark framework term for JVM instance with some spark framework system code running in it A node is a physical server machine You can have more than one JVM per node And vice versa you can hav

RE: Spark Streaming: limit number of nodes

2015-06-24 Thread Evo Eftimov
that limits the number of cores per Executor rather than the total cores for the job and hence will probably not yield the effect you need From: Wojciech Pituła [mailto:w.pit...@gmail.com] Sent: Wednesday, June 24, 2015 10:49 AM To: Evo Eftimov; user@spark.apache.org Subject: Re: Spark

R "on spark"

2015-06-27 Thread Evo Eftimov
I had a look at the new R "on Spark" API / Feature in Spark 1.4.0 For those "skilled in the art" (of R and distributed computing) it will be immediately clear that "ON" is a marketing ploy and what it actually is is "TO" ie Spark 1.4.0 offers INTERFACE from R TO DATA stored in Spark in distributed

RE:

2015-07-07 Thread Evo Eftimov
The “RDD” aka Batch RDD which you load from file, will be kept for as long as the Spark Framework is instantiated / running – you can also ensure it is flagged explicitly as Persisted e.g. In Memory and/or disk From: Anand Nalya [mailto:anand.na...@gmail.com] Sent: Tuesday, July 7, 2015 12:3

RE:

2015-07-07 Thread Evo Eftimov
spark.streaming.unpersist = false // in order for SStreaming to not drop the raw RDD data spark.cleaner.ttl = why is the above suggested provided the persist/vache operation on the constantly unioniuzed Batch RDD will have to be invoked anyway (after every union with DStream RDD), besides

RE:

2015-07-07 Thread Evo Eftimov
ext... dstream.foreachRDD{ rdd => myRDD = myRDD.union(rdd.filter(myfilter)).cashe() } From: Gerard Maas [mailto:gerard.m...@gmail.com] Sent: Tuesday, July 7, 2015 1:55 PM To: Evo Eftimov Cc: Anand Nalya; spark users Subject: Re: Evo, I'd let the OP clarify the qu

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
This is most likely due to the internal implementation of ALS in MLib. Probably for each parallel unit of execution (partition in Spark terms) the implementation allocates and uses a RAM buffer where it keeps interim results during the ALS iterations If we assume that the size of that intern

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
them will be in a suspended mode waiting for free core (Thread contexts also occupy additional RAM ) From: Aniruddh Sharma [mailto:asharma...@gmail.com] Sent: Wednesday, July 8, 2015 12:52 PM To: Evo Eftimov Subject: Re: Out of Memory Errors on less number of cores in proportion to Partitions

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov
Also try to increase the number of partions gradually – not in one big jump from 20 to 100 but adding e.g. 10 at a time and see whether there is a correlation with adding more RAM to the executors From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Wednesday, July 8, 2015 1:26 PM To

RE: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Evo Eftimov
For each partition results in having one instance of the lambda/closure per partition when e.g. publishing to output systems like message brokers, databases and file systems - that increases the level of parallelism of your output processing -Original Message- From: dgoldenberg [mailto:dg

RE: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Evo Eftimov
That was a) fuzzy b) insufficient – one can certainly use forach (only) on DStream RDDs – it works as empirical observation As another empirical observation: For each partition results in having one instance of the lambda/closure per partition when e.g. publishing to output systems like

adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
The only way to join / union /cogroup a DStream RDD with Batch RDD is via the "transform" method, which returns another DStream RDD and hence it gets discarded at the end of the micro-batch. Is there any way to e.g. union Dstream RDD with Batch RDD which produces a new Batch RDD containing the el

RE: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
iginal Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, April 15, 2015 7:43 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: adding new elements to batch RDD from DStream RDD What do you mean by "batch RDD"? they're just RDDs, though store their d

RE: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
h RDDs from file for e.g. a second time moreover after specific period of time -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, April 15, 2015 8:14 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: adding new elements to batch RDD from DStream RDD

RE: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
e not getting anywhere -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, April 15, 2015 8:30 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: adding new elements to batch RDD from DStream RDD What API differences are you talking about? a DStream gives

RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
There are indications that joins in Spark are implemented with / based on the cogroup function/primitive/transform. So let me focus first on cogroup - it returns a result which is RDD consisting of essentially ALL elements of the cogrouped RDDs. Said in another way - for every key in each of the co

RE: RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
change the total number of elements included in the result RDD and RAM allocated – right? From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 9:25 PM To: Evo Eftimov Cc: user Subject: Re: RAM management during cogroup and join Significant optimizations can be made

RE: RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
Thank you Sir, and one final confirmation/clarification - are all forms of joins in the Spark API for DStream RDDs based on cogroup in terms of their internal implementation From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 9:48 PM To: Evo Eftimov Cc: user

RE: RAM management during cogroup and join

2015-04-15 Thread Evo Eftimov
that DStreams are some sort of different type of RDDs From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 11:11 PM To: Evo Eftimov Cc: user Subject: Re: RAM management during cogroup and join Well, DStream joins are nothing but RDD joins at its core. However

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
Also you can have each message type in a different topic (needs to be arranged upstream from your Spark Streaming app ie in the publishing systems and the messaging brokers) and then for each topic you can have a dedicated instance of InputReceiverDStream which will be the start of a dedicated D

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
And yet another way is to demultiplex at one point which will yield separate DStreams for each message type which you can then process in independent DAG pipelines in the following way: MessageType1DStream = MainDStream.filter(message type1) MessageType2DStream = MainDStream.filter(message t

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
which can not be done at the same time and has to be processed sequentially is a BAD thing So the key is whether it is about 1 or 2 and if it is about 1, whether it leads to e.g. Higher Throughput and Lower Latency or not Regards, Evo Eftimov From: Gerard Maas [mailto:gerard.m

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
How do you intend to "fetch the required data" - from within Spark or using an app / code / module outside Spark -Original Message- From: mas [mailto:mas.ha...@gmail.com] Sent: Thursday, April 16, 2015 4:08 PM To: user@spark.apache.org Subject: Data partitioning and node tracking in Spa

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
/framework your app code should not be bothered on which physical node exactly, a partition resides Regards Evo Eftimov From: MUHAMMAD AAMIR [mailto:mas.ha...@gmail.com] Sent: Thursday, April 16, 2015 4:20 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: Data partitioning and node

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
, April 16, 2015 4:32 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: Data partitioning and node tracking in Spark-GraphX Thanks a lot for the reply. Indeed it is useful but to be more precise i have 3D data and want to index it using octree. Thus i aim to build a two level indexing

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
Ningjun, to speed up your current design you can do the following: 1.partition the large doc RDD based on the hash function on the key ie the docid 2. persist the large dataset in memory to be available for subsequent queries without reloading and repartitioning for every search query 3. parti

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Evo Eftimov
because all worker instances run in the memory of a single machine .. Regards, Evo Eftimov From: Manish Gupta 8 [mailto:mgupt...@sapient.com] Sent: Thursday, April 16, 2015 6:03 PM To: user@spark.apache.org Subject: General configurations on CDH5 to achieve maximum Spark Performance Hi

RE: Super slow caching in 1.3?

2015-04-16 Thread Evo Eftimov
Michael what exactly do you mean by "flattened" version/structure here e.g.: 1. An Object with only primitive data types as attributes 2. An Object with no more than one level of other Objects as attributes 3. An Array/List of primitive types 4. An Array/List of Objects This question is in ge

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Evo Eftimov
-on-yarn.html From: Manish Gupta 8 [mailto:mgupt...@sapient.com] Sent: Thursday, April 16, 2015 6:21 PM To: Evo Eftimov; user@spark.apache.org Subject: RE: General configurations on CDH5 to achieve maximum Spark Performance Thanks Evo. Yes, my concern is only regarding the infrastructure

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
HDFS adapter and invoke it in forEachRDD and foreach Regards Evo Eftimov From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Thursday, April 16, 2015 6:33 PM To: user@spark.apache.org Subject: saveAsTextFile I am using Spark Streaming where during each micro-batch I

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Nop Sir, it is possible - check my reply earlier -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, April 16, 2015 6:35 PM To: Vadim Bichutskiy Cc: user@spark.apache.org Subject: Re: saveAsTextFile You can't, since that's how it's designed to work. Batches ar

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Basically you need to unbundle the elements of the RDD and then store them wherever you want - Use foreacPartition and then foreach -Original Message- From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Thursday, April 16, 2015 6:39 PM To: Sean Owen Cc: user@spark.apache.or

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
files and directories From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Thursday, April 16, 2015 6:45 PM To: Evo Eftimov Cc: Subject: Re: saveAsTextFile Thanks Evo for your detailed explanation. On Apr 16, 2015, at 1:38 PM, Evo Eftimov wrote: The reason for this is

RE: Super slow caching in 1.3?

2015-04-16 Thread Evo Eftimov
no limitations to the level of hierarchy in the Object Oriented Model of the RDD elements (limitations in terms of performance impact/degradation) – right? From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, April 16, 2015 7:23 PM To: Evo Eftimov Cc: Christian Perez; user

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
ine with 16GB and 4 cores). I found something called IndexedRDD on the web https://github.com/amplab/spark-indexedrdd Has anybody use it? Ningjun -Original Message- From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Thursday, April 16, 2015 12:18 PM To: 'Sean Owen'; Wang, Nin

AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Evo Eftimov
Can somebody from Data Briks sched more light on this Indexed RDD library https://github.com/amplab/spark-indexedrdd It seems to come from AMP Labs and most of the Data Bricks guys are from there What is especially interesting is whether the Point Lookup (and the other primitives) can work fr

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
Yes simply look for partitionby in the javadoc for e.g. PairJavaRDD From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: Thursday, April 16, 2015 9:57 PM To: Evo Eftimov Cc: Wang, Ningjun (LNG-NPV); user Subject: Re: How to join RDD keyValuePairs efficiently Does this same

RE: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Evo Eftimov
the context of graphx From: Koert Kuipers [mailto:ko...@tresata.com] Sent: Thursday, April 16, 2015 10:31 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs i believe it is a generalization of some classes inside graphx, where

RE: How to do dispatching in Streaming?

2015-04-17 Thread Evo Eftimov
bottom lime / the big picture – in some models, friction can be a huge factor in the equations in some other it is just part of the landscape From: Gerard Maas [mailto:gerard.m...@gmail.com] Sent: Friday, April 17, 2015 10:12 AM To: Evo Eftimov Cc: Tathagata Das; Jianshi Huang; user; Shao

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-17 Thread Evo Eftimov
other apps would not be very appropriate in production because the two resource managers will be competing for cluster resources - but you can use this for performance tests From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Thursday, April 16, 2015 6:28 PM To: 'Manish Gupta 8'; &#

Re: Can a map function return null

2015-04-19 Thread Evo Eftimov
I am on the move at the moment so i cant try it immediately but from previous memory / experience i think if you return plain null you will get a spark exception Anyway yiu can try it and see what happens and then ask the question  If you do get exception try Optional instead of plain null Se

RE: Can a map function return null

2015-04-19 Thread Evo Eftimov
Spark exception THEN as far as I am concerned, chess-mate From: Steve Lewis [mailto:lordjoe2...@gmail.com] Sent: Sunday, April 19, 2015 8:16 PM To: Evo Eftimov Cc: Olivier Girardot; user@spark.apache.org Subject: Re: Can a map function return null So you imagine something like this

RE: Can a map function return null

2015-04-19 Thread Evo Eftimov
In fact you can return “NULL” from your initial map and hence not resort to Optional at all From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Sunday, April 19, 2015 9:48 PM To: 'Steve Lewis' Cc: 'Olivier Girardot'; 'user@spark.apache.org' Subject: RE:

RE: writing to hdfs on master node much faster

2015-04-20 Thread Evo Eftimov
Check whether your partitioning results in balanced partitions ie partitions with similar sizes - one of the reasons for the performance differences observed by you may be that after your explicit repartitioning, the partition on your master node is much smaller than the RDD partitions on the ot

RE: Equal number of RDD Blocks

2015-04-20 Thread Evo Eftimov
What is meant by “streams” here: 1. Two different DSTream Receivers producing two different DSTreams consuming from two different kafka topics, each with different message rate 2. One kafka topic (hence only one message rate to consider) but with two different DStream receivers

RE: Equal number of RDD Blocks

2015-04-20 Thread Evo Eftimov
And what is the message rate of each topic mate – that was the other part of the required clarifications From: Laeeq Ahmed [mailto:laeeqsp...@yahoo.com] Sent: Monday, April 20, 2015 3:38 PM To: Evo Eftimov; user@spark.apache.org Subject: Re: Equal number of RDD Blocks Hi, I have two

RE: Equal number of RDD Blocks

2015-04-20 Thread Evo Eftimov
data more evenly you can partition it explicitly Also contact Data Bricks why the Receivers are not being distributed on different cluster nodes From: Laeeq Ahmed [mailto:laeeqsp...@yahoo.com] Sent: Monday, April 20, 2015 3:57 PM To: Evo Eftimov; user@spark.apache.org Subject: Re: Equal

RE: Super slow caching in 1.3?

2015-04-20 Thread Evo Eftimov
detailed description / spec of both From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Thursday, April 16, 2015 7:23 PM To: Evo Eftimov Cc: Christian Perez; user Subject: Re: Super slow caching in 1.3? Here are the types that we specialize, other types will be much slower. This is

Custom paritioning of DSTream

2015-04-20 Thread Evo Eftimov
Is the only way to implement a custom partitioning of DStream via the foreach approach so to gain access to the actual RDDs comprising the DSTReam and hence their paritionBy method DSTReam has only a "repartition" method accepting only the number of partitions, BUT not the method of partitioning

RE: Custom paritioning of DSTream

2015-04-23 Thread Evo Eftimov
d" - every function (provided it is not Action) applied to an RDD within foreach is distributed across the cluster since it gets applied to an RDD From: davidkl [via Apache Spark User List] [mailto:ml-node+s1001560n22630...@n3.nabble.com] Sent: Thursday, April 23, 2015 10:13 AM To: Evo Ef

RE: Tasks run only on one machine

2015-04-24 Thread Evo Eftimov
# of tasks = # of partitions, hence you can provide the desired number of partitions to the textFile API which should result a) in a better spatial distribution of the RDD b) each partition will be operated upon by a separate task You can provide the number of p -Original Message- Fro

RE: Slower performance when bigger memory?

2015-04-24 Thread Evo Eftimov
You can resort to Serialized storage (still in memory) of your RDDs - this will obviate the need for GC since the RDD elements are stored as serialized objects off the JVM heap (most likely in Tachion which is distributed in memory files system used by Spark internally) Also review the Object O

spark filestrea problem

2015-05-02 Thread Evo Eftimov
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it doesn't detect new files when moving or renaming them on HDFS - only when copying them but that leads to a well known problem with .tmp files which get removed and make spark steraming filestream throw exception

spark filestream problem

2015-05-02 Thread Evo Eftimov
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it doesn't detect new files when moving or renaming them on HDFS - only when copying them but that leads to a well known problem with .tmp files which get removed and make spark steraming filestream throw exception -- View

spark filestream problem

2015-05-02 Thread Evo Eftimov
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it doesn't detect new files when moving or renaming them on HDFS - only when copying them but that leads to a well known problem with .tmp files which get removed and make spark steraming filestream throw exception -- View

RE: spark filestream problem

2015-05-02 Thread Evo Eftimov
a) copy the required file in a temp location and then b) move it from there to the dir monitored by spark filestream - this will ensure it is with recent timestamp -Original Message- From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Saturday, May 2, 2015 5:07 PM To: user

RE: spark filestream problem

2015-05-02 Thread Evo Eftimov
a) copy the required file in a temp location and then b) move it from there to the dir monitored by spark filestream - this will ensure it is with recent timestamp -Original Message- From: Evo Eftimov [mailto:evo.efti...@isecc.com] Sent: Saturday, May 2, 2015 5:09 PM To: user

RE: Creating topology in spark streaming

2015-05-06 Thread Evo Eftimov
What is called Bolt in Storm is essentially a combination of [Transformation/Action and DStream RDD] in Spark – so to achieve a higher parallelism for specific Transformation/Action on specific Dstream RDD simply repartition it to the required number of partitions which directly relates to the

RE: Map one RDD into two RDD

2015-05-06 Thread Evo Eftimov
RDD1 = RDD.filter() RDD2 = RDD.filter() From: Bill Q [mailto:bill.q@gmail.com] Sent: Tuesday, May 5, 2015 10:42 PM To: user@spark.apache.org Subject: Map one RDD into two RDD Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input RDD,

RE: Creating topology in spark streaming

2015-05-06 Thread Evo Eftimov
To: Evo Eftimov Cc: anshu shukla; ayan guha; user@spark.apache.org Subject: Re: Creating topology in spark streaming Hi, I agree with Evo, Spark works at a different abstraction level than Storm, and there is not a direct translation from Storm topologies to Spark Streaming jobs. I think

RE: Receiver Fault Tolerance

2015-05-06 Thread Evo Eftimov
This is about Kafka Receiver IF you are using Spark Streaming Ps: that book is now behind the curve in a quite a few areas since the release of 1.3.1 – read the documentation and forums From: James King [mailto:jakwebin...@gmail.com] Sent: Wednesday, May 6, 2015 1:09 PM To: user Subjec

RE: Map one RDD into two RDD

2015-05-07 Thread Evo Eftimov
: Bill Q [mailto:bill.q@gmail.com] Sent: Thursday, May 7, 2015 4:55 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: Map one RDD into two RDD Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far, it

  1   2   >