That was a) fuzzy b) insufficient – one can certainly use forach (only) on
DStream RDDs – it works as empirical observation
As another empirical observation:
For each partition results in having one instance of the lambda/closure per
partition when e.g. publishing to output systems like
For each partition results in having one instance of the lambda/closure per
partition when e.g. publishing to output systems like message brokers,
databases and file systems - that increases the level of parallelism of your
output processing
-Original Message-
From: dgoldenberg [mailto:dg
Also try to increase the number of partions gradually – not in one big jump
from 20 to 100 but adding e.g. 10 at a time and see whether there is a
correlation with adding more RAM to the executors
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Wednesday, July 8, 2015 1:26 PM
To
them will be in a suspended mode
waiting for free core (Thread contexts also occupy additional RAM )
From: Aniruddh Sharma [mailto:asharma...@gmail.com]
Sent: Wednesday, July 8, 2015 12:52 PM
To: Evo Eftimov
Subject: Re: Out of Memory Errors on less number of cores in proportion to
Partitions
This is most likely due to the internal implementation of ALS in MLib. Probably
for each parallel unit of execution (partition in Spark terms) the
implementation allocates and uses a RAM buffer where it keeps interim results
during the ALS iterations
If we assume that the size of that intern
ext...
dstream.foreachRDD{ rdd =>
myRDD = myRDD.union(rdd.filter(myfilter)).cashe()
}
From: Gerard Maas [mailto:gerard.m...@gmail.com]
Sent: Tuesday, July 7, 2015 1:55 PM
To: Evo Eftimov
Cc: Anand Nalya; spark users
Subject: Re:
Evo,
I'd let the OP clarify the qu
spark.streaming.unpersist = false // in order for SStreaming to not drop the
raw RDD data
spark.cleaner.ttl =
why is the above suggested provided the persist/vache operation on the
constantly unioniuzed Batch RDD will have to be invoked anyway (after every
union with DStream RDD), besides
The “RDD” aka Batch RDD which you load from file, will be kept for as long as
the Spark Framework is instantiated / running – you can also ensure it is
flagged explicitly as Persisted e.g. In Memory and/or disk
From: Anand Nalya [mailto:anand.na...@gmail.com]
Sent: Tuesday, July 7, 2015 12:3
I had a look at the new R "on Spark" API / Feature in Spark 1.4.0
For those "skilled in the art" (of R and distributed computing) it will be
immediately clear that "ON" is a marketing ploy and what it actually is is
"TO" ie Spark 1.4.0 offers INTERFACE from R TO DATA stored in Spark in
distributed
that limits the number of cores
per Executor rather than the total cores for the job and hence will probably
not yield the effect you need
From: Wojciech Pituła [mailto:w.pit...@gmail.com]
Sent: Wednesday, June 24, 2015 10:49 AM
To: Evo Eftimov; user@spark.apache.org
Subject: Re: Spark
There is no direct one to one mapping between Executor and Node
Executor is simply the spark framework term for JVM instance with some spark
framework system code running in it
A node is a physical server machine
You can have more than one JVM per node
And vice versa you can hav
Probably your application has crashed or was terminated without invoking the
stop method of spark context - in such cases it doesn't create the empty
flag file which apparently tells the history server that it can safely show
the log data - simpy go to some of the other dirs of the history server t
Spark Streaming 1.3.0 on YARN during Job Execution keeps generating the
following error while the application is running:
ERROR LiveListenerBus: Listener EventLoggingListener threw an exception
java.lang.reflect.InvocationTargetException
etc
etc
Caused by: java.io.IOException: Filesystem closed
What is GraphX:
- It can be viewed as a kind of Distributed, Parallel, Graph Database
- It can be viewed as Graph Data Structure (Data Structures 101 from
your CS course)
- It features some off the shelve algos for Graph Processing and
Navigation (Algos and Data
The only thing which doesn't make much sense in Spark Streaming (and I am
not saying it is done better in Storm) is the iterative and "redundant"
shipping of the essentially the same tasks (closures/lambdas/functions) to
the cluster nodes AND re-launching them there again and again
This is a l
Best is by measuring and recording how The Performance of your solution
scales as The Workload scales - recording As In "Data Points recording" and
then you can do some times series stat analysis and visualizations
For example you can start with a single box with e.g. 8 CPU cores
Use e.g. 1 or
https://spark.apache.org/docs/latest/monitoring.html
also subscribe to various Listeners for various Metrcis Types e.g. Job
Stats/Statuses - this will allow you (in the driver) to decide when to stop
the context gracefully (the listening and stopping can be done from a
completely separa
“turn (keep turning) your HDFS file (Batch RDD) into a stream of messages
(outside spark streaming)” – what I meant by that was “turn the Updates to your
HDFS dataset into Messages” and send them as such to spark streaming
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Monday, June
the app
And at the same time you join the DStream RDDs of your actual Streaming Data
with the above continuously updated DStream RDD representing your HDFS file
From: Ilove Data [mailto:data4...@gmail.com]
Sent: Monday, June 15, 2015 5:19 AM
To: Tathagata Das
Cc: Evo Eftimov; Akhil Das
need to got for what is essentialy a bit of a hack ie
runn8ng more than 1 workers per machine
Sent from Samsung Mobile
Original message From: Sandy Ryza
Date:2015/06/10 21:31 (GMT+00:00)
To: Evo Eftimov Cc: maxdml
,user@spark.apache.org Subject: Re: Determining
number of
Yes i think it is ONE worker ONE executor as executor is nothing but jvm
instance spawned by the worker
To run more executors ie jvm instances on the same physical cluster node you
need to run more than one worker on that node and then allocate only part of
the sys resourced to that worker/ex
It depends on how big the Batch RDD requiring reloading is
Reloading it for EVERY single DStream RDD would slow down the stream processing
inline with the total time required to reload the Batch RDD …..
But if the Batch RDD is not that big then that might not be an issues
especially in t
Foreachpartition callback is provided with Iterator by the Spark Frameowrk –
while iterator.hasNext() ……
Also check whether this is not some sort of Python Spark API bug – Python seems
to be the foster child here – Scala and Java are the darlings
From: John Omernik [mailto:j...@omernik.co
once in your Spark Streaming App
and then keep joining and then e.g. filtering every incoming DStream RDD with
the (big static) Batch RDD
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Friday, June 5, 2015 3:27 PM
To: 'Dmitry Goldenberg'
Cc: 'Yiannis Gkoufas'; '
It is called Indexed RDD https://github.com/amplab/spark-indexedrdd
From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
Sent: Friday, June 5, 2015 3:15 PM
To: Evo Eftimov
Cc: Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org
Subject: Re: How to share large resources like
share data
between Jobs, while an RDD is ALWAYS visible within Jobs using the same Spark
Context
From: Charles Earl [mailto:charles.ce...@gmail.com]
Sent: Friday, June 5, 2015 12:10 PM
To: Evo Eftimov
Cc: Dmitry Goldenberg; Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org
Subject: Re
€ρ@Ҝ (๏̯͡๏)
Cc: Evo Eftimov; user
Subject: Re: How to increase the number of tasks
just multiply 2-4 with the cpu core number of the node .
2015-06-05 18:04 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) :
I did not change spark.default.parallelism,
What is recommended value for it.
On Fri, Jun 5, 2015 at 3
Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark
Batch Jobs (besides anyone can put something like that in 5 min), while I am
under the impression that Dmytiy is working on Spark Streaming app
Besides the Job Server is essentially for sharing the Spark Context betwe
It may be that your system runs out of resources (ie 174 is the ceiling) due to
the following
1. RDD Partition = (Spark) Task
2. RDD Partition != (Spark) Executor
3. (Spark) Task != (Spark) Executor
4. (Spark) Task = JVM Thread
5. (Spark) Executor = JVM insta
more
From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
Sent: Wednesday, June 3, 2015 4:46 PM
To: Evo Eftimov
Cc: Cody Koeninger; Andrew Or; Gerard Maas; spark users
Subject: Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of
growth in Kafka or Spark's metrics?
)
Then just shutdown your currently running spark streaming job/app and restart
it with new params to take advantage of the larger cluster
From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
Sent: Wednesday, June 3, 2015 4:14 PM
To: Cody Koeninger
Cc: Andrew Or; Evo Eftimov; Gerard
Dmitry was concerned about the “serialization cost” NOT the “memory footprint –
hence option a) is still viable since a Broadcast is performed only ONCE for
the lifetime of Driver instance
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, June 3, 2015 2:44 PM
To: Evo Eftimov
Cc
Hmmm a spark streaming app code doesn't execute in the linear fashion
assumed in your previous code snippet - to achieve your objectives you
should do something like the following
in terms of your second objective - saving the initialization and
serialization of the params you can:
a) broadcast
I don’t think the number of CPU cores controls the “number of parallel tasks”.
The number of Tasks corresponds first and foremost to the number of (Dstream)
RDD Partitions
The Spark documentation doesn’t mention what is meant by “Task” in terms of
Standard Multithreading Terminology ie a T
free RAM ) and
taking a performance hit from that, BUT only until there is no free RAM
From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
Sent: Thursday, May 28, 2015 2:34 PM
To: Evo Eftimov
Cc: Gerard Maas; spark users
Subject: Re: FW: Re: Autoscaling Spark cluster based on topic
Original message
From: Evo Eftimov
Date:2015/05/28 13:22 (GMT+00:00)
To: Dmitry Goldenberg
Cc: Gerard Maas ,spark users
Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in
Kafka or Spark's metrics?
You can always spin new boxes i
shut down your job
gracefuly. Besides msnaging the offsets explicitly is not a big deal if
necessary
Sent from Samsung Mobile
Original message From: Dmitry Goldenberg
Date:2015/05/28 13:16 (GMT+00:00)
To: Evo Eftimov Cc: Gerard Maas
,spark users Subject:
Re: Autoscaling
@DG; The key metrics should be
- Scheduling delay – its ideal state is to remain constant over time
and ideally be less than the time of the microbatch window
- The average job processing time should remain less than the
micro-batch window
- Number of Lost Jobs –
message From: Arush Kharbanda
Date:2015/05/26 10:55 (GMT+00:00)
To: canan chen Cc: Evo Eftimov
,user@spark.apache.org Subject: Re: How does
spark manage the memory of executor with multiple tasks
Hi Evo,
Worker is the JVM and an executor runs on the JVM. And after Spark 1.4 you
9:30 AM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: How does spark manage the memory of executor with multiple tasks
Yes, I know that one task represent a JVM thread. This is what I confused.
Usually users want to specify the memory on task level, so how can I do it if
task if
An Executor is a JVM instance spawned and running on a Cluster Node (Server
machine). Task is essentially a JVM Thread – you can have as many Threads as
you want per JVM. You will also hear about “Executor Slots” – these are
essentially the CPU Cores available on the machine and granted for use
A receiver occupies a cpu core, an executor is simply a jvm instance and as
such it can be granted any number of cores and ram
So check how many cores you have per executor
Sent from Samsung Mobile
Original message From: Mike Trienis
Date:2015/05/22 21:51 (GMT+00:00)
To:
performance in the
name of the reliability/integrity of your system ie not loosing messages)
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Friday, May 22, 2015 9:39 PM
To: 'Tathagata Das'; 'Gautam Bajaj'
Cc: 'user'
Subject: RE: Storing spark processed out
If the message consumption rate is higher than the time required to process ALL
data for a micro batch (ie the next RDD produced for your stream) the
following happens – lets say that e.g. your micro batch time is 3 sec:
1. Based on your message streaming and consumption rate, you ge
OR you can run Drools in a Central Server Mode ie as a common/shared service,
but that would slowdown your Spark Streaming job due to the remote network call
which will have to be generated for every single message
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Friday, May 22, 2015
The only “tricky” bit would be when you want to manage/update the Rule Base in
your Drools Engines already running as Singletons in Executor JVMs on Worker
Nodes. The invocation of Drools from Spark Streaming to evaluate a Rule already
loaded in Drools is not a problem.
From: Evo Eftimov
From: Antonio Giambanco [mailto:antogia...@gmail.com]
Sent: Friday, May 22, 2015 11:07 AM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: Spark Streaming and Drools
Thanks a lot Evo,
do you know where I can find some examples?
Have a great one
A G
2015-05-22 12:00 GMT+02:00
You can deploy and invoke Drools as a Singleton on every Spark Worker Node /
Executor / Worker JVM
You can invoke it from e.g. map, filter etc and use the result from the Rule to
make decision how to transform/filter an event/message
From: Antonio Giambanco [mailto:antogia...@gmail.com]
Check whether the name can be resolved in the /etc/hosts file (or DNS) of the
worker
(the same btw applies for the Node where you run the driver app – all other
nodes must be able to resolve its name)
From: Stephen Boesch [mailto:java...@gmail.com]
Sent: Wednesday, May 20, 2015 10:07 AM
Is that a Spark or Spark Streaming application
Re the map transformation which is required you can also try flatMap
Finally an Executor is essentially a JVM spawn by a Spark Worker Node or YARN –
giving 60GB RAM to a single JVM will certainly result in “off the charts” GC. I
would sugges
er latency
Ps: ultimately though remember that none of this stuff is part of spark
streming as of yet
Sent from Samsung Mobile
Original message From: Akhil Das
Date:2015/05/18 16:56 (GMT+00:00)
To: Evo Eftimov Cc:
user@spark.apache.org Subject: RE: Spark Streaming an
communication and facts
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Monday, May 18, 2015 2:28 PM
To: Evo Eftimov
Cc: Dmitry Goldenberg; user@spark.apache.org
Subject: Re: Spark Streaming and reducing latency
we = Sigmoid
back-pressuring mechanism = Stoping the receiver from
, as Evo says
Spark Streaming DOES crash in “unceremonious way” when the free RAM available
for In Memory Cashed RDDs gets exhausted
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Monday, May 18, 2015 2:03 PM
To: Evo Eftimov
Cc: Dmitry Goldenberg; user@spark.apache.org
Subject: Re
-4-1410542878200 not found
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Monday, May 18, 2015 12:13 PM
To: 'Dmitry Goldenberg'; 'Akhil Das'
Cc: 'user@spark.apache.org'
Subject: RE: Spark Streaming and reducing latency
You can use
spark.strea
You can use
spark.streaming.receiver.maxRate
not set
Maximum rate (number of records per second) at which each receiver will receive
data. Effectively, each stream will consume at most this number of records per
second. Setting this configuration to 0 or a negative number will put no limit
This is the nature of Spark Streaming as a System Architecture:
1. It is a batch processing system architecture (Spark Batch) optimized
for Streaming Data
2. In terms of sources of Latency in such System Architecture, bear in
mind that besides “batching”, there is also the Centra
You can make ANY standard receiver sleep by implementing a custom Message
Deserializer class with sleep method inside it.
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Sunday, May 17, 2015 4:29 PM
To: Haopu Wang
Cc: user
Subject: Re: [SparkStreaming] Is it possible to delay the s
Ok thanks a lot for clarifying that – btw was your application a Spark
Streaming App – I am also looking for confirmation that FAIR scheduling is
supported for Spark Streaming Apps
From: Richard Marscher [mailto:rmarsc...@localytics.com]
Sent: Friday, May 15, 2015 7:20 PM
To: Evo Eftimov
me
From: Tathagata Das [mailto:t...@databricks.com]
Sent: Friday, May 15, 2015 6:45 PM
To: Evo Eftimov
Cc: user
Subject: Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond
How are you configuring the fair scheduler pools?
On Fri, May 15, 2015 at 8:33 AM, Evo Eftimov wrote
I have run / submitted a few Spark Streaming apps configured with Fair
scheduling on Spark Streaming 1.2.0, however they still run in a FIFO mode.
Is FAIR scheduling supported at all for Spark Streaming apps and from what
release / version - e.g. 1.3.1
--
View this message in context:
http://a
Where is the “Tuple” supposed to be in - you can refer to a
“Tuple” if it was e.g. >
From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of
Holden Karau
Sent: Thursday, May 14, 2015 5:56 PM
To: Yasemin Kaya
Cc: user@spark.apache.org
Subject: Re: swap tuple
Can you pas
inherent to the “commercial”
vendors, but I can confirm as fact it is also in effect to the “open source
movement” (because human nature remains the same)
From: David Morales [mailto:dmora...@stratio.com]
Sent: Thursday, May 14, 2015 4:30 PM
To: Paolo Platter
Cc: Evo Eftimov; Matei Zaharia
That has been a really rapid “evaluation” of the “work” and its “direction”
From: David Morales [mailto:dmora...@stratio.com]
Sent: Thursday, May 14, 2015 4:12 PM
To: Matei Zaharia
Cc: user@spark.apache.org
Subject: Re: SPARKTA: a real-time aggregation engine based on Spark Streaming
Than
I can confirm it does work in Java
From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com]
Sent: Tuesday, May 12, 2015 5:53 PM
To: Evo Eftimov
Cc: Saisai Shao; user@spark.apache.org
Subject: Re: DStream Union vs. StreamingContext Union
Thanks Evo. I tried chaining Dstream unions like
You can also union multiple DstreamRDDs in this way
DstreamRDD1.union(DstreamRDD2).union(DstreamRDD3) etc etc
Ps: the API is not “redundant” it offers several ways for achivieving the same
thing as a convenience depending on the situation
From: Vadim Bichutskiy [mailto:vadim.bichuts...@
LOCAL_ONE.
* spark.cassandra.output.throughput_mb_per_sec: maximum write throughput
allowed per single core in MB/s limit this on long (+8 hour) runs to 70% of
your max throughput as seen on a smaller job for stability
From: Sergio Jiménez Barrio [mailto:drarse.a...@gmail.com]
Sent: Sunday, May 10, 2015 12:59 PM
To: E
from Samsung Mobile
Original message From: Evo Eftimov
Date:2015/05/10 12:02 (GMT+00:00)
To: 'Gerard Maas' Cc: 'Sergio
Jiménez Barrio' ,'spark users'
Subject: RE: Spark streaming closes with Cassandra Conector
Hmm there is also a Connection
and distribution profile which may
skip a potential error in the Connection Pool code
From: Gerard Maas [mailto:gerard.m...@gmail.com]
Sent: Sunday, May 10, 2015 11:56 AM
To: Evo Eftimov
Cc: Sergio Jiménez Barrio; spark users
Subject: Re: Spark streaming closes with Cassandra Conector
I
I think the message that it has written 2 rows is misleading
If you look further down you will see that it could not initialize a connection
pool for Casandra (presumably while trying to write the previously mentioned 2
rows)
Another confirmation of this hypothesis is the phrase “error d
You can implement a custom partitioner
-Original Message-
From: skippi [mailto:skip...@gmx.de]
Sent: Sunday, May 10, 2015 10:19 AM
To: user@spark.apache.org
Subject: spark streaming and computation
Assuming a web server access log shall be analyzed and target of computation
shall be csv
: Bill Q [mailto:bill.q@gmail.com]
Sent: Thursday, May 7, 2015 6:27 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: Map one RDD into two RDD
The multi-threading code in Scala is quite simple and you can google it pretty
easily. We used the Future framework. You can use Akka also
: Bill Q [mailto:bill.q@gmail.com]
Sent: Thursday, May 7, 2015 4:55 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: Map one RDD into two RDD
Thanks for the replies. We decided to use concurrency in Scala to do the two
mappings using the same source RDD in parallel. So far, it
This is about Kafka Receiver IF you are using Spark Streaming
Ps: that book is now behind the curve in a quite a few areas since the release
of 1.3.1 – read the documentation and forums
From: James King [mailto:jakwebin...@gmail.com]
Sent: Wednesday, May 6, 2015 1:09 PM
To: user
Subjec
To: Evo Eftimov
Cc: anshu shukla; ayan guha; user@spark.apache.org
Subject: Re: Creating topology in spark streaming
Hi,
I agree with Evo, Spark works at a different abstraction level than Storm, and
there is not a direct translation from Storm topologies to Spark Streaming
jobs. I think
RDD1 = RDD.filter()
RDD2 = RDD.filter()
From: Bill Q [mailto:bill.q@gmail.com]
Sent: Tuesday, May 5, 2015 10:42 PM
To: user@spark.apache.org
Subject: Map one RDD into two RDD
Hi all,
I have a large RDD that I map a function to it. Based on the nature of each
record in the input RDD,
What is called Bolt in Storm is essentially a combination of
[Transformation/Action and DStream RDD] in Spark – so to achieve a higher
parallelism for specific Transformation/Action on specific Dstream RDD simply
repartition it to the required number of partitions which directly relates to
the
a) copy the required file in a temp location and
then b) move it from there to the dir monitored by spark filestream - this
will ensure it is with recent timestamp
-Original Message-
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Saturday, May 2, 2015 5:09 PM
To: user
a) copy the required file in a temp location and
then b) move it from there to the dir monitored by spark filestream - this
will ensure it is with recent timestamp
-Original Message-
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Saturday, May 2, 2015 5:07 PM
To: user
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception
--
View
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception
--
View
it seems that on Spark Streaming 1.2 the filestream API may have a bug - it
doesn't detect new files when moving or renaming them on HDFS - only when
copying them but that leads to a well known problem with .tmp files which
get removed and make spark steraming filestream throw exception
You can resort to Serialized storage (still in memory) of your RDDs - this
will obviate the need for GC since the RDD elements are stored as serialized
objects off the JVM heap (most likely in Tachion which is distributed in
memory files system used by Spark internally)
Also review the Object O
# of tasks = # of partitions, hence you can provide the desired number of
partitions to the textFile API which should result a) in a better spatial
distribution of the RDD b) each partition will be operated upon by a separate
task
You can provide the number of p
-Original Message-
Fro
d" - every function (provided it is not Action) applied to an
RDD within foreach is distributed across the cluster since it gets applied
to an RDD
From: davidkl [via Apache Spark User List]
[mailto:ml-node+s1001560n22630...@n3.nabble.com]
Sent: Thursday, April 23, 2015 10:13 AM
To: Evo Ef
Is the only way to implement a custom partitioning of DStream via the foreach
approach so to gain access to the actual RDDs comprising the DSTReam and
hence their paritionBy method
DSTReam has only a "repartition" method accepting only the number of
partitions, BUT not the method of partitioning
detailed description / spec of both
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Thursday, April 16, 2015 7:23 PM
To: Evo Eftimov
Cc: Christian Perez; user
Subject: Re: Super slow caching in 1.3?
Here are the types that we specialize, other types will be much slower. This
is
data more evenly you can partition it explicitly
Also contact Data Bricks why the Receivers are not being distributed on
different cluster nodes
From: Laeeq Ahmed [mailto:laeeqsp...@yahoo.com]
Sent: Monday, April 20, 2015 3:57 PM
To: Evo Eftimov; user@spark.apache.org
Subject: Re: Equal
And what is the message rate of each topic mate – that was the other part of
the required clarifications
From: Laeeq Ahmed [mailto:laeeqsp...@yahoo.com]
Sent: Monday, April 20, 2015 3:38 PM
To: Evo Eftimov; user@spark.apache.org
Subject: Re: Equal number of RDD Blocks
Hi,
I have two
What is meant by “streams” here:
1. Two different DSTream Receivers producing two different DSTreams
consuming from two different kafka topics, each with different message rate
2. One kafka topic (hence only one message rate to consider) but with two
different DStream receivers
Check whether your partitioning results in balanced partitions ie partitions
with similar sizes - one of the reasons for the performance differences
observed by you may be that after your explicit repartitioning, the partition
on your master node is much smaller than the RDD partitions on the ot
In fact you can return “NULL” from your initial map and hence not resort to
Optional at all
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Sunday, April 19, 2015 9:48 PM
To: 'Steve Lewis'
Cc: 'Olivier Girardot'; 'user@spark.apache.org'
Subject: RE:
Spark exception THEN
as far as I am concerned, chess-mate
From: Steve Lewis [mailto:lordjoe2...@gmail.com]
Sent: Sunday, April 19, 2015 8:16 PM
To: Evo Eftimov
Cc: Olivier Girardot; user@spark.apache.org
Subject: Re: Can a map function return null
So you imagine something like this
I am on the move at the moment so i cant try it immediately but from previous
memory / experience i think if you return plain null you will get a spark
exception
Anyway yiu can try it and see what happens and then ask the question
If you do get exception try Optional instead of plain null
Se
other apps would not be very appropriate in production because the two
resource managers will be competing for cluster resources - but you can use
this for performance tests
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Thursday, April 16, 2015 6:28 PM
To: 'Manish Gupta 8';
bottom lime / the big picture – in some models, friction
can be a huge factor in the equations in some other it is just part of the
landscape
From: Gerard Maas [mailto:gerard.m...@gmail.com]
Sent: Friday, April 17, 2015 10:12 AM
To: Evo Eftimov
Cc: Tathagata Das; Jianshi Huang; user; Shao
the context of
graphx
From: Koert Kuipers [mailto:ko...@tresata.com]
Sent: Thursday, April 16, 2015 10:31 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs
i believe it is a generalization of some classes inside graphx, where
Yes simply look for partitionby in the javadoc for e.g. PairJavaRDD
From: Jeetendra Gangele [mailto:gangele...@gmail.com]
Sent: Thursday, April 16, 2015 9:57 PM
To: Evo Eftimov
Cc: Wang, Ningjun (LNG-NPV); user
Subject: Re: How to join RDD keyValuePairs efficiently
Does this same
Can somebody from Data Briks sched more light on this Indexed RDD library
https://github.com/amplab/spark-indexedrdd
It seems to come from AMP Labs and most of the Data Bricks guys are from
there
What is especially interesting is whether the Point Lookup (and the other
primitives) can work fr
ine with 16GB and 4 cores).
I found something called IndexedRDD on the web
https://github.com/amplab/spark-indexedrdd
Has anybody use it?
Ningjun
-Original Message-
From: Evo Eftimov [mailto:evo.efti...@isecc.com]
Sent: Thursday, April 16, 2015 12:18 PM
To: 'Sean Owen'; Wang, Nin
no limitations to the level of hierarchy in the Object Oriented Model
of the RDD elements (limitations in terms of performance impact/degradation) –
right?
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Thursday, April 16, 2015 7:23 PM
To: Evo Eftimov
Cc: Christian Perez; user
1 - 100 of 122 matches
Mail list logo