Note I’m assuming you were going for the size of your RDD, meaning in the
‘collect’ alternative, you would go for a size() right after the collect().
If you were simply trying to materialize your RDD, Sean’s answer is more
complete.
—
FG
On Thu, Feb 26, 2015 at 2:33 PM, Emre Sevinc
wrote:
The short answer:
count(), as the sum can be partially aggregated on the mappers.
The long answer:
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html
—
FG
On Thu, Feb 26, 2015 at 2:28 PM, Emre Sevinc
wrote:
> H
In a nutshell : because it’s moving all of your data, compared to other
operations (e.g. reduce) that summarize it in one form or another before moving
it.
For the longer answer:
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_grou
Sanity-check: would it be possible that `threshold_var` be negative ?
—
FG
On Fri, Jan 30, 2015 at 5:06 PM, Peter Thai wrote:
> Hello,
> I am seeing negative values for accumulators. Here's my implementation in a
> standalone app in Spark 1.1.1rc:
> implicit object BigIntAccumulatorParam ex
Sorry, I answered too fast. Please disregard my last message: I did mean
aggregate.
You say: "RDD.aggregate() does not support aggregation by key."
What would you need aggregation by key for, if you do not, at the beginning,
have an RDD of key-value pairs, and do not want to build one ?
Oh, I’m sorry, I meant `aggregateByKey`.
https://spark.apache.org/docs/1.2.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions
—
FG
On Thu, Jan 29, 2015 at 7:58 PM, Mohit Jaggi wrote:
> Francois,
> RDD.aggregate() does not support aggregation by key. But, indeed, that is the
> kind of imp
Have you looked at the `aggregate` function in the RDD API ?
If your way of extracting the “key” (identifier) and “value” (payload) parts of
the RDD elements is uniform (a function), it’s unclear to me how this would be
more efficient that extracting key and value and then using combine, howe
Why would you want to ? That would be equivalent to setting your batch interval
to the same duration you’re suggesting to specify for both window length and
slide interval, IIUC.
Here’s a nice explanation of windowing by TD :
https://groups.google.com/forum/#!topic/spark-users/GQoxJHAAtX4
- You are launching up to 10 threads/topic per Receiver. Are you sure your
receivers can support 10 threads each ? (i.e. in the default configuration, do
they have 10 cores). If they have 2 cores, that would explain why this works
with 20 partitions or less.
- If you have 90 partitions, why
Hi Mukesh,
If my understanding is correct, each Stream only has a single Receiver. So, if
you have each receiver consuming 9 partitions, you need 10 input DStreams to
create 10 concurrent receivers:
https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism
IIUC, Receivers run on workers, colocated with other tasks.
The Driver, on the other hand, can either run on the querying machine (local
mode) or as a worker (cluster mode).
—
FG
On Fri, Dec 12, 2014 at 4:49 PM, twizansk wrote:
> Thanks for the reply. I might be misunderstanding something
[sorry for the botched half-message]
Hi Mukesh,
There’s been some great work on Spark Streaming reliability lately.
https://www.youtube.com/watch?v=jcJq3ZalXD8
Look at the links from:
https://issues.apache.org/jira/browse/SPARK-3129
I’m not aware of any doc yet (did I miss somet
Hi Mukesh,
There’s been some great work on Spark Streaming reliability lately
I’m not aware of any doc yet (did I miss something ?) but you can look at the
ReliableKafkaReceiver’s test suite:
—
FG
On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha
wrote:
> Hello Guys,
> Any insights on t
How about writing to a buffer ? Then you would flush the buffer to Kafka if and
only if the output operation reports successful completion. In the event of a
worker failure, that would not happen.
—
FG
On Sun, Nov 30, 2014 at 2:28 PM, Josh J wrote:
> Is there a way to do this that preserves
Hi Steve,
Are you talking about sequence alignment ?
—
FG
On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis
wrote:
> The original problem is in biology but the following captures the CS
> issues, Assume I have a large number of locks and a large number of keys.
> There is a scoring function bet
15 matches
Mail list logo