Ravi did your issue ever get solved for this?
I think i've been hitting the same thing, it looks like
the spark.sql.autoBroadcastJoinThreshold stuff isn't kicking in as
expected, if I set that to -1 then the computation proceeds successfully.
On Tue, Jun 14, 2016 at 12:28 AM, Ravi Aggarwal wrote
So if your data can be kept in memory on the driver node then you don't
really need spark? If you want to use it for hadoop reading then i'd
immediately call collect after you open it and then you can do normal scala
collections operations.
On Tue, Jun 3, 2014 at 2:56 PM, Amit Kumar wrote:
> Hi
Depending on your requirements when doing hourly metrics calculating
distinct cardinality, a much more scalable method would be to use a hyper
log log data structure.
a scala impl people have used with spark would be
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/
Mmm how many days worth of data/how deep is your data nesting?
I suspect your running into a current issue with parquet (a fix is in
master but I don't believe released yet..). It reads all the metadata to
the submitter node as part of scheduling the job. This can cause long start
times(timeouts t
I would guess the field serializer is having issues being able to
reconstruct the class again, its pretty much best effort.
Is this an intermediate type?
On Thu, Sep 25, 2014 at 2:12 PM, Sandy Ryza wrote:
> We're running into an error (below) when trying to read spilled shuffle
> data back in.
Whats the error with the 2.10 version of algebird?
On Thu, Oct 30, 2014 at 12:49 AM, thadude wrote:
> I've tried:
>
> . /bin/spark-shell --jars algebird-core_2.10-0.8.1.jar
>
> scala> import com.twitter.algebird._
> import com.twitter.algebird._
>
> scala> import HyperLogLog._
> import HyperLog
Algebird 0.8.0 has 2.11 support if you want to run in a 2.11 env.
On Thu, Oct 30, 2014 at 10:08 AM, Buntu Dev wrote:
> Thanks.. I was using Scala 2.11.1 and was able to
> use algebird-core_2.10-0.1.11.jar with spark-shell.
>
> On Thu, Oct 30, 2014 at 8:22 AM, Ian O'Connell
&g
object MyCoreNLP {
@transient lazy val coreNLP = new coreNLP()
}
and then refer to it from your map/reduce/map partitions or that it should
be fine (presuming its thread safe), it will only be initialized once per
classloader per jvm
On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks wrote:
> We ha
I'm guessing the other result was wrong, or just never evaluated here. The
RDD transforms being lazy may have let it be expressed, but it wouldn't
work. Nested RDD's are not supported.
On Mon, Mar 17, 2014 at 4:01 PM, anny9699 wrote:
> Hi Andrew,
>
> Thanks for the reply. However I did almost t
Objects been transformed need to be one of these in flight. Source data can
just use the mapreduce input formats, so anything you can do with mapred.
doing an avro one for this you probably want one of :
https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantb
A mutable map in an object should do what your looking for then I believe.
You just reference the object as an object in your closure so it won't be
swept up when your closure is serialized and you can reference variables of
the object on the remote host then. e.g.:
object MyObject {
val mmap =
I think the distinction there might be they never said they ran that code
under CDH5, just that spark supports it and spark runs under CDH5. Not that
you can use these features while running under CDH5.
They could use mesos or the standalone scheduler to run them
On Tue, May 6, 2014 at 6:16 AM,
12 matches
Mail list logo