Re: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-07-20 Thread Ian O'Connell
Ravi did your issue ever get solved for this? I think i've been hitting the same thing, it looks like the spark.sql.autoBroadcastJoinThreshold stuff isn't kicking in as expected, if I set that to -1 then the computation proceeds successfully. On Tue, Jun 14, 2016 at 12:28 AM, Ravi Aggarwal wrote

Re: RDD with a Map

2014-06-03 Thread Ian O'Connell
So if your data can be kept in memory on the driver node then you don't really need spark? If you want to use it for hadoop reading then i'd immediately call collect after you open it and then you can do normal scala collections operations. On Tue, Jun 3, 2014 at 2:56 PM, Amit Kumar wrote: > Hi

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Ian O'Connell
Depending on your requirements when doing hourly metrics calculating distinct cardinality, a much more scalable method would be to use a hyper log log data structure. a scala impl people have used with spark would be https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/

Re: Querying a parquet file in s3 with an ec2 install

2014-09-08 Thread Ian O'Connell
Mmm how many days worth of data/how deep is your data nesting? I suspect your running into a current issue with parquet (a fix is in master but I don't believe released yet..). It reads all the metadata to the submitter node as part of scheduling the job. This can cause long start times(timeouts t

Re: Kryo UnsupportedOperationException

2014-09-25 Thread Ian O'Connell
I would guess the field serializer is having issues being able to reconstruct the class again, its pretty much best effort. Is this an intermediate type? On Thu, Sep 25, 2014 at 2:12 PM, Sandy Ryza wrote: > We're running into an error (below) when trying to read spilled shuffle > data back in.

Re: Algebird using spark-shell

2014-10-30 Thread Ian O'Connell
Whats the error with the 2.10 version of algebird? On Thu, Oct 30, 2014 at 12:49 AM, thadude wrote: > I've tried: > > . /bin/spark-shell --jars algebird-core_2.10-0.8.1.jar > > scala> import com.twitter.algebird._ > import com.twitter.algebird._ > > scala> import HyperLogLog._ > import HyperLog

Re: Algebird using spark-shell

2014-10-30 Thread Ian O'Connell
Algebird 0.8.0 has 2.11 support if you want to run in a 2.11 env. On Thu, Oct 30, 2014 at 10:08 AM, Buntu Dev wrote: > Thanks.. I was using Scala 2.11.1 and was able to > use algebird-core_2.10-0.1.11.jar with spark-shell. > > On Thu, Oct 30, 2014 at 8:22 AM, Ian O'Connell &g

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Ian O'Connell
object MyCoreNLP { @transient lazy val coreNLP = new coreNLP() } and then refer to it from your map/reduce/map partitions or that it should be fine (presuming its thread safe), it will only be initialized once per classloader per jvm On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks wrote: > We ha

Re: java.lang.NullPointerException met when computing new RDD or use .count

2014-03-17 Thread Ian O'Connell
I'm guessing the other result was wrong, or just never evaluated here. The RDD transforms being lazy may have let it be expressed, but it wouldn't work. Nested RDD's are not supported. On Mon, Mar 17, 2014 at 4:01 PM, anny9699 wrote: > Hi Andrew, > > Thanks for the reply. However I did almost t

Re: Avro serialization

2014-04-03 Thread Ian O'Connell
Objects been transformed need to be one of these in flight. Source data can just use the mapreduce input formats, so anything you can do with mapred. doing an avro one for this you probably want one of : https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantb

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Ian O'Connell
A mutable map in an object should do what your looking for then I believe. You just reference the object as an object in your closure so it won't be swept up when your closure is serialized and you can reference variables of the object on the remote host then. e.g.: object MyObject { val mmap =

Re: Spark and Java 8

2014-05-06 Thread Ian O'Connell
I think the distinction there might be they never said they ran that code under CDH5, just that spark supports it and spark runs under CDH5. Not that you can use these features while running under CDH5. They could use mesos or the standalone scheduler to run them On Tue, May 6, 2014 at 6:16 AM,