Re: ISpark class not found

2014-11-12 Thread Laird, Benjamin
books/Scala/Untitled0.ipynb How did you start the notebook? Thanks & Regards, Meethu M On Wednesday, 12 November 2014 6:50 AM, "Laird, Benjamin" mailto:benjamin.la...@capitalone.com>> wrote: I've been experimenting with the ISpark extension to IScala (https://git

ISpark class not found

2014-11-11 Thread Laird, Benjamin
I've been experimenting with the ISpark extension to IScala (https://github.com/tribbloid/ISpark) Objects created in the REPL are not being loaded correctly on worker nodes, leading to a ClassNotFound exception. This does work correctly in spark-shell. I was curious if anyone has used ISpark an

Re: AVRO specific records

2014-11-05 Thread Laird, Benjamin
Something like this works and is how I create an RDD of specific records. val avroRdd = sc.newAPIHadoopFile("twitter.avro", classOf[AvroKeyInputFormat[twitter_schema]], classOf[AvroKey[twitter_schema]], classOf[NullWritable], conf) (From https://github.com/julianpeeters/avro-scala-macro-annotat

Re: Executor Memory, Task hangs

2014-08-19 Thread Laird, Benjamin
Thanks Akhil and Sean. All three workers are doing the work and tasks stall simultaneously on all three. I think Sean hit on my issue. I've been under the impression that each application has one executor process per worker machine (not per core per machine). Is that incorrect? If an executor i

Executor Memory, Task hangs

2014-08-19 Thread Laird, Benjamin
Hi all, I'm doing some testing on a small dataset (HadoopRDD, 2GB, ~10M records), with a cluster of 3 nodes Simple calculations like count take approximately 5s when using the default value of executor.memory (512MB). When I scale this up to 2GB, several Tasks take 1m or more (while most still

Re: Avro Schema + GenericRecord to HadoopRDD

2014-07-30 Thread Laird, Benjamin
nd (I've done it for Cascading/Scalding). > >----- >Chris > > > >From: Laird, Benjamin [benjamin.la...@capitalone.com] >Sent: Tuesday, July 29, 2014 8:00 AM >To: user@spark.apache.org; u...@spark.incubator.apache.org >Subject: Avro Schema + GenericRecor

Avro Schema + GenericRecord to HadoopRDD

2014-07-29 Thread Laird, Benjamin
Hi all, I can read in Avro files to Spark with HadoopRDD and submit the schema in the jobConf, but with the guidance I've seen so far, I'm left with a avro GenericRecord of Java objects without type. How do I actually use the schema to have the types inferred? Example: scala> AvroJob.setInputSc

User/Product Clustering with pySpark ALS

2014-04-29 Thread Laird, Benjamin
Hi all - I’m using pySpark/MLLib ALS for user/item clustering and would like to directly access the user/product RDDs (called userFeatures/productFeatures in class MatrixFactorizationModel in mllib/recommendation/MatrixFactorizationModel.scala This doesn’t seem to complex, but it doesn’t seem l

RE: running SparkALS

2014-04-28 Thread Laird, Benjamin
Good clarification Sean. Diana, I was also referring to this example when setting up some of my bigger ALS runs. I don't this particular example is very helpful, as it is creating the initial matrix locally in memory before parallelizing in spark. So (unless I'm misunderstanding), it is an ok ex

RE: help

2014-04-28 Thread Laird, Benjamin
Joe, Do you have your SPARK_HOME variable set correctly in the spark-env.sh script? I was getting that error when I was first setting up my cluster, turned out I had to make some changes in the spark-env script to get things working correctly. Ben -Original Message- From: Joe L [mailt

Running large join in ALS example through PySpark

2014-04-22 Thread Laird, Benjamin
Hello all - I'm running the ALS/Collaborative Filtering code through pySpark on spark0.9.0. (http://spark.apache.org/docs/0.9.0/mllib-guide.html#using-mllib-in-python) My data file has about 27M tuples (User, Item, Rating). ALS.train(ratings,1,30) runs on my 3 node cluster (24 cores, 60GB RAM)