Re: Spark and Stanford CoreNLP

2014-11-25 Thread Evan Sparks
If you only mark it as transient, then the object won't be serialized, and on the worker the field will be null. When the worker goes to use it, you get an NPE. Marking it lazy defers initialization to first use. If that use happens to be after serialization time (e.g. on the worker), then the

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan Sparks
We have gotten this to work, but it requires instantiating the CoreNLP object on the worker side. Because of the initialization time it makes a lot of sense to do this inside of a .mapPartitions instead of a .map, for example. As an aside, if you're using it from Scala, have a look at sistanlp,

Re: MLlib linking error Mac OS X

2014-10-20 Thread Evan Sparks
MLlib relies on breeze for much of its linear algebra, which in turn relies on netlib-java. netlib-java will attempt to load a native BLAS at runtime and then attempt to load it's own precompiled version. Failing that, it will default back to a Java version that it has built in. The Java version

Re: Spark speed performance

2014-10-18 Thread Evan Sparks
How many files do you have and how big is each JSON object? Spark works better with a few big files vs many smaller ones. So you could try cat'ing your files together and rerunning the same experiment. - Evan > On Oct 18, 2014, at 12:07 PM, > wrote: > > Hi, > > I have program that I have

Re: Problem reading from S3 in standalone application

2014-08-06 Thread Evan Sparks
Try s3n:// > On Aug 6, 2014, at 12:22 AM, sparkuser2345 wrote: > > I'm getting the same "Input path does not exist" error also after setting the > AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using > the format "s3:///test_data.txt" for the input file. > > > > -- >

Re: MLLib sample data format

2014-06-22 Thread Evan Sparks
Oh, and the movie lens one is userid::movieid::rating - Evan > On Jun 22, 2014, at 3:35 PM, Justin Yip wrote: > > Hello, > > I am looking into a couple of MLLib data files in > https://github.com/apache/spark/tree/master/data/mllib. But I cannot find any > explanation for these files? Does a

Re: MLLib sample data format

2014-06-22 Thread Evan Sparks
These files follow the libsvm format where each line is a record, the first column is a label, and then after that the fields are offset:value where offset is the offset into the feature vector, and value is the value of the input feature. This is a fairly efficient representation for sparse b

Re: how to save RDD partitions in different folders?

2014-04-04 Thread Evan Sparks
Have a look at MultipleOutputs in the hadoop API. Spark can read and write to arbitrary hadoop formats. > On Apr 4, 2014, at 6:01 AM, dmpour23 wrote: > > Hi all, > Say I have an input file which I would like to partition using > HashPartitioner k times. > > Calling rdd.saveAsTextFile(""hdfs: