If you only mark it as transient, then the object won't be serialized, and on
the worker the field will be null. When the worker goes to use it, you get an
NPE.
Marking it lazy defers initialization to first use. If that use happens to be
after serialization time (e.g. on the worker), then the
We have gotten this to work, but it requires instantiating the CoreNLP object
on the worker side. Because of the initialization time it makes a lot of sense
to do this inside of a .mapPartitions instead of a .map, for example.
As an aside, if you're using it from Scala, have a look at sistanlp,
MLlib relies on breeze for much of its linear algebra, which in turn relies on
netlib-java. netlib-java will attempt to load a native BLAS at runtime and then
attempt to load it's own precompiled version. Failing that, it will default
back to a Java version that it has built in. The Java version
How many files do you have and how big is each JSON object?
Spark works better with a few big files vs many smaller ones. So you could try
cat'ing your files together and rerunning the same experiment.
- Evan
> On Oct 18, 2014, at 12:07 PM,
> wrote:
>
> Hi,
>
> I have program that I have
Try s3n://
> On Aug 6, 2014, at 12:22 AM, sparkuser2345 wrote:
>
> I'm getting the same "Input path does not exist" error also after setting the
> AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using
> the format "s3:///test_data.txt" for the input file.
>
>
>
> --
>
Oh, and the movie lens one is userid::movieid::rating
- Evan
> On Jun 22, 2014, at 3:35 PM, Justin Yip wrote:
>
> Hello,
>
> I am looking into a couple of MLLib data files in
> https://github.com/apache/spark/tree/master/data/mllib. But I cannot find any
> explanation for these files? Does a
These files follow the libsvm format where each line is a record, the first
column is a label, and then after that the fields are offset:value where offset
is the offset into the feature vector, and value is the value of the input
feature.
This is a fairly efficient representation for sparse b
Have a look at MultipleOutputs in the hadoop API. Spark can read and write to
arbitrary hadoop formats.
> On Apr 4, 2014, at 6:01 AM, dmpour23 wrote:
>
> Hi all,
> Say I have an input file which I would like to partition using
> HashPartitioner k times.
>
> Calling rdd.saveAsTextFile(""hdfs: