Hi Divya,
I guess the error is thrown from spark-csv. Spark-csv tries to parse string
"null" to double.
The workaround is to add nullValue option, like .option("nullValue",
"null"). But this nullValue feature is not included in current spark-csv
1.3. Just checkout the master of spark-csv and use
Hi Francis,
>From my observation when using spark sql, dataframe.limit(n) does not
necessarily return the same result each time when running Apps.
To be more precise, in one App, the result should be same for the same n,
however, changing n might not result in the same prefix(the result for n =
1
Hi Ricky,
In your first try, you are using flatMap. It will give you a flat list of
strings. Then you are trying to map each string to a Row, which definitely
throws an exception.
Following Terry's idea, you are mapping the input to a list of arrays, each
of which contains some strings. Then you
Hi Sarath,
It might be questionable to set num-executors as 64 if you only has 8
nodes. Do you use any action like "collect" which will overwhelm the
driver since you have a large dataset?
Thanks
On Tue, Apr 28, 2015 at 10:50 AM, sarath wrote:
>
> I am trying to train a large dataset consisting
Hi Oleg,
For 1, RDD#union will help. You can iterate over folders and union the obtained
RDD along.
For 2, seems like it won’t work in a deterministic way according to this
discussion(http://stackoverflow.com/questions/24871044/in-spark-what-does-the-parameter-minpartitions-works-in-sparkcontex
Hi Dan,
In
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala,
you can see spark uses Utils.nonNegativeMod(term.##, numFeatures) to locate
a term.
It's also mentioned in the doc that " Maps a sequence of terms to their
term frequencies