Re: [Help]: DataframeNAfunction fill method throwing exception

2016-03-01 Thread ai he
Hi Divya, I guess the error is thrown from spark-csv. Spark-csv tries to parse string "null" to double. The workaround is to add nullValue option, like .option("nullValue", "null"). But this nullValue feature is not included in current spark-csv 1.3. Just checkout the master of spark-csv and use

Re: Sporadic "Input validation failed" error when executing LogisticRegressionWithLBFGS.train

2015-08-11 Thread ai he
Hi Francis, >From my observation when using spark sql, dataframe.limit(n) does not necessarily return the same result each time when running Apps. To be more precise, in one App, the result should be same for the same n, however, changing n might not result in the same prefix(the result for n = 1

Re: Re: Job aborted due to stage failure: java.lang.StringIndexOutOfBoundsException: String index out of range: 18

2015-08-28 Thread ai he
Hi Ricky, In your first try, you are using flatMap. It will give you a flat list of strings. Then you are trying to map each string to a Row, which definitely throws an exception. Following Terry's idea, you are mapping the input to a list of arrays, each of which contains some strings. Then you

Re: MLLib SVMWithSGD is failing for large dataset

2015-04-28 Thread ai he
Hi Sarath, It might be questionable to set num-executors as 64 if you only has 8 nodes. Do you use any action like "collect" which will overwhelm the driver since you have a large dataset? Thanks On Tue, Apr 28, 2015 at 10:50 AM, sarath wrote: > > I am trying to train a large dataset consisting

Re: multiple hdfs folder & files input to PySpark

2015-05-05 Thread Ai He
Hi Oleg, For 1, RDD#union will help. You can iterate over folders and union the obtained RDD along. For 2, seems like it won’t work in a deterministic way according to this discussion(http://stackoverflow.com/questions/24871044/in-spark-what-does-the-parameter-minpartitions-works-in-sparkcontex

Re: question about the TFIDF.

2015-05-07 Thread ai he
Hi Dan, In https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala, you can see spark uses Utils.nonNegativeMod(term.##, numFeatures) to locate a term. It's also mentioned in the doc that " Maps a sequence of terms to their term frequencies