Hallo,
I used Mahout for Text Classification and now I'm trying with Spark.
I had the same Problem training Bayes with (only) 569 Documents.
I solved doing htf = HashingTF(5000) instead of htf = HashingTF() [default
Features Space 2^20). I don't know if it can be considered a longterm
Solution (w
For the vectorizer, what's the output feature dimension and are you
creating sparse vectors or dense vectors? The model on the driver
consists of numClasses * numFeatures doubles. However, the driver
needs more memory in order to receive the task result (of the same
size) from executors. So you nee
Hi,
I was able to get the training running in local mode with default settings,
there was a problem with document labels which were quite large(not 20 as
suggested earlier).
I am currently training 175000 documents on a single node with 2GB of
executor memory and 5GB of driver memory successfull
Xiangrui, Thanks for replying.
I am using the subset of newsgroup20 data. I will send you the vectorized
data for analysis shortly.
I have tried running in local mode as well but I get the same OOM exception.
I started with 4GB of data but then moved to smaller set to verify that
everything was
You dataset is small. NaiveBayes should work under the default
settings, even in local mode. Could you try local mode first without
changing any Spark settings? Since your dataset is small, could you
save the vectorized data (RDD[LabeledPoint]) and send me a sample? I
want to take a look at the fea
I get the following stacktrace if it is of any help.
14/09/23 15:46:02 INFO scheduler.DAGScheduler: failed: Set()
14/09/23 15:46:02 INFO scheduler.DAGScheduler: Missing parents for Stage 7:
List()
14/09/23 15:46:02 INFO scheduler.DAGScheduler: Submitting Stage 7
(MapPartitionsRDD[24] at combineByK
Xiangrui,
Yes, the total number of terms is 43839. I have also tried running it using
different values of parallelism ranging from 1/core to 10/core. I also used
multiple configurations like setting spark.storage.memoryFaction and
spark.shuffle.memoryFraction to default values. The point to note
d that is unnerving for
> me. I can't use Hashing TF available with Spark due to the resultant
> decrease in accuracy, but with this feature size, I expect Spark to run
> easily. Thanks, Jatin
> Novice Big Data Programmer
>
> ________________
> View this messag
Hi,I have been facing an unusual issue with Naive Baye's training. I run out
of heap space with even with limited data during training phase. I am trying
to run the same on a rudimentary cluster of two development machines in
standalone mode.I am reading data from an HBase table, converting them in