Re: lda times out prior to writing topics

Steven Cullens Fri, 14 Mar 2014 10:13:37 -0700

I misspelled dictionary when calling cvb.  Thanks for your help, Suneel.
Otherwise, I'd be at the office on the weekend trying to figure out the
problem.



On Fri, Mar 14, 2014 at 12:41 PM, Suneel Marthi <[email protected]>wrote:

> the issue is that the numTerms in dictionary is 0.
>
> learning for LDA on
> reuters-lda/reuters-matrix/matrix (numTerms: 0), finding 5-topics, with
> document/topic prior 1.0E-4, topic/term prior 1.0E-4.  Maximum iterations
> to run will be 2, unless the change in perplexity is less than 0.0.  Topic
> model output (p(term|topic) for each topic) will be stored
>
> Could you check ur seq2sparse output and the terms in the dictionary ?
>
>
>
>
>
>
>
>
> On Friday, March 14, 2014 12:16 PM, Steven Cullens <[email protected]>
> wrote:
>
> Hi,
>
> I'm running Mahout 0.9 and Hadoop 1.1.1 and I'm following the
> examples/bin/cluster-reuters.sh script, but manually entering commands
> because the script crashes.  Data preparation runs smoothly, but when I
> call cvb, it times out prior to writing topics.  Any ideas?  Thanks in
> advance and here are the commands:
>
> # comparison of output of seqdumper with raw text files was fine
> bin/mahout seqdirectory -i
> file:///home/hduser/software/mahout-distribution-0.9/examples/reuters-out
> -o reuters-lda/seqfile -c UTF-8 -chunk 64 -xm sequential
>
> # output had dictionary and a frequency file.  tf-vector part file had text
> file name for key and vector of large number (word id?) : integer (word
> count?)
> bin/mahout seq2sparse -i reuters-lda/seqfile -o reuters-lda/vectors -ow
> --maxDFPercent 85 --namedVector
>
> # key was replaced by integer: Key: 0: Value:
> /reut2-000.sgm-0.txt:{26587:3.0,19426:6.0,41154:1.0
> bin/mahout rowid -i reuters-lda/vectors/tf-vectors -o
> reuters-lda/reuters-matrix
>
> # times out prior to writing topics.
> bin/mahout cvb -i reuters-lda/reuters-matrix/matrix -o reuters-lda/lda -k 5
> -ow -x 2 -dict reuters-lda/vectors/dictonary.file-* -dt reuters-lda/topics
> -mt reuters-lda/model
>
> Here's the output of the last step:
>
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Warning: $HADOOP_HOME is deprecated.
>
> Running on hadoop, using /home/hduser/software/hadoop-1.1.1/bin/hadoop and
> HADOOP_CONF_DIR=/home/hduser/software/hadoop-1.1.1/conf
> MAHOUT-JOB:
> /home/hduser/software/mahout-distribution-0.9/mahout-examples-0.9-job.jar
> Warning: $HADOOP_HOME is deprecated.
>
> 14/03/14 11:41:58 WARN driver.MahoutDriver: No cvb.props found on
> classpath, will use command-line arguments only
> 14/03/14 11:41:58 INFO common.AbstractJob: Command line arguments:
> {--convergenceDelta=[0.0],
> --dictionary=[reuters-lda/vectors/dictonary.file-*],
> --doc_topic_output=[reuters-lda/topics], --doc_topic_smoothing=[1.0E-4],
> --endPhase=[2147483647], --input=[reuters-lda/reuters-matrix/matrix],
> --iteration_block_size=[10], --maxIter=[2], --max_doc_topic_iters=[10],
> --num_reduce_tasks=[10], --num_topics=[5], --num_train_threads=[4],
> --num_update_threads=[1], --output=[reuters-lda/lda], --overwrite=null,
> --startPhase=[0], --tempDir=[temp], --term_topic_smoothing=[1.0E-4],
> --test_set_fraction=[0.0], --topic_model_temp_dir=[reuters-lda/model]}
> 14/03/14 11:41:58 INFO cvb.CVB0Driver: Will run Collapsed Variational Bayes
> (0th-derivative approximation) learning for LDA on
> reuters-lda/reuters-matrix/matrix (numTerms: 0), finding 5-topics, with
> document/topic prior 1.0E-4, topic/term prior 1.0E-4.  Maximum iterations
> to run will be 2, unless the change in perplexity is less than 0.0.  Topic
> model output (p(term|topic) for each topic) will be stored
> reuters-lda/lda.  Random initialization seed is 7962, holding out 0.0 of
> the data for perplexity check
>
> 14/03/14 11:41:58 INFO cvb.CVB0Driver: Dictionary to be used located
> reuters-lda/vectors/dictonary.file-*
> p(topic|docId) will be stored reuters-lda/topics
>
> 14/03/14 11:41:58 INFO cvb.CVB0Driver: Current iteration number: 0
> 14/03/14 11:41:58 INFO cvb.CVB0Driver: About to run iteration 1 of 2
> 14/03/14 11:41:58 INFO cvb.CVB0Driver: About to run: Iteration 1 of 2,
> input path: reuters-lda/model/model-0
> 14/03/14 11:41:59 INFO input.FileInputFormat: Total input paths to process
> : 1
> 14/03/14 11:41:59 INFO mapred.JobClient: Running job: job_201403131444_0034
> 14/03/14 11:42:00 INFO mapred.JobClient:  map 0% reduce 0%
> 14/03/14 11:42:11 INFO mapred.JobClient:  map 86% reduce 0%
> 14/03/14 11:42:14 INFO mapred.JobClient:  map 100% reduce 0%
> 14/03/14 11:42:22 INFO mapred.JobClient:  map 100% reduce 3%
> 14/03/14 11:42:23 INFO mapred.JobClient:  map 100% reduce 6%
> 14/03/14 11:42:24 INFO mapred.JobClient:  map 100% reduce 20%
> 14/03/14 11:42:32 INFO mapred.JobClient:  map 100% reduce 30%
> 14/03/14 11:42:33 INFO mapred.JobClient:  map 100% reduce 40%
> 14/03/14 11:42:40 INFO mapred.JobClient:  map 100% reduce 43%
> 14/03/14 11:42:41 INFO mapred.JobClient:  map 100% reduce 60%
> 14/03/14 11:42:48 INFO mapred.JobClient:  map 100% reduce 66%
> 14/03/14 11:42:49 INFO mapred.JobClient:  map 100% reduce 73%
> 14/03/14 11:42:50 INFO mapred.JobClient:  map 100% reduce 80%
> 14/03/14 11:42:57 INFO mapred.JobClient:  map 100% reduce 83%
> 14/03/14 11:42:58 INFO mapred.JobClient:  map 100% reduce 100%
> 14/03/14 11:42:59 INFO mapred.JobClient: Job complete:
> job_201403131444_0034
> 14/03/14 11:42:59 INFO mapred.JobClient: Counters: 29
> 14/03/14 11:42:59 INFO mapred.JobClient:   Job Counters
> 14/03/14 11:42:59 INFO mapred.JobClient:     Launched reduce tasks=10
> 14/03/14 11:42:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12784
> 14/03/14 11:42:59 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 14/03/14 11:42:59 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 14/03/14 11:42:59 INFO mapred.JobClient:     Launched map tasks=1
> 14/03/14 11:42:59 INFO mapred.JobClient:     Data-local map tasks=1
> 14/03/14 11:42:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=85618
> 14/03/14 11:42:59 INFO mapred.JobClient:   File Output Format Counters
> 14/03/14 11:42:59 INFO mapred.JobClient:     Bytes Written=1040
> 14/03/14 11:42:59 INFO mapred.JobClient:   FileSystemCounters
> 14/03/14 11:42:59 INFO mapred.JobClient:     FILE_BYTES_READ=258
> 14/03/14 11:42:59 INFO mapred.JobClient:     HDFS_BYTES_READ=6924921
> 14/03/14 11:42:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=279071
> 14/03/14 11:42:59 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
> 14/03/14 11:42:59 INFO mapred.JobClient:   File Input Format Counters
> 14/03/14 11:42:59 INFO mapred.JobClient:     Bytes Read=6924788
> 14/03/14 11:42:59 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/14 11:42:59 INFO mapred.JobClient:     Map output materialized
> bytes=178
> 14/03/14 11:42:59 INFO mapred.JobClient:     Map input records=21578
> 14/03/14 11:42:59 INFO mapred.JobClient:     Reduce shuffle bytes=178
> 14/03/14 11:42:59 INFO mapred.JobClient:     Spilled Records=10
> 14/03/14 11:42:59 INFO mapred.JobClient:     Map output bytes=30
> 14/03/14 11:42:59 INFO mapred.JobClient:     CPU time spent (ms)=7510
> 14/03/14 11:42:59 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=323293184
> 14/03/14 11:42:59 INFO mapred.JobClient:     Combine input records=5
> 14/03/14 11:42:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=133
> 14/03/14 11:42:59 INFO mapred.JobClient:     Reduce input records=5
> 14/03/14 11:42:59 INFO mapred.JobClient:     Reduce input groups=5
> 14/03/14 11:42:59 INFO mapred.JobClient:     Combine output records=5
> 14/03/14 11:42:59 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=579133440
> 14/03/14 11:42:59 INFO mapred.JobClient:     Reduce output records=5
> 14/03/14 11:42:59 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=4194500608
> 14/03/14 11:42:59 INFO mapred.JobClient:     Map output records=5
> 14/03/14 11:42:59 INFO cvb.CVB0Driver: About to run iteration 2 of 2
> 14/03/14 11:42:59 INFO cvb.CVB0Driver: About to run: Iteration 2 of 2,
> input path: reuters-lda/model/model-1
> 14/03/14 11:43:00 INFO input.FileInputFormat: Total input paths to process
> : 1
> 14/03/14 11:43:00 INFO mapred.JobClient: Running job: job_201403131444_0035
> 14/03/14 11:43:01 INFO mapred.JobClient:  map 0% reduce 0%
> 14/03/14 11:43:12 INFO mapred.JobClient:  map 82% reduce 0%
> 14/03/14 11:43:15 INFO mapred.JobClient:  map 100% reduce 0%
> 14/03/14 11:43:22 INFO mapred.JobClient:  map 100% reduce 6%
> 14/03/14 11:43:23 INFO mapred.JobClient:  map 100% reduce 13%
> 14/03/14 11:43:24 INFO mapred.JobClient:  map 100% reduce 20%
> 14/03/14 11:43:30 INFO mapred.JobClient:  map 100% reduce 23%
> 14/03/14 11:43:32 INFO mapred.JobClient:  map 100% reduce 40%
> 14/03/14 11:43:39 INFO mapred.JobClient:  map 100% reduce 43%
> 14/03/14 11:43:40 INFO mapred.JobClient:  map 100% reduce 50%
> 14/03/14 11:43:41 INFO mapred.JobClient:  map 100% reduce 60%
> 14/03/14 11:43:49 INFO mapred.JobClient:  map 100% reduce 66%
> 14/03/14 11:43:50 INFO mapred.JobClient:  map 100% reduce 80%
> 14/03/14 11:43:57 INFO mapred.JobClient:  map 100% reduce 83%
> 14/03/14 11:43:59 INFO mapred.JobClient:  map 100% reduce 100%
> 14/03/14 11:43:59 INFO mapred.JobClient: Job complete:
> job_201403131444_0035
> 14/03/14 11:43:59 INFO mapred.JobClient: Counters: 29
> 14/03/14 11:43:59 INFO mapred.JobClient:   Job Counters
> 14/03/14 11:43:59 INFO mapred.JobClient:     Launched reduce tasks=10
> 14/03/14 11:43:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=11051
> 14/03/14 11:43:59 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 14/03/14 11:43:59 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 14/03/14 11:43:59 INFO mapred.JobClient:     Launched map tasks=1
> 14/03/14 11:43:59 INFO mapred.JobClient:     Data-local map tasks=1
> 14/03/14 11:43:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=87458
> 14/03/14 11:43:59 INFO mapred.JobClient:   File Output Format Counters
> 14/03/14 11:43:59 INFO mapred.JobClient:     Bytes Written=1040
> 14/03/14 11:43:59 INFO mapred.JobClient:   FileSystemCounters
> 14/03/14 11:43:59 INFO mapred.JobClient:     FILE_BYTES_READ=258
> 14/03/14 11:43:59 INFO mapred.JobClient:     HDFS_BYTES_READ=6925961
> 14/03/14 11:43:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=288058
> 14/03/14 11:43:59 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
> 14/03/14 11:43:59 INFO mapred.JobClient:   File Input Format Counters
> 14/03/14 11:43:59 INFO mapred.JobClient:     Bytes Read=6924788
> 14/03/14 11:43:59 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/14 11:43:59 INFO mapred.JobClient:     Map output materialized
> bytes=178
> 14/03/14 11:43:59 INFO mapred.JobClient:     Map input records=21578
> 14/03/14 11:43:59 INFO mapred.JobClient:     Reduce shuffle bytes=178
> 14/03/14 11:43:59 INFO mapred.JobClient:     Spilled Records=10
> 14/03/14 11:43:59 INFO mapred.JobClient:     Map output bytes=30
> 14/03/14 11:43:59 INFO mapred.JobClient:     CPU time spent (ms)=7080
> 14/03/14 11:43:59 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=323293184
> 14/03/14 11:43:59 INFO mapred.JobClient:     Combine input records=5
> 14/03/14 11:43:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=133
> 14/03/14 11:43:59 INFO mapred.JobClient:     Reduce input records=5
> 14/03/14 11:43:59 INFO mapred.JobClient:     Reduce input groups=5
> 14/03/14 11:43:59 INFO mapred.JobClient:     Combine output records=5
> 14/03/14 11:43:59 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=579375104
> 14/03/14 11:43:59 INFO mapred.JobClient:     Reduce output records=5
> 14/03/14 11:43:59 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=4192706560
> 14/03/14 11:43:59 INFO mapred.JobClient:     Map output records=5
> 14/03/14 11:43:59 INFO cvb.CVB0Driver: Completed 2 iterations in 120
> seconds
> 14/03/14 11:43:59 INFO cvb.CVB0Driver: Perplexities: ()
> 14/03/14 11:43:59 INFO cvb.CVB0Driver: About to run: Writing final
> topic/term distributions from reuters-lda/model/model-2 to reuters-lda/lda
> 14/03/14 11:43:59 INFO input.FileInputFormat: Total input paths to process
> : 10
> 14/03/14 11:43:59 INFO cvb.CVB0Driver: About to run: Writing final
> document/topic inference from reuters-lda/reuters-matrix/matrix to
> reuters-lda/topics
> 14/03/14 11:44:00 INFO input.FileInputFormat: Total input paths to process
> : 1
> 14/03/14 11:44:00 INFO mapred.JobClient: Running job: job_201403131444_0036
> 14/03/14 11:44:01 INFO mapred.JobClient:  map 0% reduce 0%
> 14/03/14 11:44:11 INFO mapred.JobClient:  map 10% reduce 0%
> 14/03/14 11:44:12 INFO mapred.JobClient:  map 20% reduce 0%
> 14/03/14 11:44:14 INFO mapred.JobClient:  map 40% reduce 0%
> 14/03/14 11:44:16 INFO mapred.JobClient:  map 60% reduce 0%
> 14/03/14 11:44:19 INFO mapred.JobClient:  map 80% reduce 0%
> 14/03/14 11:44:21 INFO mapred.JobClient:  map 90% reduce 0%
> 14/03/14 11:44:22 INFO mapred.JobClient:  map 100% reduce 0%
> 14/03/14 11:44:22 INFO mapred.JobClient: Job complete:
> job_201403131444_0036
> 14/03/14 11:44:22 INFO mapred.JobClient: Counters: 19
> 14/03/14 11:44:22 INFO mapred.JobClient:   Job Counters
> 14/03/14 11:44:22 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=30038
> 14/03/14 11:44:22 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 14/03/14 11:44:22 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 14/03/14 11:44:22 INFO mapred.JobClient:     Launched map tasks=10
> 14/03/14 11:44:22 INFO mapred.JobClient:     Data-local map tasks=10
> 14/03/14 11:44:22 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 14/03/14 11:44:22 INFO mapred.JobClient:   File Output Format Counters
> 14/03/14 11:44:22 INFO mapred.JobClient:     Bytes Written=1040
> 14/03/14 11:44:22 INFO mapred.JobClient:   FileSystemCounters
> 14/03/14 11:44:22 INFO mapred.JobClient:     HDFS_BYTES_READ=2420
> 14/03/14 11:44:22 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=250140
> 14/03/14 11:44:22 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
> 14/03/14 11:44:22 INFO mapred.JobClient:   File Input Format Counters
> 14/03/14 11:44:22 INFO mapred.JobClient:     Bytes Read=1040
> 14/03/14 11:44:22 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/14 11:44:22 INFO mapred.JobClient:     Map input records=5
> 14/03/14 11:44:22 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=364630016
> 14/03/14 11:44:22 INFO mapred.JobClient:     Spilled Records=0
> 14/03/14 11:44:22 INFO mapred.JobClient:     CPU time spent (ms)=800
> 14/03/14 11:44:22 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=162529280
> 14/03/14 11:44:22 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=3778887680
> 14/03/14 11:44:22 INFO mapred.JobClient:     Map output records=5
> 14/03/14 11:44:22 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1380
> 14/03/14 11:44:22 INFO mapred.JobClient: Running job: job_201403131444_0037
> 14/03/14 11:44:23 INFO mapred.JobClient:  map 0% reduce 0%
> 14/03/14 11:54:28 INFO mapred.JobClient: Task Id :
> attempt_201403131444_0037_m_000000_0, Status : FAILED
> org.apache.mahout.math.IndexException: Index 26587 is outside allowable
> range of [0,0)
>     at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:195)
>     at
>
> org.apache.mahout.clustering.lda.cvb.TopicModel.pTopicGivenTerm(TopicModel.java:374)
>     at
>
> org.apache.mahout.clustering.lda.cvb.TopicModel.trainDocTopicModel(TopicModel.java:287)
>     at
>
> org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:41)
>     at
>
> org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:28)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>     at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:416)
>     at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>     at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
> Task attempt_201403131444_0037_m_000000_0 failed to report status for 600
> seconds. Killing!
> attempt_201403131444_0037_m_000000_0: log4j:WARN No appenders could be
> found for logger (org.apache.hadoop.hdfs.DFSClient).
> attempt_201403131444_0037_m_000000_0: log4j:WARN Please initialize the
> log4j system properly.
> 14/03/14 12:04:35 INFO mapred.JobClient: Task Id :
> attempt_201403131444_0037_m_000000_1, Status : FAILED
> org.apache.mahout.math.IndexException: Index 26587 is outside allowable
> range of [0,0)
>     at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:195)
>     at
>
> org.apache.mahout.clustering.lda.cvb.TopicModel.pTopicGivenTerm(TopicModel.java:374)
>     at
>
> org.apache.mahout.clustering.lda.cvb.TopicModel.trainDocTopicModel(TopicModel.java:287)
>     at
>
> org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:41)
>     at
>
> org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:28)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>     at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:416)
>     at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>     at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
> Task attempt_201403131444_0037_m_000000_1 failed to report status for 600
> seconds. Killing!
> attempt_201403131444_0037_m_000000_1: log4j:WARN No appenders could be
> found for logger (org.apache.hadoop.hdfs.DFSClient).
> attempt_201403131444_0037_m_000000_1: log4j:WARN Please initialize the
> log4j system properly.
>

Re: lda times out prior to writing topics

Reply via email to