how to extract feature vectors.

2013-04-14 Thread Sachin Kulkarni
Dear all, I would like to extract feature vectors for each document that is relevant to a query and write it out to a file. Is there a way in Lucene where I can specify a parameter to do this? or which part of the code deals with the feature vectors related to the documents so that I can modify th

Re: DiskDocValuesFormat

2013-04-14 Thread Wei Wang
Strange. That's all I got from the log beside the first line I wrote to show starting merging with a time stamp. On Sun, Apr 14, 2013 at 4:58 PM, Robert Muir wrote: > Your stack trace is incomplete: it doesn't even show where the OOM > occurred. > > On Sun, Apr 14, 2013 at 7:48 PM, Wei Wang wro

Re: DiskDocValuesFormat

2013-04-14 Thread Robert Muir
Your stack trace is incomplete: it doesn't even show where the OOM occurred. On Sun, Apr 14, 2013 at 7:48 PM, Wei Wang wrote: > Unfortunately, I got another problem. My index has 9 segments (9 dvdd > files) with total size is about 22GB. The merging step eventually failed > and I saw an error me

Re: DiskDocValuesFormat

2013-04-14 Thread Wei Wang
Unfortunately, I got another problem. My index has 9 segments (9 dvdd files) with total size is about 22GB. The merging step eventually failed and I saw an error message: Exception in thread "main" java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot complete forceMerge

Re: DiskDocValuesFormat

2013-04-14 Thread Wei Wang
That makes sense. BTW, I checked the jar file. Exactly as you pointed out, the services files only contains info from lucene-core, without codec from lucene-codecs. After adding the maven plugin, now it is running. Thanks! On Sun, Apr 14, 2013 at 3:26 PM, Uwe Schindler wrote: > Hi, > > > Thank

RE: DiskDocValuesFormat

2013-04-14 Thread Uwe Schindler
Hi, > Thanks for the hint. I will double check the jar file. > > I am just a bit puzzled that if the indexing step recognizes 'Disk' codec and > creates index properly, the merge step that immediately follows indexing > seems should also recognize the 'Disk' codec. This is easy to explain: By cr

Re: DiskDocValuesFormat

2013-04-14 Thread Wei Wang
Thanks for the hint. I will double check the jar file. I am just a bit puzzled that if the indexing step recognizes 'Disk' codec and creates index properly, the merge step that immediately follows indexing seems should also recognize the 'Disk' codec. On Sun, Apr 14, 2013 at 3:03 PM, Uwe Schindle

RE: DiskDocValuesFormat

2013-04-14 Thread Uwe Schindler
Are you sure that you use the ServicesResourceTransformer in your shade config? http://maven.apache.org/plugins/maven-shade-plugin/examples/resource-transformers.html#ServicesResourceTransformer The problem is: lucene-core.jar and lucene-codecs.jar both contain codec components and their classes

Re: DiskDocValuesFormat

2013-04-14 Thread Wei Wang
Yes, I used Maven Shade plugin, but still have this problem. Here is the Maven output during packaging: [INFO] --- maven-shade-plugin:2.0:shade (default) @ audience-profile-indexer --- [INFO] Including commons-collections:commons-collections:jar:3.2.1 in the shaded jar. [INFO] Including org.mockit

RE: DiskDocValuesFormat

2013-04-14 Thread Uwe Schindler
If you create a single JAR file out of multiple Lucene JAR files use a tool like Maven Shade plugin, otherwise, required metadata propreties (META-INF/services) files in the JAR files are not correctly merged together. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.

Re: DiskDocValuesFormat

2013-04-14 Thread Wei Wang
Hi Adrien, The Lucene42Codec works well to generate the index with DiskDocValuesFormat. But when I tried to merge the index segments by calling: IndexWriter iw = new IndexWriter(directory, iw_config); ... iw.forceMerge(1); I got the following error message: Caused by: java.lang.IllegalArgumentE

SmartChineseAnalyzer & JapaneseAnalyzer description/paper

2013-04-14 Thread Lucenius
Hi community, I am looking for a description or paper about the SmartChineseAnalyzer and the JapaneseAnalyzer. The SmartChineseAnalyzer uses (Hierarchical?) Hidden Markov Models? The JapaneseAnalyzer(Kuromoji) uses Conditional Random Fields? Thx. :) -- View this message in context: http://