Re: [EASY]How to change the demo of lucene143 into a multithread one?

2009-08-13 Thread Glen Newton
You are optimizing before the threads are finished adding to the index. I think this should work: IndexWriter writer = new IndexWriter("D:\\index", new StandardAnalyzer(), true); File file=new File(args[0]); Thread t1=new Thread(new IndexFiles(writer,file)); Thread t2=new Thread(new IndexFiles(wri

Re: Indexer crashes with "hit exception during merge"

2009-08-13 Thread rishisinghal
I am running this on OpenVMS V8.2-1 on IA64. For a small number of files this works all fine. I checked the resources part and i have enough disk and ram available. Regards, Rishi Michael McCandless-2 wrote: > > It's very odd that CheckIndex has no trouble opening the segment's > files, yet

Re: AW: Wildcard search fails

2009-08-13 Thread AHMET ARSLAN
> we used different analyzers and regenerated the index each > time with the same results...used Luke each time already. > Currently we're using SnowBall and Luke can't find any > documents using the supplied query examples below (in > zzz-all). > > Same happened using StandardAnalyzer (for both,

Simple tf cosine similarity

2009-08-13 Thread Claudio Gennaro
I would like to know if there is a simple way to force Lucene to adopt the simple cosine similarity of the term frequency vectors of the documents and the query for ranking the result. In practice the score sc_i of the document i should be given by: sc_i = (D_i*Q)/(|D_i|*|Q|) where D_i = vector o

Re: Is there a way to check for field "uniqueness" when indexing?

2009-08-13 Thread Shai Erera
In 2.9 there will be - IndexWriter#getReader(). BTW, note that even if someone deletes, your reader may not see this delete. If you use IndexWriter to delete docs, the open reader won't see those deletes. So you may still have a problem. I don't know how much stuff users can index, and how often

Re: Is there a way to check for field "uniqueness" when indexing?

2009-08-13 Thread Daniel Shane
Users can index really a lot of stuff, so I'd like not to keep things in memory for too long. Even if I keep a set of things added, how do I know if something has been deleted via a delete? It seems rather difficult to keep this set of documents added in sync with the index reader on the index

Re: Is there a way to check for field "uniqueness" when indexing?

2009-08-13 Thread Shai Erera
How many documents do you index between you refresh a reader? If it's not too much, I'd keep a Set of those terms and check every incoming document in the set and then the reader. Note that the set keeps only just the terms of those documents your reader doesn't see. You should clear() it after yo

Is there a way to check for field "uniqueness" when indexing?

2009-08-13 Thread Daniel Shane
Hi all! I'm currently running a big lucene index and one of my main concerns is the integrity of the data entered. A few things come to mind, like enforcing that certain fields be non-blank, forcing certain formats etc... All these validations are easy to do with lucene, since I can validate

AW: Wildcard search fails

2009-08-13 Thread Ueli Kistler
Hello, we used different analyzers and regenerated the index each time with the same results...used Luke each time already. Currently we're using SnowBall and Luke can't find any documents using the supplied query examples below (in zzz-all). Same happened using StandardAnalyzer (for both, inde

Re: Generating Query

2009-08-13 Thread AHMET ARSLAN
> hm...try tat...but doesn't seems to be working for me though Discarding lengthNorm didn't work for you. Very interesting. I am not sure but I think inverse document frequency causing problem to you. Probably one of query word (very common word) has high document frequency, and the docs having

Re: Generating Query

2009-08-13 Thread AHMET ARSLAN
> hm...try tat...but doesn't seems to be working for me though Discarding lengthNorm didn't work for you. Very interesting. I am not sure but I think inverse document frequency causing problem to you. Probably one of query word (very common word) has high document frequency, and the docs having

Re: Term Extraction

2009-08-13 Thread Grant Ingersoll
I would just throw your doc into a MemoryIndex (lives in contrib/ memory, I think; it only holds one doc), get the Vector and do what you need to do. So you would kind of be doing indexing, but not really. On Aug 13, 2009, at 8:43 AM, joe_coder wrote: Grant, thanks for responding. My i

Re: Wildcard search fails

2009-08-13 Thread Erick Erickson
Several, all of which boil down to "what analyzers are you usingduring indexing and searching?". Without that information, we can't say much. Also, I'd recommend you get a copy of Luke and examine your index to see whether what's in there is what you expect. And query.toString and (as Grant says)

Re: Term Extraction

2009-08-13 Thread joe_coder
For example, I am able to do Analyzer analyzer = new StandardAnalyzer(); // or any other analyzer TokenStream ts = analyzer.tokenStream("myfield",new StringReader("some text goes here")); Token t = ts.next(); while (t!=null) { System.out.println("token: "+t)); t

Re: Term Extraction

2009-08-13 Thread joe_coder
Grant, thanks for responding. My issue is that I am not planning to use lucene ( as I don't need any search capability, atleast yet). All I have is a text document and I need to extract keywords and their frequency ( which could be a simple split on space and tracking the count). But I realize th

Re: Indexer crashes with "hit exception during merge"

2009-08-13 Thread Michael McCandless
It's very odd that CheckIndex has no trouble opening the segment's files, yet when you run optimize the OS reports a "file not found" exception (errno 5). Something odd is happening at the OS/filesystem level. What OS are you running on? Can you boil this down to a smallish standalone test that

Re: Term Extraction

2009-08-13 Thread Grant Ingersoll
On Aug 13, 2009, at 7:40 AM, joe_coder wrote: I was wondering if there is any way to directly use Lucene API to extract terms from a given string. My requirement is that I have a text document for which I need a term frequency vector ( after stemming, removing stopwords and synonyms che

Re: Indexer crashes with "hit exception during merge"

2009-08-13 Thread rishisinghal
I tried creating the index in different disks but still i see the issue :-( I tried to index documents in other disks also and got the same exception. I also tried $ java org.apache.lucene.index.CheckIndex /SYS$SYSDEVICE/RISHI/melon_1600/ -segment _61 NOTE: testing will be more thorough if you

Wildcard search fails

2009-08-13 Thread Ueli Kistler
Hello, We're experiencing a problem using Lucene 2.4.1 and Compass 2.1.4 using wildcard search. Attribute values containing slashes can be searched using the full word, but not using wildcards. We already tried different analyzers with the same result. Slash isn't mentioned as a stop word onl

Re: Indexer crashes with "hit exception during merge"

2009-08-13 Thread Shai Erera
I noticed the exception is "Caused by: java.io.FileNotFoundException: /SYS$SYSDEVICE/RISHI/melon_1600/_61.cfs (i/o error (errno:5))" I searched for i/o error (errno:5) and found some information which associates it w/ a more native IO problem, like corrupt file due to system crash etc. Did you ex

Re: Indexer crashes with "hit exception during merge"

2009-08-13 Thread rishisinghal
It is a local file system. We are using lucene 2.4 and java 1.5 Regards, Rishi Shai Erera wrote: > > Is that a local file system, or a network share? > > On Thu, Aug 13, 2009 at 1:07 PM, rishisinghal > wrote: > >> >> >>Is there any chance that two writers are open on this directory? >> No,

Term Extraction

2009-08-13 Thread joe_coder
I was wondering if there is any way to directly use Lucene API to extract terms from a given string. My requirement is that I have a text document for which I need a term frequency vector ( after stemming, removing stopwords and synonyms checks ). The result needs to be the terms and frequency. I

Re: [EASY]How to change the demo of lucene143 into a multithread one?

2009-08-13 Thread Amin Mohammed-Coleman
Hi I have recently created an indexing reference project using Spring Integration. May not help you with what you're doing but it might be interesting for creating asynchronous indexing using JMS. http://code.google.com/p/lucene-indexing-with-si/ Cheers Amin On Thu, Aug 13, 2009 at 11:53 AM,

Re: Lucene Vs Sphinx benchmarking for large dataset

2009-08-13 Thread Anshum
Thanks Simon! :) -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw On Thu, Aug 13, 2009 at 3:58 PM, Simon Willnauer < simon.willna...@googlemail.com> wrote: > On Thu, Aug 13, 20

[EASY]How to change the demo of lucene143 into a multithread one?

2009-08-13 Thread Chuan SHI
Hi all, I am new to multi-thread programming and lucene. I want to change the indexing demo of lucene143 into a multi-thread one. I create one instance of IndexWriter which is shared by three threads. But I find that the time it costs when three threads are used is approximate three times of

Re: Lucene Vs Sphinx benchmarking for large dataset

2009-08-13 Thread Simon Willnauer
On Thu, Aug 13, 2009 at 12:24 PM, Anshum wrote: > Hey Simon, > Thanks for the comment, though would be great to have the comment @ the > blog! :) done! Simon > About testing vanilla sphinx Vs Sphinx, have that pipelined but would be > some time before I go ahead and do that. > I'm also planning a

Re: Lucene Vs Sphinx benchmarking for large dataset

2009-08-13 Thread Anshum
Hey Simon, Thanks for the comment, though would be great to have the comment @ the blog! :) About testing vanilla sphinx Vs Sphinx, have that pipelined but would be some time before I go ahead and do that. I'm also planning a benchmarking (of search & indexing) of 2.4 & 2.9 (when its here) with the

Re: Generating Query

2009-08-13 Thread bourne71
hm...try tat...but doesn't seems to be working for me though Ahmet Arslan wrote: > >> I am trying to boost  results that have all the query >> in it to increase its ranking. But both the query unfortunately does not >> > seems to effect it > > Did you read last two messages on this thread? > >

Re: Indexer crashes with "hit exception during merge"

2009-08-13 Thread Shai Erera
Is that a local file system, or a network share? On Thu, Aug 13, 2009 at 1:07 PM, rishisinghal wrote: > > >>Is there any chance that two writers are open on this directory? > No, thats not true. > > >>something external to Lucene is removing files from the directory. > No this also has rare chanc

Re: Indexer crashes with "hit exception during merge"

2009-08-13 Thread rishisinghal
>>Is there any chance that two writers are open on this directory? No, thats not true. >>something external to Lucene is removing files from the directory. No this also has rare chances as I am the owner of these files and other then me no one can delete the, :-) Here are all the files in the

Re: Generating Query

2009-08-13 Thread AHMET ARSLAN
> I am trying to boost  results that have all the query > in it to increase its ranking. But both the query unfortunately does not > > seems to effect it Did you read last two messages on this thread? http://www.nabble.com/Generating-Query-for-Multiple-Clauses-in-a-Single-Field-td24694748.html

Re: Indexer crashes with "hit exception during merge"

2009-08-13 Thread Michael McCandless
Is there any chance that two writers are open on this directory? Or, something external to Lucene is removing files from the directory. It looks like there were at least two missing files (_37 On Thu, Aug 13, 2009 at 5:19 AM, rishisinghal wrote: > > Hi, > > I am trying to index documents and whe

Indexer crashes with "hit exception during merge"

2009-08-13 Thread rishisinghal
Hi, I am trying to index documents and when all is complete and optimize is called I get IFD [main]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@4fced0 IW 0 [main]: setInfoStream: dir=org.apache.lucene.store.FSDirectory@/SYS$SYSDEVICE/RISHI/melon_1600 au

Re: Generating Query

2009-08-13 Thread bourne71
I am trying to boost results that have all the query in it to increase its ranking. But both the query unfortunately does not seems to effect it Ahmet Arslan wrote: > >> thanks for the suggestion, but unfortunately it does not >> work. > > What are you trying to do? Both Adriano's and my query

Re: Lucene Vs Sphinx benchmarking for large dataset

2009-08-13 Thread Simon Willnauer
Anshum, thanks for posting this on the list. I have a view comments on that benchmark while being happy that lucene has an upper hand in yours. I wonder if you can publish the various modifications you did to either of those? If not would it be possible to run the benchmarks against the vanilla ver

Re: Contribute to Lucene

2009-08-13 Thread Simon Willnauer
Once you open an issues on the spartial / analyzers/cn contribs feel free to assign me to them. simon On Thu, Aug 13, 2009 at 9:47 AM, Amin Mohammed-Coleman wrote: > Cool! I'll be on the case. > > Cheers! > > Amin > > On Thu, Aug 13, 2009 at 8:44 AM, Simon Willnauer > wrote: >> >> There is a lot

Re: Contribute to Lucene

2009-08-13 Thread Amin Mohammed-Coleman
Cool! I'll be on the case. Cheers! Amin On Thu, Aug 13, 2009 at 8:44 AM, Simon Willnauer < simon.willna...@googlemail.com> wrote: > There is a lot of code in /contrib which needs proper documentation, > refactoring and clean-up. > For refactoring you can have a quick look at /analyzers/smartcn.

Re: Contribute to Lucene

2009-08-13 Thread Simon Willnauer
There is a lot of code in /contrib which needs proper documentation, refactoring and clean-up. For refactoring you can have a quick look at /analyzers/smartcn. Clean-up and documentation is needed in /contrib/spartial which still suffers from lots of legacy comments and certainly legacy code. I gue

Simple Cosine Similarity

2009-08-13 Thread Claudio Gennaro
I would like to know if there is a simple way to force Lucene to adopt the simple cosine similarity of the term frequency vectors of the documents and the query for ranking the result. Thank you Claudio - To unsubscribe, e-mai

Re: term query boost problem

2009-08-13 Thread Simon Willnauer
Chrisitan, if you haven't done so you might find Luke (http://www.getopt.org/luke/) very helpful so see what has been indexed and how. simon On Thu, Aug 13, 2009 at 6:10 AM, Christian Bongiorno wrote: > turns out the index is being built with lower-case terms which is why we > aren't getting hits

Re: Contribute to Lucene

2009-08-13 Thread Amin Mohammed-Coleman
Thanks for your replies. I have checked out trunk and have started looking at what I can do. Any more suggestions as usual always welcome. Thanks all! Amin On Wed, Aug 12, 2009 at 10:28 PM, Chris Hostetter wrote: > > : that you use. Also, we are nearing 2.9 release, so it would > : be great t