Re: Possible exceptions using IndexReader & IndexWriter

2006-09-18 Thread Jason Polites
I've also seen FileNotFound exceptions when attempting a search on an index while it's being updated, and the searcher is in a different JVM. This is supposed to be supported, but on Windows seems to regularly fail (for me anyway). The simplest solution to this would be a service oriented approa

Re: Stop words in index

2006-09-04 Thread Jason Polites
e" which is what should have cone in your doc when it was indexed using that analyzer. : : On 9/3/06, Jason Polites <[EMAIL PROTECTED]> wrote: : > : > Roger that. I'll double check my code. : > : > Thanks. : > : > : > On 9/3/06, Otis Gospodnetic <[EMAIL PROT

Re: Stop words in index

2006-09-03 Thread Jason Polites
ot;, but not "on". This is fine, and if the user searches for: Disney on Ice They will get a match. But, it seems that a search for: "Disney on Ice" With the quotations indicating the desire for an "exact match", the absence of stop words in the index means this

Re: Stop words in index

2006-09-02 Thread Jason Polites
Original Message ---- From: Jason Polites <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, September 2, 2006 9:05:27 AM Subject: Stop words in index Hey all, I am using the StandardAnalyzer with my own list of stop words (which is more comprehensive than the default list), and m

Stop words in index

2006-09-02 Thread Jason Polites
Hey all, I am using the StandardAnalyzer with my own list of stop words (which is more comprehensive than the default list), and my expectation was that this would omit these stop words from the index when data is indexed using this analyzer. However, I am seeing stop words in the term vector fo

Re-created fields consistently indexed?

2006-08-30 Thread Jason Polites
Hi all, I understand that it is possible to "re-create" fields which are indexed but not stored (as is done by Luke), and that this is a lossy process, however I am wondering whether the indexed version of this remains consistent. That is, if I re-create a non-stored field, then re-index this fi

Re: Straight TF-IDF cosine similarity?

2006-08-29 Thread Jason Polites
Have you looked at the MoreLikeThis class in the similarity package? On 8/30/06, Winton Davies <[EMAIL PROTECTED]> wrote: Hi All, I'm scratching my head - can someone tell me which class implements an efficient multiple term TF.IDF Cosine similarity scoring mechanism? There is clearly the sin

Re: java.io.IOException: Access is denied on java.io.WinNTFileSystem.createFileExclusively

2006-08-28 Thread Jason Polites
ound.. if that helps. On 8/28/06, Michael McCandless <[EMAIL PROTECTED]> wrote: Jason Polites wrote: > Yeah.. I had a think about this, and I now remember why I originally > came to > the conclusion about cross-JVM access. > > When I was adding documents to the index, and searc

Re: java.io.IOException: Access is denied on java.io.WinNTFileSystem.createFileExclusively

2006-08-28 Thread Jason Polites
Yeah.. I had a think about this, and I now remember why I originally came to the conclusion about cross-JVM access. When I was adding documents to the index, and searching at the same time (from a different JVM) I would get the occassional (but regular) FileNotFoundException. I don't recall the

Re: Lucene displaying results in the order they were added

2006-08-27 Thread Jason Polites
Not sure what the desired end result is here, but you shouldn't need to update the document jut to give it a boost factor. This can be done in the query string used to search the index. As for updating affecting search order, I don't think you can assume any guarantees in this regard. You're pr

Re: java.io.IOException: Access is denied on java.io.WinNTFileSystem.createFileExclusively

2006-08-27 Thread Jason Polites
]> wrote: Doron Cohen wrote: > "Jason Polites" <[EMAIL PROTECTED]> wrote on 27/08/2006 09:36:07: > >> I would have thought that simultaneous cross-JVM access to an index was >> outside of scope of the core Lucene API (although it would be great), but &

Re: java.io.IOException: Access is denied on java.io.WinNTFileSystem.createFileExclusively

2006-08-27 Thread Jason Polites
due to any reason can be thought of as the same thing, regardless of the reason (so long as its logged). Seems like the simplest solution too. On 8/28/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 8/26/06, Jason Polites <[EMAIL PROTECTED]> wrote: > Synchronization at this

Re: java.io.IOException: Access is denied on java.io.WinNTFileSystem.createFileExclusively

2006-08-26 Thread Jason Polites
On 8/26/06, Michael McCandless <[EMAIL PROTECTED]> wrote: Are you also running searchers against this index? Are they re-init'ing frequently or being opened and then held open? No searches running in my initial test, although I can't be certain what is happening under the Compass hood. This

java.io.IOException: Access is denied on java.io.WinNTFileSystem.createFileExclusively

2006-08-26 Thread Jason Polites
Hi all, When indexing with multiple threads, and under heavy load, I get the following exception: java.io.IOException: Access is denied at java.io.WinNTFileSystem.createFileExclusively(Native Method) at java.io.File.createNewFile(File.java:850) at org.apache.lucene.store.FSDirectory$1.o

Re: index update with database insertion

2006-08-21 Thread Jason Polites
I'm not sure about the solution in the referenced thread. It will work, but doesn't it run the risk of breaching the transaction isolation of the database write? The issue is when the index is notified of a database update. If it is notified prior to the transaction commit, and the commit fails

Re: Indexing Documents which has Attachments and are Refered many times!!

2006-08-19 Thread Jason Polites
dex all data that way. The database is not required. To address your search complexity concern, you can create queries that search only those Field(s) the user wants -- there is no need to have a Field for each possible combination of content type. Steve Jason Polites wrote: > Maybe I'm not u

Re: Index not recreated

2006-08-14 Thread Jason Polites
fferent threads accessing the index. This would also explain why you see the problem in production and not testing. On 8/15/06, Jason Polites <[EMAIL PROTECTED]> wrote: My advice would be the "back-to-basics" approach. Create a test case which creates a simple index with a few do

Re: Index not recreated

2006-08-14 Thread Jason Polites
My advice would be the "back-to-basics" approach. Create a test case which creates a simple index with a few documents, verify the index is as you expect, then re-create the index and verify again. Run this test case on your production environment (if you are able). This will determine once and

Re: updating document

2006-08-12 Thread Jason Polites
This strategy can also be nicely abstracted from your main app. Whilst I haven't yet implemented it, my plan is to create a template style structure which tells me which fields are in lucene, and which are externalized. This way I don't bother storing data in lucene that it stored elsewhere, but

Re: 30 milllion+ docs on a single server

2006-08-12 Thread Jason Polites
Sounds like you're a bit frustrated. Cheer up, the simple fact is that engineering and business rarely see eye-to-eye. Just focus on the fact that what you have learnt from the process will help you, and they paid for it ;) On the issue at hand...Lucene should scale to this level, but you need

Re: WIll storing docs affect lucene's search performance ?

2006-08-12 Thread Jason Polites
IMO you should avoid storing any data in the index that you don't need for display. Lucene is an index (and a damn good one), not a database. If you find yourself storing large amounts of data in the index, this could be an indication that you may need to re-think your architecture. In its simp

Re: Indexing Documents which has Attachments and are Refered many times!!

2006-08-12 Thread Jason Polites
Maybe I'm not understanding your requirement, but this should be fairly simple in Lucene. Each document in your document management system would be represented by a single Lucene document in the index. Each lucene document will then have several fields, each field representing the values of the

Re: search document for keywords and keyphrases

2006-08-11 Thread Jason Polites
Yes you could use lucene for this, but it may be overkill for your requirement. If I understand you correctly, all you need to is find documents which match "any" of the words in your list? Do you need to rank the results? If not, it's probably easier just to create your own inverted index of

Re: updating document

2006-08-10 Thread Jason Polites
the index. Lucene works best when the index is light-weight. My recommendation is to think carefully about the "role" of the index, vs the role of your data storage approach. On 8/11/06, Deepan Chakravarthy <[EMAIL PROTECTED]> wrote: On Fri, 2006-08-11 at 01:58 +1000, Jason Po

Re: Field compression too slow

2006-08-10 Thread Jason Polites
I can share the data.. but it would be quicker for you to just pull out some random text from anywhere you like. The issue is that the text was in an email, which was one of about 2,000 and I don't know which one. I got the 4.5MB figure from the number of bytes in the byte array reported in the

Re: updating document

2006-08-10 Thread Jason Polites
Are your storing the contents of the fields in the index? That is, specifying Field.Store.YES when creating the field? In my experience fields which are not stored are not recoverable from the index (well.. they can be reconstructed but it's a lossy process). So when you retrieve the document,

Re: Field compression too slow

2006-08-10 Thread Jason Polites
Thanks for the Jira issue... one question on your synchronization comment... I have "assumed" I can't have two threads writing to the index concurrently, so have implemented my own read/write locking system. Are you saying I don't need to bother with this? My reading of the doco suggests that y

Field compression too slow

2006-08-10 Thread Jason Polites
Hello all, I am experiencing some performance problems indexing large(ish) amounts of text using the IndexField.Store.COMPRESS option when creating a Field in Lucene. I have a sample document which has about 4.5MB of text to be stored as compressed data within the field, and the indexing of this

Re: Inappropriate content detection

2006-02-06 Thread Jason Polites
There is also an open source java anti spam api which does a baysian scan of email content (plus other stuff). You could retro-fit to work with raw text. www.jasen.org (get the latest HEAD from CVS as the current release is a bit old... new version imminent) - Original Message - From:

RE: Search Timeout - abort a search

2005-07-07 Thread Jason Polites
You could do it asynchronously. That is, separate off the actually lucene search into a different thread which does the actual search, then the calling thread simply waits for a maximum time for the search thread to complete, then queries the status of the search thread to get the results obtained

RE: FileNotFoundException segments

2005-07-07 Thread Jason Polites
if ((indexFile = new File(indexDir)).exists() && indexFile.isDirectory()) { exists = false; Isn't this backwards? Couldn't you just do: indexFile = new File(indexDir); exists = (indexFile.exists() && indexFile.isDirectory()); -Original Message- From: bib_lucene bib [mailto: