Re: np-pandock search problem (again, with more detail)

2007-06-07 Thread Michael D. Curtin
Doron Cohen wrote: From the StandardAnalyzer javacc grammar : // floating point, serial, model numbers, ip addresses, etc. // every other segment must have at least one digit etc. <#P: ("_"|"-"|"/"|"."|",") > My understanding of this: a non-whitespace sequence is broken at eithe

Re: np-pandock search problem (again, with more detail)

2007-06-07 Thread Michael D. Curtin
Doron Cohen wrote: I think it splits by hyphens unless the no-hyphen part has digits, so: np-pandock-a7 becomes np pandock-a7 This is for the indexing part. Wow! Do you know the thinking behind that, i.e. why a number in a hyphenated expression prevents the split? --MDC

Re: np-pandock search problem (again, with more detail)

2007-06-07 Thread Michael D. Curtin
John Powers wrote: Np-pandock Np-pandock-1 Np-pandock-2 Np-pandock-L Np-pandock-L1 Np-pandock-L2 I'm not positive, but I think StandardAnalyzer splits this input at the hyphens. That is, it gives the terms "Np", "pandock", "1", "2", "L", "L1", and "L2", but NOT "Np-pandoc", etc. --MD

Re: Merge performance

2007-04-19 Thread Michael D. Curtin
david m wrote: A couple of reasons that lead to the merge approach: - Source documents are written to archive media and retrieval is relatively slow. Add to that our processing pipeline (including text extraction)... Retrieving and merging minis is faster than re-processing and re-indexing f

Re: Merge performance

2007-04-18 Thread Michael D. Curtin
d m wrote: I'd like to share index merge performance data and have a couple of questions about it... We (AXS-One, www.axsone.com) build one "master" index per day. For backup and recovery purposes, we also build many individual "mini" indexes from the docs added to the master index. Should one

Re: Design Problem: Searching large set of protected documents

2007-04-03 Thread Michael D. Curtin
Jonathan O'Connor wrote: I have a database of a million documents and about 100 users. The documents can have an access control list, and there is a complex, recursive algorithm to say if a particular user can see a particular document. My problem is that my search algorithm is to first do a st

Re: one Field in many documents

2007-03-08 Thread Michael D. Curtin
<[EMAIL PROTECTED]> wrote on 08/03/2007 12:56:33: I have to index many documents with the same fields (only one or two fields are different). Can I add a field (Field instance) to many documents? It seams to work but I'm not sure if this is the right way... What does "many" mean in this context?

Re: Lucene 1.4.3 : IndexWriter.addDocument(doc) fails when run on OS requiring permissions

2007-02-22 Thread Michael D. Curtin
Is your disk almost full? Under Linux, when you reach about 90% used on a file system, only the superuser can allocate more space (e.g. create files, add data to files, etc.). --MDC - To unsubscribe, e-mail: [EMAIL PROTECTED]

Re: search on colon ":" ending words

2007-01-28 Thread Michael D. Curtin
Felix Litman wrote: We want to be able to return a result regardless if users use a colon or not in the query. So 'work:' and 'work' query should still return same result. With the current parser if a user enters 'work:' with a ":" , Lucene does not return anything :-(. It seems to me the

Re: Long Query Performance

2007-01-22 Thread Michael D. Curtin
Somnath Banerjee wrote: Thanks for the reply. Good guess I think. DB (Index) is basically a collection of encyclopedia documents. Queries are also a collection of documents but of various domains. My task is to find out for each "query document" top 100 matching encyclopedia contents. I tried b

Re: Long Query Performance

2007-01-22 Thread Michael D. Curtin
Somnath Banerjee wrote: I have created a 8GB index of almost 2 million documents. My requirement is to run nearly 0.72 million query on this index. Each query consists of 200 - 400 words. I have created a Boolean Query by ORing these words. But each query is taking nearly 5 - 10 secon

Re: term vectors

2006-11-15 Thread Michael D. Curtin
Phil Rosen wrote: I would like to get the sum of frequency counts for each term in the fields I specify across the search results. I can just iterate through the documents and use getTermFreqVector() for each desired field on each document, then sum that; but this seems slow to me. It seems

Re: term vectors

2006-11-15 Thread Michael D. Curtin
Phil Rosen wrote: I am building an application that requires I index a set of documents on the scale of hundreds of thousands. A document can have a varying number of attribute fields with an unknown set of potential values. I realize that just indexing a blob of fields would be much faster, ho

Re: termpositions at index time...

2006-10-18 Thread Michael D. Curtin
Erick Erickson wrote: Arbitrary restrictions by IT on the space the indexes can take up. Actually, I won't categorically I *can't* make this happen, but in order to use this option, I need to be able to present a convincing case. And I can't do that until I've exhausted my options/creativity.

Re: termpositions at index time...

2006-10-18 Thread Michael D. Curtin
Erick Erickson wrote: Here's my problem: We're indexing books. I need to a> return books ordered by relevancy b> for any single book, return the number of hits in each chapter (which, of course, may be many pages). 1>If I index each page as a document, creating the relevance on a book basis

Re: index architectures

2006-10-18 Thread Michael D. Curtin
On Wed, 2006-10-18 at 19:05 +1300, Paul Waite wrote: No they don't want that. They just want a small number. What happens is they enter some silly query, like searching for all stories with a single common non-stop-word in them, and with the usual sort criterion of by date (ie. a field) descendi

Re: FWD: Re: parser question

2006-09-08 Thread Michael D. Curtin
If your question is why are the queries '(field:software field:engineer)' and '(+field:software +field:engineer)' returning the same results, it could be because none of your documents have *only* "software" *or* "engineer", i.e. they all have both words or neither. You could tes

Re: Documents that know more?

2006-08-29 Thread Michael D. Curtin
Furash Gary wrote: I'm sure this is just a design point that I'm missing, but is there a way to have my document objects know more about themselves? At the time I create my document, I know a bit about how information is being stored in it (e.g., this field represents a SOUNDEX copy, etc.), yet

Re: a "fair" similarity

2006-08-14 Thread Michael D. Curtin
Daniel Naber wrote: Hi, as some of you may have noticed, Lucene prefers shorter documents over longer ones, i.e. shorter documents get a higher ranking, even if the ratio "matched terms / total terms in document" is the same. For example, take these two artificial documents: doc1: x 2 3 4 5

Re: MultiField Query

2006-07-17 Thread Michael D. Curtin
Erick Erickson wrote: I'm pretty sure your problem is Query q = new BooleanQuery... should be BooleanQuery q = new BooleanQuery... Good catch! That's why the add() method is barfing. The parse() thing, though, is probably a change to the QueryParser interface for 2.0. --MDC --

Re: MultiField Query

2006-07-17 Thread Michael D. Curtin
[EMAIL PROTECTED] wrote: When I try this ( using Lucene 2.0 API ) I get the error: "..The method parse(String) is not applicable for arguments (String, String, StopAnalyzer) ... When I try this I get the error: The method add(TermQuery, BooleanClause.Occur) is undefined for type Query Could th

Re: MultiField Query

2006-07-16 Thread Michael D. Curtin
[EMAIL PROTECTED] wrote: I am using Lucene 2.0 and trying to use the MultiFieldQueryParser in my search. I want to limit my search to documents which have "silly" in "field1" ...within that subset of documents, I want documents which have "example" in "field2" OR "field3" The code fragment b

Re: BooleanQuery question

2006-07-06 Thread Michael D. Curtin
Van Nguyen wrote: I just want results that have: ID: 1234 OR 2344 OR 2323 LOCATION: A1 LANGUAGE: ENU This query returns everything from my index. How would I create a query that will only return results the must have LOCATION and LANGUAGE and have only those three IDs. I think you'll ne

Re: Lucene and database

2006-07-04 Thread Michael D. Curtin
Alexander Mashtakov wrote: But, the database is going to be big enough, and the list of IDs returned by Lucene too. This may cause high memory usage and slow sql query speed (for instance 1000 IDs in "IN (id1, id2 ...)" sql filter) For this part, I recommend using a working table to hold the

Re: Sorting & SQL-Database

2006-07-01 Thread Michael D. Curtin
Dominik Bruhn wrote: Hy, i use Lucene to index a SQL-Table which contains three fields: a index-field, the text to search in and another field. When adding a lucene document I let Lucene index the search-field and also save the id along with it in the lucene index. Uppon searching I collect

Re: BooleanQuery.TooManyClauses on MultiSearcher

2006-06-15 Thread Michael D. Curtin
Rob Staveley (Tom) wrote: I guess the most expensive thing I'm doing from the perspective of Boolean clauses is heavily using PrefixQuery. I want my user to be able to find e-mail to, cc or from [EMAIL PROTECTED], so I opted for PrefixQuery on James. Bearing in mind that this is causing me grie

Re: Searching UN_TOKENIZED fields

2006-06-15 Thread Michael D. Curtin
[EMAIL PROTECTED] wrote: Hi, I have a field indexed as follows: new Field(name, value, Store.YES, Index.UN_TOKENIZED) I would like to search this field for exact match of the query term. Thus if, for instance in the above code snippet: String name="PROJECT"; String value="Apache Lucene";

Re: Does more memory help Lucene?

2006-06-12 Thread Michael D. Curtin
Nadav Har'El wrote: Otis Gospodnetic <[EMAIL PROTECTED]> wrote on 12/06/2006 04:36:45 PM: Nadav, Look up one of my onjava.com Lucene articles, where I talk about this. You may also want to tell Lucene to merge segments on disk less frequently, which is what mergeFactor does. Thanks. Can

Re: Does more memory help Lucene?

2006-06-12 Thread Michael D. Curtin
Nadav Har'El wrote: What I couldn't figure out how to use, however, was the abundant memory (2 GB) that this machine has. I tried playing with IndexWriter.setMaxBufferedDocs(), and noticed that there is no speed gain after I set it to 1000, at which point the running Lucene takes up just 70 MB

Re: searching in more than fields on document

2006-06-06 Thread Michael D. Curtin
Not sure if I understand exactly what you want to do, but would the ":" syntax that QueryParser understands work for you? That is, you could send query text like f1:foo f2:foo f3:foo to search for "foo" in any of the 3 fields. If you need boolean capabilities you can use parentheses, li

Re: Search precondition: matching area

2006-05-16 Thread Michael D. Curtin
David Trattnig wrote: Hello LuceneList, I've got at least following fields in my index: AREA = "home news business" CONTENTS = "... hello world ..." If I submit the query query-string: "hello area:home" Lucene should only search these documents which has the matching area. Actually Lucene s

Re: IndexReader seems loading the full index

2006-05-16 Thread Michael D. Curtin
Sharad Agarwal wrote: I am a newbie in lucene space. and trying to understand lucene search result caching; facing with a wierd issue. After creating the IndexReader from a file system directory, I rename/remove the index directory; but still I am able to search the index and able to get the

Re: Boosting Fields (in index) or Queries

2006-04-14 Thread Michael D. Curtin
Jeremy Hanna wrote: I would use a database function to force the ordering like the one your provided that works in Oracle, but it doesn't look like mysql 5 supports that. If anyone else knows of a way to force the ordering using mysql 5 queries, please respond. I think I'll just resort th

Re: Best design for an use case which is going to stress Lucene

2006-03-16 Thread Michael D. Curtin
Terenzio Treccani wrote: You're both true, this doesn't sound like Lucene at all... But the problem of such SQL tables is their size: speaking about millions of customers and thousands of news items, the many-to-many (CustArt) table would end up by containing BILLIONS of lines A bit too big

Re: Best design for an use case which is going to stress Lucene

2006-03-15 Thread Michael D. Curtin
This doesn't sound like a Lucene problem, at least the way you've described it. For example, Lucene can't search on any field that isn't indexed (and most of yours aren't indexed). Given that, it seems like your option (c) is the way to go. Seems like a simple RDBMS schema with 3 tables woul

Re: search problem

2006-02-28 Thread Michael D. Curtin
Anton Potehin wrote: I have a problem. There is an index, which contains about 6,000,000 records (15,000,000 will be soon) the size is 4GB. Index is optimized and consists of only one segment. This index stores the products. Each product has brand, price and about 10 more additional fields. I

Re: Question to Lucene Index

2006-02-24 Thread Michael D. Curtin
Volodymyr Bychkoviak wrote: This is not the case. maxClauseCount limit number of terms that can fit into BooleanQuery during some queries rewriting. And default value is 1024, not 32. 32 required/prohibited clauses is limitation of Lucene 1.4.3 due to usage of bit patterns to mask required/

Re: Question to Lucene Index

2006-02-24 Thread Michael D. Curtin
Thomas Papke wrote: What is the disadvantage of doing that? Besides being a bit awkward, it could use more RAM and time to compile queries. That could be because a query like "foo*" gets expanded into all the terms from the index that begin with "foo", ORed together, like so: "foo" OR "fo

Re: Question to Lucene Index

2006-02-24 Thread Michael D. Curtin
Thomas Papke wrote: i am a "newby" in usage of Apache Lucene. If have a relativly big database indexed by lucene (about 300MB File). Up to now - all users could search over the hole index. How to restrict the resultset? I have tried it with adding some BooleanQuerys to restrict entries. But wi

Re: SQL DISTINCT functionality in Lucene

2006-02-23 Thread Michael D. Curtin
Hugh Ross wrote: I need to find all distinct values for a keyword field in a Lucene index. I think the IndexReader.terms() method will do what you want. Good luck! --MDC - To unsubscribe, e-mail: [EMAIL PROTECTED] For addi

Re: Custom Sorting

2006-02-20 Thread Michael D. Curtin
SOME ONE wrote: Hi, Yes, my queries are like the first case. And as there have been no other suggestions to do it in a single search operation, will have to do it the way you suggested. This technique will do the job particularly because title's text is always in the body as well. So finally I

Re: Custom Sorting

2006-02-18 Thread Michael D. Curtin
I'm not sure you can do what you want in a single search. But, I'm not sure I actually understand what your queries look like, either. I *think* you want to search like (title:a OR body:a) AND (title:b OR body:b) AND (title:c OR body:c) not something like (title:a OR title:b OR title:c) AND

Re: Custom Sorting

2006-02-18 Thread Michael D. Curtin
SOME ONE wrote: Yes, I could run two searches, but that means running two searches for each request from user and that I think doubles the job taking double time. Any suggestions to do it more efficiently please ? I think it would only take double time if the sets of hit documents have substa

Re: Custom Sorting

2006-02-17 Thread Michael D. Curtin
SOME ONE wrote: Hi, I am using MultiFieldQueryParser (Lucene 1.9) to search title and body fields in the documents. The requirement is that documents with title match should be returned before the documents with body match. Using the default scoring, title matches do come before the body matche

Re: Speedup indexing process

2006-02-17 Thread Michael D. Curtin
Java Programmer wrote: Hi, Maybe this question is trivial but I need to ask it. I've some problem with indexing large number of documents, and I seek for better solution. Task is to index about 33GB text data CSV (each record about 30kB), it possible of course to index these data but I'm not ver

Re: Help with mass delete from large index

2006-02-15 Thread Michael D. Curtin
Chandramohan wrote: perform such a cull again, you might make several distinct indexes (one per day, per week, per whatever) during that reindexing so the next time will be much easier. How would you search and consolidate the results across multiple indexes? Hits from each index will have

Re: Help with mass delete from large index

2006-02-13 Thread Michael D. Curtin
Greg Gershman wrote: No problem; this is not meant to be a regular operation, rather it's a (hopefully) one-time thing till the index can be restructured. The data is chronological in nature, deleting everything before a specific point in time. The index is optimized, so is it possible to remo

Re: Help with mass delete from large index

2006-02-13 Thread Michael D. Curtin
Greg Gershman wrote: I'm trying to delete a large number of documents (~15million) from a a large index (30+ million documents). I've started with an optimized index, and a list of docIds (our own unique identifier for a document, not a Lucene doc number) to pass to the IndexReader.delete(Term

Re: Performance and FS block size

2006-02-10 Thread Michael D. Curtin
Otis Gospodnetic wrote: Michael, Actually, one more thing - you said you changed the store/BufferedIndexOutput.BUFFER_SIZE from 1024 to 4096 and that turned out to yield the fastest indexing. Does your FS block size also happen to be 4k (dumpe2fs output) on that FC3 box? If so, I wonder if

Re: Performance and FS block size

2006-02-10 Thread Michael D. Curtin
Otis Gospodnetic wrote: Hi, I'm wondering if anyone has tested Lucene indexing/search performance with different file system block sizes? I just realized one of the servers where I run a lot of Lucene indexing and searching has an FS with blocks of only 1K in size (typically they are 4k or

Re: Queries not derived from the text index

2006-02-07 Thread Michael D. Curtin
Daniel Noll wrote: Is it possible to customise the QueryParser so that it returns Query instances that have no relationship to the text index whatsoever? The syntax that Lucene's QueryParser supports isn't very complicated. I'm sure you could write your own parser from scratch, perhaps with s

Re: How to find "function()" - ?

2006-01-30 Thread Michael D. Curtin
Dmitry Goldenberg wrote: a) if I index "function()" as "function()" rather than "function", does that mean that if I search for "function", then it won't be found? -- the problem is that in some cases, the user will want to find function(), and in some cases just function -- can I accommodate

Re: How to find "function()" - ?

2006-01-27 Thread Michael D. Curtin
Dmitry Goldenberg wrote: Hi, I'm trying to figure out a way to locate tokens which include special characters. The actual text in the file being indexed is something like "function() { statement1; statement2; }" The query I'm using is "function\()" since I want to locate precisely "function

Re: Help needed with BooleanQuery formation

2006-01-25 Thread Michael D. Curtin
Michael Pickard wrote: Can anyone help me with the formation of a BooleanQuery ? I want a query of the form: x AND ( a OR b OR c OR d) You're going to need 2 BooleanQuery objects, one for the OR'd expression in parentheses, and another for the AND and expression. Something like this:

Re: performance implications for an index with large number of documents.

2006-01-24 Thread Michael D. Curtin
Hi Ori, Before taking drastic rehosting measures, and introducing the associated software complexity off splitting your application into pieces running on separate machines, I'd recommend looking at the way your document data is distributed and the way you're searching them. Here are some qu

Re: OutOfMemory during optimize

2006-01-23 Thread Michael D. Curtin
Steve Rajavuori wrote: I am periodically getting "Too many open files" error when searching. Currently there are over 500 files in my Lucene directory. I am attempting to run optimize( ) to reduce the number of files. However, optimize never finishes because whenever I run it, it quits with a

Re: OutOfMemory during optimize

2005-12-22 Thread Michael D. Curtin
Steve Rajavuori wrote: I am periodically getting "Too many open files" error when searching. Currently there are over 500 files in my Lucene directory. I am attempting to run optimize( ) to reduce the number of files. However, optimize never finishes because whenever I run it, it quits with a

Re: How to retrieve distinct field matches?

2005-12-16 Thread Michael D. Curtin
Plat wrote: Basically, pretend I do a regular search for "category:fiction". After stemming/etc, this would match any Document with a category of "fiction", "non-fiction", "fictitious", etc. All 900+ of them. BUT as far as the results are concerned, I'm not actually interested in each Document

Re: How to retrieve distinct field matches?

2005-12-15 Thread Michael D. Curtin
Mr Plate wrote: This puzzle has been bugging me for a while; I'm hoping there's an elegant way to handle it in Lucene. DATA DESCRIPTION: I've got an index of over 100,000 Documents. In addition to other fields, each of these Documents has 0 or more "category" field values. There are over

Re: delete and optimize

2005-12-08 Thread Michael D. Curtin
Mordo, Aviran (EXP N-NANNATEK) wrote: Optimization also purges the deleted documents, thus reduces the size (in bytes) of the index. Until you optimize documents stay in the index only marked as deleted. Deleted documents' space is reclaimed during optimization, 'tis true. But it can also be

Re: Wildcard

2005-12-02 Thread Michael D. Curtin
John Powers wrote: Hello, Lucene only lets you use a wildcard after a term, not before, correct? What work arounds are there for that? If I have an item 108585-123 And another 332323-123 How can I look for all the -123 family of items? Classic indexing problem. Here are a couple simple ideas

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Michael D. Curtin
Richard Jones wrote: If you're willing to continue subsetting / summarizing the data out into Lucene, how about subsetting it out into a dedicated MySQL instance for this purpose? 100 artists * 1M profiles * 2 ints * 4 bytes/int = roughly 1 GB of data, which would easily fit into RAM. Queries

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Michael D. Curtin
Richard Jones wrote: The data i'm dealing with is stored over a few mysql dbs on different machines, horizontally partitioned so each user is assigned to a single db. The queries i'm doing can be done in SQL in parallel over all machines then combined, which i've tested - it's unacceptably slo

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

2005-11-02 Thread Michael D. Curtin
Richard Jones wrote: Hi, I'm using lucene (which rocks, btw ;) behind the scenes at www.last.fm for various things, and i've run into a situation that seems somewhat inelegant regarding populating fields which i already know the termvector for. I'm creating a document for each user (last.fm t

Re: BooleanQuery

2005-11-01 Thread Michael D. Curtin
tcorbet wrote: I have an index over the titles to .mp3 songs. It is not unreasonable for the user to want to see the results from: "Show me Everything". I understand that title:* is not a valid wildcard query. I understand that title:[a* TO z*] is a valid wildcard query. What I cannot underst

Re: what is the best way to sort by document ids

2005-11-01 Thread Michael D. Curtin
Oren Shir wrote: My documents contain a field called SORT_ID, which contains an int that increases with every document added to the index. I want my results to be sorted by it. Which approach will prove the best performance: 1) Zero pad SORT_ID field and sort by it as plain text. 2) Sort using