Re: a complete solution for building a website search with lucene

2010-01-08 Thread Otis Gospodnetic
Nutch is written in Java, so Nutch itself *should* work on other non-Linux OSs that the JVM supports. But it does contain some shell scripts, as does Hadoop that Nutch uses. Oh, I guess Windows people run it under Cygwin? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

Re: Indexing pages and chapters of a book

2010-01-08 Thread Erick Erickson
Sure, you can add any data to any document that you want, probably stored but not indexed in this case. It could even be a serialized Java object. Or an XML packet or a stringized map. Or... whatever suits your fancy. If it's not indexed, only stored it'll make your index larger but have a negligib

Indexing pages and chapters of a book

2010-01-08 Thread LucasMeadows
I have a large number of text files (books) that I am trying to make searchable with Lucene 2.3.2. I would like search results to display the page and chapter in which a match with the search term occurred. My question is whether it is possible to add structural data (xml perhaps) to the files s

Re: Search query problem

2010-01-08 Thread Will Murnane
On Fri, Jan 8, 2010 at 16:27, Jamie wrote: > Hi Ian / Will > > Thanks. Surely, the Porter Stemmer should not stem proper noun's. i.e. it > could check the capitalization of the first letter of a word and whether or > not the word is the start of sentence. If so, it could choose not apply any > ste

Re: ShingleFilter with outputUnigrams=false

2010-01-08 Thread Simon Willnauer
You can find the issue for this here https://issues.apache.org/jira/browse/LUCENE-2199 On Fri, Jan 8, 2010 at 8:53 PM, Simon Willnauer wrote: > This is truly a bug. The outputUnigram internally only works if you > request bi-grams. > If the outputUnigram is set to false the filter increment the >

Re: Search query problem

2010-01-08 Thread Jamie
Hi Ian / Will Thanks. Surely, the Porter Stemmer should not stem proper noun's. i.e. it could check the capitalization of the first letter of a word and whether or not the word is the start of sentence. If so, it could choose not apply any stemming. Or am I completely out of whack? Jamie I

Re: Search query problem

2010-01-08 Thread Ian Lea
Looks like PorterStemFilter converts "Lowe's" to low. Not very surprising. Options include . Drop the stemming . Index stemmed and non-stemmed variants and search both, maybe boosting the non-stemmed variant. If you really want exact matches only, you may also/instead want untokenized fields

Re: Search query problem

2010-01-08 Thread Will Murnane
On Fri, Jan 8, 2010 at 15:01, Jamie wrote: > Hi There > > We are trying to search for the exact word "Lowe's" across a large set of > indexed data. Our results include everything with "low" in it. Thus, we are > receiving a much larger data set that we expected. The data is indexing > using the an

Search query problem

2010-01-08 Thread Jamie
Hi There We are trying to search for the exact word "Lowe's" across a large set of indexed data. Our results include everything with "low" in it. Thus, we are receiving a much larger data set that we expected. The data is indexing using the analyzer: TokenStream result = new Standa

Re: ShingleFilter with outputUnigrams=false

2010-01-08 Thread Simon Willnauer
This is truly a bug. The outputUnigram internally only works if you request bi-grams. If the outputUnigram is set to false the filter increment the shingleposition by one and therefore skips every even shingle. The position should only be incremented if shingleBufferPosition % maxShingle == 0 I ha

Re: ShingleFilter with outputUnigrams=false

2010-01-08 Thread Chris Hostetter
: I am using lucene 2.9.1 and I was trying to understand the ShingleFilter and wrote the code below. ... : I was expecting the output as follows with maxShingleSize=3 and outputUnigrams=false : ... : Am I missing something or this is the expected behavior? I'm not very familiar

Re: Term Frequency for phrases

2010-01-08 Thread Erick Erickson
What are the associated Analyzers for your Gene and Token? Because if they're NOT something akin to KeywordAnalyzer, you have a problem. Specifically, most of the "regular" tokenizers will break this stream up into three separate terms, "brain", "natriuetic", and "peptide". If that's the case, the

Re: Term Frequency for phrases

2010-01-08 Thread Jason Rutherglen
I'm not going to go into too much code level detail, however I'd index the phrases using tri-gram shingles, and as uni-grams. I think this'll give you the results you're looking for. You'll be able to quickly recall the count of a given phrase aka tri-gram such as "blue_shorts_burough" On Fri, J

Re: Term Frequency for phrases

2010-01-08 Thread hrishim
@All : Elaborating the problem The phrase is being indexed as a single token ... I have a Gene tag in the xml document which is like brain natriuretic peptide This phrase is present in the abstract text for the given document . Code is as : doc.add(new Field("Gene", geneName, Field.Store.YES

Re: Term Frequency for phrases

2010-01-08 Thread Grant Ingersoll
When do you detect that they are phrases? During indexing or during search? On Jan 8, 2010, at 5:16 AM, hrishim wrote: > > Hi . > I have phrases like brain natriuretic peptide indexed as a single token > using Lucene. > When I calculate the term frequency for the same the count is 0 since the

AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-08 Thread Yuliya Palchaninava
Mike, thanks a lot! That's exactly what we'll do. Actually we have a lot of dynamic fields which are not analyzed and not involved in field/document boosting, so we can disable norms on these fields without problems. Thanks again. Yuliya > -Ursprüngliche Nachricht- > Von: Michael

Re: Term Frequency for phrases

2010-01-08 Thread Erick Erickson
On a quick read, your statements are contradictory <<>> <<>> Either "brain natriuretic peptide" is a single token/term or it's not Are you sure you're not confusing indexing and storing? What analyzer are you using at index time? Erick On Fri, Jan 8, 2010 at 5:16 AM, hrishim wrote:

Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-08 Thread Michael McCandless
Lucene stores 1 byte (disk and RAM, when searching that field) per document for any field that has norms enabled, even for documents that do not contain that field. In your case, that's ~20 MB per field (once optimize is done), times 559 fields = ~11TB of storage. You should index these fields wi

AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-08 Thread Yuliya Palchaninava
Thanks Michael. You are probably wright. Not optimized size is 4.1G, optimized index is about 15G. Yes, our documents do have many different indexed fields and norms are enabled. Nr of fields: 559 Nr of documents: 20845906 Nr of terms: 25615389 Could you please give me a more detailled explanat

Re: Concurrent access IndexReader / IndexWriter - FileNotFoundException

2010-01-08 Thread Michael McCandless
Normally, this (using an IndexReader, [re-]opening a new IndexReader while an IndexWriter is committing) is perfectly fine. The reader searches the point-in-time snapshot of the index as of when it was opened. But: what filesystem are you using? NFS presents challenges, for example. Mike On Fr

Re: Question about relevance

2010-01-08 Thread Erik Hatcher
One technique I've seen commonly used is to index both stemmed and unstemmed fields, and during search query both and boost the unstemmed field matches higher. Erik On Jan 8, 2010, at 4:05 AM, Yannick Caillaux wrote: Hi, I index 2 documents. the first contains the word "Wallis" in

Concurrent access IndexReader / IndexWriter - FileNotFoundException

2010-01-08 Thread legrand thomas
Hi, I often get a FileNotFoundException when my single IndexWriter commits while the IndexReader also tries to read. My application is multithreaded (Tomcat uses the business APIs); I firstly thought the read/write access was thread-safe but I probably forget something.  Please help me to unde

Re: Is there a way to limit the size of an index?

2010-01-08 Thread Michael McCandless
On Fri, Jan 8, 2010 at 1:22 AM, Babak Farhang wrote: >>> I wonder if renaming that to maxSegSizeMergeMB would make it more obvious >>> what this does? > > How about using the *able* moniker to make it clear we're referring to > the size of the to-be-merged segment, not the resultant merged > segm

Re: Term Frequency for phrases

2010-01-08 Thread Michael McCandless
Issue a PhraseQuery and count how many hits came back? Is that too slow? If so, you could detect all phrases during indexing and add them as tokens to the index? Mike On Fri, Jan 8, 2010 at 5:16 AM, hrishim wrote: > > Hi . > I have phrases like brain natriuretic peptide indexed as a single tok

Re: Implementing filtering based on multiple fields

2010-01-08 Thread Yaniv Ben Yosef
Thanks Otis, that's very helpful. On Fri, Jan 8, 2010 at 2:08 AM, Otis Gospodnetic wrote: > Ah, well, masking it didn't help. Yes, ignore Bixo, Nutch, and Droids > then. > Consider DataImportHandler from Solr or wait a bit for Lucene Connectors > Framework to materialize. Or use LuSql, or DbSi

Term Frequency for phrases

2010-01-08 Thread hrishim
Hi . I have phrases like brain natriuretic peptide indexed as a single token using Lucene. When I calculate the term frequency for the same the count is 0 since the tokens from the text are indexed separately i.e. brain , natriuretic , peptide. Is there a way to solve this problem and get the ter

Re: *Only* the matching whole sentence highlighted

2010-01-08 Thread Simon Willnauer
You need contrib-memory.jar in your classpath to use MemoryIndex. simon On Fri, Jan 8, 2010 at 10:42 AM, Li Leon wrote: > Hi all, > > I was able to get a whole sentence(including stop words) highlighted with > "StandardAnalyzer" and an empty stop words String[]. > > The current issue I'm having

Re: a complete solution for building a website search with lucene

2010-01-08 Thread jyzhou817
Hi Paul, Thanks. Use Nutch to do crawling. and integrate Lucene to the web application, so that can do search online. BTW, Nutch seems to have only Linux version, what my development is on Windows. Am i right? Zhou --- On Fri, 8/1/10, Paul Libbrecht wrote: From: Paul Libbrecht Subject: Re

*Only* the matching whole sentence highlighted

2010-01-08 Thread Li Leon
Hi all, I was able to get a whole sentence(including stop words) highlighted with "StandardAnalyzer" and an empty stop words String[]. The current issue I'm having is that not only the whole sentence got highlighted but those tokens partially match with the sentence also highlighted. I tried to u

Question about relevance

2010-01-08 Thread Yannick Caillaux
Hi, I index 2 documents. the first contains the word "Wallis" in the title field. The second has the same title but "Wallis" is replaced by "Wall". I execute the query : "title:wallis" During the search, "Wallis" is cut by the FrenchAnalyzer and becomes "wall". So the two documents are results

Re: a complete solution for building a website search with lucene

2010-01-08 Thread Paul Libbrecht
Zhou, Lucene is a back-end library, it's very useful for developer but it is not a complete site-search-engine. A lucene-based site-search-engine is Nutch, it does crawl. Solr also provides functions close to these with a large amount of thoughts on flexible integration; crawling methods are