RE: Linking two different indexes
Hi Mike, IndexReader provides a method addIndex() which should do what you are looking for, if I understand correctly. Damien -Original Message- From: Yakn [mailto:[EMAIL PROTECTED] Sent: 25 March 2007 03:02 To: java-user@lucene.apache.org Subject: Linking two different indexes I am trying to link the nutch index and the index generated from my database using Lucene. So at the time of indexing my database, I want to pull the indexes in from nutch and link the content from the url in the database and the url that nutch hit. Can anyone tell me if they have done this and if so how they did it. I would appreciate the help. If anyone knows of another way, I would be interested in that as well. Thanks in advance. Mike -- View this message in context: http://www.nabble.com/Linking-two-different-indexes-tf3461011.html#a9656534 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MergeFactor and MaxBufferedDocs value should ...?
I should add that in my situation, the number of documents that fit in ram is...er...problematical to determine. My current project is composed of books that I chose to index as a single book at a time. Unfortunately, answering the question "how big is a book" doesn't help much, they range from 2 pages to over 7,000 pages. So how to set the various indexing parameters, especially maxBufferedDocs is hard to balance between efficiency and memory. Will I happen to get a string of 100 large books? If so, I need to set the limit to a small number. Which will not be terribly efficient for the "usual" case. That said, I don't much care about efficiency in this case. I can't generate the index quickly (20,000+ books) and the factors I've chosen let me generate it between the time I leave work and the time I get back in the morning, so I don't really need much more tweaking. But this illustrates why I referred to picking factors as a "guess". With a heterogeneous index where the documents vary widely in size, picking parameters isn't straight-forward. My current parameters may not work if I index the documents in a different order than I am currently. I just don't know. They may even not work on the next set of data, since much of the data is OCR and for many books it's pretty trashy and/or incomplete (imagine the OCR output of a genealogy book that consists entirely of a stylized tree with the names written by hand along the branches in many orientations!). We're promised much better OCR data in the next set of books we index, which may blow my current indexer out of the watter. Which is why I'm so glad that the ramSizeInBytes has been added. It seems to me that I can now create a reasonably generalized way to index heterogeneous documents with "good enough" efficiency. I'm imagining keeping a few simple statistics, like size of incoming document and change in index size as a result of indexing that doc. This should allow me to figure out a reasonable factor for predicting how much the *next* addition will increase the index and flushing ram based upon that prediction. With, probably, quite a large safety margin. I don't really care if I get every last efficiency in this case. What I *do* care about is that the indexing run completes and this new capability seems to allow me to insure that without penalizing the bulk of my indexing because I have a few edge cases. Anyway, thanks for adding this capability, which I'll probably use in the pretty near future. And thanks Michael for your explanation of what these factors really do. It may have been documented before, but this time it finally is sticking in my aging brain... Erick On 3/23/07, Michael McCandless <[EMAIL PROTECTED]> wrote: "Erick Erickson" <[EMAIL PROTECTED]> wrote: > I haven't used it yet, but I've seen several references to > IndexWriter.ramSizeInBytes() and using it to control when the writer > flushes the RAM. This seems like a more deterministic way of > making things efficient than trying various combinations of > maxBufferedDocs , MergeFactor, etc, all of which are guesses > at best. I agree this is the most efficient way to flush. The one caveat is this Jira issue: http://issues.apache.org/jira/browse/LUCENE-845 which can cause over-merging if you make maxBufferedDocs too large. I think the rule of thumb to avoid this issue is 1) set maxBufferedDocs to be no more than 10X the "typical" number of docs you will flush, and then 2) flush by RAM usage. So for example if when you flush by RAM you typically flush "around" 200-300 docs, then setting maxBufferedDocs to eg 1000 is good since it's far above 200-300 (so it won't trigger a flush when you didn't want it to) but it's also well below 10X your range of docs (so it won't tickle the above bug). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Linking two different indexes
Thanks Damien, I believe that addIndex(index) is only going to add the index into the new indexes. But how do I actually link the document either at search time or index time from the url in the database indexes and the Nutch index? So to explain my problem a little better Nutch Index URL Content Awww.something.com A lot of junk that needs linked Bwww.somethingelse.com Some more junk that needs linked Lucene Index(from Database) URL Other Fields Dwww.something.com Gwww.something.com I want D and G to be linked with A either at Indexing time or at searching time. Can anyone elaborate on how to do this. Thanks in advance and thanks again Damien. Mike Damien McCarthy wrote: > > Hi Mike, > > IndexReader provides a method addIndex() which should do what you are > looking for, if I understand correctly. > > Damien > > -Original Message- > From: Yakn [mailto:[EMAIL PROTECTED] > Sent: 25 March 2007 03:02 > To: java-user@lucene.apache.org > Subject: Linking two different indexes > > > I am trying to link the nutch index and the index generated from my > database > using Lucene. So at the time of indexing my database, I want to pull the > indexes in from nutch and link the content from the url in the database > and > the url that nutch hit. Can anyone tell me if they have done this and if > so > how they did it. I would appreciate the help. If anyone knows of another > way, I would be interested in that as well. Thanks in advance. > > Mike > -- > View this message in context: > http://www.nabble.com/Linking-two-different-indexes-tf3461011.html#a9656534 > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Linking-two-different-indexes-tf3461011.html#a9660891 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reverse search
On app startup: 1) parse all Queries and place in an array. 2) Create a RAMIndex containing a doc for each query with content consisting of the query's terms (see Query.extractTerms). For optimal performance only index the most rare term for queries with multiple mandatory criteria e.g. PhraseQuerys. "Most rare" can be determined by looking at IndexReader.docFreq(t) using an existing index which is representative of your type of content. 3) For any queries that can't be handled by 2) e.g. FuzzyQueries - add to list of "run always queries". Whenever you receive a new document: 1) Put it in a MemoryIndex 2) Get a list of the document's terms by calling memoryIndex.getReader().terms(); 3) For each term hit your query RAMIndex and get queryIndexReader.termDocs(term) - this will give you the ids of queries that need to be run - you can use the doc id to index straight into your parsed queries array. 4) Run all queries found in 3) and all those held in your "run always" list against the MemoryIndex containing your new document Hope this helps, Mark Melanie Langlois wrote: Hi Mark, If I follow you, I should list the key terms in my incoming document, then select the queries which contains these key terms, and then run those queries on my index ? If this is correct there is two things I don't understand: -how do I know which term is a key term in my document ? -how can I select the queries? Should I index them in a separate index? Thanks, Mélanie Langlois -Original Message- From: mark harwood [mailto:[EMAIL PROTECTED] Sent: Friday, March 23, 2007 11:19 PM To: java-user@lucene.apache.org Subject: Re: Reverse search Bear in mind that the million queries you run on the MemoryIndex can be shortlisted if you place those queries in a RAMIndex and use the source document's terms to "query the queries". The list of unique terms for your document is readily available in the MemoryIndex's TermEnum. You can take this list and find "likely related queries" to execute from your Query index. Note that for phrase queries or other forms of query with multiple mandatory terms you should only index one of the terms (preferably the rarest) to ensure that your query is not needlessly executed. For example - using this approach I need only run the phrase query for "XYZ limited" whenever I encounter a document with the rare term "XYZ" in it, rather than the much more commonplace "limited". Cheers Mark - Original Message From: karl wettin <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 23 March, 2007 12:54:36 PM Subject: Re: Reverse search 23 mar 2007 kl. 09.57 skrev Melanie Langlois: Well, I though to use the PerFieldAnalyzerWrapper which contains as basic the snowballAnalyzer with English stopwords and use snowballAnalyzer with language specific keywords for the fields which will be in different languages. But I'm seeing that in your MemoryIndexTest you commented the use of SnowballAnalyzer, is it because it's too slow. In this case, I think I could use the StandardAnalyzer... what do you think? I think that creating an index with a couple of documents takes a fraction of the time it will take to place a million queries on that index. There is no real need to optimize something that takes milliseconds when you in the same process do something that takes half a minute. ___ All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine http://uk.docs.yahoo.com/nowyoucan.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index word files ( doc )
I've been using Ryan's textmining in prefence to the POI as internally TM uses POI and the Word6 extractor so handles a greater variety of files. Ryan, thanks for fixing your site. Do you have any plans/ideas on how to parse the 'fast-saved' files and any ideas on Word files older than the Word 6 format? Regards Antony Ryan Ackley wrote: As the author of both Word POI and textmining.org, I recommend using textmining.org. POI is for general purpose manipulation of Word documents. textmining's only purpose is extracting text. Also, people recommend using POI for text extraction but the only place I've seen an actual how-to on this is in the "Lucene in Action" book. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index word files ( doc )
Yes I do have plans for adding fast save support and support for more file formats. The time frame for this happening is the next couple of months. I'm playing with the idea of offering a commercial version. I want to continue to support the open source community so I want to keep it open source or free and add value that people would be willing to pay for. Any comments on this are appreciated. One thing I thought of would be to continue to offer the text extraction as open source but add html conversion with hit highlighting for a variety of file formats as a commercial add on. Is this something anyone would pay for? What are some other pain points of the Lucene community besides text extraction? On 3/25/07, Antony Bowesman <[EMAIL PROTECTED]> wrote: I've been using Ryan's textmining in prefence to the POI as internally TM uses POI and the Word6 extractor so handles a greater variety of files. Ryan, thanks for fixing your site. Do you have any plans/ideas on how to parse the 'fast-saved' files and any ideas on Word files older than the Word 6 format? Regards Antony Ryan Ackley wrote: > As the author of both Word POI and textmining.org, I recommend using > textmining.org. POI is for general purpose manipulation of Word > documents. textmining's only purpose is extracting text. > > Also, people recommend using POI for text extraction but the only > place I've seen an actual how-to on this is in the "Lucene in Action" > book. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index word files ( doc )
Ryan Ackley wrote: As the author of both Word POI and textmining.org, I recommend using textmining.org. POI is for general purpose manipulation of Word documents. textmining's only purpose is extracting text. I wish the two would collaborate though. It's true that POI contains code for writing which isn't necessary for indexing. But it's also true that POI contains code for extracting images, which for many projects *is* necessary. Also, people recommend using POI for text extraction but the only place I've seen an actual how-to on this is in the "Lucene in Action" book. It's not too difficult though: doc.getTextTable().getTextPieces(); Downside of that approach is that some of the text you get back isn't "text" in the sense that you might expect. (I consider it an upside myself, because sometimes it's good to find all this otherwise hidden text.) Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Linking two different indexes
Yakn wrote: Thanks Damien, I believe that addIndex(index) is only going to add the index into the new indexes. But how do I actually link the document either at search time or index time from the url in the database indexes and the Nutch index? So to explain my problem a little better Nutch Index URL Content Awww.something.com A lot of junk that needs linked Bwww.somethingelse.com Some more junk that needs linked Lucene Index(from Database) URL Other Fields Dwww.something.com Gwww.something.com I want D and G to be linked with A either at Indexing time or at searching time. Can anyone elaborate on how to do this. Thanks in advance and thanks again Damien. Unless you define what "linked with" actually means it's going to be hard to offer suggestions, but have you looked at ParallelReader? If that won't do what you want then the better way to approach this is to explain what you're actually trying to *do*, rather than asking for advice on how to implement one possibility of doing it. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this message or attachment is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
how to search over another search
hi I have two separated index but there are some fields that are common between them. now I want to search from one index and then apply the result to the second one. what solution do you suggest? what happens on fields? I mean the first document has some fields that are not present in the second one so I need the final document has all the fields of both indexes. thanks -- Regards, Mohammad
Re: index word files ( doc )
Ryan Ackley wrote: Yes I do have plans for adding fast save support and support for more file formats. The time frame for this happening is the next couple of months. That would be good when it comes. It would be nice if it could handle a 'brute force' mode where in the event of problems, it will just allow the text it can find to be extracted. Currently if there is an Exception, I just run a raw strings parser on the file to fetch what I can. One problem I found is that files not padded to 512 byte blocks cannot be parsed, but Words reads them happily. They seem to be valid in other respects, i.e. they have the 1Table, Root Entry and other recognisable parts. Padding the file to 512 byte boundary with nulls parses OK. Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]