RE: Linking two different indexes

2007-03-25 Thread Damien McCarthy
Hi Mike,

IndexReader provides a method addIndex() which should do what you are
looking for, if I understand correctly.

Damien

-Original Message-
From: Yakn [mailto:[EMAIL PROTECTED] 
Sent: 25 March 2007 03:02
To: java-user@lucene.apache.org
Subject: Linking two different indexes


I am trying to link the nutch index and the index generated from my database
using Lucene. So at the time of indexing my database, I want to pull the
indexes in from nutch and link the content from the url in the database and
the url that nutch hit. Can anyone tell me if they have done this and if so
how they did it. I would appreciate the help. If anyone knows of another
way, I would be interested in that as well. Thanks in advance.

Mike
-- 
View this message in context:
http://www.nabble.com/Linking-two-different-indexes-tf3461011.html#a9656534
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MergeFactor and MaxBufferedDocs value should ...?

2007-03-25 Thread Erick Erickson

I should add that in my situation, the number of documents that
fit in ram is...er...problematical to determine. My current project
is composed of books that I chose to index as a single book at a
time.

Unfortunately, answering the question "how big is a book" doesn't
help much, they range from 2 pages to over 7,000 pages. So how
to set the various indexing parameters, especially maxBufferedDocs
is hard to balance between efficiency and memory. Will I happen
to get a string of 100 large books? If so, I need to set the limit
to a small number. Which will not be terribly efficient for the "usual"
case.

That said, I don't much care about efficiency in this case. I can't
generate the index quickly (20,000+ books) and the factors I've
chosen let me generate it between the time I leave work and the
time I get back in the morning, so I don't really need much more
tweaking.

But this illustrates why I referred to picking factors as a "guess".
With a heterogeneous index where the documents vary widely
in size, picking parameters isn't straight-forward. My current
parameters may not work if I index the documents in a different
order than I am currently. I just don't know.

They may even not work on the next set of data, since much of
the data is OCR and for many books it's pretty trashy and/or
incomplete (imagine the OCR output of a genealogy book that
consists entirely of a stylized tree with the names written
by hand along the branches in many orientations!).

We're promised much better OCR data in the next set of books
we index, which may blow my current indexer out of the watter.

Which is why I'm so glad that the ramSizeInBytes has been
added. It seems to me that I can now create a reasonably
generalized way to index heterogeneous documents with
"good enough" efficiency. I'm imagining keeping a few
simple statistics, like size of incoming document and
change in index size as a result of indexing that doc. This
should allow me to figure out a reasonable factor for
predicting how much the *next* addition will increase the index
and flushing ram based upon that prediction. With, probably,
quite a large safety margin.

I don't really care if I get every last efficiency in this case. What
I *do* care about is that the indexing run completes and this
new capability seems to allow me to insure that without
penalizing the bulk of my indexing because I have  a few edge
cases.

Anyway, thanks for adding this capability, which I'll probably
use in the pretty near future.

And thanks Michael for your explanation of what these factors
really do. It may have been documented before, but this time
it finally is sticking in my aging brain...

Erick


On 3/23/07, Michael McCandless <[EMAIL PROTECTED]> wrote:



"Erick Erickson" <[EMAIL PROTECTED]> wrote:
> I haven't used it yet, but I've seen several references to
> IndexWriter.ramSizeInBytes() and using it to control when the writer
> flushes the RAM. This seems like a more deterministic way of
> making things efficient than trying various combinations of
> maxBufferedDocs , MergeFactor, etc, all of which are guesses
> at best.

I agree this is the most efficient way to flush.  The one caveat is
this Jira issue:

  http://issues.apache.org/jira/browse/LUCENE-845

which can cause over-merging if you make maxBufferedDocs too large.

I think the rule of thumb to avoid this issue is 1) set
maxBufferedDocs to be no more than 10X the "typical" number of docs
you will flush, and then 2) flush by RAM usage.

So for example if when you flush by RAM you typically flush "around"
200-300 docs, then setting maxBufferedDocs to eg 1000 is good since
it's far above 200-300 (so it won't trigger a flush when you didn't
want it to) but it's also well below 10X your range of docs (so it
won't tickle the above bug).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: Linking two different indexes

2007-03-25 Thread Yakn

Thanks Damien, I believe that addIndex(index) is only going to add the index
into the new indexes. But how do I actually link the document either at
search time or index time from the url in the database indexes and the Nutch
index? So to explain my problem a little better

Nutch Index URL 
   
Content
Awww.something.com
A lot of junk that needs linked
Bwww.somethingelse.com   
Some more junk that needs linked

Lucene Index(from Database) URL  Other
Fields
Dwww.something.com  
Gwww.something.com 


I want D and G to be linked with A either at Indexing time or at searching
time. Can anyone elaborate on how to do this. Thanks in advance and thanks
again Damien.

Mike 



Damien McCarthy wrote:
> 
> Hi Mike,
> 
> IndexReader provides a method addIndex() which should do what you are
> looking for, if I understand correctly.
> 
> Damien
> 
> -Original Message-
> From: Yakn [mailto:[EMAIL PROTECTED] 
> Sent: 25 March 2007 03:02
> To: java-user@lucene.apache.org
> Subject: Linking two different indexes
> 
> 
> I am trying to link the nutch index and the index generated from my
> database
> using Lucene. So at the time of indexing my database, I want to pull the
> indexes in from nutch and link the content from the url in the database
> and
> the url that nutch hit. Can anyone tell me if they have done this and if
> so
> how they did it. I would appreciate the help. If anyone knows of another
> way, I would be interested in that as well. Thanks in advance.
> 
> Mike
> -- 
> View this message in context:
> http://www.nabble.com/Linking-two-different-indexes-tf3461011.html#a9656534
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Linking-two-different-indexes-tf3461011.html#a9660891
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reverse search

2007-03-25 Thread markharw00d


On app startup:
1) parse all Queries and place in an array.
2) Create a RAMIndex containing a doc for each query with content 
consisting of the query's terms (see Query.extractTerms). For optimal 
performance only index the most rare term for queries with multiple 
mandatory criteria e.g. PhraseQuerys. "Most rare" can be determined by 
looking at IndexReader.docFreq(t) using an existing index which is 
representative of  your type of content.
3) For any queries that can't be handled by 2) e.g. FuzzyQueries - add 
to list of "run always queries".


Whenever you receive a new document:
1) Put it in a MemoryIndex
2) Get a list of the document's terms by calling 
memoryIndex.getReader().terms();
3) For each term hit your query RAMIndex and get 
queryIndexReader.termDocs(term) - this will give you the ids of queries 
that need to be run - you can use the doc id to index straight into your 
parsed queries array.
4) Run all queries found in 3) and all those held in your "run always" 
list against the MemoryIndex containing your new document


Hope this helps,
Mark


Melanie Langlois wrote:

Hi Mark,
If I follow you, I should list the key terms in my incoming document, then 
select the queries which contains these key terms, and then run those queries 
on my index ? If this is correct there is two things I don't understand:
-how do I know which term is a key term in my document ?
-how can I select the queries? Should I index them in a separate index?

Thanks,


Mélanie Langlois 
  
-Original Message-
From: mark harwood [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 23, 2007 11:19 PM

To: java-user@lucene.apache.org
Subject: Re: Reverse search

Bear in mind that the million queries you run on the MemoryIndex can be shortlisted if 
you place those queries in a RAMIndex and use the source document's terms to "query 
the queries". The list of unique terms for your document is readily available in the 
MemoryIndex's TermEnum.
You can take this list and find "likely related queries" to execute from your 
Query index.
Note that for phrase queries or other forms of query with multiple mandatory terms you should only index one of the terms (preferably the rarest) to ensure that your query is not needlessly executed. For example - using this approach I need only run the phrase query for "XYZ limited" whenever I encounter a document with the rare term "XYZ" in it, rather than the much more commonplace "limited". 


Cheers
Mark

- Original Message 
From: karl wettin <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 23 March, 2007 12:54:36 PM
Subject: Re: Reverse search


23 mar 2007 kl. 09.57 skrev Melanie Langlois:

  
Well, I though to use the PerFieldAnalyzerWrapper which contains as  
basic the snowballAnalyzer with English stopwords and use  
snowballAnalyzer with language specific keywords for the fields  
which will be in different languages. But I'm seeing that in your  
MemoryIndexTest you commented the use of SnowballAnalyzer, is it  
because it's too slow. In this case, I think I could use the  
StandardAnalyzer... what do you think?



I think that creating an index with a couple of documents takes a  
fraction of the time it will take to place a million queries on that  
index. There is no real need to optimize something that takes  
milliseconds when you in the same process do something that takes  
half a minute.


  







___ 
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine 
http://uk.docs.yahoo.com/nowyoucan.html


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index word files ( doc )

2007-03-25 Thread Antony Bowesman
I've been using Ryan's textmining in prefence to the POI as internally TM uses 
POI and the Word6 extractor so handles a greater variety of files.


Ryan, thanks for fixing your site.  Do you have any plans/ideas on how to parse 
the 'fast-saved' files and any ideas on Word files older than the Word 6 format?


Regards
Antony


Ryan Ackley wrote:

As the author of both Word POI and textmining.org, I recommend using
textmining.org. POI is for general purpose manipulation of Word
documents. textmining's only purpose is extracting text.

Also, people recommend using POI for text extraction but the only
place I've seen an actual how-to on this is in the "Lucene in Action"
book.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index word files ( doc )

2007-03-25 Thread Ryan Ackley

Yes I do have plans for adding fast save support and support for more
file formats. The time frame for this happening is the next couple of
months.

I'm playing with the idea of offering a commercial version. I want to
continue to support the open source community so I want to keep it
open source or free and add value that people would be willing to pay
for.

Any comments on this are appreciated. One thing I thought of would be
to continue to offer the text extraction as open source but add html
conversion with hit highlighting for a variety of file formats as a
commercial add on. Is this something anyone would pay for? What are
some other pain points of the Lucene community besides text
extraction?

On 3/25/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:

I've been using Ryan's textmining in prefence to the POI as internally TM uses
POI and the Word6 extractor so handles a greater variety of files.

Ryan, thanks for fixing your site.  Do you have any plans/ideas on how to parse
the 'fast-saved' files and any ideas on Word files older than the Word 6 format?

Regards
Antony


Ryan Ackley wrote:
> As the author of both Word POI and textmining.org, I recommend using
> textmining.org. POI is for general purpose manipulation of Word
> documents. textmining's only purpose is extracting text.
>
> Also, people recommend using POI for text extraction but the only
> place I've seen an actual how-to on this is in the "Lucene in Action"
> book.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index word files ( doc )

2007-03-25 Thread Daniel Noll

Ryan Ackley wrote:

As the author of both Word POI and textmining.org, I recommend using
textmining.org. POI is for general purpose manipulation of Word
documents. textmining's only purpose is extracting text.


I wish the two would collaborate though.  It's true that POI contains 
code for writing which isn't necessary for indexing.  But it's also true 
that POI contains code for extracting images, which for many projects 
*is* necessary.



Also, people recommend using POI for text extraction but the only
place I've seen an actual how-to on this is in the "Lucene in Action"
book.


It's not too difficult though:

  doc.getTextTable().getTextPieces();

Downside of that approach is that some of the text you get back isn't 
"text" in the sense that you might expect.  (I consider it an upside 
myself, because sometimes it's good to find all this otherwise hidden text.)


Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://nuix.com/   Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Linking two different indexes

2007-03-25 Thread Daniel Noll

Yakn wrote:

Thanks Damien, I believe that addIndex(index) is only going to add the index
into the new indexes. But how do I actually link the document either at
search time or index time from the url in the database indexes and the Nutch
index? So to explain my problem a little better

Nutch Index URL
Content
Awww.something.com
A lot of junk that needs linked
Bwww.somethingelse.com   
Some more junk that needs linked


Lucene Index(from Database) URL  Other
Fields
Dwww.something.com  
Gwww.something.com 


I want D and G to be linked with A either at Indexing time or at searching
time. Can anyone elaborate on how to do this. Thanks in advance and thanks
again Damien.


Unless you define what "linked with" actually means it's going to be 
hard to offer suggestions, but have you looked at ParallelReader?


If that won't do what you want then the better way to approach this is 
to explain what you're actually trying to *do*, rather than asking for 
advice on how to implement one possibility of doing it.


Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://nuix.com/   Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



how to search over another search

2007-03-25 Thread Mohammad Norouzi

hi
I have two separated index but there are some fields that are common between
them. now I want to search from one index and then apply the result to the
second one. what solution do you suggest?
what happens on fields? I mean the first document has some fields that are
not present in the second one so I need the final document has all the
fields of both indexes.

thanks


--
Regards,
Mohammad


Re: index word files ( doc )

2007-03-25 Thread Antony Bowesman

Ryan Ackley wrote:

Yes I do have plans for adding fast save support and support for more
file formats. The time frame for this happening is the next couple of
months.


That would be good when it comes.  It would be nice if it could handle a 'brute 
force' mode where in the event of problems, it will just allow the text it can 
find to be extracted.  Currently if there is an Exception, I just run a raw 
strings parser on the file to fetch what I can.


One problem I found is that files not padded to 512 byte blocks cannot be 
parsed, but Words reads them happily.  They seem to be valid in other respects, 
i.e. they have the 1Table, Root Entry and other recognisable parts.  Padding the 
file to 512 byte boundary with nulls parses OK.


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]