Re: Merging Lucene documents

Developer Developer Mon, 07 Jan 2008 06:00:20 -0800

Eric,

Thank you very much for the insight on offsets. I think I may not really
need to worry about offsets.


Nonetheless, I solved my offset problem by overriding Java StringReader
classs instead of overriding tokenstream class., The StringReader class does
the streaming nicely, thus solving the offset problem and also the memory
hog problem.




On Jan 6, 2008 5:43 PM, Erick Erickson <[EMAIL PROTECTED]> wrote:

> Well, I wonder if offsets mean anything in this context. I suppose another
> way of asking this is "why do you care?". What is it that you're trying
> to make happen? Or prevent from happening?
>
> The offsets are used for a variety of purposes, but what does it mean
> for two offsets to be "close" in this context? Imagine that you have
> the following pages:
> p1
>  p1.1
>  p1.2
>    p 1.2.1
>    p 1.2.2
>  p1.3
>  p1.4
> p2
>  p2.1
>  p2.2
>     p2.2.1
>     p2.2.2
>
>
> Now, is a token on p1.2.1 near or far from a token on p2.2.1? Does
> it depend upon how you happened to crawl the site? If so, could you
> crawl it one time and get pages p1 followed by p2  the first time, and
> the second time get p1, p1.1, p1.2..... p2 the next time? What does
> adjacency mean then? and since it can change anyway, does
> it really provide you with anything useful?
>
> What all this is leading to is that so far, this seems like an XY problem.
> That is, you're trying to solve some (as yet unstated) problem and
> assuming that tokenizing is the answer. What is the higher-level
> problem you're struggling with? Why couldn't you go ahead and
> index each page as a single document, *Also* index the base URL,
> then just search with an "AND URL:blah.blort.blivet" clause?
>
> I think my first approach would be really stupid. Just crawl the site, and
> write the resulting pages out to one file per site on disk, then index
> those
> files. If for no other reason than re-crawling the website every time will
> be an utter pain in the neck because you'll be waiting forever for each
> iteration of your index since you have to go back to the site each time.
> *Then* worry about elegance <G>...
>
> But I will say, about offsets: If you're overriding next(), you can make
> them anything you want.
>
> Best
> Erick
>
> On Jan 6, 2008 1:54 PM, Developer Developer <[EMAIL PROTECTED]>
> wrote:
>
> > Hi Eric,
> >
> > No, you are not off base. You are on track, but here is my problem.
> >
> > I have a requirement to create one lucene document per site i.e suppose
> I
> > crawl www.xxx.com which has 1000 pages in it. If I use nutch then it
> will
> > create 1000 lucene documents i.e 1 document per page. My requirement is
> to
> > combine all 1000 pages in to just one lucene document.
> >
> > One approach is to construct an in memory String by combining content
> from
> > all the pages and then index it in lucene as one document, but this is
> not
> > an elegant approach because the in memory String would be a memory hog.
> > Therefore I am trying to construct tokenStream for each document as
> > follows
> >
> >                  StandardAnalyzer st = new StandardAnalyzer();
> >                TokenStream stream = st.tokenStream("content", new
> > StringReader(documentText);
> >
> > and then  construct a LuceneDocument by using a Field based on
> tokenStream
> >
> > *Field<
> >
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20org.apache.lucene.analysis.TokenStream%29
> > >
> > *(String<
> >
> http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html?is-external=true
> > >
> > name,
> > TokenStream<
> >
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/TokenStream.html
> > >
> >  tokenStream)
> >
> > Then TokenStream would be my own implementation which will overrise
> next()
> > method and return tokens one by one . With this approach I can avoid
> > creating a huge in memory string.
> >
> > So, I am wondering will the tokens have correct offset values with this
> > approach.
> >
> > Thanks !
> >
> >
> >
> >
> > On Jan 6, 2008 1:13 PM, Erick Erickson <[EMAIL PROTECTED]> wrote:
> >
> > > I don't get what you mean about extracting tokenstreams. Tokenstreams
> > > are, as far as I understand, an analysis-time class. That is, either
> > when
> > > originally indexing the document or when analyzing a query.
> > >
> > > If you do not have the entire document stored in the index, you have
> to
> > > do something like reconstruct the document from the indexed data,
> which
> > > is time-consuming. But see the Luke code for a way to do this.
> > >
> > > If you *do* have stored fields, then you have the raw text available.
> > >
> > > In either case, you eventually get a string representation of the
> > various
> > > fields in the documents you want to combine. Why not just index that?
> > > Since this is an index process and (presumably) can take some time,
> > > you could either concatenate the strings together in memory and index
> > > the string or write it to a file on disk and then index *that*.
> > >
> > > If this is way off base, perhaps a bit more explanation of the problem
> > > you're trying to solve would be in order.
> > >
> > > Best
> > > Erick
> > >
> > > On Jan 6, 2008 12:45 PM, Developer Developer <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > Hello Friends,
> > > >
> > > > I have a unique requirement of merging two or more lucene indexed
> > > > documents
> > > > into just one indexed document . For example
> > > >
> > > > Document newDocutmet = doc1+doc2+doc3
> > > >
> > > > In order to do this I am planning to extract tokenstreams form each
> > > > document
> > > > ( i.e doc1, doc2 and doc3) , and use them to construct newDocument .
> > The
> > > > reason is , I do not have access to the content of the original
> > > documents
> > > > (doc1,doc2,doc3)
> > > >
> > > >
> > > > My questions are
> > > >
> > > > 1. Is this the correct approach
> > > > 2. Do I have to update the start and end offsets of the tokens since
> > the
> > > > tokens from original documents (doc1, 2,3) were relative to the
> > original
> > > > documents, and in the newDocument these offsets may be wrong.
> > > > 3. If Yes, then how do I make sure that the mergeded tokens have
> > correct
> > > > start and end offset.
> > > >
> > > > Thanks !
> > > >
> > >
> >
>

Re: Merging Lucene documents

Reply via email to