Hi Dan,

My guess, though you didn't directly say so, is that you're representing each 
sentence/"line" as a separate Lucene document.  To directly answer your 
question about whether inter-document relations (like database joins) are 
queryable in Lucene, I don't think so, other than performing multiple searches, 
where you feed the results of one query into another one (e.g.: first query for 
all lines with tag X, retrieve the line-ID and transcript-ID field values, then 
query for tag Y, requiring the same transcript-ID field value, and any one of 
the line-ID values that are within the window you want).

If instead (or perhaps in addition, depending on your other needs), each full 
transcript is a Lucene document, you can perform the kinds of searches you're 
talking about with tools available in Lucene.

I'm thinking of a lucene document with a "line-tags" field, populated with the 
tags you've associated with each line, and with the position of each line tag 
adjusted so that two tags assigned to the same line are given the same position 
(sometimes Lucene users call terms with the same position "synonyms", because 
that's the most common thing this capability is used for).

Then you can run a SpanNearQuery over the line-tags field, to return matches 
where tag X is within N lines of tag Y.

(See 
<http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/spans/package-summary.html>
 for info on the Lucene Span Query family.)

Steve

On 4/8/2009 at 9:33 AM, Dan Scrima wrote:
> So I have a requirement where I have a directory filled with xml files.
> I wrote a parser to parse these files, and index all of the xml
> attributes and properties into documents. An example of one of these
> documents is below. I'm parsing sentences into words, and tagging the
> sentences based on certain criteria.
> 
> My issue is trying to find out if lucene can handle cross-document
> searching. So below is indexed as a single document... and there will
> be multiple sentences before, after, and throughout an entire
> transcript. Is it possible somehow to say, "I want a result where one
> line marked as Symptom is 5 lines away from another line marked as
> Brand." So in essence, I'm trying to search across multiple lucene
> documents.
> 
> Any thoughts or literature out there?
> 
> <transcript>
>   <line id="1">
>     <tag id="10" type="Symptom" />
>     <tag id="12" type="Brand" />
>     <word>
>       <token>Coughing</token>
>       <part-of-speech>SBJ</part-of-speech>
>     </word>
>     <word>
>       <token>is</token>
>       <part-of-speech>VB</part-of-speech>
>     </word>
>     <word>
>       <token>caused</token>
>       <part-of-speech>NP</part-of-speech>
>     </word>
>     <word>
>       <token>by</token>
>       <part-of-speech>PP</part-of-speech>
>     </word>
>     <word>
>       <token>Mucinex</token>
>       <part-of-speech>PDC</part-of-speech>
>     </word>
>   </line>
> </transcript>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to