Re: Indexing document instances and retrieving instance attributes

Chris D Thu, 11 Aug 2005 11:34:49 -0700

So I've decided I'm going to simply have empty fields, and that
brought up several other questions.


First, is there a limit on the number of fields per document?

Secondly why are fields in Document implemented with a Vector instead
of a HashSet or similar? Wouldn't retrieval be faster without
iterating through a list?

Lastly how difficult (or possible) is it to do something like extend
the Document class to have the functionality I want?

I know I'm likely missing a simple solution but I just can't see it.

Chris

On 8/10/05, Chris D <[EMAIL PROTECTED]> wrote:
> I'm adding files to an index over time, so after some time I'm likely
> to see the same file more than once. I would like to be able to search
> for the information about that particular instance of the file
> (Filename, date etc) For instance I index File1 and then File2 (which
> are identical) at different times I want to be able to search for the
> contents and retrieve all the Filenames and MIME.
> 
> The first way I did it was to add a seperate doc for every instance as follows
> 
> DOC   1
> FILEID 123
> MIME   test/html
> CONTENT   blam blam blam etc.
> 
> DOC   2
> FILEID 123
> FILENAME  File1
> DATE   090909
> 
> DOC   3
> FILEID 123
> FILENAME  File2
> DATE   101010
> AUTH   Jim Jones
> 
> The problem with this was that if the user needed all of the Filenames
> that are associated with content:blam I would have to search for
> fileID:123 to retrieve them. This gets slow with several thousand hits
> because I have to do a search for every hit.
> 
> I solved that by using multiple fields of the same name.
> 
> DOC   1
> FILEID 123
> MIME   test/html
> CONTENT   blam blam blam etc.
> FILENAME  File1
> DATE   090909
> FILENAME  File2
> DATE   101010
> AUTH   Jim Jones
> 
> But now I have a problem where I can't retrieve specific information
> about an instance of the file. I tried using getFields(String) but if
> I wanted the author for instance 2 I have a problem, it should be Jim
> jones but in the index it looks like he's the auther for instance 1.
> 
> One solution I see would be to fill all of the fields for each
> instance with empty strings, but that seems like a bit of a hack.
> 
> Another that fell appart fairly quickly was to have a reference table.
> 
> DOCID                            1
> FILEID                             123abd321
> MIME/TYPE                       text/html
> INSTANCE                        uri1 collectiondate1
> URI1                                http://blam.com/
> COLLECTIONDATE1         12355
> INSTANCE                        uri2 collectiondate2 author2
> URI2                                 http://google.ca/
> COLLECTIONDATE2         12356
> AUTHOR2                         Jim Brown
> 
> Now I can't search for URI without having to search for URI1:foo + URI2:foo 
> ...
> 
> How can I make specific attributes of an instance of the file
> searchable without having to do a search for every hit?
> 
> Thanks,
> Chris
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing document instances and retrieving instance attributes

Reply via email to