Re: Indexing document instances and retrieving instance attributes

Otis Gospodnetic Thu, 11 Aug 2005 11:52:08 -0700

> So I've decided I'm going to simply have empty fields, and that
> brought up several other questions.
> 
> First, is there a limit on the number of fields per document?


I don't think so.

> Secondly why are fields in Document implemented with a Vector instead
> of a HashSet or similar? Wouldn't retrieval be faster without
> iterating through a list?

Field order may be important.  Maybe it could be done with Lists
instead of Vectors, though.

> Lastly how difficult (or possible) is it to do something like extend
> the Document class to have the functionality I want?

I'm not sure, I never had to do it.  I see the class is final, so you
won't be able to extend it.  Maybe the final modifier could be removed,
if you present a good use case.

Otis

> I know I'm likely missing a simple solution but I just can't see it.
> 
> Chris
> 
> On 8/10/05, Chris D <[EMAIL PROTECTED]> wrote:
> > I'm adding files to an index over time, so after some time I'm
> likely
> > to see the same file more than once. I would like to be able to
> search
> > for the information about that particular instance of the file
> > (Filename, date etc) For instance I index File1 and then File2
> (which
> > are identical) at different times I want to be able to search for
> the
> > contents and retrieve all the Filenames and MIME.
> > 
> > The first way I did it was to add a seperate doc for every instance
> as follows
> > 
> > DOC   1
> > FILEID 123
> > MIME   test/html
> > CONTENT   blam blam blam etc.
> > 
> > DOC   2
> > FILEID 123
> > FILENAME  File1
> > DATE   090909
> > 
> > DOC   3
> > FILEID 123
> > FILENAME  File2
> > DATE   101010
> > AUTH   Jim Jones
> > 
> > The problem with this was that if the user needed all of the
> Filenames
> > that are associated with content:blam I would have to search for
> > fileID:123 to retrieve them. This gets slow with several thousand
> hits
> > because I have to do a search for every hit.
> > 
> > I solved that by using multiple fields of the same name.
> > 
> > DOC   1
> > FILEID 123
> > MIME   test/html
> > CONTENT   blam blam blam etc.
> > FILENAME  File1
> > DATE   090909
> > FILENAME  File2
> > DATE   101010
> > AUTH   Jim Jones
> > 
> > But now I have a problem where I can't retrieve specific
> information
> > about an instance of the file. I tried using getFields(String) but
> if
> > I wanted the author for instance 2 I have a problem, it should be
> Jim
> > jones but in the index it looks like he's the auther for instance
> 1.
> > 
> > One solution I see would be to fill all of the fields for each
> > instance with empty strings, but that seems like a bit of a hack.
> > 
> > Another that fell appart fairly quickly was to have a reference
> table.
> > 
> > DOCID                            1
> > FILEID                             123abd321
> > MIME/TYPE                       text/html
> > INSTANCE                        uri1 collectiondate1
> > URI1                                http://blam.com/
> > COLLECTIONDATE1         12355
> > INSTANCE                        uri2 collectiondate2 author2
> > URI2                                 http://google.ca/
> > COLLECTIONDATE2         12356
> > AUTHOR2                         Jim Brown
> > 
> > Now I can't search for URI without having to search for URI1:foo +
> URI2:foo ...
> > 
> > How can I make specific attributes of an instance of the file
> > searchable without having to do a search for every hit?
> > 
> > Thanks,
> > Chris
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing document instances and retrieving instance attributes

Reply via email to