So I've decided I'm going to simply have empty fields, and that brought up several other questions.
First, is there a limit on the number of fields per document? Secondly why are fields in Document implemented with a Vector instead of a HashSet or similar? Wouldn't retrieval be faster without iterating through a list? Lastly how difficult (or possible) is it to do something like extend the Document class to have the functionality I want? I know I'm likely missing a simple solution but I just can't see it. Chris On 8/10/05, Chris D <[EMAIL PROTECTED]> wrote: > I'm adding files to an index over time, so after some time I'm likely > to see the same file more than once. I would like to be able to search > for the information about that particular instance of the file > (Filename, date etc) For instance I index File1 and then File2 (which > are identical) at different times I want to be able to search for the > contents and retrieve all the Filenames and MIME. > > The first way I did it was to add a seperate doc for every instance as follows > > DOC 1 > FILEID 123 > MIME test/html > CONTENT blam blam blam etc. > > DOC 2 > FILEID 123 > FILENAME File1 > DATE 090909 > > DOC 3 > FILEID 123 > FILENAME File2 > DATE 101010 > AUTH Jim Jones > > The problem with this was that if the user needed all of the Filenames > that are associated with content:blam I would have to search for > fileID:123 to retrieve them. This gets slow with several thousand hits > because I have to do a search for every hit. > > I solved that by using multiple fields of the same name. > > DOC 1 > FILEID 123 > MIME test/html > CONTENT blam blam blam etc. > FILENAME File1 > DATE 090909 > FILENAME File2 > DATE 101010 > AUTH Jim Jones > > But now I have a problem where I can't retrieve specific information > about an instance of the file. I tried using getFields(String) but if > I wanted the author for instance 2 I have a problem, it should be Jim > jones but in the index it looks like he's the auther for instance 1. > > One solution I see would be to fill all of the fields for each > instance with empty strings, but that seems like a bit of a hack. > > Another that fell appart fairly quickly was to have a reference table. > > DOCID 1 > FILEID 123abd321 > MIME/TYPE text/html > INSTANCE uri1 collectiondate1 > URI1 http://blam.com/ > COLLECTIONDATE1 12355 > INSTANCE uri2 collectiondate2 author2 > URI2 http://google.ca/ > COLLECTIONDATE2 12356 > AUTHOR2 Jim Brown > > Now I can't search for URI without having to search for URI1:foo + URI2:foo > ... > > How can I make specific attributes of an instance of the file > searchable without having to do a search for every hit? > > Thanks, > Chris > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]