Yes, that should work. Stream the file, converting each record to a Lucene Document. All of the fields should probably be indexed only (not stored) for size reasons, and then you could have a single stored but not indexed field that would be the offset into your binary file.
-Yonik On Fri, Jan 30, 2009 at 6:45 AM, Paul Feuer <paul...@gmail.com> wrote: > > The ~25 GB represents about 100 million events an avg of about 250 bytes > each. the indexed and searchable values are normal things: small bits of text > (8-10 bytes usually); longs; ints; etc... > > Also this 25GB is a per-day size, which is why expanding the values in it to > ascii is problematic from a storage perspective. > > I've never used lucene before, but looking at the javadocs, I was hoping that > I'd be able to implement some IndexOutput that would store the relevant > offsets and then be able to search custom Fieldables in my document. (I'm > writing this on the subway right now, so I don't have the docs in front of > me, but I think that's what I was thinking last night) > > ./paul > > Sent from my Verizon Wireless BlackBerry > > -----Original Message----- > From: Michael McCandless <luc...@mikemccandless.com> > > Date: Fri, 30 Jan 2009 05:45:17 > To: <java-user@lucene.apache.org> > Subject: Re: indexing binary files? > > > > You can also create a Lucene field using a Reader, if the String is > really too large to materialize at once. Such fields cannot be stored > though. > > But, if the String really is so large, I would worry about the end > user's experience (normally you want a Document to be a rather bite- > sized piece of content so users browsing through search results won't > see a single monolithic result covering tons and tons of content). > > Mike > > Ganesh wrote: > >> Use your parser to get the string out of the binary file and index >> them using Lucene. >> >> Store the string as it is, if it is small otherwise store the path >> and its offset position. The content could be later retrieved. >> >> Regards >> Ganesh >> >> >> ----- Original Message ----- From: "Paul Feuer" <paul...@gmail.com> >> To: <java-user@lucene.apache.org> >> Sent: Friday, January 30, 2009 10:00 AM >> Subject: Re: indexing binary files? >> >> >>> we have parsers for these files. >>> >>> to index them, do the string representations need to be stored (aside >>> from sitting in the index file)? or can the reader simply provide the >>> string in order to record the location of the record in the binary >>> file? >>> >>> if i need to convert the binary file into text fields, the files will >>> get VERY large. >>> >>> the binary data are well-formed events, so queries would be like >>> "where ACCOUNT = 'Microsoft'" >>> >>> ./paul >>> >>> >>> On Thu, Jan 29, 2009 at 11:00 PM, Anshum <ansh...@gmail.com> wrote: >>>> Hi Paul, >>>> Lucene is a 'text only' saerch lib. i.e. as long as you feed in >>>> anything as >>>> a string, you'd be able to use lucene else I don't think there's a >>>> way. >>>> How do you even intend to search in those binary files? as in... >>>> what would >>>> be the keyword/phrase? asking out of curiosity! >>>> >>>> -- >>>> Anshum Gupta >>>> Naukri Labs! >>>> http://ai-cafe.blogspot.com >>>> >>>> The facts expressed here belong to everybody, the opinions to me. >>>> The >>>> distinction is yours to draw............ >>>> >>>> >>>> On Fri, Jan 30, 2009 at 9:13 AM, Paul Feuer <paul...@gmail.com> >>>> wrote: >>>> >>>>> Hi - >>>>> >>>>> I've looked on the FAQ, the Java Docs, and searched a little in >>>>> google, but haven't been able to figure out if Lucene can index >>>>> binary >>>>> files. >>>>> >>>>> Our binary files can get up into the 20-30 gigabyte range. >>>>> >>>>> If it is possible, anyone have any pointers to what interfaces I >>>>> should >>>>> look at? >>>>> >>>>> Thanks, >>>>> >>>>> ./paul >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> Send instant messages to your online friends http://in.messenger.yahoo.com >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org