Yes, that should work.  Stream the file, converting each record to a
Lucene Document.  All of the fields should probably be indexed only
(not stored) for size reasons, and then you could have a single stored
but not indexed field that would be the offset into your binary file.

-Yonik

On Fri, Jan 30, 2009 at 6:45 AM, Paul Feuer <paul...@gmail.com> wrote:
>
> The ~25 GB represents about 100 million events an avg of about 250 bytes 
> each. the indexed and searchable values are normal things: small bits of text 
> (8-10 bytes usually); longs; ints; etc...
>
> Also this 25GB is a per-day size, which is why expanding the values in it to 
> ascii is problematic from a storage perspective.
>
> I've never used lucene before, but looking at the javadocs, I was hoping that 
> I'd be able to implement some IndexOutput that would store the relevant 
> offsets and then be able to search custom Fieldables in my document. (I'm 
> writing this on the subway right now, so I don't have the docs in front of 
> me, but I think that's what I was thinking last night)
>
> ./paul
>
> Sent from my Verizon Wireless BlackBerry
>
> -----Original Message-----
> From: Michael McCandless <luc...@mikemccandless.com>
>
> Date: Fri, 30 Jan 2009 05:45:17
> To: <java-user@lucene.apache.org>
> Subject: Re: indexing binary files?
>
>
>
> You can also create a Lucene field using a Reader, if the String is
> really too large to materialize at once.  Such fields cannot be stored
> though.
>
> But, if the String really is so large, I would worry about the end
> user's experience (normally you want a Document to be a rather bite-
> sized piece of content so users browsing through search results won't
> see a single monolithic result covering tons and tons of content).
>
> Mike
>
> Ganesh wrote:
>
>> Use your parser to get the string out of the binary file and index
>> them using Lucene.
>>
>> Store the string as it is, if it is small otherwise store the path
>> and its offset position. The content could be later retrieved.
>>
>> Regards
>> Ganesh
>>
>>
>> ----- Original Message ----- From: "Paul Feuer" <paul...@gmail.com>
>> To: <java-user@lucene.apache.org>
>> Sent: Friday, January 30, 2009 10:00 AM
>> Subject: Re: indexing binary files?
>>
>>
>>> we have parsers for these files.
>>>
>>> to index them, do the string representations need to be stored (aside
>>> from sitting in the index file)? or can the reader simply provide the
>>> string in order to record the location of the record in the binary
>>> file?
>>>
>>> if i need to convert the binary file into text fields, the files will
>>> get VERY large.
>>>
>>> the binary data are well-formed events, so queries would be like
>>> "where ACCOUNT = 'Microsoft'"
>>>
>>> ./paul
>>>
>>>
>>> On Thu, Jan 29, 2009 at 11:00 PM, Anshum <ansh...@gmail.com> wrote:
>>>> Hi Paul,
>>>> Lucene is a 'text only' saerch lib. i.e. as long as you feed in
>>>> anything as
>>>> a string, you'd be able to use lucene else I don't think there's a
>>>> way.
>>>> How do you even intend to search in those binary files? as in...
>>>> what would
>>>> be the keyword/phrase? asking out of curiosity!
>>>>
>>>> --
>>>> Anshum Gupta
>>>> Naukri Labs!
>>>> http://ai-cafe.blogspot.com
>>>>
>>>> The facts expressed here belong to everybody, the opinions to me.
>>>> The
>>>> distinction is yours to draw............
>>>>
>>>>
>>>> On Fri, Jan 30, 2009 at 9:13 AM, Paul Feuer <paul...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi -
>>>>>
>>>>> I've looked on the FAQ, the Java Docs, and searched a little in
>>>>> google, but haven't been able to figure out if Lucene can index
>>>>> binary
>>>>> files.
>>>>>
>>>>> Our binary files can get up into the 20-30 gigabyte range.
>>>>>
>>>>> If it is possible, anyone have any pointers to what interfaces I
>>>>> should
>>>>> look at?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> ./paul
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>> Send instant messages to your online friends http://in.messenger.yahoo.com
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to