RE: example on RegexQuery

2008-10-24 Thread Agrawal, Aashish (IT)
Hi, 
 
I want to use lucene for a simple search engine  with regex support .
I tried using RegexQuery.. but seems I am missing something.
Is there any working exmaple on using RegexQuery ??
 
thanks
Aashish Agrawal


NOTICE: If received in error, please destroy and notify sender. Sender does not 
intend to waive confidentiality or privilege. Use of this email is prohibited 
when received in error.


tag search

2008-10-24 Thread Borja Martín

Hi,
I want to index a document that has a field called 'tags' that looks
like that : 'foo, foo bar'
The comma is the separator for each tag, so I have a tag with the value
'foo' and another one with 'foo bar'
What I want to do is to be able to retrieve the documents with certain
tag(only one tag per query), so if I search by 'foo', this document
should be hit, as well as if I search by 'foo bar', but if I enter 'bar'
as the tag, it shouldn't although it is being retrieved too. I tried to
index the field as keyword and as text(I know so it's tokenized so it
shouldn't work at all) and tried several queries with no success. Any
tip to achieve what I want? Should I write my own analyzer?
Thanks in advance.

Regards


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Combining keyword queries with database-style queries

2008-10-24 Thread Niels Ott

Erick,

this RangeQuery thing looks promising. It might be a bit hacky but it 
will most probably do the job in the given time and framework.


Thanks a lot,

   Niels

Erick Erickson schrieb:

Well, assuming that token_count is an indexed field
in your documents (i.e. not something you're
computing on the fly), just use a RangeQuery for the numeric
part. Actually, you probably want to use
ConstantScoreRangeQuery...

The only thing you have to watch is that Lucene does a
lexical compare, so you have to index your numbers
as comparable strings, probably left-padding to some
fixed width with zeros, see NumberTools.

Best
Erick

On Thu, Oct 23, 2008 at 8:27 AM, Niels Ott <[EMAIL PROTECTED]>wrote:


Hi everybody,

I need to query for documents not only for search terms but also for
numeric values (or other general types). Let me try to explain with a
hypothetical example.

Assuming there is a value for the number words in each document (or the
number of person names, or whatever), I would want to formulate a query
like "Give me documents containing 'jack johnson' AND with token_count >
250".

I've been working with Lucene before and the keyword part is easy, but
what would be a good solution to query for numbers etc.?

One first idea I had was storing the numbers (which are basically a
HashMap) in the index in some way or the other. But it is
not at all obvious for me how to query them then.

Another thing I could think of would be using a separate database of any
type, but then how to bring those two together in a way that makes sense?

Any pointers to useful resources and any types of hints are welcome! :-)

Best,

 Niels



--
Niels Ott
Computational Linguist (B.A.)
http://www.drni.de/niels/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: tag search

2008-10-24 Thread Daan de Wit
Hi Borja,

Try to add multiple untokenized fields named 'tag', each holding one tag.

Regards,
Daan

> -Original Message-
> From: Borja Martín [mailto:[EMAIL PROTECTED]
> Sent: vrijdag 24 oktober 2008 12:59
> To: java-user@lucene.apache.org
> Subject: tag search
> 
> Hi,
> I want to index a document that has a field called 'tags' that looks
> like that : 'foo, foo bar'
> The comma is the separator for each tag, so I have a tag with the value
> 'foo' and another one with 'foo bar'
> What I want to do is to be able to retrieve the documents with certain
> tag(only one tag per query), so if I search by 'foo', this document
> should be hit, as well as if I search by 'foo bar', but if I enter 'bar'
> as the tag, it shouldn't although it is being retrieved too. I tried to
> index the field as keyword and as text(I know so it's tokenized so it
> shouldn't work at all) and tried several queries with no success. Any
> tip to achieve what I want? Should I write my own analyzer?
> Thanks in advance.
> 
> Regards
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: tag search

2008-10-24 Thread Borja Martín
I already tried that but with no success. Here is a snippet of what if I 
tried: http://pastebin.com/m41f6719a

Regards

Daan de Wit escribió:

Hi Borja,

Try to add multiple untokenized fields named 'tag', each holding one tag.

Regards,
Daan

  

-Original Message-
From: Borja Martín [mailto:[EMAIL PROTECTED]
Sent: vrijdag 24 oktober 2008 12:59
To: java-user@lucene.apache.org
Subject: tag search

Hi,
I want to index a document that has a field called 'tags' that looks
like that : 'foo, foo bar'
The comma is the separator for each tag, so I have a tag with the value
'foo' and another one with 'foo bar'
What I want to do is to be able to retrieve the documents with certain
tag(only one tag per query), so if I search by 'foo', this document
should be hit, as well as if I search by 'foo bar', but if I enter 'bar'
as the tag, it shouldn't although it is being retrieved too. I tried to
index the field as keyword and as text(I know so it's tokenized so it
shouldn't work at all) and tried several queries with no success. Any
tip to achieve what I want? Should I write my own analyzer?
Thanks in advance.

Regards


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  




Re: Any Spanish analyzer available?

2008-10-24 Thread Grant Ingersoll

The Snowball stuff supports Spanish.

On Oct 23, 2008, at 6:13 PM, Zhang, Lisheng wrote:


Hi,

Is there any Spanish analyzer available for lucene applications?
I did not see any in lucene 2.4.0 contribute folders.

Thanks very much for helps, Lisheng

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: tag search

2008-10-24 Thread Grant Ingersoll
You either need to write a tokenizer that breaks on comma or you can  
do as Daan suggested.


On Oct 24, 2008, at 6:58 AM, Borja Martín wrote:


Hi,
I want to index a document that has a field called 'tags' that looks
like that : 'foo, foo bar'
The comma is the separator for each tag, so I have a tag with the  
value

'foo' and another one with 'foo bar'
What I want to do is to be able to retrieve the documents with certain
tag(only one tag per query), so if I search by 'foo', this document
should be hit, as well as if I search by 'foo bar', but if I enter  
'bar'
as the tag, it shouldn't although it is being retrieved too. I tried  
to

index the field as keyword and as text(I know so it's tokenized so it
shouldn't work at all) and tried several queries with no success. Any
tip to achieve what I want? Should I write my own analyzer?
Thanks in advance.

Regards


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: tag search

2008-10-24 Thread Borja Martín

Sorry,
as I was trying to do this with the php implementation and thought it 
was a problem with the query syntax, I sent the message to this list 
too. But it seems that is the php version lacks from some features. 
Sorry for the incoveniences.


Regards.

Daan de Wit escribió:

Hi Borja,

Try to add multiple untokenized fields named 'tag', each holding one tag.

Regards,
Daan

  

-Original Message-
From: Borja Martín [mailto:[EMAIL PROTECTED]
Sent: vrijdag 24 oktober 2008 12:59
To: java-user@lucene.apache.org
Subject: tag search

Hi,
I want to index a document that has a field called 'tags' that looks
like that : 'foo, foo bar'
The comma is the separator for each tag, so I have a tag with the value
'foo' and another one with 'foo bar'
What I want to do is to be able to retrieve the documents with certain
tag(only one tag per query), so if I search by 'foo', this document
should be hit, as well as if I search by 'foo bar', but if I enter 'bar'
as the tag, it shouldn't although it is being retrieved too. I tried to
index the field as keyword and as text(I know so it's tokenized so it
shouldn't work at all) and tried several queries with no success. Any
tip to achieve what I want? Should I write my own analyzer?
Thanks in advance.

Regards


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  




Re: Combining keyword queries with database-style queries

2008-10-24 Thread Erick Erickson
Hacky is in the eye of the hacker .

It's hard to keep in mind that Lucene is a search engine,
not a database, so whenever I find myself thinking in
database terms, I'm usually making things difficult. It
operates on strings, not the "usual" data types that one
thinks are available in programming languages, DBs, etc...

So I find myself doing things that "aren't natural" ...

Best
Erick

On Fri, Oct 24, 2008 at 7:26 AM, Niels Ott <[EMAIL PROTECTED]>wrote:

> Erick,
>
> this RangeQuery thing looks promising. It might be a bit hacky but it will
> most probably do the job in the given time and framework.
>
> Thanks a lot,
>
>   Niels
>
> Erick Erickson schrieb:
>
>  Well, assuming that token_count is an indexed field
>> in your documents (i.e. not something you're
>> computing on the fly), just use a RangeQuery for the numeric
>> part. Actually, you probably want to use
>> ConstantScoreRangeQuery...
>>
>> The only thing you have to watch is that Lucene does a
>> lexical compare, so you have to index your numbers
>> as comparable strings, probably left-padding to some
>> fixed width with zeros, see NumberTools.
>>
>> Best
>> Erick
>>
>> On Thu, Oct 23, 2008 at 8:27 AM, Niels Ott <[EMAIL PROTECTED]
>> >wrote:
>>
>>  Hi everybody,
>>>
>>> I need to query for documents not only for search terms but also for
>>> numeric values (or other general types). Let me try to explain with a
>>> hypothetical example.
>>>
>>> Assuming there is a value for the number words in each document (or the
>>> number of person names, or whatever), I would want to formulate a query
>>> like "Give me documents containing 'jack johnson' AND with token_count >
>>> 250".
>>>
>>> I've been working with Lucene before and the keyword part is easy, but
>>> what would be a good solution to query for numbers etc.?
>>>
>>> One first idea I had was storing the numbers (which are basically a
>>> HashMap) in the index in some way or the other. But it is
>>> not at all obvious for me how to query them then.
>>>
>>> Another thing I could think of would be using a separate database of any
>>> type, but then how to bring those two together in a way that makes sense?
>>>
>>> Any pointers to useful resources and any types of hints are welcome! :-)
>>>
>>> Best,
>>>
>>>  Niels
>>>
>>>
> --
> Niels Ott
> Computational Linguist (B.A.)
> http://www.drni.de/niels/
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


RE: Multi -threaded indexing of large number of PDF documents

2008-10-24 Thread Sudarsan, Sithu D.
 
Hi Glen, Mike, Grant & Mark

Thank you for the quick responses.

1. Yes, I'm looking now at ThreadPoolExecutor. Looking for a sample code
to improve the multi-threaded code.

2. We'll try using as many Indexwriters as the number of cores, first
(which is 2cpu x 4 core = 8).

3. Yes, PDFBox exceptions have been independently checked. We've a
prototype module to check PDF files that contain errors. Generally they
are few, less than 1% of the total number of files. The PDFs all have
been OCRed. Also, if any throws exceptions then they are quarantined in
a separate folder for further analysis to have a look at the document
itself.

4. We've tried using larger JVM space by defining -Xms1800m and
-Xmx1800m, but it runs out of memory. Only -Xms1080m and -Xmx1080m seems
stable. That is strange as we have 32 GB of RAM and 34GB swap space.
Typically no other application is running. However, the CentOS version
is 32 bit. The Ungava project seems to be using 64 bit.

5. -QUIT option for Linux does throw stack trace, but after few threads
it hangs. Don't know why. Need to look at that.

Meantime, we're seriously looking for a ThreadPoolExecutor sample source
code. It looks like, we need to use unbounded queues.

Really appreciate your inputs and will keep you posted on what we get.

Now working on the code for ThreadPoolExecutor.

Thanks and regards,
Sithu Sudarsan
Graduate Research Assistant, UALR
& Visiting Researcher, CDRH/OSEL

[EMAIL PROTECTED]
[EMAIL PROTECTED]

-Original Message-
From: Michael McCandless [mailto:[EMAIL PROTECTED] 
Sent: Thursday, October 23, 2008 5:01 PM
To: java-user@lucene.apache.org; Glen Newton
Subject: Re: Multi -threaded indexing of large number of PDF documents


Glen Newton wrote:

> 2008/10/23 Michael McCandless <[EMAIL PROTECTED]>:
>>
>> Mark Miller wrote:
>>
>>> Glen Newton wrote:

 2008/10/23 Mark Miller <[EMAIL PROTECTED]>:

> It sounds like you might have some thread synchronization issues  
> outside
> of
> Lucene. To simplify things a bit, you might try just using one
> IndexWriter.
> If I remember right, the IndexWriter is now pretty efficient,  
> and there
> isn't much need to index to smaller indexes and then merge.  
> There is a
> lot
> of juggling to get wrong with that approach.
>

 While I agree it is easier to have a single IndexWriter, if you  
 have
 multiple cores you will get significant speed-ups with multiple
 IndexWriters, even with the impact of merging at the end.
 #IndexWriters = # physical cores is an reasonable rule of thumb.

 General speed-up estimate: # cores * 0.6 - 0.8  over single  
 IndexWriter
 YMMV

 When I get around to it, I'll re-run my tests varying the # of
 IndexWriters & post.

 -Glen

>>> Hey Mr McCandless, whats up with that? Can IndexWriter be made to  
>>> be as
>>> efficient as using Multiple Writers? Where do you suppose the hold  
>>> up is?
>>> Number of threads doing merges? Sync contention? I hate the idea  
>>> of multiple
>>> IndexWriter/Readers being more efficient than a single instance.  
>>> In an ideal
>>> Lucene world, a single instance would hide the complexity and use  
>>> the number
>>> of threads needed to match multiple instance performance.
>>
>> Honestly this surprises me: I would expect a single IndexWriter with
>> multiple threads to be as fast (or faster, considering the extra  
>> merge time
>> at the end) than multiple IndexWriters.
>>
>> IndexWriter's concurrency has improved alot lately, with
>> ConcurrentMergeScheduler.  The only serious operation that is not  
>> concurrent
>> is flushing the RAM buffer as a new segment; but in a well tuned  
>> indexing
>> process (large RAM buffer) the time spent there should be quite  
>> small,
>> especially with a fast IO system.
>>
>> Actually, addIndexes is also not concurrent in that if multiple  
>> threads call
>> it, only one can run at once.  But normally you would call it with  
>> all the
>> indices you want to add, and then the merging is concurrent.
>>
>> Glen, in your single IndexWriter test, is it possible there was  
>> accidental
>> thread contention during document preparation or analysis?
>
> I don't think there is. I've been refining this for quite a while, and
> have done a lot of analysis and hand-checking of the threading stuff.

OK.

For your multiple-index-writer test, how much time is spent building  
the N indices vs merging them in the end?

> I do use multiple threads for document creation: this is where much of
> the speed-up happens (at least in my case where I have a large indexed
> field for the full-text of an article: the parsing becomes a
> significant part of the process).

So in the single-index-writer vs multiple-index-writer tests, this  
part (64 threads that construct document objects) is unchanged, right?

How do you rate limit the 64 threads?  (Ie, slow them down when they  
get too far ahead of i

RE: Multi -threaded indexing of large number of PDF documents

2008-10-24 Thread Toke Eskildsen
On Fri, 2008-10-24 at 16:01 +0200, Sudarsan, Sithu D. wrote:
> 4. We've tried using larger JVM space by defining -Xms1800m and
> -Xmx1800m, but it runs out of memory. Only -Xms1080m and -Xmx1080m seems
> stable. That is strange as we have 32 GB of RAM and 34GB swap space.
> Typically no other application is running. However, the CentOS version
> is 32 bit. The Ungava project seems to be using 64 bit.

The <2GB limit for Java is a known problem under Windows. I don't know
about CentOS, but from your description it seems that the problem exists
on that platform too. Anyway, you'll never get above 4GB for Java when
you're running 32bit. Might I ask why you're not using 64bit for a 32GB
machine?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Multi -threaded indexing of large number of PDF documents

2008-10-24 Thread Sudarsan, Sithu D.

 There have been some earlier messages, where memory consumption issue
for Lucene Documents due to 64 bit (double that of 32 bit). We expect
the index to grow very large, and we may end up maintaining more than
one with different analyzers for the same data set. Hence we are
concerned about the index size as well. If there are ways to overcome
it, we're game for 64 bit version as well :-)

Any ideas,


Thanks and regards,
Sithu Sudarsan
Graduate Research Assistant, UALR
& Visiting Researcher, CDRH/OSEL

[EMAIL PROTECTED]
[EMAIL PROTECTED]

-Original Message-
From: Toke Eskildsen [mailto:[EMAIL PROTECTED] 
Sent: Friday, October 24, 2008 10:43 AM
To: java-user@lucene.apache.org
Subject: RE: Multi -threaded indexing of large number of PDF documents

On Fri, 2008-10-24 at 16:01 +0200, Sudarsan, Sithu D. wrote:
> 4. We've tried using larger JVM space by defining -Xms1800m and
> -Xmx1800m, but it runs out of memory. Only -Xms1080m and -Xmx1080m
seems
> stable. That is strange as we have 32 GB of RAM and 34GB swap space.
> Typically no other application is running. However, the CentOS version
> is 32 bit. The Ungava project seems to be using 64 bit.

The <2GB limit for Java is a known problem under Windows. I don't know
about CentOS, but from your description it seems that the problem exists
on that platform too. Anyway, you'll never get above 4GB for Java when
you're running 32bit. Might I ask why you're not using 64bit for a 32GB
machine?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



performance boost through multithreaded query processing?

2008-10-24 Thread pfaun
Hello,

Currently we are facing the problem that some searches espacially fuzzy 
(term~0.6) wildcard searches (*term*) needs some time depending on the 
field-searchword combination (the more terms there are the more processing has 
to be done).
We improved the performance through caching the bitsets of the single fuzzy 
query/wildcard query.

Within our logs we can see that combined queries within a BooleanQuery are 
processed sequentially. So our question are: Does it make sense for you to 
parallelize the processing of the queries within a boolean query (with a 
restriction of the amount of prallel processed queries)? With the caches in 
mind it might be faster and the system is running on a multicore machine. Has 
anyone experience in prallelizing single query processing within a BooleanQuery?
Could there be drawbacks combining the results of the booelan clauses. At the 
end there should only be the bitsets connected to the terms, shouldn't it? 

Thanks in advance

stephan



  

WG: performance boost through multithreaded query processing?

2008-10-24 Thread pfaun
Hello,

Currently we are facing the problem that some searches espacially fuzzy 
(term~0.6) wildcard searches (*term*) needs some time depending on the 
field-searchword combination (the more terms there are the more processing has 
to be done).
We improved the performance through caching the bitsets of the single fuzzy 
query/wildcard query.

Within our logs we can see that combined queries within a BooleanQuery are 
processed sequentially. So our question are: Does it make sense for you to 
parallelize the processing of the queries within a boolean query (with a 
restriction of the amount of prallel processed queries)? With the caches in 
mind it might be faster and the system is running on a multicore machine. Has 
anyone experience in prallelizing single query processing within a BooleanQuery?
Could there be drawbacks combining the results of the booelan clauses (e.g some 
IO). At the end there should only be the bitsets connected to the terms and 
this should be in memory already, shouldn't it?

Thanks in advance

stephan



  

Combining keyword queries with database-style queries

2008-10-24 Thread Niels Ott

Hi everybody,

I need to query for documents not only for search terms but also for 
numeric values (or other general types). Let me try to explain with a 
hypothetical example.


Assuming there is a value for the number words in each document (or the 
number of person names, or whatever), I would want to formulate a query 
like "Give me documents containing 'jack johnson' AND with token_count > 
250".


I've been working with Lucene before and the keyword part is easy, but 
what would be a good solution to query for numbers etc.?


One first idea I had was storing the numbers (which are basically a 
HashMap) in the index in some way or the other. But it is 
not at all obvious for me how to query them then.


Another thing I could think of would be using a separate database of any 
type, but then how to bring those two together in a way that makes sense?


Any pointers to useful resources and any types of hints are welcome! :-)

Best,

  Niels


--
Niels Ott
Computational Linguist (B.A.)
http://www.drni.de/niels/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: example on RegexQuery

2008-10-24 Thread Steven A Rowe
Hi Aashish,

On 10/24/2008 at 3:35 AM, Agrawal, Aashish (IT) wrote:
> I want to use lucene for a simple search engine  with regex support .
> I tried using RegexQuery.. but seems I am missing something.
> Is there any working exmaple on using RegexQuery ??

How about TestRegexQuery?:



Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Multiple values in field

2008-10-24 Thread agatone

Hello,

I know I can store multiple values under same field and I can later retrieve
all those values. But the problem I have is a bit structure related. When
I'm reading those fields (that usually have more than one value) it happens
that it has only one value and I cannot know if that field is meant to have
multiple values.

Is there a way (at indexing (creating fields)) to set that certain field is
meant for multiple values, so that later when I'm searching and I get
document/s hit, I can get from each field in it how to represent the hit.

Thank you.

-- 
View this message in context: 
http://www.nabble.com/Multiple-values-in-field-tp20152411p20152411.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Combining keyword queries with database-style queries

2008-10-24 Thread Erick Erickson
Is this an inadvertent re-post or is there still something you're wondering
about?

Erick

On Wed, Oct 22, 2008 at 9:14 AM, Niels Ott <[EMAIL PROTECTED]> wrote:

> Hi everybody,
>
> I need to query for documents not only for search terms but also for
> numeric values (or other general types). Let me try to explain with a
> hypothetical example.
>
> Assuming there is a value for the number words in each document (or the
> number of person names, or whatever), I would want to formulate a query like
> "Give me documents containing 'jack johnson' AND with token_count > 250".
>
> I've been working with Lucene before and the keyword part is easy, but what
> would be a good solution to query for numbers etc.?
>
> One first idea I had was storing the numbers (which are basically a
> HashMap) in the index in some way or the other. But it is not
> at all obvious for me how to query them then.
>
> Another thing I could think of would be using a separate database of any
> type, but then how to bring those two together in a way that makes sense?
>
> Any pointers to useful resources and any types of hints are welcome! :-)
>
> Best,
>
>  Niels
>
>
> --
> Niels Ott
> Computational Linguist (B.A.)
> http://www.drni.de/niels/
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


RE: Multi -threaded indexing of large number of PDF documents

2008-10-24 Thread Toke Eskildsen
Sudarsan, Sithu D. [EMAIL PROTECTED] wrote:
> There have been some earlier messages, where memory consumption issue
> for Lucene Documents due to 64 bit (double that of 32 bit).

All pointers are doubled, yes. While not a doubling in total RAM consumption,
it does give a substantial overhead.

> We expect the index to grow very large, and we may end up maintaining
> more than one with different analyzers for the same data set. Hence we are
> concerned about the index size as well. If there are ways to overcome
> it, we're game for 64 bit version as well :-)

Fair enough.  We've chosen to use use 64bit on our 16 and 32GB machines
and have never looked back, but our initial requirements called for ~7GB
for each JVM, so we didn't have a choice at the time.

> Any ideas,

Solaris should be capable of giving you ~3.5GB for JVMs with 32bit.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple values in field

2008-10-24 Thread Erick Erickson
I *think* what you're looking for is Document.getFields(String field),
which returns a list corresponding to every Document.add() you did
originally.

Alternatively, you could always index a companion field that had the
count of times you called Document.add() on a particular field.

Best
Erick

On Fri, Oct 24, 2008 at 11:36 AM, agatone <[EMAIL PROTECTED]> wrote:

>
> Hello,
>
> I know I can store multiple values under same field and I can later
> retrieve
> all those values. But the problem I have is a bit structure related. When
> I'm reading those fields (that usually have more than one value) it happens
> that it has only one value and I cannot know if that field is meant to have
> multiple values.
>
> Is there a way (at indexing (creating fields)) to set that certain field is
> meant for multiple values, so that later when I'm searching and I get
> document/s hit, I can get from each field in it how to represent the hit.
>
> Thank you.
>
> --
> View this message in context:
> http://www.nabble.com/Multiple-values-in-field-tp20152411p20152411.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Multi -threaded indexing of large number of PDF documents

2008-10-24 Thread Michael McCandless


Sudarsan, Sithu D. wrote:



Hi Glen, Mike, Grant & Mark

Thank you for the quick responses.

1. Yes, I'm looking now at ThreadPoolExecutor. Looking for a sample  
code

to improve the multi-threaded code.

2. We'll try using as many Indexwriters as the number of cores, first
(which is 2cpu x 4 core = 8).


You could also try multiple threads against a single IndexWriter.   
It's simpler, and you don't have to merge indices in the end.  It'd be  
great if you could post back on net throughput because I'd really like  
to understand if there is some sort of thread issue sharing a single  
IndexWriter.



3. Yes, PDFBox exceptions have been independently checked. We've a
prototype module to check PDF files that contain errors. Generally  
they

are few, less than 1% of the total number of files. The PDFs all have
been OCRed. Also, if any throws exceptions then they are quarantined  
in

a separate folder for further analysis to have a look at the document
itself.

4. We've tried using larger JVM space by defining -Xms1800m and
-Xmx1800m, but it runs out of memory. Only -Xms1080m and -Xmx1080m  
seems

stable. That is strange as we have 32 GB of RAM and 34GB swap space.
Typically no other application is running. However, the CentOS version
is 32 bit. The Ungava project seems to be using 64 bit.

5. -QUIT option for Linux does throw stack trace, but after few  
threads

it hangs. Don't know why. Need to look at that.


Can you post the stack traces that you did see?  (Do you think those  
threads are hung?)


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Any Spanish analyzer available?

2008-10-24 Thread Marcelo Ochoa
Zhang:
  I have done a simple SpanishAnalyzer for Lucene Domain Index test
suites which index Spanish WikiPedia dumps.
 This simple analyzer have a list of stops words and is faster than
SnowballAnalyzer which also performs stemming.
  You can get the code using CVS from SourceForget.net servers or
simple cut a paste this code:
http://dbprism.cvs.sourceforge.net/viewvc/dbprism/ojvm/src/test/org/apache/lucene/analysis/SpanishAnalyzer.java?revision=1.2&view=markup
  Also there is a WikiPediaAnalyzer which uses previous analyzer:
http://dbprism.cvs.sourceforge.net/viewvc/dbprism/ojvm/src/test/org/apache/lucene/analysis/SpanishWikipediaAnalyzer.java?revision=1.3&view=markup
  Best regards, Marcelo.

On Thu, Oct 23, 2008 at 8:13 PM, Zhang, Lisheng
<[EMAIL PROTECTED]> wrote:
> Hi,
>
> Is there any Spanish analyzer available for lucene applications?
> I did not see any in lucene 2.4.0 contribute folders.
>
> Thanks very much for helps, Lisheng
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
__
Want to integrate Lucene and Oracle?
http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
Is Oracle 11g REST ready?
http://marceloochoa.blogspot.com/2008/02/is-oracle-11g-rest-ready.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple values in field

2008-10-24 Thread agatone

That sounds like abuse of Document.add()  :)
Ok, so adding first one extra "empty" value for every field i wish to mark
as multi.
Well if that ain't so wrong, I'll use that :)

Ty




Erick Erickson wrote:
> 
> I *think* what you're looking for is Document.getFields(String field),
> which returns a list corresponding to every Document.add() you did
> originally.
> 
> Alternatively, you could always index a companion field that had the
> count of times you called Document.add() on a particular field.
> 
> Best
> Erick
> 
> On Fri, Oct 24, 2008 at 11:36 AM, agatone <[EMAIL PROTECTED]> wrote:
> 
>>
>> Hello,
>>
>> I know I can store multiple values under same field and I can later
>> retrieve
>> all those values. But the problem I have is a bit structure related. When
>> I'm reading those fields (that usually have more than one value) it
>> happens
>> that it has only one value and I cannot know if that field is meant to
>> have
>> multiple values.
>>
>> Is there a way (at indexing (creating fields)) to set that certain field
>> is
>> meant for multiple values, so that later when I'm searching and I get
>> document/s hit, I can get from each field in it how to represent the hit.
>>
>> Thank you.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Multiple-values-in-field-tp20152411p20152411.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Multiple-values-in-field-tp20152411p20156607.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene Input/Output error

2008-10-24 Thread JulieSoko

Hello All,
  First of all I’m new to Lucene, and have written code using it to search
over 1 to man indexes, using a user defined query. 
 I don't have any code on this system so have to type everything in here... 
I have the following design but am getting
 An Input / Output error exception which I have typed in a part of it below. 
My question is this?  Do I have  glaring flaw
In this design?  I am reusing the IndexSearchers/IndexReaders and not
closing them.   The input/output error arises when 2 or
More searches occur at the same time over some of the same indexes.  Can you
give me some direction where I should look 
For the solution to the exception?

Here is an explanation of my  
Data:
Up to 60 different indexes used at a time  
 1 directory per 1 day of data
Millions of documents per day
Data  is received and indexes  merged on a continual bases - whole
separate process

Index contains
value: contenteventType: type of data   eventTime: time data
collected

1 to many users can create individual queries containing 1 or more of
the fields and values
searching over 1 to many indexes 

Design:
 Utilize the IndexAccessor classes  to cache
IndexSearcher/IndexReaders i.e. they are made one per index  and never
closed.
   

 Use a ParallelMultiSearcher  - create one per request using 1 to
many of the indexes

 try {
 QueryParser parser = new QueryParser("value", new
StandardAnalyzer(0);
 parser.setDefaultOperator(QueryParser.AND_OPERATOR);
 Query query = parser.parse(queryString);

TopDocCollector col = new TopDocCollector(MAX_NUMBER_HITS);
multiSearcher.search(query, new
RangeFilter("eventTime",startTime,endTime,true,true),col);
int numHits = col.getTotalHits(0);
TopDocs docs = col.topDocs();

if (numHits > 0))
  for (int i=0; i< numHits && i< MAX_NUMBER_HITS; i++)[
Document doc = multiSearcher.doc(docs.scoreDocs[i].doc);
  
  }
  
 }catch(Exception e  ){
 e.printStackTrace();
  } finally{
  //IndexSearchers are not closed since shared by many users
}
   


 
When the second user accesses directories used by the first query then I get
the following error:

java.io.IOException: Input/output error
 java.io.RandomAccessFile.readBytes(Native Method)
 java.io.RandomAccessFile.read(RandomAccessFile.java:315)
 at
org.apache.lucene.store.FSDirecotry$FSIndexInput.readInternal(FSDirectory.java:550)
 at
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedInputInput.java:131)
 at
org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:240)
at
org.apache.lucene.instoreBufferedIndexInput.refill(BufferedIndexInput.java:
152)
at
org.apache.lucene.instoreBufferedIndexInput.readByte(BufferedIndexInput.java:
152)
at org.lucene.store.IndexInput.readVInt(IndexInput.java:76)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:123)
at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:154)
at
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:678)
at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:87)
at org.apache.lucene.search.Searcher.docFreqs(searcher.java:118)
at
org.apache.lucene.search.MultiSearcher.createWeight(MultiSearcher.java:311)
at org.apache.lucene.search.Searcher.search(Searcher.java:178)

Thanks!

-- 
View this message in context: 
http://www.nabble.com/Lucene-Input-Output-error-tp20156805p20156805.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple values in field

2008-10-24 Thread Erick Erickson
No, no, no...

Say you have the following
Document doc = new Document()
doc.add("field1", "stuff", blah, blah)
doc.add("field1", "more stuff", blah, blah)
doc.add("field1", "stuff and nonsense", blah, blah)
IndexWriter.addDocument(doc)




Now, in your search code that document comes up as a hit and you have
Field[] adds = doc.getFields("field1");


adds.size() should == 3

whenever adds.size() > 1, you can know it has multiple entries

I wasn't suggesting that you ever add empty fields, and I don't think an
empty
add would even compile.

Best
[EMAIL PROTECTED]


On Fri, Oct 24, 2008 at 3:38 PM, agatone <[EMAIL PROTECTED]> wrote:

>
> That sounds like abuse of Document.add()  :)
> Ok, so adding first one extra "empty" value for every field i wish to mark
> as multi.
> Well if that ain't so wrong, I'll use that :)
>
> Ty
>
>
>
>
> Erick Erickson wrote:
> >
> > I *think* what you're looking for is Document.getFields(String field),
> > which returns a list corresponding to every Document.add() you did
> > originally.
> >
> > Alternatively, you could always index a companion field that had the
> > count of times you called Document.add() on a particular field.
> >
> > Best
> > Erick
> >
> > On Fri, Oct 24, 2008 at 11:36 AM, agatone <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> Hello,
> >>
> >> I know I can store multiple values under same field and I can later
> >> retrieve
> >> all those values. But the problem I have is a bit structure related.
> When
> >> I'm reading those fields (that usually have more than one value) it
> >> happens
> >> that it has only one value and I cannot know if that field is meant to
> >> have
> >> multiple values.
> >>
> >> Is there a way (at indexing (creating fields)) to set that certain field
> >> is
> >> meant for multiple values, so that later when I'm searching and I get
> >> document/s hit, I can get from each field in it how to represent the
> hit.
> >>
> >> Thank you.
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Multiple-values-in-field-tp20152411p20152411.html
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Multiple-values-in-field-tp20152411p20156607.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Possible payload bug in lucene

2008-10-24 Thread Fatih Emekci
Hi all,
I am getting the below exception when try to read the payload data:
   [java] java.lang.NullPointerException
 [java] at
org.apache.lucene.index.MultiSegmentReader$MultiTermPositions.nextPosition(MultiSegmentReader.java:631)

However, if I optimize the index before reading the payload, it just works
fine.
This seems like a bug. Pasting the code below. please let me know if I am
doing something wrong.

thanks


import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermPositions;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.store.FSDirectory;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.index.Payload;
import java.io.IOException;

public class Experiment
{

  /**
   * TOkenStream for the anet ids' payload
   */
  private static class IDArrayPayloadStream extends TokenStream
  {
private final Token _token;
private boolean _returnToken = false;

public IDArrayPayloadStream(Term term)
{
  _token = new Token(term.text(), 0, 0);
}


/**
 * Output the ids into the payload
 * @param ids the list of ids.
 */
void setIDs(List ids)
{
  byte[] buffer = new byte[ (ids.size()) * 4];
  for (int i = 0; i < ids.size(); i++)
  {
buffer = intToByteArray(ids.get(i));
  }
  _token.setPayload(new Payload(buffer));
  _returnToken = true;
}


/**
 * Return the single token created.
 * @return token if it's already been set, null otherwise.
 * @throws IOException
 */
public Token next() throws IOException
{
  if (_returnToken)
  {
_returnToken = false;
return _token;
  }
  else
  {
return null;
  }
}
  }

  /**
   * Helper method to add payload (list of integers) to the term in the
document.
   * @param document given document
   * @param term term to use to add payload
   * @param data payload content
   */
  private static void  addPayload(Document document, Term term,
List data)
  {
if (data.size() > 0)
{
  // add a payload for the anet ids
  IDArrayPayloadStream aps = new IDArrayPayloadStream(term);
  aps.setIDs(data);
  Field f = document.getField(term.field());
  if (f == null)
  {
f = new Field(term.field(), aps);
document.add(f);
  }
  else
  {
f.setValue(aps);
  }
}
  }

  public static byte[] intToByteArray(int value) {
byte[] b = new byte[4];
for (int i = 0; i < 4; i++) {
  int offset = (b.length - 1 - i) * 8;
  b[i] = (byte) ((value >>> offset) & 0xFF);
}
return b;
  }

  public static final int byteArrayToInt(byte [] b) {
return (b[0] << 24)
+ ((b[1] & 0xFF) << 16)
+ ((b[2] & 0xFF) << 8)
+ (b[3] & 0xFF);
  }
  /**
   * @param args
   */
  public static void main(String[] args) throws Exception
  {
int TOTDOC = 10;
IndexWriter writer = new IndexWriter("/Users/femekci/lucene/deneme3",
new WhitespaceAnalyzer(),
 true);
try {
  int ll =0;
  while( ll < TOTDOC){
try {
  String state   =  (ll*ll%10)+"123";
  String email = ll*2 +"s";

  String last_tran = ll*3+"67";
  String memid = ll +"0";
  String fname = ll + " " + ll*101;
  String lname = ll + " " + ll*1001;;

  Document doc = new Document();
  doc.add(new Field("state", state, Field.Store.NO,
Field.Index.TOKENIZED));

  doc.add(new Field("email", email, Field.Store.NO,
Field.Index.TOKENIZED));
  doc.add(new Field("last_tran", last_tran, Field.Store.NO,
Field.Index.TOKENIZED));
  doc.add(new Field("memid", memid.trim(), Field.Store.NO,
Field.Index.TOKENIZED));
  doc.add(new Field("fname", fname.trim(), Field.Store.NO,
Field.Index.TOKENIZED));
  doc.add(new Field("lname", lname.trim(), Field.Store.NO,
Field.Index.TOKENIZED));


  Term t = new Term("dids", "did");
  List l = new ArrayList();
  l.add(new Integer(ll));
  addPayload(doc, t, l);
  System.out.println(memid);
  doc.add(new Field("row", email + " " + fname + " " + lname +" "
+memid, Field.Store.NO,   Field.Index.TOKENIZED));
  writer.addDocument(doc);
  ll++;
} catch(Exception e) {System.out.println("except" + e.toString());

}
  }
  //writer.optimize(); /***Removing the comment and
optimizing fixes the problem */
  writer.close();
} catch (Exception e)
{
  //writer.optimize();
  writer.close();
  System.out.println("exception in indexing "