Re: Using Hits as document space for new search

2008-09-23 Thread nukie

Thanks a lot, i've discovered about Solr , column classification and other
interesting things. ;)

Best

Stas


hossman wrote:
> 
> 
> : For example,  in my case it's car  searching form.
> : First of all i'm telling that i want to search for BMW. System returning
> set
> : of results. 
> : In process of viewing results system shows additional criterias for
> making
> : search result more exact, and shows count of result set after adding
> this
> : criteria (..this count is smaller than current result set size, because
> new
> : result is just subset of current result list).
> 
> this is generally known as "faceted searching" ... if you search the list 
> archives for "facet" or in some cases 'category counts" you'll find 
> numerous discussions on how to tackle problems like this.
> 
> In general: you don't want to try this using something like the Hits 
> class, it's internal behavior is very inefficient forsoemthing like this 
> -- building Filters (and caching them) tends to be the way to go 9and you 
> can always build a Filter out of a query)
> 
> 
> 
> -Hoss
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Using-Hits-as-document-space-for-new-search-tp19511672p19624222.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi Field search without Multifieldqueryparser

2008-09-23 Thread Anshul jain
Here is what I'm trying to do:

say a lucene document:
name: abc ^10
organization: xyz ^3

^10 and ^3 are boosts in the document.

now if I query name: abc ^5 AND organization: xyz this will work.

but if I query (default_field): abc^5 AND xyz this won't work.

Now what I want is that a text can be associated with more than one field. i.e.

(field1,field2,field3):value
name,(default_field),title: abc^10
organization,(default_field),institute: xyz^3

then both of my queries will work.

Is it possible to do so in lucene without changing the source?
If no then can anyone please explain the indexing and searching
mechanism for lucene, so that I can start working on it.

The solution given by the java-users won't work for me as I do not
want to add all the contents of the document in a single field and
then search for that field, as this would increase the index size and
I've to index more than 10 million documents. Also
multifieldqueryparser will make it query execution inefficient, as
there will be thousands of fields.

If I start storing just a single field as: (default_field): "name abc
organization xyz", then it is possible that some other documents might
get selected that are not relevant. Also i want to boost individual
fields in a document.

Anshul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi Field search without Multifieldqueryparser

2008-09-23 Thread Grant Ingersoll
So, the piece I'm missing is how do you know what field for which  
terms.  In other words how do you know xyz goes against organization  
and abc against name.  Your wording implies that you don't know this  
before hand, yet you are somehow suggesting that Lucene should be able  
to do it.  Correct me if I'm wrong.


-Grant


On Sep 23, 2008, at 6:51 AM, Anshul jain wrote:


Here is what I'm trying to do:

say a lucene document:
name: abc ^10
organization: xyz ^3

^10 and ^3 are boosts in the document.

now if I query name: abc ^5 AND organization: xyz this will work.

but if I query (default_field): abc^5 AND xyz this won't work.

Now what I want is that a text can be associated with more than one  
field. i.e.


(field1,field2,field3):value
name,(default_field),title: abc^10
organization,(default_field),institute: xyz^3

then both of my queries will work.

Is it possible to do so in lucene without changing the source?
If no then can anyone please explain the indexing and searching
mechanism for lucene, so that I can start working on it.

The solution given by the java-users won't work for me as I do not
want to add all the contents of the document in a single field and
then search for that field, as this would increase the index size and
I've to index more than 10 million documents. Also
multifieldqueryparser will make it query execution inefficient, as
there will be thousands of fields.

If I start storing just a single field as: (default_field): "name abc
organization xyz", then it is possible that some other documents might
get selected that are not relevant. Also i want to boost individual
fields in a document.

Anshul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query attached words

2008-09-23 Thread Erick Erickson
Yes you can query *method. But you have to turn leading wildcards
(which I don't have right on the tips of my fingers, but know it's been
an option for some time now).

But your solution doesn't scale well. If you had
a.b.c.d.e.f.g.h you'd have to store many combinations in order
to do what you want, quickly becoming really, really ugly.

But you could store the tokens
a
.
b
.
c
.
e
.
f
.
g
.
h
by using the appropriate analyzer (or perhaps rolling
your own). Then you could use either PhraseQuerys
or SpanQuerys to do what you want

Best
Erick

On Mon, Sep 22, 2008 at 5:40 PM, Jean-Claude Antonio
<[EMAIL PROTECTED]>wrote:

> Hello,
>
> If I had a file with the following content:
> ...
> object.method();
> ...
> I would like to be able to query for
> object
> method
> object.method
>
> My guess is that I should store not only "object.method", but also "object"
> and "method" as I cannot query *method.
> Any other suggestion?
>
> Kind regards,
>
> JClaude
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Multi Field search without Multifieldqueryparser

2008-09-23 Thread Anshul jain
yes you are partly correct

what I need is that lucene should support two type of queries for the
following document:
name: abc^10
organization: xyz^3

structured query:
name: abc and organization: xyz

unstructured query:
default_field: abc ^5 and xyz

But i do not want to create one more field(default_field) that will
contain all the values concatenated in it. Also, even if i get all the
fields during indexing and use it for multi field query parser, then
the query will become very inefficient as there can be thousands of
fields. I think it should clarify my point.



On Tue, Sep 23, 2008 at 1:58 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> So, the piece I'm missing is how do you know what field for which terms.  In
> other words how do you know xyz goes against organization and abc against
> name.  Your wording implies that you don't know this before hand, yet you
> are somehow suggesting that Lucene should be able to do it.  Correct me if
> I'm wrong.
>
> -Grant
>
>
> On Sep 23, 2008, at 6:51 AM, Anshul jain wrote:
>
>> Here is what I'm trying to do:
>>
>> say a lucene document:
>> name: abc ^10
>> organization: xyz ^3
>>
>> ^10 and ^3 are boosts in the document.
>>
>> now if I query name: abc ^5 AND organization: xyz this will work.
>>
>> but if I query (default_field): abc^5 AND xyz this won't work.
>>
>> Now what I want is that a text can be associated with more than one field.
>> i.e.
>>
>> (field1,field2,field3):value
>> name,(default_field),title: abc^10
>> organization,(default_field),institute: xyz^3
>>
>> then both of my queries will work.
>>
>> Is it possible to do so in lucene without changing the source?
>> If no then can anyone please explain the indexing and searching
>> mechanism for lucene, so that I can start working on it.
>>
>> The solution given by the java-users won't work for me as I do not
>> want to add all the contents of the document in a single field and
>> then search for that field, as this would increase the index size and
>> I've to index more than 10 million documents. Also
>> multifieldqueryparser will make it query execution inefficient, as
>> there will be thousands of fields.
>>
>> If I start storing just a single field as: (default_field): "name abc
>> organization xyz", then it is possible that some other documents might
>> get selected that are not relevant. Also i want to boost individual
>> fields in a document.
>>
>> Anshul
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 
Anshul Jain

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exception while doing sorting

2008-09-23 Thread Erick Erickson
That still seems excessive. Are you measuring your first sort? Lucene
builds up caches to help sort with the first few *sorts* that happen, so
that's a possibility.

But if that isn't the case, I think you need to slap a profiler on the
problem and see where you're spending your time. I'd also be careful
about what you measure when you measure your query. For instance,
I've been fooled by measuring the total time to get an assembled response
and it turned out that the time was spent fetching the documents, NOT
searching/sorting.

Try measuring various operations. In particular comment out anything having
to do with assembling the response. Perhaps just substitute in making a list
of the doc IDs and time *that*. Slowly build back up to your current app,
and
I suspect that one of the steps will cause your time to increase
dramatically.

How many documents are you assembling to respond? If you're assembling
40,000 hits, then 10-20 seconds may not be unreasonable.

Best
Erick

On Tue, Sep 23, 2008 at 12:51 AM, Ganesh - yahoo <[EMAIL PROTECTED]>wrote:

> System Specification:
> Processor speed: 2Ghz
> Ram: 3 GB
>
> IndexDB size 5 GB.
> Total documents indexed: 5.8 million.
>
> To collect hits, i have replaced Hits object with TopFieldDocs. This has
> improved the search performance better. Sorting is faster on date / long
> field, but it is very slow on string field. In a standalone application it
> took 10 - 20 secs to dispaly the results sorted on string field. [I am not
> opening indexsearcher every time].
>
> Regards
> Ganesh
>
>
>
> - Original Message - From: "Erick Erickson" <
> [EMAIL PROTECTED]>
> To: 
> Sent: Monday, September 22, 2008 6:29 PM
>
> Subject: Re: Exception while doing sorting
>
>
>  Sure, your tomcat instance is assigning some amount of memory
>> to the JVM that your searcher is running in. Of course, now you're
>> going to ask me now to increase that number... I have no idea but
>> I've seen this question multiple times in the mail archive,
>> so a search there or in the tomcat docs should let you know.
>>
>> But 12 seconds is still a long time to wait for a search to complete.
>> Can you tell us more about your search?
>>
>> For instance, are you opening a searcher for each request? That's bad.
>> Are you sorting? that can take a long time, but again the first one
>> will have a performance penalty as things are cached.
>>
>> There are a number of tips here:
>> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
>>
>> Best
>> Erick
>>
>> On Mon, Sep 22, 2008 at 7:45 AM, Ganesh - yahoo <[EMAIL PROTECTED]
>> >wrote:
>>
>>  My index crossed 5 GB and 5 million documents are indexed.
>>> My query includes searching and sorting returns 4 hits.
>>>
>>> If i do search from a standalone application, the results are returned in
>>> 12 seconds. If i perform the same from web application running inside
>>> Tomcat, out of memory exception is occured.
>>>
>>> Could any one clarify it?
>>>
>>> Regards
>>> Ganesh
>>>
>>> - Original Message - From: "Ganesh - yahoo" <
>>> [EMAIL PROTECTED]
>>> >
>>> To: 
>>> Sent: Friday, September 19, 2008 10:56 AM
>>>
>>> Subject: Re: Exception while doing sorting
>>>
>>>
>>>  Ok. If i distribure the indexes, whether sorting would be faster?
>>>

 In Lucene user group mailing list, most emails suggests to use single
 indicies. Searching across the indexes may not be slower?

  Lucene uses FieldCache for sorting on non-tokenized field and tries to

> maintain fields from all your 4 millions documents, even if you need
>> to sort only 4000 docs.
>>
>>  Don't know why Lucene keeps all terms in FieldCache for sorting. It
>
 supposed to sort only the hits. Please clarify?

 Regards
 Ganesh

 - Original Message - From: "Otis Gospodnetic" <
 [EMAIL PROTECTED]>
 To: 
 Sent: Thursday, September 18, 2008 12:17 PM
 Subject: Re: Exception while doing sorting


  If your index is increasing in size so fast, you should start thinking

> about sharding your index (breaking it into multiple smaller indices
> that
> each fits on its server) and searching across them (aka distributed
> search).
>
> Yes, Lucene can handle millions of records if run on adequate hardware
> and if used correctly.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
>
>  From: Ganesh - yahoo <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Thursday, September 18, 2008 12:53:19 AM
>> Subject: Re: Exception while doing sorting
>>
>> My index is growing by 1 million records per day. How much memory do i
>> need
>> to increase.
>>
>> What kind of sorting algorithm being used in Lucene. Is this efficient
>> enough to handle millions of records.
>>
>> Whether we could do sorting using our own algori

Re: Multi Field search without Multifieldqueryparser

2008-09-23 Thread Erick Erickson
Are you sure you want to be boosting the document fields at
index time? From Hossman

<<>>

But Lucene isn't magic, it's an engine that you have to
make do what you want. You say

"But i do not want to create one  more field(default_field)
that will contain all the values  concatenated in it"

Is this for theoretical reasons or do you have evidence that this
is unacceptable? You haven't told us how much data you're
indexing, so we have no way to reassure (or warn) you about
trying this.

I suggest you try the "bag of words" solution (this should
not take you more than a few hours) and see if it's
unacceptable before rejecting it.

Best
Erick

On Tue, Sep 23, 2008 at 6:51 AM, Anshul jain <[EMAIL PROTECTED]>wrote:

> Here is what I'm trying to do:
>
> say a lucene document:
> name: abc ^10
> organization: xyz ^3
>
> ^10 and ^3 are boosts in the document.
>
> now if I query name: abc ^5 AND organization: xyz this will work.
>
> but if I query (default_field): abc^5 AND xyz this won't work.
>
> Now what I want is that a text can be associated with more than one field.
> i.e.
>
> (field1,field2,field3):value
> name,(default_field),title: abc^10
> organization,(default_field),institute: xyz^3
>
> then both of my queries will work.
>
> Is it possible to do so in lucene without changing the source?
> If no then can anyone please explain the indexing and searching
> mechanism for lucene, so that I can start working on it.
>
> The solution given by the java-users won't work for me as I do not
> want to add all the contents of the document in a single field and
> then search for that field, as this would increase the index size and
> I've to index more than 10 million documents. Also
> multifieldqueryparser will make it query execution inefficient, as
> there will be thousands of fields.
>
> If I start storing just a single field as: (default_field): "name abc
> organization xyz", then it is possible that some other documents might
> get selected that are not relevant. Also i want to boost individual
> fields in a document.
>
> Anshul
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Multi Field search without Multifieldqueryparser

2008-09-23 Thread Umesh Prasad
On Tue, Sep 23, 2008 at 5:28 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

> So, the piece I'm missing is how do you know what field for which terms.
>  In other words how do you know xyz goes against organization and abc
> against name.  Your wording implies that you don't know this before hand,


  I guess this would be the case. The free flowing text search leads
to this issue.


> yet you are somehow suggesting that Lucene should be able to do it.
>  Correct me if I'm wrong.

  I am not sure if Lucene will be able to directly able to do it.
However Indexed Terms in Lucene can certainly be used in learning the field
of a particular word/token.
  One way, would be Lucene Index can be traversed to generated a
Learning System which will be later used to learn the field name of a
particular system. I suggest traversing the termDocs and extracting out the
words and field information which can be stored in a separate DB/Index
(Learning System). This system can then be queried 1st to determine the
field type of word. The additional time that the Learning System will
require should be compensated by having a smaller Index Size.



Thanks
Umesh




>
> -Grant
>
>
>
> On Sep 23, 2008, at 6:51 AM, Anshul jain wrote:
>
>  Here is what I'm trying to do:
>>
>> say a lucene document:
>> name: abc ^10
>> organization: xyz ^3
>>
>> ^10 and ^3 are boosts in the document.
>>
>> now if I query name: abc ^5 AND organization: xyz this will work.
>>
>> but if I query (default_field): abc^5 AND xyz this won't work.
>>
>> Now what I want is that a text can be associated with more than one field.
>> i.e.
>>
>> (field1,field2,field3):value
>> name,(default_field),title: abc^10
>> organization,(default_field),institute: xyz^3
>>
>> then both of my queries will work.
>>
>> Is it possible to do so in lucene without changing the source?
>> If no then can anyone please explain the indexing and searching
>> mechanism for lucene, so that I can start working on it.
>>
>> The solution given by the java-users won't work for me as I do not
>> want to add all the contents of the document in a single field and
>> then search for that field, as this would increase the index size and
>> I've to index more than 10 million documents. Also
>> multifieldqueryparser will make it query execution inefficient, as
>> there will be thousands of fields.
>>
>> If I start storing just a single field as: (default_field): "name abc
>> organization xyz", then it is possible that some other documents might
>> get selected that are not relevant. Also i want to boost individual
>> fields in a document.
>>
>> Anshul
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> --
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


RE: Multi Field search without Multifieldqueryparser

2008-09-23 Thread Dino Korah
Just an idea... Along winded one. I'm not sure either.! Pardon me if I am
directing you in the wrong direction


If you add a lucene doc like below into your main index

- Doc 1 -
Field1: rainy today
Field2: rainy yesterday
Field3: weather forcast for tomorrow

- Doc 2 -
Field1: rainy tomorrow
Field2: rainy today
Field3: weather forcast for today


... etc


And if you create something like an inverted index like below

- Doc 1 -
Field: Field1
Value: rainy today

- Doc 2 -
Field: Field2
Value: rainy yesterday

- Doc 3 -
Field: Field3
Value: weather forcast for tomorrow

- Doc 4 -
Field: Field1
Value: rainy tomorrow

- Doc 5 -
Field: Field2
Value: rainy today

- Doc 6 -
Field: Field3
Value: weather forcast for today

And if you run a query on the inverted index to find out the field that is
most probably to match the text you are about to search for in the main
index, I have a feeling that this might work.



-Original Message-
From: Umesh Prasad [mailto:[EMAIL PROTECTED] 
Sent: 23 September 2008 13:58
To: java-user@lucene.apache.org
Subject: Re: Multi Field search without Multifieldqueryparser

On Tue, Sep 23, 2008 at 5:28 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

> So, the piece I'm missing is how do you know what field for which terms.
>  In other words how do you know xyz goes against organization and abc 
> against name.  Your wording implies that you don't know this before 
> hand,


  I guess this would be the case. The free flowing text search leads
to this issue.


> yet you are somehow suggesting that Lucene should be able to do it.
>  Correct me if I'm wrong.

  I am not sure if Lucene will be able to directly able to do it.
However Indexed Terms in Lucene can certainly be used in learning the field
of a particular word/token.
  One way, would be Lucene Index can be traversed to generated a
Learning System which will be later used to learn the field name of a
particular system. I suggest traversing the termDocs and extracting out the
words and field information which can be stored in a separate DB/Index
(Learning System). This system can then be queried 1st to determine the
field type of word. The additional time that the Learning System will
require should be compensated by having a smaller Index Size.



Thanks
Umesh




>
> -Grant
>
>
>
> On Sep 23, 2008, at 6:51 AM, Anshul jain wrote:
>
>  Here is what I'm trying to do:
>>
>> say a lucene document:
>> name: abc ^10
>> organization: xyz ^3
>>
>> ^10 and ^3 are boosts in the document.
>>
>> now if I query name: abc ^5 AND organization: xyz this will work.
>>
>> but if I query (default_field): abc^5 AND xyz this won't work.
>>
>> Now what I want is that a text can be associated with more than one
field.
>> i.e.
>>
>> (field1,field2,field3):value
>> name,(default_field),title: abc^10
>> organization,(default_field),institute: xyz^3
>>
>> then both of my queries will work.
>>
>> Is it possible to do so in lucene without changing the source?
>> If no then can anyone please explain the indexing and searching 
>> mechanism for lucene, so that I can start working on it.
>>
>> The solution given by the java-users won't work for me as I do not 
>> want to add all the contents of the document in a single field and 
>> then search for that field, as this would increase the index size and 
>> I've to index more than 10 million documents. Also 
>> multifieldqueryparser will make it query execution inefficient, as 
>> there will be thousands of fields.
>>
>> If I start storing just a single field as: (default_field): "name abc 
>> organization xyz", then it is possible that some other documents 
>> might get selected that are not relevant. Also i want to boost 
>> individual fields in a document.
>>
>> Anshul
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> --
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi Field search without Multifieldqueryparser

2008-09-23 Thread Grant Ingersoll


On Sep 23, 2008, at 8:35 AM, Anshul jain wrote:


yes you are partly correct

what I need is that lucene should support two type of queries for the
following document:
name: abc^10
organization: xyz^3

structured query:
name: abc and organization: xyz

unstructured query:
default_field: abc ^5 and xyz


And what field(s) should "xyz" be searched against?  Again, I ask, how  
do you know what fields "xyz" should go against and why does abc go  
against the default_field?  You've said it shouldn't go against all  
fields (b/c there are thousands of them), and you've said it shouldn't  
go against a catch-all field, but otherwise I still have no clue your  
criteria for what fields xyz should search.  Are you saying that you  
want it to intelligently know that when "xyz" comes in that it should  
search the organization field?


Other than seconding Umesh's or Dino's suggestions of using machine  
learning or heuristics or using some type of templating system, I'm  
not sure what else to offer.  You might look at Solr's Dismax Query  
Parser, which allows you to specify the field structure of queries in  
a multi-field way, but again, I doubt that is wholly what you are  
looking for.





But i do not want to create one more field(default_field) that will
contain all the values concatenated in it. Also, even if i get all the
fields during indexing and use it for multi field query parser, then
the query will become very inefficient as there can be thousands of
fields. I think it should clarify my point.



On Tue, Sep 23, 2008 at 1:58 PM, Grant Ingersoll  
<[EMAIL PROTECTED]> wrote:
So, the piece I'm missing is how do you know what field for which  
terms.  In
other words how do you know xyz goes against organization and abc  
against
name.  Your wording implies that you don't know this before hand,  
yet you
are somehow suggesting that Lucene should be able to do it.   
Correct me if

I'm wrong.

-Grant


On Sep 23, 2008, at 6:51 AM, Anshul jain wrote:


Here is what I'm trying to do:

say a lucene document:
name: abc ^10
organization: xyz ^3

^10 and ^3 are boosts in the document.

now if I query name: abc ^5 AND organization: xyz this will work.

but if I query (default_field): abc^5 AND xyz this won't work.

Now what I want is that a text can be associated with more than  
one field.

i.e.

(field1,field2,field3):value
name,(default_field),title: abc^10
organization,(default_field),institute: xyz^3

then both of my queries will work.

Is it possible to do so in lucene without changing the source?
If no then can anyone please explain the indexing and searching
mechanism for lucene, so that I can start working on it.

The solution given by the java-users won't work for me as I do not
want to add all the contents of the document in a single field and
then search for that field, as this would increase the index size  
and

I've to index more than 10 million documents. Also
multifieldqueryparser will make it query execution inefficient, as
there will be thousands of fields.

If I start storing just a single field as: (default_field): "name  
abc
organization xyz", then it is possible that some other documents  
might

get selected that are not relevant. Also i want to boost individual
fields in a document.

Anshul

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






--
Anshul Jain

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query attached words

2008-09-23 Thread Jean-Claude Antonio

Thanks Erick, you are right about the various combinations.
Cheers,

Erick Erickson wrote:

Yes you can query *method. But you have to turn leading wildcards
(which I don't have right on the tips of my fingers, but know it's been
an option for some time now).

But your solution doesn't scale well. If you had
a.b.c.d.e.f.g.h you'd have to store many combinations in order
to do what you want, quickly becoming really, really ugly.

But you could store the tokens
a
.
b
.
c
.
e
.
f
.
g
.
h
by using the appropriate analyzer (or perhaps rolling
your own). Then you could use either PhraseQuerys
or SpanQuerys to do what you want

Best
Erick

On Mon, Sep 22, 2008 at 5:40 PM, Jean-Claude Antonio
<[EMAIL PROTECTED]>wrote:

  

Hello,

If I had a file with the following content:
...
object.method();
...
I would like to be able to query for
object
method
object.method

My guess is that I should store not only "object.method", but also "object"
and "method" as I cannot query *method.
Any other suggestion?

Kind regards,

JClaude




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Query attached words

2008-09-23 Thread Matthew Hall

We have a similar requirement here at our work.

In order to get around it we create two indexes, one of which 
punctuation is relevant, and one in which all punctuation is treated as 
a place to break tokens.


We then do a search against both indexes and merge the results, it seems 
that such a technique might be able to help you here as well.  (Though 
upon rereading it seems like perhaps you want SOME punctuation to be 
relevant, but others not, the technique itself though could still be 
applied with these rules used instead)


- Matt

Jean-Claude Antonio wrote:

Thanks Erick, you are right about the various combinations.
Cheers,

Erick Erickson wrote:

Yes you can query *method. But you have to turn leading wildcards
(which I don't have right on the tips of my fingers, but know it's been
an option for some time now).

But your solution doesn't scale well. If you had
a.b.c.d.e.f.g.h you'd have to store many combinations in order
to do what you want, quickly becoming really, really ugly.

But you could store the tokens
a
.
b
.
c
.
e
.
f
.
g
.
h
by using the appropriate analyzer (or perhaps rolling
your own). Then you could use either PhraseQuerys
or SpanQuerys to do what you want

Best
Erick

On Mon, Sep 22, 2008 at 5:40 PM, Jean-Claude Antonio
<[EMAIL PROTECTED]>wrote:

 

Hello,

If I had a file with the following content:
...
object.method();
...
I would like to be able to query for
object
method
object.method

My guess is that I should store not only "object.method", but also 
"object"

and "method" as I cannot query *method.
Any other suggestion?

Kind regards,

JClaude




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multi Field search without Multifieldqueryparser

2008-09-23 Thread Anshul jain
unstructured query:
 default_field: abc ^5 and xyz

seems to have created a confusion, what I meant was while initializing
the parser I have "default_field" as the default text field. So, the
query should be:

QueryParser parser = new QueryParser("default_field",analyzer);
query = parser.parse("abc^5 and xyz");

so query will be: default_field:abc^5 and default_field:xyz^3

I am sorry for mentioning it wrong earlier.

To answer Ericks question: I'll be indexing around 10-20 million
documents of average size of 4 KB, but the number of documents could
be mor.

Now let me again clearly explain my problem:

say i have a set of lucene documents as:

Document 1:
name: Anshul ^10
organization: EPFL ^5
sex: Male

Document 2:
name: Rakesh ^10
organization: IIT-B ^5
sex: Male

Docuemt 3:
name: erin brochowich^10
organization: ABC law firm
sex: Female

Document 4:
title: lord of the rings ^10
directors: John ^2
actors: Kate

Document 5:
title: godfather ^10
directors: Kate ^2
actors: alpachino

 Docmuent 1, 2 and 3 belongs to a same class so there boosting
parameters will be same. Similar is the case with document 4 and 5.

If I give a query like:

name: "Erin Brochowich" and Oranization: "ABC law firm".  this query
will work perfectly.

but if the query is
QueryParser parser = new QueryParser("default_field",analyzer);
query = parser.parse("Erin Brochowich and ABC law firm");
 it would not work.

what i want is that default_field should be connected to the all the
text somehow, but it should not take extra space for storing its own
text.

I think it should be clear enough now.

Thank you for your responses.
Regards,
Anshul





On Tue, Sep 23, 2008 at 4:55 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
> On Sep 23, 2008, at 8:35 AM, Anshul jain wrote:
>
>> yes you are partly correct
>>
>> what I need is that lucene should support two type of queries for the
>> following document:
>> name: abc^10
>> organization: xyz^3
>>
>> structured query:
>> name: abc and organization: xyz
>>
>> unstructured query:
>> default_field: abc ^5 and xyz
>
> And what field(s) should "xyz" be searched against?  Again, I ask, how do
> you know what fields "xyz" should go against and why does abc go against the
> default_field?  You've said it shouldn't go against all fields (b/c there
> are thousands of them), and you've said it shouldn't go against a catch-all
> field, but otherwise I still have no clue your criteria for what fields xyz
> should search.  Are you saying that you want it to intelligently know that
> when "xyz" comes in that it should search the organization field?
>
> Other than seconding Umesh's or Dino's suggestions of using machine learning
> or heuristics or using some type of templating system, I'm not sure what
> else to offer.  You might look at Solr's Dismax Query Parser, which allows
> you to specify the field structure of queries in a multi-field way, but
> again, I doubt that is wholly what you are looking for.
>
>>
>>
>> But i do not want to create one more field(default_field) that will
>> contain all the values concatenated in it. Also, even if i get all the
>> fields during indexing and use it for multi field query parser, then
>> the query will become very inefficient as there can be thousands of
>> fields. I think it should clarify my point.
>>
>>
>>
>> On Tue, Sep 23, 2008 at 1:58 PM, Grant Ingersoll <[EMAIL PROTECTED]>
>> wrote:
>>>
>>> So, the piece I'm missing is how do you know what field for which terms.
>>>  In
>>> other words how do you know xyz goes against organization and abc against
>>> name.  Your wording implies that you don't know this before hand, yet you
>>> are somehow suggesting that Lucene should be able to do it.  Correct me
>>> if
>>> I'm wrong.
>>>
>>> -Grant
>>>
>>>
>>> On Sep 23, 2008, at 6:51 AM, Anshul jain wrote:
>>>
 Here is what I'm trying to do:

 say a lucene document:
 name: abc ^10
 organization: xyz ^3

 ^10 and ^3 are boosts in the document.

 now if I query name: abc ^5 AND organization: xyz this will work.

 but if I query (default_field): abc^5 AND xyz this won't work.

 Now what I want is that a text can be associated with more than one
 field.
 i.e.

 (field1,field2,field3):value
 name,(default_field),title: abc^10
 organization,(default_field),institute: xyz^3

 then both of my queries will work.

 Is it possible to do so in lucene without changing the source?
 If no then can anyone please explain the indexing and searching
 mechanism for lucene, so that I can start working on it.

 The solution given by the java-users won't work for me as I do not
 want to add all the contents of the document in a single field and
 then search for that field, as this would increase the index size and
 I've to index more than 10 million documents. Also
 multifieldqueryparser will make it query execution inefficient, as
 there will be thousands of fields.

Re: Multi Field search without Multifieldqueryparser

2008-09-23 Thread Erick Erickson
But the "default_field" for your query parser is just that, the default
*if nothing else is specified*. So the following would work just fine:

QueryParser parser = new QueryParser("default_field", analyzer);
query = parser.parse("name:Erin AND name:Brochowich AND organization:ABC AND
organization:law AND organization:firm");
None of the terms would go against default_field since an
explicit field is given for each. You'd have to break up the
incoming queries and add the field to each, but that's not hard.

Or even
query = parser.parse("name:"Erin Brochowich"~3 AND organization:"ABC law
firm"~3");
for phrase queries with slop.

I *still* think you're misunderstanding index-time boosting. It is
INDEPENDENT of
query time boosting. Index time boosting has the effect of raising the
importance
of a particular field IN THAT DOCUMENT relative to that field IN OTHER
DOCUMENTS.
Boosting all the terms for a given field for ALL documents is essentially
doing nothing.

I very strongly recommend you get a copy of Luke and experiment with how
queries
are parsed. That tool has the ability to, for any given query, send it
through the
parser and see exactly what it looks like after parsing. I think that would
allow
you to get much better answers much more quickly. Just google lucene luke
and you should be fine.

Finally, the number of documents you're talking about will produce a pretty
small
index by Lucene standards. There's no reason to avoid the "bag of words"
solution
if that solves your problem because you fear bloating your index.

Best
Erick


On Tue, Sep 23, 2008 at 11:54 AM, Anshul jain <[EMAIL PROTECTED]>wrote:

> unstructured query:
>  default_field: abc ^5 and xyz
>
> seems to have created a confusion, what I meant was while initializing
> the parser I have "default_field" as the default text field. So, the
> query should be:
>
> QueryParser parser = new QueryParser("default_field",analyzer);
> query = parser.parse("abc^5 and xyz");
>
> so query will be: default_field:abc^5 and default_field:xyz^3
>
> I am sorry for mentioning it wrong earlier.
>
> To answer Ericks question: I'll be indexing around 10-20 million
> documents of average size of 4 KB, but the number of documents could
> be mor.
>
> Now let me again clearly explain my problem:
>
> say i have a set of lucene documents as:
>
> Document 1:
> name: Anshul ^10
> organization: EPFL ^5
> sex: Male
>
> Document 2:
> name: Rakesh ^10
> organization: IIT-B ^5
> sex: Male
>
> Docuemt 3:
> name: erin brochowich^10
> organization: ABC law firm
> sex: Female
>
> Document 4:
> title: lord of the rings ^10
> directors: John ^2
> actors: Kate
>
> Document 5:
> title: godfather ^10
> directors: Kate ^2
> actors: alpachino
>
>  Docmuent 1, 2 and 3 belongs to a same class so there boosting
> parameters will be same. Similar is the case with document 4 and 5.
>
> If I give a query like:
>
> name: "Erin Brochowich" and Oranization: "ABC law firm".  this query
> will work perfectly.
>
> but if the query is
> QueryParser parser = new QueryParser("default_field",analyzer);
> query = parser.parse("Erin Brochowich and ABC law firm");
>  it would not work.
>
> what i want is that default_field should be connected to the all the
> text somehow, but it should not take extra space for storing its own
> text.
>
> I think it should be clear enough now.
>
> Thank you for your responses.
> Regards,
> Anshul
>
>
>
>
>
> On Tue, Sep 23, 2008 at 4:55 PM, Grant Ingersoll <[EMAIL PROTECTED]>
> wrote:
> >
> > On Sep 23, 2008, at 8:35 AM, Anshul jain wrote:
> >
> >> yes you are partly correct
> >>
> >> what I need is that lucene should support two type of queries for the
> >> following document:
> >> name: abc^10
> >> organization: xyz^3
> >>
> >> structured query:
> >> name: abc and organization: xyz
> >>
> >> unstructured query:
> >> default_field: abc ^5 and xyz
> >
> > And what field(s) should "xyz" be searched against?  Again, I ask, how do
> > you know what fields "xyz" should go against and why does abc go against
> the
> > default_field?  You've said it shouldn't go against all fields (b/c there
> > are thousands of them), and you've said it shouldn't go against a
> catch-all
> > field, but otherwise I still have no clue your criteria for what fields
> xyz
> > should search.  Are you saying that you want it to intelligently know
> that
> > when "xyz" comes in that it should search the organization field?
> >
> > Other than seconding Umesh's or Dino's suggestions of using machine
> learning
> > or heuristics or using some type of templating system, I'm not sure what
> > else to offer.  You might look at Solr's Dismax Query Parser, which
> allows
> > you to specify the field structure of queries in a multi-field way, but
> > again, I doubt that is wholly what you are looking for.
> >
> >>
> >>
> >> But i do not want to create one more field(default_field) that will
> >> contain all the values concatenated in it. Also, even if i get all the
> >> fields during indexing and use it for 

Re: Query attached words

2008-09-23 Thread Jean-Claude Antonio

Thanks Matt,
I will go for Erick's suggestion as the combination can be messy: for 
a.b.c I would need to store a,b,c,a.b,b.c and a.b.c

Cheers




Matthew Hall wrote:

We have a similar requirement here at our work.

In order to get around it we create two indexes, one of which 
punctuation is relevant, and one in which all punctuation is treated 
as a place to break tokens.


We then do a search against both indexes and merge the results, it 
seems that such a technique might be able to help you here as well.  
(Though upon rereading it seems like perhaps you want SOME punctuation 
to be relevant, but others not, the technique itself though could 
still be applied with these rules used instead)


- Matt

Jean-Claude Antonio wrote:

Thanks Erick, you are right about the various combinations.
Cheers,

Erick Erickson wrote:

Yes you can query *method. But you have to turn leading wildcards
(which I don't have right on the tips of my fingers, but know it's been
an option for some time now).

But your solution doesn't scale well. If you had
a.b.c.d.e.f.g.h you'd have to store many combinations in order
to do what you want, quickly becoming really, really ugly.

But you could store the tokens
a
.
b
.
c
.
e
.
f
.
g
.
h
by using the appropriate analyzer (or perhaps rolling
your own). Then you could use either PhraseQuerys
or SpanQuerys to do what you want

Best
Erick

On Mon, Sep 22, 2008 at 5:40 PM, Jean-Claude Antonio
<[EMAIL PROTECTED]>wrote:

 

Hello,

If I had a file with the following content:
...
object.method();
...
I would like to be able to query for
object
method
object.method

My guess is that I should store not only "object.method", but also 
"object"

and "method" as I cannot query *method.
Any other suggestion?

Kind regards,

JClaude




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Rsync causing search timeouts on master

2008-09-23 Thread rahul_k123

Hi,

I am using snappuller to sync my slave with master, i am not using rsync
daemon, i am doing Rsync using remote shell.

When i am serving requests from the master when the snappuller is running
(after optimization, total index is arnd 4 gb it doing the transfer of whole
index), the performance is very bad actually causing timeouts.



Any ideas why this happens .


Any suggestions will help.


Thanks.
-- 
View this message in context: 
http://www.nabble.com/Rsync-causing-search-timeouts-on-master-tp19641103p19641103.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Rsync causing search timeouts on master

2008-09-23 Thread Otis Gospodnetic
Hi,

Wrong list. :)  I answered your question on solr-user.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: rahul_k123 <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Tuesday, September 23, 2008 11:00:02 PM
> Subject: Rsync causing search timeouts on master
> 
> 
> Hi,
> 
> I am using snappuller to sync my slave with master, i am not using rsync
> daemon, i am doing Rsync using remote shell.
> 
> When i am serving requests from the master when the snappuller is running
> (after optimization, total index is arnd 4 gb it doing the transfer of whole
> index), the performance is very bad actually causing timeouts.
> 
> 
> 
> Any ideas why this happens .
> 
> 
> Any suggestions will help.
> 
> 
> Thanks.
> -- 
> View this message in context: 
> http://www.nabble.com/Rsync-causing-search-timeouts-on-master-tp19641103p19641103.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]