Re: frequent keyword computation within a search ( and timeinterval )

Jason Rutherglen Thu, 05 Jan 2012 17:13:14 -0800

> Although I still question whether this is a *good* use of Solr

It's a great use of Lucene, which can be made into a superior
horizontally scalable database when compared with open source
relational database systems.


My only concern, going back to *other* conversation(s) is whether or
not the field cache used by stats component is operated on per-segment
or not.  If *true* then the stats part of Solr can be checked off as
NRT / soft commit capable / efficient.

I think the answer is *FALSE* based on these lines in StatsComponent
which seem to be operating on the top-level reader (eg, NOT
per-segment).

  si = FieldCache.DEFAULT.getTermsIndex(searcher.getIndexReader(), fieldName);

  UnInvertedField uif = UnInvertedField.getUnInvertedField(f, searcher);

On Thu, Jan 5, 2012 at 4:54 PM, Erick Erickson <[email protected]> wrote:
> Hmmm, guess you're right, the stats component
> does return that data. It's been a long day...
>
> Although I still question whether this is a *good*
> use of Solr, I'd still re-examine my approach
> whenever I found myself trying to translate
> SQL queries into Solr....
>
> But if, after that examination I still required
> SUM, stats would do it.
>
> Erick
>
> On Thu, Jan 5, 2012 at 7:23 PM, Jason Rutherglen
> <[email protected]> wrote:
>>> Short answer is that no, there isn't an aggregate
>>> function. And you shouldn't even try
>>
>> If that is the case why does a 'stats' component exist for Solr with
>> the SUM function built in?
>>
>> http://wiki.apache.org/solr/StatsComponent
>>
>> On Thu, Jan 5, 2012 at 1:37 PM, Erick Erickson <[email protected]> 
>> wrote:
>>> You will encounter endless grief until you stop
>>> thinking of Solr/Lucene as a replacement for
>>> an RDBMS. It is a *text search engine*.
>>> Whenever you start asking "how do I implement
>>> a SQL statement in Solr", you have to stop
>>> and reconsider *why* you are trying to do that.
>>> Then recast the question in terms of searching.
>>>
>>> Short answer is that no, there isn't an aggregate
>>> function. And you shouldn't even try.
>>>
>>> Best
>>> Erick
>>>
>>> On Thu, Jan 5, 2012 at 12:53 PM, prasenjit mukherjee
>>> <[email protected]> wrote:
>>>> Thanks Eric for the response.
>>>>
>>>> Will lucene/solr provide me aggregations ( of field vaues ) satisying
>>>> a query criteria ? e.g. SELECT SUM(price) WHERE item=fruits
>>>>
>>>> Or I need to use hitCollector to achieve that ?
>>>>
>>>> Any sample solr/lucene query to compte aggregates ( like SUM ) will be 
>>>> great.
>>>>
>>>> -Thanks,
>>>> Prasenjit
>>>>
>>>> On Thu, Jan 5, 2012 at 7:10 PM, Erick Erickson <[email protected]> 
>>>> wrote:
>>>>> the time interval is just a RangeQuery in the Lucene
>>>>> world. The rest is pretty standard search stuff.
>>>>>
>>>>> You probably want to have a look at the NRT
>>>>> (near real time) stuff in trunk.
>>>>>
>>>>> Your reads/writes are pretty high, so you'll need
>>>>> some experimentation to size your site
>>>>> correctly.
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>> On Wed, Jan 4, 2012 at 12:17 AM, prasenjit mukherjee
>>>>> <[email protected]> wrote:
>>>>>> I have a requirement where reads and writes are quite high ( @ 100-500
>>>>>> per-sec ). A document has the following fields : timestamp,
>>>>>> unique-docid,  content-text, keyword. Average content-text length is ~
>>>>>> 20 bytes, there is only 1 keyword for a given docid.
>>>>>>
>>>>>> At runtime, given a query-term ( which could be null ) and a
>>>>>> time-interval,  I need to find out top-k frequent keywords which
>>>>>> contains the query-term ( optional if its null )  in its context-text
>>>>>> field within that time-interval. I can purge the data every day, hence
>>>>>> no need for me to have more than a days data.
>>>>>>
>>>>>> I have quite a few options here : Starting with MySQL, NoSQLs (
>>>>>> Cassandra, Mongo, Couch, Riak, Redis ) , Search-Engine based (
>>>>>> lucene/solr ) each having its own pros/cons.
>>>>>>
>>>>>> In MySQL we can achieve this via : GROUP-BY/COUNT  clause
>>>>>> In NoSQL I can probably write a map/reduce task to query these
>>>>>> numbers. Although I am not very sure about the query response time.
>>>>>> Not sure of we can achieve it via lucene/solr OOB.
>>>>>>
>>>>>> Any suggestions on what would be a good choice for this use case ?
>>>>>>
>>>>>> -Thanks,
>>>>>> prasenjit
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: frequent keyword computation within a search ( and timeinterval )

Reply via email to