Re: frequent keyword computation within a search ( and timeinterval )

prasenjit mukherjee Thu, 05 Jan 2012 17:41:57 -0800

It seems that the field ( on which stats needs to be cimputed ) should
always remain in memory. This could be a killer. Why isn't it possible
to put that stat-field information into posting stream ( using payload
) which facilitate fast computation of stats withouting requiring it
to keep the content in memory.



On 1/6/12, Jason Rutherglen <jason.rutherg...@gmail.com> wrote:
>> Although I still question whether this is a *good* use of Solr
>
> It's a great use of Lucene, which can be made into a superior
> horizontally scalable database when compared with open source
> relational database systems.
>
> My only concern, going back to *other* conversation(s) is whether or
> not the field cache used by stats component is operated on per-segment
> or not.  If *true* then the stats part of Solr can be checked off as
> NRT / soft commit capable / efficient.
>
> I think the answer is *FALSE* based on these lines in StatsComponent
> which seem to be operating on the top-level reader (eg, NOT
> per-segment).
>
>   si = FieldCache.DEFAULT.getTermsIndex(searcher.getIndexReader(),
> fieldName);
>
>   UnInvertedField uif = UnInvertedField.getUnInvertedField(f, searcher);
>
> On Thu, Jan 5, 2012 at 4:54 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>> Hmmm, guess you're right, the stats component
>> does return that data. It's been a long day...
>>
>> Although I still question whether this is a *good*
>> use of Solr, I'd still re-examine my approach
>> whenever I found myself trying to translate
>> SQL queries into Solr....
>>
>> But if, after that examination I still required
>> SUM, stats would do it.
>>
>> Erick
>>
>> On Thu, Jan 5, 2012 at 7:23 PM, Jason Rutherglen
>> <jason.rutherg...@gmail.com> wrote:
>>>> Short answer is that no, there isn't an aggregate
>>>> function. And you shouldn't even try
>>>
>>> If that is the case why does a 'stats' component exist for Solr with
>>> the SUM function built in?
>>>
>>> http://wiki.apache.org/solr/StatsComponent
>>>
>>> On Thu, Jan 5, 2012 at 1:37 PM, Erick Erickson <erickerick...@gmail.com>
>>> wrote:
>>>> You will encounter endless grief until you stop
>>>> thinking of Solr/Lucene as a replacement for
>>>> an RDBMS. It is a *text search engine*.
>>>> Whenever you start asking "how do I implement
>>>> a SQL statement in Solr", you have to stop
>>>> and reconsider *why* you are trying to do that.
>>>> Then recast the question in terms of searching.
>>>>
>>>> Short answer is that no, there isn't an aggregate
>>>> function. And you shouldn't even try.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Thu, Jan 5, 2012 at 12:53 PM, prasenjit mukherjee
>>>> <prasen....@gmail.com> wrote:
>>>>> Thanks Eric for the response.
>>>>>
>>>>> Will lucene/solr provide me aggregations ( of field vaues ) satisying
>>>>> a query criteria ? e.g. SELECT SUM(price) WHERE item=fruits
>>>>>
>>>>> Or I need to use hitCollector to achieve that ?
>>>>>
>>>>> Any sample solr/lucene query to compte aggregates ( like SUM ) will be
>>>>> great.
>>>>>
>>>>> -Thanks,
>>>>> Prasenjit
>>>>>
>>>>> On Thu, Jan 5, 2012 at 7:10 PM, Erick Erickson
>>>>> <erickerick...@gmail.com> wrote:
>>>>>> the time interval is just a RangeQuery in the Lucene
>>>>>> world. The rest is pretty standard search stuff.
>>>>>>
>>>>>> You probably want to have a look at the NRT
>>>>>> (near real time) stuff in trunk.
>>>>>>
>>>>>> Your reads/writes are pretty high, so you'll need
>>>>>> some experimentation to size your site
>>>>>> correctly.
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>> On Wed, Jan 4, 2012 at 12:17 AM, prasenjit mukherjee
>>>>>> <prasen....@gmail.com> wrote:
>>>>>>> I have a requirement where reads and writes are quite high ( @
>>>>>>> 100-500
>>>>>>> per-sec ). A document has the following fields : timestamp,
>>>>>>> unique-docid,  content-text, keyword. Average content-text length is
>>>>>>> ~
>>>>>>> 20 bytes, there is only 1 keyword for a given docid.
>>>>>>>
>>>>>>> At runtime, given a query-term ( which could be null ) and a
>>>>>>> time-interval,  I need to find out top-k frequent keywords which
>>>>>>> contains the query-term ( optional if its null )  in its context-text
>>>>>>> field within that time-interval. I can purge the data every day,
>>>>>>> hence
>>>>>>> no need for me to have more than a days data.
>>>>>>>
>>>>>>> I have quite a few options here : Starting with MySQL, NoSQLs (
>>>>>>> Cassandra, Mongo, Couch, Riak, Redis ) , Search-Engine based (
>>>>>>> lucene/solr ) each having its own pros/cons.
>>>>>>>
>>>>>>> In MySQL we can achieve this via : GROUP-BY/COUNT  clause
>>>>>>> In NoSQL I can probably write a map/reduce task to query these
>>>>>>> numbers. Although I am not very sure about the query response time.
>>>>>>> Not sure of we can achieve it via lucene/solr OOB.
>>>>>>>
>>>>>>> Any suggestions on what would be a good choice for this use case ?
>>>>>>>
>>>>>>> -Thanks,
>>>>>>> prasenjit
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Sent from my mobile device

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: frequent keyword computation within a search ( and timeinterval )

Reply via email to