Poll: how to report # of docs in index over time

2012-02-13 Thread Otis Gospodnetic
Hello, Quick poll for those who have an opinion about what index size monitoring should report in terms of the number of documents in the index. Poll: http://blog.sematext.com/2012/02/13/poll-solr-index-size-monitoring/ For example, imagine that in some 5-minute time period (say 10:00 AM to 10:

Re: Any benchmark corps to evaluate performance of specified query?

2013-01-17 Thread Otis Gospodnetic
Hi, Maybe https://github.com/sematext/ActionGenerator could be of help? We use it to produce query load for Solr and ElasticSearch and the whole thing is extensible, so you could easily add support for talking directly to Lucene. Oh, and there is the benchmark in Lucene:  http://lucene.apache.or

Document scoring order?

2013-04-03 Thread Otis Gospodnetic
Hi, When Lucene scores matching documents, what is the order in which documents are processed/scored and can that be changed? I'm guessing it scores matches in whichever order they are stored in the index/on disk, which means by increasing docIDs? I do see some out of order scoring is possible..

Re: Content based recommender using lucene/solr

2013-06-28 Thread Otis Gospodnetic
Hi, Have a look at http://www.youtube.com/watch?v=13yQbaW2V4Y . I'd say it's easier than Mahout, especially if you already have and know your way around Solr. Otis -- Solr & ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Fri, Jun 28, 2013 at

Re: Content based recommender using lucene/solr

2013-06-28 Thread Otis Gospodnetic
Hi, It doesn't have to be one or the other. In the past I've built a news recommender engine based on CF (Mahout) and combined it with Content Similarity-based engine (wasn't Solr/Lucene, but something custom that worked with ngrams, but it may have as well been Lucene/Solr/ES). It worked well.

Re: TermsFilter instead of "should" TermQueries

2010-05-09 Thread Otis Gospodnetic
I think what Tomislav was trying to ask is: Can filters replace only strictly boolean clauses (i.e. only MUST and MUST_NOT), such as: +gender:F, -rating:xxx)? Or can filters also replace SHOULD clauses, such as: food:banana (which is neither absolutely required or strictly prohibited)? Otis --

Re: Filter vs. TermQuery performance

2010-05-09 Thread Otis Gospodnetic
I think others will have more thoughts on this, esp. for Numeric* questions... but I'll try answering... - Original Message > From: Tomislav Poljak > To: java-user@lucene.apache.org > Sent: Fri, May 7, 2010 2:34:46 PM > Subject: Filter vs. TermQuery performance > > Hi, > when is it w

Re: Grouping or de-duping

2010-05-31 Thread Otis Gospodnetic
Pasa, Maybe Field Collapsing (Solr) can help? See SOLR-236 in JIRA http://search-lucene.com/?q=field+collapsing&fc_project=Lucene&fc_project=Solr Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message --

Re: Is Lucene a "document oriented database"?

2010-05-31 Thread Otis Gospodnetic
I think those doc-oriented DBs tend to be distributed, with replication built-in and such, but yes, in some way the schemaless DB with docs and fields (whether they are pumped in as JSON or XML or Java objects) feels the same. I saw something from Grant about 2 months ago how Lucene is "nosql-i

Re: Using JSON for index input and search output

2010-05-31 Thread Otis Gospodnetic
VL, Solr (not Lucene, but you can embed Solr) has JsonUpdateRequestHandler, which lets you send docs to Solr for indexing in JSON (instead of the usual XML): http://search-lucene.com/c/Solr:/src/java/org/apache/solr/handler/JsonUpdateRequestHandler.java And you can get Solr to respond with JSON

Re: Wich way would you recommend for successive-words similarity and scoring ?

2010-06-01 Thread Otis Gospodnetic
Hi Pablo, This question comes up every once in a while. You'll find some previous discussions and answers here: http://search-lucene.com/?q=terms+closer+together+score Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -

Re: is there any resources that explain detailed implementation of lucene?

2010-06-03 Thread Otis Gospodnetic
Li Li: Then best to go to the source. Here's one version with syntax highlighting and line numbers, should you have questions about specific parts of that class: http://search-lucene.com/c/Lucene:/src/java/org/apache/lucene/search/PhraseQuery.java Otis Sematext :: http://sematext.com/ ::

Re: numDeletedDocs()

2010-06-03 Thread Otis Gospodnetic
Btw. folks, http://search-lucene.com/ has a really handy source code search with auto-completion for Lucene, Solr, etc. For example, I typed in: numDel - and immediately found those methods. Use it. :) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search

Re: Monitoring low level IO

2010-06-03 Thread Otis Gospodnetic
Other than iostat, vmstat and such? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Jason Rutherglen > To: java-user@lucene.apache.org > Sent: Thu, June 3, 2010 2:13:17 PM > Subject: Mo

Re: Monitoring low level IO

2010-06-04 Thread Otis Gospodnetic
Ah, there is another one I came across several months back - http://wiki.sdn.sap.com/wiki/display/Java/JPicus. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Otis Gospodnetic &

Re: arguments in favour of lucene over commercial competition

2010-06-23 Thread Otis Gospodnetic
Lucene/Solr choice typically means: * lower cost of ownership (think about various crazy licensing models some of the commercial search vendors have: per doc, per server, per query, per year) * faster implementation (just think about the duration of the sales/negotiation phase for commerci

Re: arguments in favour of lucene over commercial competition

2010-06-23 Thread Otis Gospodnetic
Off the top of my head: FAST Endeca Coveo Attivio Vivisimo Google Search Appliance (tell me when to stop) Dieselpoint IBM OmniFind Exalead Autonomy dtSearch ISYS Oracle ... ... Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com

Re: arguments in favour of lucene over commercial competition

2010-06-23 Thread Otis Gospodnetic
nd Lucene... And I > personally wouldn't count full text search solutions such as > Oracle's. Itamar. > -----Original Message- > From: > Otis Gospodnetic [mailto: > href="mailto:otis_gospodne...@yahoo.com";>otis_gospodne...@yahoo.com] >

Re: arguments in favour of lucene over commercial competition

2010-06-23 Thread Otis Gospodnetic
On Wed, Jun 23, > 2010 at 11:41 PM, Otis Gospodnetic < > ymailto="mailto:otis_gospodne...@yahoo.com"; > href="mailto:otis_gospodne...@yahoo.com";>otis_gospodne...@yahoo.com> > wrote: > Off the top of my head: > > FAST > > Endeca > Co

Re: arguments in favour of lucene over commercial competition

2010-06-24 Thread Otis Gospodnetic
too, to show how it has improved in the last > versions (not that it was bad before) does anyone have a link to a nice page > with numbers/graphs ? On Thu, Jun 24, 2010 at 7:43 AM, Otis > Gospodnetic < > href="mailto:otis_gospodne...@yahoo.com";>otis_gospodne...@yahoo.co

Re: Personal Intro and a question on "find top 10 similar items" functionality

2010-07-08 Thread Otis Gospodnetic
Igor, You can treat that question as the query and use it to search the index where you've indexed other questions. More Like This is another option. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message

Re: LUCENE-2456 (A Column-Oriented Cassandra-Based Lucene Directory)

2010-08-07 Thread Otis Gospodnetic
Utku, you should ask via comments on https://issues.apache.org/jira/browse/LUCENE-2453. What happened with Lucandra? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Utku Can Topçu > To

Re: understanding lucene

2010-08-08 Thread Otis Gospodnetic
Manning, the Lucene in Action publisher, frequently offers 30-50% off on a number of their books, including LIA2. See http://twitter.com/ManningBooks Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message

Re: Using categories with Lucene

2010-08-08 Thread Otis Gospodnetic
Hello Luan, I think you are looking for facets and faceted search. In short, it means storing the category for a document (web page) in the Document Field in Lucene index . Then, at search time, you count how many matches were in which category. You can implement this yourself or you can use

Re: Calculate Term Co-occurrence Matrix

2010-08-20 Thread Otis Gospodnetic
There is also a non-Mahout Key Phrase Extractor for Collocations, SIPs, and a few other things: http://sematext.com/products/key-phrase-extractor/index.html One of the demos that uses news data is at http://sematext.com/demo/kpe/index.html Otis Sematext :: http://sematext.com/ :: Solr - Lu

Re: lucene indexing configuration

2010-08-20 Thread Otis Gospodnetic
Hi, Are you actually talking about Solr? Sounds like it. Check solr-u...@lucene list. Maybe you need to treat those words are protected words? See the protwords.txt file in the conf dir. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://se

Re: Calculate Term Co-occurrence Matrix

2010-08-21 Thread Otis Gospodnetic
s.searchenginewatch.com/showthread.php?t=48>. > I hope to find some code that given a text corpus, generate all the words > pairs with their probability of occurring together. > > > On Sat, Aug 21, 2010 at 1:46 AM, Otis Gospodnetic < > otis_gospodne...@yahoo.com> wro

Re: does lucene support Database full text search

2010-09-10 Thread Otis Gospodnetic
Hello, You can use LuSQL to index DB content into Lucene. Solr (the "Lucene Server") has DataImportHandler for indexing data from DBs: http://search-lucene.com/?q=dataimporthandler Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-luce

Re: How about lucene's delete performance ?

2010-10-13 Thread Otis Gospodnetic
Hello, Of course, if you actually want the last 7 days rolling effect and not the this week vs. previous week, then you could go with smaller indices, say daily ones. Then you'd always add new docs to the latest index and removing the oldest index completely every 24 hours. You could go hourly

Re: Best practices for multiple languages?

2011-01-18 Thread Otis Gospodnetic
Hi Clemens, If you will be searching individual languages, go with language-specific indices. Wunder likes to give an example of "die" in German vs. English. :) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Orig

Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Otis Gospodnetic
> [X] ASF Mirrors (linked in our release announcements or via the Lucene >website) > > [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [X] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally or via a > d

Re: Backup or replication option with lucene

2011-03-02 Thread Otis Gospodnetic
Hi Ganesh, You could probably use replication scripts from Solr. But why not just use Solr? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Ganesh > To: java-user@lucene.apache.org > S

Re: Detecting duplicates

2011-03-08 Thread Otis Gospodnetic
Mark, Keep in mind that there are actually multiple patches for this. SOLR-236 and SOLR-1086 IIRC. Also, I just noticed this is java-user@lucene. You may want to continue on solr-user@lucene. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http:/

Re: Indexing Non-Textual Data

2011-04-06 Thread Otis Gospodnetic
Hi Chris, Yes, people have done classification with Lucene before. Have a look at http://search-lucene.com/?q=classifier&fc_project=Lucene for some discussions and actual code (in old JIRA issues) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: ht

Re: NRT consistency

2011-04-11 Thread Otis Gospodnetic
I think what's being described here is a lot like what I *think* ElasticSearch does, where there is no single master and index changed made to any node get propagated to N-1 other nodes (N=number of index replicas). I'm not sure how it deals with situations where "incompatible" index changes a

SorterTemplate.quickSort causes StackOverflowError

2011-04-28 Thread Otis Gospodnetic
Hi, I'm looking at some code that uses MemoryIndex (Lucene 3.1) and that's exhibiting a strange behaviour - it slows down over time. The MemoryIndex contains 1 doc, of course, and executes a set of a few thousand queries against it. The set of queries does not change - the same set of queries

Re: SorterTemplate.quickSort causes StackOverflowError

2011-04-28 Thread Otis Gospodnetic
y ArrayUtils.mergeSort() > and see if problem is still there? > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: Otis Gospodnetic [mailto:otis

Re: SorterTemplate.quickSort causes StackOverflowError

2011-04-29 Thread Otis Gospodnetic
s and stack overflow. In Lucene 3.0 this used > > > stock java sort (which is mergesort), maybe replace the > > > ArrayUtils.quickSort my ArrayUtils.mergeSort() and see if problem is > still > > there? > > > > > > Uwe > > > > > > - > > >

Re: SorterTemplate.quickSort causes StackOverflowError

2011-04-29 Thread Otis Gospodnetic
at (nearly) full speed and once > you hit the breakpoint, inspect the stack, variables, etc... > > Dawid > > On Fri, Apr 29, 2011 at 1:40 PM, Otis Gospodnetic < > otis_gospodne...@yahoo.com> wrote: > > > Hi, > > > > OK, so it looks like it's not

Reusing Query instances

2011-04-29 Thread Otis Gospodnetic
Hi, Is there any reason why one would *not* want to reuse Query instances? I'm using MemoryIndex with a fixed set of queries and I'm executing them all on each new document that comes in. Because each document needs to have many tens of thousands of queries executed against it, I thought I'd j

Thoughts on Search Analytics?

2011-05-01 Thread Otis Gospodnetic
Hi, I'd like to solicit your thoughts about Search Analytics if you are doing any sort of analysis/reporting of search logs or click stream or anything related. * Which information or reports do you find the most useful and why? * Which reports would you like to have, but don't have for whatever

Re: MultiPhraseQuery slowing down over time in Lucene 3.1

2011-05-02 Thread Otis Gospodnetic
Hi, I think this describes what's going on: 10 load N stored queries 20 parse N stored queries, keep them in some List forever 30 for each incoming document create a new MemoryIndex instance "mi" 40 for query 1 to N do mi.search(query) Over time this step 40 takes longer and longer and longer --

Re: AW: "fuzzy prefix" search

2011-05-03 Thread Otis Gospodnetic
Hi, I didn't read this thread closely, but just in case: * Is this something you can handle with synonyms? * If this is for English and you are trying to handle typos, there is a list of common English misspellings out there that you could use for this perhaps. * Have you considered n-gramming yo

Re: AW: AW: "fuzzy prefix" search

2011-05-03 Thread Otis Gospodnetic
k that just "n-grams" the docs/fields. > > class SimpleNGramAnalyzer extends Analyzer > { > @Override > public TokenStream tokenStream ( String fieldName, Reader reader ) > { >EdgeNGramTokenFilter... ??? > } > } > > > -Ursprüngliche Nachric

Re: AW: AW: AW: "fuzzy prefix" search

2011-05-03 Thread Otis Gospodnetic
eld content) as it is... > > > -Ursprüngliche Nachricht- > > Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] > > Gesendet: Dienstag, 3. Mai 2011 21:31 > > An: java-user@lucene.apache.org > > Betreff: Re: AW: AW: "fuzzy prefix" search > &g

Re: AW: AW: AW: AW: "fuzzy prefix" search

2011-05-04 Thread Otis Gospodnetic
ne - Nutch Lucene > > > > ecosystem search :: http://search-lucene.com/ > > > > > > > > > > > > > > > > ----- Original Message > > > > > From: Clemens Wyss > > > > > To: "java-user@lucene.apache.org"

Re: How do I sort lucene search results by relevance and time?

2011-05-11 Thread Otis Gospodnetic
If only you were using Solr http://wiki.apache.org/solr/DisMaxQParserPlugin#bf_.28Boost_Functions.29 Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Johnbin Wang > To: java-user@l

Re: distributing the indexing process

2011-07-06 Thread Otis Gospodnetic
We've used Hadoop MapReduce with Solr to parallelize indexing for a customer and that brought down their multi-hour indexing process down to a couple of minutes.  There is/was also Lucene-level contrib in Hadoop that makes use of MapReduce to parallelize indexing. Otis Sematext :: http://

Castle for Lucene/Solr?

2011-09-03 Thread Otis Gospodnetic
Hello, I saw mentions of something called "Caste" a while back, but only now looked at what it is, and it sounds like something that's potentially interesting/useful (performance-wise) for Lucene/Solr. See http://twitter.com/#!/otisg/status/109768673467699200 Has anyone tried it with Lucene/S

Hit search-lucene.com a little harder

2011-10-18 Thread Otis Gospodnetic
Hello folks, Do you ever use http://search-lucene.com (SL) or http://search-hadoop.com (SH)? If you do, I'd like to ask you for a small favour: We are at Lucene Eurocon in Barcelona and we are about to show the Search Analytics [1] and Performance Monitoring [2] tools/services we've built and t

Re: OutOfMemoryError

2011-10-18 Thread Otis Gospodnetic
Bok Tamara, You didn't say what -Xmx value you are using.  Try a little higher value.  Note that loading field values (and it looks like this one may be big because is compressed) from a lot of hits is not recommended. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene e

Re: How can i search lucene java user list archive?

2011-10-20 Thread Otis Gospodnetic
Have a look at http://search-lucene.com/ where you can search Lucene mailing list archives (user, dev, common) its web site, wiki, source code, jira, etc. as well as the same types of data for Solr, Nutch, and so on. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene eco

Re: Lucene for Log file indexing and search

2013-09-20 Thread Otis Gospodnetic
Hi, Logstash is the piece that first touches your logs, filters them, and then outputs them somewhere. People often use it with ElasticSearch.  Once logs are in ES, they look at them with Kibana. Note: somebody should write a Logstash output for Solr! In Solr world there is Flume, which has a

MergePolicy for append-only indices?

2014-01-06 Thread Otis Gospodnetic
Hi, (cross-posting to both Solr and Lucene user lists because while this is a Lucene-level question, I suspect a lot of people who know about this or are interested in this subject are actually on the Solr list) I have a large append-only index and I looked at merge policies hoping to identify one

Re: MergePolicy for append-only indices?

2014-01-28 Thread Otis Gospodnetic
Thanks Mike(s) & Co. Added https://issues.apache.org/jira/browse/LUCENE-5419 Sounds like a killer feature :) Otis On Wed, Jan 8, 2014 at 4:17 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Mon, Jan 6, 2014 at 3:42 PM, Michael Sokolov > wrote: > > I think the key optimization

JOB @ Sematext: Professional Services Lead => Head

2014-02-18 Thread Otis Gospodnetic
Hello, We have what I think is a great opening at Sematext. Ideal candidate would be in New York, but that's not an absolute must. More info below + on http://sematext.com/about/jobs.html in job-ad-speak, but I'd be happy to describe what we are looking for, what we do, and what types of companie

BTRFS ?

2014-12-21 Thread Otis Gospodnetic
Hi, I spotted Uwe's comment in JIRA the other day "BTRFS, which might also bring some cool things for Lucene.". Has anyone tried Lucene (or Solr or Elasticsearch) with BTRFS and seen some (performance) benefits over ext3/4 or xfs for example? Thanks, Otis -- Monitoring * Alerting * Anomaly D

Re: FilteredQuery

2008-08-25 Thread Otis Gospodnetic
Heiko, It's most likely because that B case has a purely negative query. Perhaps you can combine it with MatchAllDocs query? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Heiko <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sen

Re: FilteredQuery

2008-08-25 Thread Otis Gospodnetic
} > return false; > } > > For this simplified call: > > public boolean next() { > return (id++ < maxId); > } > > This change doesn't validate deleted documents, in my implementation it was > not a problem, so, it's possible that this

Re: system design for big numbers

2008-08-26 Thread Otis Gospodnetic
Giovanni, You could try the approach you described - one index per user. When I built Simpy (see http://simpy.com ) a few years ago I chose the same approach and I never regretted it. The hardware behind Simpy is very modest, usage is high, and I never had problems with too many indices open

Re: Case Sensitivity

2008-08-26 Thread Otis Gospodnetic
Dino, you lost me half-way through your email :( NO_NORMS does not mean the field is not tokenized. UN_TOKENIZED does mean the field is not tokenized. Otis-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Dino Korah <[EMAIL PROTECTED]> > To: java

Re: Case Sensitivity

2008-08-26 Thread Otis Gospodnetic
Dino, If a field is not tokenized then it is indexed as is. For example: "Dino Korah" would get indexed just like that. It would not get split into multiple tokens, it would not be lowercased, it would not have any stop words removed from it, etc. Otis -- Sematext -- http://sematext.com/ -- Lu

Re: Case Sensitivity

2008-08-27 Thread Otis Gospodnetic
> Field.Index.UN_TOKENIZED plus field.setOmitNorms(true). > > Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS? > > Mike > > Otis Gospodnetic wrote: > > > Dino, you lost me half-way through your email :( > > > > NO_NORMS does not me

Re: Replicating Lucene Index with out SOLR

2008-08-27 Thread Otis Gospodnetic
Hi, You may want to ask on the java-user list (more subscribers), which I'm CC-ing, so we can continue discussion there. I think you will have to implement your own logic that runs on A and does something like this: - stop adding new docs - call commit on the IndexWriter - copy the index - res

Re: Lucene sample code and api documentation

2008-08-27 Thread Otis Gospodnetic
Sithu, Old emails: markmail.org Sample code: Lucene in Action has free downloadable code -- manning.com/hatcher2 Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: "Sudarsan, Sithu D." <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Se

Re: Replicating Lucene Index with out SOLR

2008-08-27 Thread Otis Gospodnetic
dex every certain amount of time on A. > > -copy the index > Copying the whole index everytime ? > > Currently i am investigating how i can make use of SOLR replication scripts > to achive this. > > > Is there anyone who did this with out SOLR before? > > > Tha

Re: Case Sensitivity

2008-08-28 Thread Otis Gospodnetic
So in other words, it *is* possible to have the field both tokenized and its norms omitted? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Karl Wettin <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Thursday, August 28, 200

Re: Replicating Lucene Index with out SOLR

2008-08-28 Thread Otis Gospodnetic
t; > > > > The snapinstaller runs on the slave after a snapshot has > > been pulled from > > the master. This signals the local Solr server to open a > > new index reader, > > then auto-warming of the cache(s) begins (in the new > > reader), while ot

Re: Case Sensitivity

2008-08-28 Thread Otis Gospodnetic
e.org > Sent: Thursday, August 28, 2008 1:39:21 PM > Subject: Re: Case Sensitivity > > Otis Gospodnetic wrote: > > So in other words, it *is* possible to have the field both tokenized and > > its > norms omitted? > > Yes. Probably this is an unintended side-ef

Re: Confused with NGRAM results

2008-08-28 Thread Otis Gospodnetic
This actually sounds bugish to me, but you removed the text from your original email, so I don't know what context this was in. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: gaz77 <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Se

Re: C++ Bindings for Lucene?

2008-09-08 Thread Otis Gospodnetic
Joe, CLucene is slightly behind Java Lucene, but I believe CLucene developers are working on 2.3.2 port. I think that's the only C++ option. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Joseph Kovacic <[EMAIL PROTECTED]> > To: "java-us

Re: Terms with different boosts

2008-09-11 Thread Otis Gospodnetic
Guy, ulimit -n is your friend. As is the compound index format. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Guy Gavriely <[EMAIL PROTECTED]> > To: "java-user@lucene.apache.org" > Sent: Thursday, September 11, 2008 10:28:34 AM > Subjec

Re: IndexSearcher.search

2008-09-15 Thread Otis Gospodnetic
Hi, Check the Hits javadoc: * @deprecated Hits will be removed in Lucene 3.0. * Instead e. g. [EMAIL PROTECTED] TopDocCollector} and [EMAIL PROTECTED] TopDocs} can be used: * * TopDocCollector collector = new TopDocCollector(hitsPerPage); * searcher.search(query, collector); * Scor

Re: patching lucene-1314

2008-09-15 Thread Otis Gospodnetic
Yes, probably out of sync with the 2.3.2 code. Have you tried applying it to the trunk? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Cam Bazz <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Monday, September 15, 2008 11:14

Re: warming up searchers

2008-09-15 Thread Otis Gospodnetic
I don't think the "exists vs. doesn't exist" matters (but I should really try it and see) as much as using Sort vs. not using it if you use sorting because sorting required FieldCache loading. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > Fr

Re: TopDocs question

2008-09-15 Thread Otis Gospodnetic
I think Daniel was suggesting you write your own HitCollector with its own "int hits" counter var. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Cam Bazz <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Monday, September 15,

Re: Using separate index for each user

2008-09-16 Thread Otis Gospodnetic
Tobias, That's the approach I took with Simpy.com and it's been working well for several years now. You'll have to keep track of searchers and close them when appropriate, of course. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Tobias

Re: Phrase Query

2008-09-16 Thread Otis Gospodnetic
Are the terms stopwords? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Cam Bazz <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Tuesday, September 16, 2008 1:33:48 AM > Subject: Phrase Query > > Hello, > > Lets say I have

Re: Exception while doing sorting

2008-09-17 Thread Otis Gospodnetic
If your index is increasing in size so fast, you should start thinking about sharding your index (breaking it into multiple smaller indices that each fits on its server) and searching across them (aka distributed search). Yes, Lucene can handle millions of records if run on adequate hardware and

Re: Case studies for Lucene in Action 2nd edition

2008-09-18 Thread Otis Gospodnetic
t.com/ -- Lucene - Solr - Nutch - Original Message > From: Otis Gospodnetic <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Tuesday, August 12, 2008 3:37:00 PM > Subject: Case studies for Lucene in Action 2nd edition > > Hello, > > We are work

Re: Rsync causing search timeouts on master

2008-09-23 Thread Otis Gospodnetic
Hi, Wrong list. :) I answered your question on solr-user. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: rahul_k123 <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Tuesday, September 23, 2008 11:00:02 PM > Subject: Rsync cau

Re: Getting all found document ids from a search result

2008-09-26 Thread Otis Gospodnetic
Gregor, You could loop through the results or collect them using a custom HitCollector. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Friday, September 26,

Re: sharing SearchIndexer

2008-09-26 Thread Otis Gospodnetic
I think somebody provided a patch (might have been a whole new IndexReader impl?) mny moons ago (2005?), but it never attracted enough interest to get committed. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Michael Wechner <

Re: Getting all found document ids from a search result

2008-09-29 Thread Otis Gospodnetic
bject: RE: Getting all found document ids from a search result > > Hi, > > Do I really get all results if I use a custom hitcollector? > This would be great :-) > > Regards, > Gregor > > > > -Original Message- > From: Otis Gospodnetic [mailto:[EMAIL PR

Re: Extracting Dates

2008-10-03 Thread Otis Gospodnetic
David, this is not really a Lucene issue. Here is some Perl code that you could either use or rewrite in Java if you need it in Java: http://search.cpan.org/dist/Date-Extract/ Tika won't help with this, and I believe UIMA itself with not help either, although there may be components for date ex

Re: Performance of never optimizing

2008-11-02 Thread Otis Gospodnetic
Hello, Very quick comments. - Original Message > From: Justus Pendleton <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Sunday, November 2, 2008 10:42:52 PM > Subject: Performance of never optimizing > > Howdy, > > I have a couple of questions regarding some Lucene ben

Re: Feasibility question

2008-11-11 Thread Otis Gospodnetic
Yes, I think it is. I think the only catch will be those log timestamps, how fine you really need them to be, and if you want them very fine what happens when you do range queries on timestamps. If you have a pile of log files lying around, it should be pretty easy to get them indexed. You do

Re: 1:n queries again

2008-11-12 Thread Otis Gospodnetic
Christian, If I understand your situation correctly, you should look at sloppy phrases and at Span family of queries. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Christian Reuschling <[EMAIL PROTECTED]> To: java-user@lucene.apache

Re: AW: Parsing MSWord

2008-11-12 Thread Otis Gospodnetic
Or Tika, Lucene's cousin: http://incubator.apache.org/tika/ (which uses POI under the hood, but goes beyond MS Word parsing) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Donna L Gresh <[EMAIL PROTECTED]> To: java-user@lucene.apache.or

Re: About counting term hits

2008-11-13 Thread Otis Gospodnetic
Mario, Does this help: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/index/TermFreqVector.html Plus: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/index/IndexReader.html#method_summary (look for "getTerm.Freq...") Otis -- Se

Re: About counting term hits

2008-11-13 Thread Otis Gospodnetic
The more Documents you have to look at the slower it will be, but it may still be fast enough - it's impossible to tell without considering index size, query volume, hardware, number of hits/Docs, etc. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ___

Re: Scoped Search and Facets generation using Lucene

2008-11-14 Thread Otis Gospodnetic
Hi Mayur, Solr has built-in support for facets. I don't understand what you mean by scoped searches. Could you please give a concrete example? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: "Bapat, Mayur" <[EMAIL PROTECTED]> To: ja

Re: I would want to know more about the lucene implementation in C++

2008-12-04 Thread Otis Gospodnetic
There is CLucene. It's not a part of Apache, but lives on SourceForge, I think. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Ariel <[EMAIL PROTECTED]> > To: lucene user > Sent: Tuesday, December 2, 2008 2:13:08 PM > Subject: I wou

Re: Slow queries with lots of hits

2008-12-04 Thread Otis Gospodnetic
Tim (and we should move this to java-dev if it gains traction), Perhaps you can come up with a mechanism to perform scoring in two passes instead of one: - first pass is cheap and fast - second pass is more expensive and slower Currently, there is no choice - Lucene does 2). But perhaps you can

Re: Design guidance - search strategy

2008-12-05 Thread Otis Gospodnetic
Yeah, I think we'll have to start paying the commission fee! ;) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Erick Erickson <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Friday, December 5, 2008 8:37:20 AM > Subject: Re:

Re: Inquiry on Lucene Stemming

2008-12-21 Thread Otis Gospodnetic
If Hoss is referring to synonym expansion, allow me to point out that freely downloadable code from Lucene in Action (first edition) has code for that, if you'd like to have a look, OP. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Chri

Re: Default and optimal use of RAMDirectory

2008-12-21 Thread Otis Gospodnetic
Let me add to that that I clearly recall having a hard time getting the tests for that particular section of LIA1 to clearly and consistently show that using the RAMDirectory buffering approach instead of vanilla IndexWriter yields faster indexing. Even back then IndexWriter buffered indexed da

Re: Url Analyzer

2008-12-21 Thread Otis Gospodnetic
Mark, This is simple enough that it should be easy to put together. If you search the ML archives you'll see that one of the common "tricks" is to "flip" host name parts (e.g. com.sematext.www). The details of this have been discussed before, so have a look. Otis -- Sematext -- http://semat

Re: lucene suiteable ? 6 mio recods / day 1k

2008-12-21 Thread Otis Gospodnetic
Christian, You can certainly purge old documents on a daily basis in order to keep the corpus from growing, but note that 3M*90=270M 2K docs may be a bit too much for a single index unless you really have lots of RAM or you don't need queries to be quick. In other words, you may have to spread

Re: lucene suiteable ? 6 mio recods / day 1k

2008-12-22 Thread Otis Gospodnetic
. > can you give me an idea what in your opinion would mean "don't need > queries to be quick" ... > i have no idea in what timeframe it could be handeled if it is not > completely in RAM. > > regards chris > > > > On Mon, Dec 22, 2008 at 4:41 AM, Oti

  1   2   3   4   5   6   7   8   9   >