BTRFS ?

2014-12-21 Thread Otis Gospodnetic
Hi, I spotted Uwe's comment in JIRA the other day "BTRFS, which might also bring some cool things for Lucene.". Has anyone tried Lucene (or Solr or Elasticsearch) with BTRFS and seen some (performance) benefits over ext3/4 or xfs for example? Thanks, Otis -- Monitoring * Alerting * Anomaly D

JOB @ Sematext: Professional Services Lead => Head

2014-02-18 Thread Otis Gospodnetic
Hello, We have what I think is a great opening at Sematext. Ideal candidate would be in New York, but that's not an absolute must. More info below + on http://sematext.com/about/jobs.html in job-ad-speak, but I'd be happy to describe what we are looking for, what we do, and what types of companie

Re: MergePolicy for append-only indices?

2014-01-28 Thread Otis Gospodnetic
Thanks Mike(s) & Co. Added https://issues.apache.org/jira/browse/LUCENE-5419 Sounds like a killer feature :) Otis On Wed, Jan 8, 2014 at 4:17 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Mon, Jan 6, 2014 at 3:42 PM, Michael Sokolov > wrote: > > I think the key optimization

MergePolicy for append-only indices?

2014-01-06 Thread Otis Gospodnetic
Hi, (cross-posting to both Solr and Lucene user lists because while this is a Lucene-level question, I suspect a lot of people who know about this or are interested in this subject are actually on the Solr list) I have a large append-only index and I looked at merge policies hoping to identify one

Re: Lucene for Log file indexing and search

2013-09-20 Thread Otis Gospodnetic
Hi, Logstash is the piece that first touches your logs, filters them, and then outputs them somewhere. People often use it with ElasticSearch.  Once logs are in ES, they look at them with Kibana. Note: somebody should write a Logstash output for Solr! In Solr world there is Flume, which has a

Re: Content based recommender using lucene/solr

2013-06-28 Thread Otis Gospodnetic
Hi, It doesn't have to be one or the other. In the past I've built a news recommender engine based on CF (Mahout) and combined it with Content Similarity-based engine (wasn't Solr/Lucene, but something custom that worked with ngrams, but it may have as well been Lucene/Solr/ES). It worked well.

Re: Content based recommender using lucene/solr

2013-06-28 Thread Otis Gospodnetic
Hi, Have a look at http://www.youtube.com/watch?v=13yQbaW2V4Y . I'd say it's easier than Mahout, especially if you already have and know your way around Solr. Otis -- Solr & ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Fri, Jun 28, 2013 at

Document scoring order?

2013-04-03 Thread Otis Gospodnetic
Hi, When Lucene scores matching documents, what is the order in which documents are processed/scored and can that be changed? I'm guessing it scores matches in whichever order they are stored in the index/on disk, which means by increasing docIDs? I do see some out of order scoring is possible..

Re: Any benchmark corps to evaluate performance of specified query?

2013-01-17 Thread Otis Gospodnetic
Hi, Maybe https://github.com/sematext/ActionGenerator could be of help? We use it to produce query load for Solr and ElasticSearch and the whole thing is extensible, so you could easily add support for talking directly to Lucene. Oh, and there is the benchmark in Lucene:  http://lucene.apache.or

Poll: how to report # of docs in index over time

2012-02-13 Thread Otis Gospodnetic
Hello, Quick poll for those who have an opinion about what index size monitoring should report in terms of the number of documents in the index. Poll: http://blog.sematext.com/2012/02/13/poll-solr-index-size-monitoring/ For example, imagine that in some 5-minute time period (say 10:00 AM to 10:

Re: How can i search lucene java user list archive?

2011-10-20 Thread Otis Gospodnetic
Have a look at http://search-lucene.com/ where you can search Lucene mailing list archives (user, dev, common) its web site, wiki, source code, jira, etc. as well as the same types of data for Solr, Nutch, and so on. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene eco

Re: OutOfMemoryError

2011-10-18 Thread Otis Gospodnetic
Bok Tamara, You didn't say what -Xmx value you are using.  Try a little higher value.  Note that loading field values (and it looks like this one may be big because is compressed) from a lot of hits is not recommended. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene e

Hit search-lucene.com a little harder

2011-10-18 Thread Otis Gospodnetic
Hello folks, Do you ever use http://search-lucene.com (SL) or http://search-hadoop.com (SH)? If you do, I'd like to ask you for a small favour: We are at Lucene Eurocon in Barcelona and we are about to show the Search Analytics [1] and Performance Monitoring [2] tools/services we've built and t

Castle for Lucene/Solr?

2011-09-03 Thread Otis Gospodnetic
Hello, I saw mentions of something called "Caste" a while back, but only now looked at what it is, and it sounds like something that's potentially interesting/useful (performance-wise) for Lucene/Solr. See http://twitter.com/#!/otisg/status/109768673467699200 Has anyone tried it with Lucene/S

Re: distributing the indexing process

2011-07-06 Thread Otis Gospodnetic
We've used Hadoop MapReduce with Solr to parallelize indexing for a customer and that brought down their multi-hour indexing process down to a couple of minutes.  There is/was also Lucene-level contrib in Hadoop that makes use of MapReduce to parallelize indexing. Otis Sematext :: http://

Re: How do I sort lucene search results by relevance and time?

2011-05-11 Thread Otis Gospodnetic
If only you were using Solr http://wiki.apache.org/solr/DisMaxQParserPlugin#bf_.28Boost_Functions.29 Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Johnbin Wang > To: java-user@l

Re: AW: AW: AW: AW: "fuzzy prefix" search

2011-05-04 Thread Otis Gospodnetic
ne - Nutch Lucene > > > > ecosystem search :: http://search-lucene.com/ > > > > > > > > > > > > > > > > ----- Original Message > > > > > From: Clemens Wyss > > > > > To: "java-user@lucene.apache.org"

Re: AW: AW: AW: "fuzzy prefix" search

2011-05-03 Thread Otis Gospodnetic
eld content) as it is... > > > -Ursprüngliche Nachricht- > > Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] > > Gesendet: Dienstag, 3. Mai 2011 21:31 > > An: java-user@lucene.apache.org > > Betreff: Re: AW: AW: "fuzzy prefix" search > &g

Re: AW: AW: "fuzzy prefix" search

2011-05-03 Thread Otis Gospodnetic
k that just "n-grams" the docs/fields. > > class SimpleNGramAnalyzer extends Analyzer > { > @Override > public TokenStream tokenStream ( String fieldName, Reader reader ) > { >EdgeNGramTokenFilter... ??? > } > } > > > -Ursprüngliche Nachric

Re: AW: "fuzzy prefix" search

2011-05-03 Thread Otis Gospodnetic
Hi, I didn't read this thread closely, but just in case: * Is this something you can handle with synonyms? * If this is for English and you are trying to handle typos, there is a list of common English misspellings out there that you could use for this perhaps. * Have you considered n-gramming yo

Re: MultiPhraseQuery slowing down over time in Lucene 3.1

2011-05-02 Thread Otis Gospodnetic
Hi, I think this describes what's going on: 10 load N stored queries 20 parse N stored queries, keep them in some List forever 30 for each incoming document create a new MemoryIndex instance "mi" 40 for query 1 to N do mi.search(query) Over time this step 40 takes longer and longer and longer --

Thoughts on Search Analytics?

2011-05-01 Thread Otis Gospodnetic
Hi, I'd like to solicit your thoughts about Search Analytics if you are doing any sort of analysis/reporting of search logs or click stream or anything related. * Which information or reports do you find the most useful and why? * Which reports would you like to have, but don't have for whatever

Reusing Query instances

2011-04-29 Thread Otis Gospodnetic
Hi, Is there any reason why one would *not* want to reuse Query instances? I'm using MemoryIndex with a fixed set of queries and I'm executing them all on each new document that comes in. Because each document needs to have many tens of thousands of queries executed against it, I thought I'd j

Re: SorterTemplate.quickSort causes StackOverflowError

2011-04-29 Thread Otis Gospodnetic
at (nearly) full speed and once > you hit the breakpoint, inspect the stack, variables, etc... > > Dawid > > On Fri, Apr 29, 2011 at 1:40 PM, Otis Gospodnetic < > otis_gospodne...@yahoo.com> wrote: > > > Hi, > > > > OK, so it looks like it's not

Re: SorterTemplate.quickSort causes StackOverflowError

2011-04-29 Thread Otis Gospodnetic
s and stack overflow. In Lucene 3.0 this used > > > stock java sort (which is mergesort), maybe replace the > > > ArrayUtils.quickSort my ArrayUtils.mergeSort() and see if problem is > still > > there? > > > > > > Uwe > > > > > > - > > >

Re: SorterTemplate.quickSort causes StackOverflowError

2011-04-28 Thread Otis Gospodnetic
y ArrayUtils.mergeSort() > and see if problem is still there? > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- > > From: Otis Gospodnetic [mailto:otis

SorterTemplate.quickSort causes StackOverflowError

2011-04-28 Thread Otis Gospodnetic
Hi, I'm looking at some code that uses MemoryIndex (Lucene 3.1) and that's exhibiting a strange behaviour - it slows down over time. The MemoryIndex contains 1 doc, of course, and executes a set of a few thousand queries against it. The set of queries does not change - the same set of queries

Re: NRT consistency

2011-04-11 Thread Otis Gospodnetic
I think what's being described here is a lot like what I *think* ElasticSearch does, where there is no single master and index changed made to any node get propagated to N-1 other nodes (N=number of index replicas). I'm not sure how it deals with situations where "incompatible" index changes a

Re: Indexing Non-Textual Data

2011-04-06 Thread Otis Gospodnetic
Hi Chris, Yes, people have done classification with Lucene before. Have a look at http://search-lucene.com/?q=classifier&fc_project=Lucene for some discussions and actual code (in old JIRA issues) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: ht

Re: Detecting duplicates

2011-03-08 Thread Otis Gospodnetic
Mark, Keep in mind that there are actually multiple patches for this. SOLR-236 and SOLR-1086 IIRC. Also, I just noticed this is java-user@lucene. You may want to continue on solr-user@lucene. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http:/

Re: Backup or replication option with lucene

2011-03-02 Thread Otis Gospodnetic
Hi Ganesh, You could probably use replication scripts from Solr. But why not just use Solr? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Ganesh > To: java-user@lucene.apache.org > S

Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Otis Gospodnetic
> [X] ASF Mirrors (linked in our release announcements or via the Lucene >website) > > [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [X] I/we build them from source via an SVN/Git checkout. > > [] Other (someone in your company mirrors them internally or via a > d

Re: Best practices for multiple languages?

2011-01-18 Thread Otis Gospodnetic
Hi Clemens, If you will be searching individual languages, go with language-specific indices. Wunder likes to give an example of "die" in German vs. English. :) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Orig

Re: How about lucene's delete performance ?

2010-10-13 Thread Otis Gospodnetic
Hello, Of course, if you actually want the last 7 days rolling effect and not the this week vs. previous week, then you could go with smaller indices, say daily ones. Then you'd always add new docs to the latest index and removing the oldest index completely every 24 hours. You could go hourly

Re: does lucene support Database full text search

2010-09-10 Thread Otis Gospodnetic
Hello, You can use LuSQL to index DB content into Lucene. Solr (the "Lucene Server") has DataImportHandler for indexing data from DBs: http://search-lucene.com/?q=dataimporthandler Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-luce

Re: Calculate Term Co-occurrence Matrix

2010-08-21 Thread Otis Gospodnetic
s.searchenginewatch.com/showthread.php?t=48>. > I hope to find some code that given a text corpus, generate all the words > pairs with their probability of occurring together. > > > On Sat, Aug 21, 2010 at 1:46 AM, Otis Gospodnetic < > otis_gospodne...@yahoo.com> wro

Re: lucene indexing configuration

2010-08-20 Thread Otis Gospodnetic
Hi, Are you actually talking about Solr? Sounds like it. Check solr-u...@lucene list. Maybe you need to treat those words are protected words? See the protwords.txt file in the conf dir. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://se

Re: Calculate Term Co-occurrence Matrix

2010-08-20 Thread Otis Gospodnetic
There is also a non-Mahout Key Phrase Extractor for Collocations, SIPs, and a few other things: http://sematext.com/products/key-phrase-extractor/index.html One of the demos that uses news data is at http://sematext.com/demo/kpe/index.html Otis Sematext :: http://sematext.com/ :: Solr - Lu

Re: Using categories with Lucene

2010-08-08 Thread Otis Gospodnetic
Hello Luan, I think you are looking for facets and faceted search. In short, it means storing the category for a document (web page) in the Document Field in Lucene index . Then, at search time, you count how many matches were in which category. You can implement this yourself or you can use

Re: understanding lucene

2010-08-08 Thread Otis Gospodnetic
Manning, the Lucene in Action publisher, frequently offers 30-50% off on a number of their books, including LIA2. See http://twitter.com/ManningBooks Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message

Re: LUCENE-2456 (A Column-Oriented Cassandra-Based Lucene Directory)

2010-08-07 Thread Otis Gospodnetic
Utku, you should ask via comments on https://issues.apache.org/jira/browse/LUCENE-2453. What happened with Lucandra? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Utku Can Topçu > To

Re: Personal Intro and a question on "find top 10 similar items" functionality

2010-07-08 Thread Otis Gospodnetic
Igor, You can treat that question as the query and use it to search the index where you've indexed other questions. More Like This is another option. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message

Re: arguments in favour of lucene over commercial competition

2010-06-24 Thread Otis Gospodnetic
too, to show how it has improved in the last > versions (not that it was bad before) does anyone have a link to a nice page > with numbers/graphs ? On Thu, Jun 24, 2010 at 7:43 AM, Otis > Gospodnetic < > href="mailto:otis_gospodne...@yahoo.com";>otis_gospodne...@yahoo.co

Re: arguments in favour of lucene over commercial competition

2010-06-23 Thread Otis Gospodnetic
On Wed, Jun 23, > 2010 at 11:41 PM, Otis Gospodnetic < > ymailto="mailto:otis_gospodne...@yahoo.com"; > href="mailto:otis_gospodne...@yahoo.com";>otis_gospodne...@yahoo.com> > wrote: > Off the top of my head: > > FAST > > Endeca > Co

Re: arguments in favour of lucene over commercial competition

2010-06-23 Thread Otis Gospodnetic
nd Lucene... And I > personally wouldn't count full text search solutions such as > Oracle's. Itamar. > -----Original Message- > From: > Otis Gospodnetic [mailto: > href="mailto:otis_gospodne...@yahoo.com";>otis_gospodne...@yahoo.com] >

Re: arguments in favour of lucene over commercial competition

2010-06-23 Thread Otis Gospodnetic
Off the top of my head: FAST Endeca Coveo Attivio Vivisimo Google Search Appliance (tell me when to stop) Dieselpoint IBM OmniFind Exalead Autonomy dtSearch ISYS Oracle ... ... Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com

Re: arguments in favour of lucene over commercial competition

2010-06-23 Thread Otis Gospodnetic
Lucene/Solr choice typically means: * lower cost of ownership (think about various crazy licensing models some of the commercial search vendors have: per doc, per server, per query, per year) * faster implementation (just think about the duration of the sales/negotiation phase for commerci

Re: Monitoring low level IO

2010-06-04 Thread Otis Gospodnetic
Ah, there is another one I came across several months back - http://wiki.sdn.sap.com/wiki/display/Java/JPicus. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Otis Gospodnetic &

Re: Monitoring low level IO

2010-06-03 Thread Otis Gospodnetic
Other than iostat, vmstat and such? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Jason Rutherglen > To: java-user@lucene.apache.org > Sent: Thu, June 3, 2010 2:13:17 PM > Subject: Mo

Re: numDeletedDocs()

2010-06-03 Thread Otis Gospodnetic
Btw. folks, http://search-lucene.com/ has a really handy source code search with auto-completion for Lucene, Solr, etc. For example, I typed in: numDel - and immediately found those methods. Use it. :) Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search

Re: is there any resources that explain detailed implementation of lucene?

2010-06-03 Thread Otis Gospodnetic
Li Li: Then best to go to the source. Here's one version with syntax highlighting and line numbers, should you have questions about specific parts of that class: http://search-lucene.com/c/Lucene:/src/java/org/apache/lucene/search/PhraseQuery.java Otis Sematext :: http://sematext.com/ ::

Re: Wich way would you recommend for successive-words similarity and scoring ?

2010-06-01 Thread Otis Gospodnetic
Hi Pablo, This question comes up every once in a while. You'll find some previous discussions and answers here: http://search-lucene.com/?q=terms+closer+together+score Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ -

Re: Using JSON for index input and search output

2010-05-31 Thread Otis Gospodnetic
VL, Solr (not Lucene, but you can embed Solr) has JsonUpdateRequestHandler, which lets you send docs to Solr for indexing in JSON (instead of the usual XML): http://search-lucene.com/c/Solr:/src/java/org/apache/solr/handler/JsonUpdateRequestHandler.java And you can get Solr to respond with JSON

Re: Is Lucene a "document oriented database"?

2010-05-31 Thread Otis Gospodnetic
I think those doc-oriented DBs tend to be distributed, with replication built-in and such, but yes, in some way the schemaless DB with docs and fields (whether they are pumped in as JSON or XML or Java objects) feels the same. I saw something from Grant about 2 months ago how Lucene is "nosql-i

Re: Grouping or de-duping

2010-05-31 Thread Otis Gospodnetic
Pasa, Maybe Field Collapsing (Solr) can help? See SOLR-236 in JIRA http://search-lucene.com/?q=field+collapsing&fc_project=Lucene&fc_project=Solr Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message --

Re: Filter vs. TermQuery performance

2010-05-09 Thread Otis Gospodnetic
I think others will have more thoughts on this, esp. for Numeric* questions... but I'll try answering... - Original Message > From: Tomislav Poljak > To: java-user@lucene.apache.org > Sent: Fri, May 7, 2010 2:34:46 PM > Subject: Filter vs. TermQuery performance > > Hi, > when is it w

Re: TermsFilter instead of "should" TermQueries

2010-05-09 Thread Otis Gospodnetic
I think what Tomislav was trying to ask is: Can filters replace only strictly boolean clauses (i.e. only MUST and MUST_NOT), such as: +gender:F, -rating:xxx)? Or can filters also replace SHOULD clauses, such as: food:banana (which is neither absolutely required or strictly prohibited)? Otis --

Lucandra - Lucene/Solr on Cassandra: April 26, NYC

2010-04-22 Thread Otis Gospodnetic
Hello folks, Those of you in or near NYC and using Lucene or Solr should come to "Lucandra - a Cassandra-based backend for Lucene and Solr" on April 26th: http://www.meetup.com/NYC-Search-and-Discovery/calendar/12979971/ The presenter will be Lucandra's author, Jake Luciani. Please spread the

Re: Range Query Assistance

2010-04-21 Thread Otis Gospodnetic
Joseph, If you can, get the latest Lucene and use NumericField to index your dates with appropriate precision and then use NumericRangeQueries when searching. This will be faster than searching for string dates in a given range. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nut

Re: NumericField indexing performance

2010-04-15 Thread Otis Gospodnetic
Hi, I actually don't follow your change, because after "but changing it to" line the only different thing I see is the doc.add(dateField) call, which you didn't list before "but changing it to". Also, if I understood Uwe correctly, he was suggesting reusing NumericField instances, which means

Slides from Finite-State Queries, Flexible Indexing, Scoring talk

2010-03-25 Thread Otis Gospodnetic
Hello everyone, Robert Muir gave a great presentation on a few advanced Lucene topics last night and even found time to send this presentation to me, which I just uploaded: http://www.slideshare.net/otisg/finite-state-queries-in-lucene You'll find all other presentations from the NYC Search

Re: Searching Subversion comments:

2010-03-08 Thread Otis Gospodnetic
Hi Erick, For what it's worth, we are considering indexing JIRA comments over on http://search-lucene.com/ , though I'm not entirely convinced searching in comments would be super valuable. Would it? But note that JIRA (and LucidFind) already do that. For example, go to http://issues.apache.

Re: OutOfMemoryError

2010-03-05 Thread Otis Gospodnetic
Maybe it's not a leak, Monique. :) If you use sorting in Lucene, then the FieldCache object will keep some data permanently in memory, for example. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message -

Re: SpanQueries in Luke

2010-03-04 Thread Otis Gospodnetic
Andrzej, Does that mean the regular Lucene QP will get Span query syntax support (vs. having it in that separate Surround QP)? Or maybe that already happened and I missed it? :) Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://searc

Lucene: Finite-State Queries, Flexible Indexing, Scoring, and more

2010-03-03 Thread Otis Gospodnetic
Hello folks, Those of you in or near New York and using Lucene or Solr should come to "Lucene: Finite-State Queries, Flexible Indexing, Scoring, and more" on March 24th: http://www.meetup.com/NYC-Search-and-Discovery/calendar/12720960/ The presenter will be the hyper active Lucene committer R

Re: "one of the terms"

2010-01-29 Thread Otis Gospodnetic
Paul, Custom Similarity perhaps, oui. Not 100% sure, maybe have this always return 1.0f. /** Computes a score factor based on the fraction of all query terms that a * document contains. This value is multiplied into scores. * * The presence of a large portion of the query terms ind

Re: Email Filter using Lucene 3.0

2010-01-29 Thread Otis Gospodnetic
Hi Jamie, Could you say more about how it's not working? No compiling? Run-time exceptions? Doesn't work as expected after you run a unit test for it? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Mes

Re: index demo throws LockObtainFailedException

2010-01-28 Thread Otis Gospodnetic
Fedora Core 4 is *ancient*! :) Could it be that the NFS client on it is old, and this is causing problems? I remember emails about NFS 3 vs. NFS 4 and some improvements in the latter. I don't recall the details and tend to keep my Lucene and Solr instances away from NFS mounts. Otis Sema

Re: Proximity of More than Single Words?

2010-01-21 Thread Otis Gospodnetic
Yes, that's just a phrase slop, allowing for variable gaps between words. I *believe* the Surround QP that works with Span family of queries does handle what you are looking for. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: T. R. Halvor

Re: Can you boost multiple terms using brackets ?

2010-01-20 Thread Otis Gospodnetic
Yes, I believe it is the same. I bet the Explain explanation would help confirm this. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: Paul Taylor > To: java-user@lucene.apache.org > Sent: Wed, January 20, 2010 1:03:14 PM > Subject: Can yo

Re: Lucene as a primary datastore

2010-01-20 Thread Otis Gospodnetic
Guido, No, you should absolutely not need to constantly rebuild the index. If you find you have to do that, you'll know you are doing something wrong. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: Guido Bartolucci > To: java-user@lucen

Re: Lucene as a primary datastore

2010-01-19 Thread Otis Gospodnetic
t know how to modify/use the solr script. > > Regards > Ganesh > > > - Original Message - > From: "Otis Gospodnetic" > To: ; > Sent: Wednesday, January 20, 2010 10:45 AM > Subject: Re: Lucene as a primary datastore > > > > You are not al

Re: Lucene as a primary datastore

2010-01-19 Thread Otis Gospodnetic
You are not alone, Guido. It's a good question. In my experience, Lucene is as stable as MySQL/PostgreSQL in terms of its ability to hold your data and not corrupt it. Of course, even with the most expensive databases, you'd want to make backups. The same goes with Lucene. Nowadays, one way

Re: A way to download URLs and index better ?

2010-01-16 Thread Otis Gospodnetic
Hello, Use Droids, it's much simpler than Nutch or Heritrix: http://incubator.apache.org/droids/ Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: Phan The Dai > To: java-user@lucene.apache.org > Sent: Sat, January 16, 2010 2:20:47 AM > Sub

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Otis Gospodnetic
I think Jason meant "15-20GB segments"? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch From: Jason Rutherglen To: java-user@lucene.apache.org Sent: Wed, January 13, 2010 5:54:38 PM Subject: Re: Max Segmentation Size when Optimizing Index Ye

Re: lucene index file randomly crash and need to reindex

2010-01-12 Thread Otis Gospodnetic
Hi, Use the latest version of Lucene, obey Lucene's locks, write with 1 IndexWriter, avoid NFS... Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: zhang99 > To: java-user@lucene.apache.org > Sent: Tue, January 12, 2010 10:41:19 PM > Subje

Re: how to follow intranet: configuration in nutch website

2010-01-12 Thread Otis Gospodnetic
Zhou, Your question will get more attention if you send it to nutch-u...@lucene.apache.org list instead. This list is for Lucene Java. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: "jyzhou...@yahoo.com" > To: java-user@lucene.apache.o

NYC Search in the Cloud meetup: Jan 20

2010-01-12 Thread Otis Gospodnetic
Hello, If "Search Engine Integration, Deployment and Scaling in the Cloud" sounds interesting to you, and you are going to be in or near New York next Wednesday (Jan 20) evening: http://www.meetup.com/NYC-Search-and-Discovery/calendar/12238220/ Sorry for dupes to those of you subscribed to mul

Re: a complete solution for building a website search with lucene

2010-01-08 Thread Otis Gospodnetic
Nutch is written in Java, so Nutch itself *should* work on other non-Linux OSs that the JVM supports. But it does contain some shell scripts, as does Hadoop that Nutch uses. Oh, I guess Windows people run it under Cygwin? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

Re: Implementing filtering based on multiple fields

2010-01-07 Thread Otis Gospodnetic
t; Also, what did you mean about isolating users and their data/indices. Did > you mean that I should create a separate index per user? > > Thanks again! > > On Fri, Jan 8, 2010 at 12:35 AM, Otis Gospodnetic < > otis_gospodne...@yahoo.com> wrote: > > > For something li

Re: Implementing filtering based on multiple fields

2010-01-07 Thread Otis Gospodnetic
For something like CSE, I think you want to isolate users and their data/indices. I'd look at Bixo or Nutch or Droids ==> Lucene or Solr Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: Yaniv Ben Yosef > To: java-user@lucene.apache.org > S

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Otis Gospodnetic
o limit the size of an index? > > On Thu, Jan 7, 2010 at 2:23 PM, Otis Gospodnetic > wrote: > >> Merge factor controls how many segments are merged at once. The default > >> is > 10. > >> > >> The maxMergeMB setting sets the max size for a given seg

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Otis Gospodnetic
> Merge factor controls how many segments are merged at once. The default is > 10. > > The maxMergeMB setting sets the max size for a given segment to be > included in a merge. I wonder if renaming that to maxSegSizeMergeMB would make it more obvious what this does? Otis -- Sematext -- http:/

Re: AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-07 Thread Otis Gospodnetic
/is completed successfully and, as you say, > there is only one segment in the directory. > > Some other ideas? > > Thanks, > Yuliya > > > -Ursprüngliche Nachricht- > > Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] > > Gesendet: Donner

Re: Performance Results on changing the way fields are stored

2010-01-07 Thread Otis Gospodnetic
You could try Avro instead of JSON/XML/Java Serialization. It's compact (and new). http://hadoop.apache.org/avro/ Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: Paul Taylor > To: java-user@lucene.apache.org > Sent: Tue, January 5, 2010

Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

2010-01-07 Thread Otis Gospodnetic
Yuliya, The index *directory* will be larger *while* you are optimizing. After the optimization is completed successfully, the index directory will be smaller. It is possible that your index directory is large(r) because you have some left-over segments (e.g. from some earlier failed/interrup

Re: NGramTokenizer stops working after about 1000 terms

2010-01-03 Thread Otis Gospodnetic
This actually rings a bell for me... have a look at Lucene's JIRA, I think this was reported as a bug once and perhaps has been fixed. Note that Lucene in Action 2 has a case study that talks about searching source code. You may find that study interesting. Otis -- Sematext -- http://sematext

Re: Getting score of explicit documents for a query

2009-12-03 Thread Otis Gospodnetic
I think you should be able to use 1+ FilteredQuery (with IDs of your docs) with your main query and thus get the scores only for docs that interest you. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: Erdinc Yilmazel > To: java-user@lucen

Re: Snowball Stemmer Question

2009-12-03 Thread Otis Gospodnetic
Chris, You could look at KStem to see if that does a better job. Or perhaps WordNet can be used to get the lemma of those terms instead of using stemming. Finally what was I going to say... ah, yes, using synonyms may be another way this can be handled. Otis -- Sematext -- http://sematext.c

Re: Need help regarding implementation of autosuggest using jquery

2009-12-01 Thread Otis Gospodnetic
Hi, Have a look at http://www.sematext.com/products/autocomplete/index.html It handles Chinese and large volumes of data. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: fulin tang > To: java-user@lucene.apache.org > Sent: Thu, November

NYC Search & Discovery Meetup

2009-12-01 Thread Otis Gospodnetic
Hello, For those living in or near NYC, you may be interested in joining (and/or presenting?) at the NYC Search & Discovery Meetup. Topics are: search, machine learning, data mining, NLP, information gathering, information extraction, etc. http://www.meetup.com/NYC-Search-and-Discovery/ Our

Re: Is Lucene a good choice for PB scale mailbox search?

2009-11-24 Thread Otis Gospodnetic
For what it's worth, AOL uses a Solr cluster to handle searches for @aol users. Each user has his own index. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message > From: fulin tang > To

Re: lucene not returning correct results eventhough search query is present

2009-11-18 Thread Otis Gospodnetic
Hi, Please use java-user list for user questions. Are you sure the file got fully indexed in the first place? Use Luke to check. Also, see: IndexWriter.MaxFieldLength Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NE

Re: Lucene Java 3.0.0 RC1 now available for testing

2009-11-17 Thread Otis Gospodnetic
Well, I think some people will be for hiding complexity, while others will be for being in control and having transparency. Think how surprised one would be to find 1 extra field in his index, say when looking at their index with Luke. :) Otis -- Sematext is hiring -- http://sematext.com/about

Re: Why Lucene takes longer time for the first query and less for subsequent ones

2009-11-17 Thread Otis Gospodnetic
Hello, Most likely due to the operating system caching the relevant portions of the index after the first set of queries. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message > From: Din

Re: Prefix Query for autocomplete - TooManyClauses

2009-11-13 Thread Otis Gospodnetic
Hello, Also keep in mind prefix queries are not the cheapest. Plug: We've seen people use this successfully: http://www.sematext.com/products/autocomplete/index.html I believe somebody is trying this out with a set of 1B suggestions. The demo at http://www.sematext.com/demo/ac/index.html search

Re: OutofMemory in large index

2009-11-13 Thread Otis Gospodnetic
Hello, Comments inlined. - Original Message > From: vsevel > To: java-user@lucene.apache.org > Sent: Fri, November 13, 2009 11:32:02 AM > Subject: Re: OutofMemory in large index > > > Hi, I am jumping into the thread because I have got a similar issue. > My index is 30Gb large and

Re: Lucene index write performance optimization

2009-11-10 Thread Otis Gospodnetic
This is what we have in Lucene in Action 2: ~/lia2$ ff \*Thread\*java ./src/lia/admin/CreateThreadedIndexTask.java ./src/lia/admin/ThreadedIndexWriter.java Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR -

Re: Filtering query results based on relevance/acuracy

2009-09-22 Thread Otis Gospodnetic
Alex, If I understand you correctly, all you have to do is either make sure that query is run as a phrase query (with quotes around the it), or as a term query where both terms are required (with plus sign in front of each term, no space). As for detecting score gap and such, you could do that

Re: Taking too much time in optimization

2009-08-10 Thread Otis Gospodnetic
Hi, That mergeFactor is too high. I suggest going back to default (10). maxBufferedDocs is an old and not very accurate setting (imagine what happens with the JVM heap if your indexer hits a SUPER LARGE document). Use setRamBufferSizeMB instead. Otis -- Sematext is hiring -- http://sematext.c

  1   2   3   4   5   6   7   8   9   >