Hi,
I spotted Uwe's comment in JIRA the other day "BTRFS, which might also
bring some cool things for Lucene.".
Has anyone tried Lucene (or Solr or Elasticsearch) with BTRFS and seen some
(performance) benefits over ext3/4 or xfs for example?
Thanks,
Otis
--
Monitoring * Alerting * Anomaly D
Hello,
We have what I think is a great opening at Sematext. Ideal candidate would
be in New York, but that's not an absolute must. More info below + on
http://sematext.com/about/jobs.html in job-ad-speak, but I'd be happy to
describe what we are looking for, what we do, and what types of companie
Thanks Mike(s) & Co.
Added https://issues.apache.org/jira/browse/LUCENE-5419
Sounds like a killer feature :)
Otis
On Wed, Jan 8, 2014 at 4:17 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:
> On Mon, Jan 6, 2014 at 3:42 PM, Michael Sokolov
> wrote:
> > I think the key optimization
Hi,
(cross-posting to both Solr and Lucene user lists because while this is a
Lucene-level question, I suspect a lot of people who know about this or are
interested in this subject are actually on the Solr list)
I have a large append-only index and I looked at merge policies hoping to
identify one
Hi,
Logstash is the piece that first touches your logs, filters them, and then
outputs them somewhere.
People often use it with ElasticSearch. Once logs are in ES, they look at them
with Kibana.
Note: somebody should write a Logstash output for Solr!
In Solr world there is Flume, which has a
Hi,
It doesn't have to be one or the other. In the past I've built a news
recommender engine based on CF (Mahout) and combined it with Content
Similarity-based engine (wasn't Solr/Lucene, but something custom that
worked with ngrams, but it may have as well been Lucene/Solr/ES). It
worked well.
Hi,
Have a look at http://www.youtube.com/watch?v=13yQbaW2V4Y . I'd say
it's easier than Mahout, especially if you already have and know your
way around Solr.
Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm
On Fri, Jun 28, 2013 at
Hi,
When Lucene scores matching documents, what is the order in which
documents are processed/scored and can that be changed? I'm guessing
it scores matches in whichever order they are stored in the index/on
disk, which means by increasing docIDs?
I do see some out of order scoring is possible..
Hi,
Maybe https://github.com/sematext/ActionGenerator could be of help?
We use it to produce query load for Solr and ElasticSearch and the whole thing
is extensible, so you could easily add support for talking directly to Lucene.
Oh, and there is the benchmark in Lucene:
http://lucene.apache.or
Hello,
Quick poll for those who have an opinion about what index size monitoring
should report in terms of the number of documents in the index.
Poll: http://blog.sematext.com/2012/02/13/poll-solr-index-size-monitoring/
For example, imagine that in some 5-minute time period (say 10:00 AM to 10:
Have a look at http://search-lucene.com/ where you can search Lucene mailing
list archives (user, dev, common) its web site, wiki, source code, jira, etc.
as well as the same types of data for Solr, Nutch, and so on.
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene eco
Bok Tamara,
You didn't say what -Xmx value you are using. Try a little higher value. Note
that loading field values (and it looks like this one may be big because is
compressed) from a lot of hits is not recommended.
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene e
Hello folks,
Do you ever use http://search-lucene.com (SL) or http://search-hadoop.com (SH)?
If you do, I'd like to ask you for a small favour:
We are at Lucene Eurocon in Barcelona and we are about to show the Search
Analytics [1] and Performance Monitoring [2] tools/services we've built and
t
Hello,
I saw mentions of something called "Caste" a while back, but only now looked at
what it is, and it sounds like something that's potentially interesting/useful
(performance-wise) for Lucene/Solr.
See http://twitter.com/#!/otisg/status/109768673467699200
Has anyone tried it with Lucene/S
We've used Hadoop MapReduce with Solr to parallelize indexing for a customer
and that brought down their multi-hour indexing process down to a couple of
minutes. There is/was also Lucene-level contrib in Hadoop that makes use of
MapReduce to parallelize indexing.
Otis
Sematext :: http://
If only you were using Solr
http://wiki.apache.org/solr/DisMaxQParserPlugin#bf_.28Boost_Functions.29
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
- Original Message
> From: Johnbin Wang
> To: java-user@l
ne - Nutch Lucene
> > > > ecosystem search :: http://search-lucene.com/
> > > >
> > > >
> > > >
> > > > ----- Original Message
> > > > > From: Clemens Wyss
> > > > > To: "java-user@lucene.apache.org"
eld content) as it is...
>
> > -Ursprüngliche Nachricht-
> > Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> > Gesendet: Dienstag, 3. Mai 2011 21:31
> > An: java-user@lucene.apache.org
> > Betreff: Re: AW: AW: "fuzzy prefix" search
> &g
k that just "n-grams" the docs/fields.
>
> class SimpleNGramAnalyzer extends Analyzer
> {
> @Override
> public TokenStream tokenStream ( String fieldName, Reader reader )
> {
>EdgeNGramTokenFilter... ???
> }
> }
>
> > -Ursprüngliche Nachric
Hi,
I didn't read this thread closely, but just in case:
* Is this something you can handle with synonyms?
* If this is for English and you are trying to handle typos, there is a list of
common English misspellings out there that you could use for this perhaps.
* Have you considered n-gramming yo
Hi,
I think this describes what's going on:
10 load N stored queries
20 parse N stored queries, keep them in some List forever
30 for each incoming document create a new MemoryIndex instance "mi"
40 for query 1 to N do mi.search(query)
Over time this step 40 takes longer and longer and longer --
Hi,
I'd like to solicit your thoughts about Search Analytics if you are doing any
sort of analysis/reporting of search logs or click stream or anything related.
* Which information or reports do you find the most useful and why?
* Which reports would you like to have, but don't have for whatever
Hi,
Is there any reason why one would *not* want to reuse Query instances?
I'm using MemoryIndex with a fixed set of queries and I'm executing them all on
each new document that comes in. Because each document needs to have many tens
of thousands of queries executed against it, I thought I'd j
at (nearly) full speed and once
> you hit the breakpoint, inspect the stack, variables, etc...
>
> Dawid
>
> On Fri, Apr 29, 2011 at 1:40 PM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
>
> > Hi,
> >
> > OK, so it looks like it's not
s and stack overflow. In Lucene 3.0 this used
> > > stock java sort (which is mergesort), maybe replace the
> > > ArrayUtils.quickSort my ArrayUtils.mergeSort() and see if problem is
> still
> > there?
> > >
> > > Uwe
> > >
> > > -
> > >
y ArrayUtils.mergeSort()
> and see if problem is still there?
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: Otis Gospodnetic [mailto:otis
Hi,
I'm looking at some code that uses MemoryIndex (Lucene 3.1) and that's
exhibiting a strange behaviour - it slows down over time.
The MemoryIndex contains 1 doc, of course, and executes a set of a few thousand
queries against it. The set of queries does not change - the same set of
queries
I think what's being described here is a lot like what I *think* ElasticSearch
does, where there is no single master and index changed made to any node get
propagated to N-1 other nodes (N=number of index replicas). I'm not sure how
it
deals with situations where "incompatible" index changes a
Hi Chris,
Yes, people have done classification with Lucene before. Have a look at
http://search-lucene.com/?q=classifier&fc_project=Lucene for some discussions
and actual code (in old JIRA issues)
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: ht
Mark,
Keep in mind that there are actually multiple patches for this. SOLR-236 and
SOLR-1086 IIRC.
Also, I just noticed this is java-user@lucene. You may want to continue on
solr-user@lucene.
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http:/
Hi Ganesh,
You could probably use replication scripts from Solr.
But why not just use Solr?
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
- Original Message
> From: Ganesh
> To: java-user@lucene.apache.org
> S
> [X] ASF Mirrors (linked in our release announcements or via the Lucene
>website)
>
> [X] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.)
>
> [X] I/we build them from source via an SVN/Git checkout.
>
> [] Other (someone in your company mirrors them internally or via a
> d
Hi Clemens,
If you will be searching individual languages, go with language-specific
indices. Wunder likes to give an example of "die" in German vs. English. :)
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
- Orig
Hello,
Of course, if you actually want the last 7 days rolling effect and not the this
week vs. previous week, then you could go with smaller indices, say daily ones.
Then you'd always add new docs to the latest index and removing the oldest
index
completely every 24 hours.
You could go hourly
Hello,
You can use LuSQL to index DB content into Lucene. Solr (the "Lucene Server")
has DataImportHandler for indexing data from DBs:
http://search-lucene.com/?q=dataimporthandler
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-luce
s.searchenginewatch.com/showthread.php?t=48>.
> I hope to find some code that given a text corpus, generate all the words
> pairs with their probability of occurring together.
>
>
> On Sat, Aug 21, 2010 at 1:46 AM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wro
Hi,
Are you actually talking about Solr? Sounds like it. Check solr-u...@lucene
list.
Maybe you need to treat those words are protected words? See the protwords.txt
file in the conf dir.
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://se
There is also a non-Mahout Key Phrase Extractor for Collocations, SIPs, and a
few other things: http://sematext.com/products/key-phrase-extractor/index.html
One of the demos that uses news data is at
http://sematext.com/demo/kpe/index.html
Otis
Sematext :: http://sematext.com/ :: Solr - Lu
Hello Luan,
I think you are looking for facets and faceted search. In short, it means
storing the category for a document (web page) in the Document Field in Lucene
index . Then, at search time, you count how many matches were in which
category. You can implement this yourself or you can use
Manning, the Lucene in Action publisher, frequently offers 30-50% off on a
number of their books, including LIA2.
See http://twitter.com/ManningBooks
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
- Original Message
Utku, you should ask via comments on
https://issues.apache.org/jira/browse/LUCENE-2453.
What happened with Lucandra?
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
- Original Message
> From: Utku Can Topçu
> To
Igor,
You can treat that question as the query and use it to search the index where
you've indexed other questions.
More Like This is another option.
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
- Original Message
too, to show how it has improved in
the last
> versions (not that it was bad before) does anyone have a link
to a nice page
> with numbers/graphs ?
On Thu, Jun 24, 2010 at 7:43 AM, Otis
> Gospodnetic
<
> href="mailto:otis_gospodne...@yahoo.com";>otis_gospodne...@yahoo.co
On Wed, Jun 23,
> 2010 at 11:41 PM, Otis Gospodnetic
<
> ymailto="mailto:otis_gospodne...@yahoo.com";
> href="mailto:otis_gospodne...@yahoo.com";>otis_gospodne...@yahoo.com>
> wrote:
> Off the top of my head:
>
> FAST
>
> Endeca
> Co
nd Lucene...
And I
> personally wouldn't count full text search solutions such as
> Oracle's.
Itamar.
> -----Original Message-
> From:
> Otis Gospodnetic [mailto:
> href="mailto:otis_gospodne...@yahoo.com";>otis_gospodne...@yahoo.com]
>
Off the top of my head:
FAST
Endeca
Coveo
Attivio
Vivisimo
Google Search Appliance
(tell me when to stop)
Dieselpoint
IBM OmniFind
Exalead
Autonomy
dtSearch
ISYS
Oracle
...
...
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com
Lucene/Solr choice typically means:
* lower cost of ownership (think about various crazy licensing models some of
the commercial search vendors have: per doc, per server, per query, per
year)
* faster implementation (just think about the duration of the sales/negotiation
phase for commerci
Ah, there is another one I came across several months back -
http://wiki.sdn.sap.com/wiki/display/Java/JPicus.
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
- Original Message
> From: Otis Gospodnetic
&
Other than iostat, vmstat and such?
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
- Original Message
> From: Jason Rutherglen
> To: java-user@lucene.apache.org
> Sent: Thu, June 3, 2010 2:13:17 PM
> Subject: Mo
Btw. folks, http://search-lucene.com/ has a really handy source code search
with auto-completion for Lucene, Solr, etc. For example, I typed in: numDel -
and immediately found those methods. Use it. :)
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search
Li Li:
Then best to go to the source.
Here's one version with syntax highlighting and line numbers, should you have
questions about specific parts of that class:
http://search-lucene.com/c/Lucene:/src/java/org/apache/lucene/search/PhraseQuery.java
Otis
Sematext :: http://sematext.com/ ::
Hi Pablo,
This question comes up every once in a while. You'll find some previous
discussions and answers here:
http://search-lucene.com/?q=terms+closer+together+score
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
-
VL,
Solr (not Lucene, but you can embed Solr) has JsonUpdateRequestHandler, which
lets you send docs to Solr for indexing in JSON (instead of the usual XML):
http://search-lucene.com/c/Solr:/src/java/org/apache/solr/handler/JsonUpdateRequestHandler.java
And you can get Solr to respond with JSON
I think those doc-oriented DBs tend to be distributed, with replication
built-in and such, but yes, in some way the schemaless DB with docs and fields
(whether they are pumped in as JSON or XML or Java objects) feels the same. I
saw something from Grant about 2 months ago how Lucene is "nosql-i
Pasa,
Maybe Field Collapsing (Solr) can help? See SOLR-236 in JIRA
http://search-lucene.com/?q=field+collapsing&fc_project=Lucene&fc_project=Solr
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/
- Original Message --
I think others will have more thoughts on this, esp. for Numeric* questions...
but I'll try answering...
- Original Message
> From: Tomislav Poljak
> To: java-user@lucene.apache.org
> Sent: Fri, May 7, 2010 2:34:46 PM
> Subject: Filter vs. TermQuery performance
>
> Hi,
> when is it w
I think what Tomislav was trying to ask is:
Can filters replace only strictly boolean clauses (i.e. only MUST and
MUST_NOT), such as: +gender:F, -rating:xxx)?
Or can filters also replace SHOULD clauses, such as: food:banana (which is
neither absolutely required or strictly prohibited)?
Otis
--
Hello folks,
Those of you in or near NYC and using Lucene or Solr should come to "Lucandra -
a Cassandra-based backend for Lucene and Solr" on April 26th:
http://www.meetup.com/NYC-Search-and-Discovery/calendar/12979971/
The presenter will be Lucandra's author, Jake Luciani.
Please spread the
Joseph,
If you can, get the latest Lucene and use NumericField to index your dates with
appropriate precision and then use NumericRangeQueries when searching. This
will be faster than searching for string dates in a given range.
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nut
Hi,
I actually don't follow your change, because after "but changing it to" line
the only different thing I see is the doc.add(dateField) call, which you didn't
list before "but changing it to".
Also, if I understood Uwe correctly, he was suggesting reusing NumericField
instances, which means
Hello everyone,
Robert Muir gave a great presentation on a few advanced Lucene topics last
night and even found time to send this presentation to me, which I just
uploaded:
http://www.slideshare.net/otisg/finite-state-queries-in-lucene
You'll find all other presentations from the NYC Search
Hi Erick,
For what it's worth, we are considering indexing JIRA comments over on
http://search-lucene.com/ , though I'm not entirely convinced searching in
comments would be super valuable. Would it?
But note that JIRA (and LucidFind) already do that. For example, go to
http://issues.apache.
Maybe it's not a leak, Monique. :)
If you use sorting in Lucene, then the FieldCache object will keep some data
permanently in memory, for example.
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/
- Original Message -
Andrzej,
Does that mean the regular Lucene QP will get Span query syntax support (vs.
having it in that separate Surround QP)? Or maybe that already happened and I
missed it? :)
Thanks,
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://searc
Hello folks,
Those of you in or near New York and using Lucene or Solr should come to
"Lucene: Finite-State Queries, Flexible Indexing, Scoring, and more" on March
24th:
http://www.meetup.com/NYC-Search-and-Discovery/calendar/12720960/
The presenter will be the hyper active Lucene committer R
Paul,
Custom Similarity perhaps, oui. Not 100% sure, maybe have this always return
1.0f.
/** Computes a score factor based on the fraction of all query terms that a
* document contains. This value is multiplied into scores.
*
* The presence of a large portion of the query terms ind
Hi Jamie,
Could you say more about how it's not working? No compiling? Run-time
exceptions? Doesn't work as expected after you run a unit test for it?
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/
- Original Mes
Fedora Core 4 is *ancient*! :)
Could it be that the NFS client on it is old, and this is causing problems? I
remember emails about NFS 3 vs. NFS 4 and some improvements in the latter. I
don't recall the details and tend to keep my Lucene and Solr instances away
from NFS mounts.
Otis
Sema
Yes, that's just a phrase slop, allowing for variable gaps between words.
I *believe* the Surround QP that works with Span family of queries does handle
what you are looking for.
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
> From: T. R. Halvor
Yes, I believe it is the same. I bet the Explain explanation would help
confirm this.
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
> From: Paul Taylor
> To: java-user@lucene.apache.org
> Sent: Wed, January 20, 2010 1:03:14 PM
> Subject: Can yo
Guido,
No, you should absolutely not need to constantly rebuild the index. If you
find you have to do that, you'll know you are doing something wrong.
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
> From: Guido Bartolucci
> To: java-user@lucen
t know how to modify/use the solr script.
>
> Regards
> Ganesh
>
>
> - Original Message -
> From: "Otis Gospodnetic"
> To: ;
> Sent: Wednesday, January 20, 2010 10:45 AM
> Subject: Re: Lucene as a primary datastore
>
>
> > You are not al
You are not alone, Guido. It's a good question. In my experience, Lucene is
as stable as MySQL/PostgreSQL in terms of its ability to hold your data and not
corrupt it. Of course, even with the most expensive databases, you'd want to
make backups. The same goes with Lucene. Nowadays, one way
Hello,
Use Droids, it's much simpler than Nutch or Heritrix:
http://incubator.apache.org/droids/
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
> From: Phan The Dai
> To: java-user@lucene.apache.org
> Sent: Sat, January 16, 2010 2:20:47 AM
> Sub
I think Jason meant "15-20GB segments"?
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
From: Jason Rutherglen
To: java-user@lucene.apache.org
Sent: Wed, January 13, 2010 5:54:38 PM
Subject: Re: Max Segmentation Size when Optimizing Index
Ye
Hi,
Use the latest version of Lucene, obey Lucene's locks, write with 1
IndexWriter, avoid NFS...
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
> From: zhang99
> To: java-user@lucene.apache.org
> Sent: Tue, January 12, 2010 10:41:19 PM
> Subje
Zhou,
Your question will get more attention if you send it to
nutch-u...@lucene.apache.org list instead. This list is for Lucene Java.
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
> From: "jyzhou...@yahoo.com"
> To: java-user@lucene.apache.o
Hello,
If "Search Engine Integration, Deployment and Scaling in the Cloud" sounds
interesting to you, and you are going to be in or near New York next Wednesday
(Jan 20) evening:
http://www.meetup.com/NYC-Search-and-Discovery/calendar/12238220/
Sorry for dupes to those of you subscribed to mul
Nutch is written in Java, so Nutch itself *should* work on other non-Linux OSs
that the JVM supports.
But it does contain some shell scripts, as does Hadoop that Nutch uses. Oh, I
guess Windows people run it under Cygwin?
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
t; Also, what did you mean about isolating users and their data/indices. Did
> you mean that I should create a separate index per user?
>
> Thanks again!
>
> On Fri, Jan 8, 2010 at 12:35 AM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
>
> > For something li
For something like CSE, I think you want to isolate users and their
data/indices.
I'd look at Bixo or Nutch or Droids ==> Lucene or Solr
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
> From: Yaniv Ben Yosef
> To: java-user@lucene.apache.org
> S
o limit the size of an index?
>
> On Thu, Jan 7, 2010 at 2:23 PM, Otis Gospodnetic
> wrote:
> >> Merge factor controls how many segments are merged at once. The default
> >> is
> 10.
> >>
> >> The maxMergeMB setting sets the max size for a given seg
> Merge factor controls how many segments are merged at once. The default is
> 10.
>
> The maxMergeMB setting sets the max size for a given segment to be
> included in a merge.
I wonder if renaming that to maxSegSizeMergeMB would make it more obvious what
this does?
Otis
--
Sematext -- http:/
/is completed successfully and, as you say,
> there is only one segment in the directory.
>
> Some other ideas?
>
> Thanks,
> Yuliya
>
> > -Ursprüngliche Nachricht-
> > Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> > Gesendet: Donner
You could try Avro instead of JSON/XML/Java Serialization. It's compact (and
new).
http://hadoop.apache.org/avro/
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
> From: Paul Taylor
> To: java-user@lucene.apache.org
> Sent: Tue, January 5, 2010
Yuliya,
The index *directory* will be larger *while* you are optimizing. After the
optimization is completed successfully, the index directory will be smaller.
It is possible that your index directory is large(r) because you have some
left-over segments (e.g. from some earlier failed/interrup
This actually rings a bell for me... have a look at Lucene's JIRA, I think this
was reported as a bug once and perhaps has been fixed.
Note that Lucene in Action 2 has a case study that talks about searching source
code. You may find that study interesting.
Otis
--
Sematext -- http://sematext
I think you should be able to use 1+ FilteredQuery (with IDs of your docs) with
your main query and thus get the scores only for docs that interest you.
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
> From: Erdinc Yilmazel
> To: java-user@lucen
Chris,
You could look at KStem to see if that does a better job.
Or perhaps WordNet can be used to get the lemma of those terms instead of using
stemming.
Finally what was I going to say... ah, yes, using synonyms may be another
way this can be handled.
Otis
--
Sematext -- http://sematext.c
Hi,
Have a look at http://www.sematext.com/products/autocomplete/index.html
It handles Chinese and large volumes of data.
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
> From: fulin tang
> To: java-user@lucene.apache.org
> Sent: Thu, November
Hello,
For those living in or near NYC, you may be interested in joining (and/or
presenting?) at the NYC Search & Discovery Meetup.
Topics are: search, machine learning, data mining, NLP, information gathering,
information extraction, etc.
http://www.meetup.com/NYC-Search-and-Discovery/
Our
For what it's worth, AOL uses a Solr cluster to handle searches for @aol users.
Each user has his own index.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
- Original Message
> From: fulin tang
> To
Hi,
Please use java-user list for user questions.
Are you sure the file got fully indexed in the first place? Use Luke to check.
Also, see:
IndexWriter.MaxFieldLength
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NE
Well, I think some people will be for hiding complexity, while others will be
for being in control and having transparency. Think how surprised one would be
to find 1 extra field in his index, say when looking at their index with Luke.
:)
Otis
--
Sematext is hiring -- http://sematext.com/about
Hello,
Most likely due to the operating system caching the relevant portions of the
index after the first set of queries.
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
- Original Message
> From: Din
Hello,
Also keep in mind prefix queries are not the cheapest.
Plug:
We've seen people use this successfully:
http://www.sematext.com/products/autocomplete/index.html
I believe somebody is trying this out with a set of 1B suggestions. The demo
at http://www.sematext.com/demo/ac/index.html search
Hello,
Comments inlined.
- Original Message
> From: vsevel
> To: java-user@lucene.apache.org
> Sent: Fri, November 13, 2009 11:32:02 AM
> Subject: Re: OutofMemory in large index
>
>
> Hi, I am jumping into the thread because I have got a similar issue.
> My index is 30Gb large and
This is what we have in Lucene in Action 2:
~/lia2$ ff \*Thread\*java
./src/lia/admin/CreateThreadedIndexTask.java
./src/lia/admin/ThreadedIndexWriter.java
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
-
Alex,
If I understand you correctly, all you have to do is either make sure that
query is run as a phrase query (with quotes around the it), or as a term query
where both terms are required (with plus sign in front of each term, no space).
As for detecting score gap and such, you could do that
Hi,
That mergeFactor is too high. I suggest going back to default (10).
maxBufferedDocs is an old and not very accurate setting (imagine what happens
with the JVM heap if your indexer hits a SUPER LARGE document). Use
setRamBufferSizeMB instead.
Otis
--
Sematext is hiring -- http://sematext.c
1 - 100 of 835 matches
Mail list logo