I have been asked to forward this information to the dev and user mailing
list - there is an opportunity to travel to Beijing for CoC Asia. Please
read the information below and apply, if you're interested.
Dawid
-- Forwarded message -
From: Gavin McDonald
The Travel Assistance
s(DocumentsWriterPerThread.java:274)
> >> at
> >>
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)
> >> at
> >>
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1552)
> >> at
> >>
> org.ap
Split your large file into smaller fragments and index each fragment as a
document.
D.
On Fri, Feb 28, 2025 at 10:30 AM Daniel Cerqueira
wrote:
> Hi. I have apache-lucene version 10.1.0:
> ```
> $ pacman -Qs apache-lucene
> local/apache-lucene 10.1.0-1
> Apache Lucene is a high-performance,
You could flatten the intervals into different documents. This would
make retrieval of all of document's sectors a bit more clumsy but
searching would be simpler and the number of fields would be constant.
So each document would look like this:
document_id: xyz
sector_num: ...
start: ...
end: ...
> I spent some time with ChatGPT and Google, looking for a simple CLI method
> to explore the content. I see mention of Luke, but it seems very dated.
Luke is your best bet. There is no command-line tool to "explore the
content" because Lucene indexes are fairly low
level. I'm guessing you'd like
You need to subtract the matching documents from everything else in the
negative part, effectively:
*:* AND NOT (zips-within area)
D.
On Wed, May 8, 2024 at 8:27 PM Siraj Haider
wrote:
> Hello there,
> We are using Lucene v6.4.1 and are looking to implement geopoint searching
> within or outsi
ularly large, but
> > are the most suspect.
> >
> > Thanks for any input,
> > Marc
> >
> > 9.4.2
> > Time(ms) per Document
> > facetConfig.build : 0.9882365
> > Taxo Add: 0.8334876
> >
> > 9.5
> > facetConfig.build : 11.0
If you download the binary distribution, try this:
Windows:
java --module-path modules;modules-thirdparty --module
org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles
Linux/Unix/Mac:
java --module-path modules:modules-thirdparty --module
org.apache.lucene.demo/org.apache.lucene.demo.IndexFil
Hi Marc,
You could try git bisect lucene repository to pinpoint the commit that
caused what you're observing. It'll take some time to build but it's a
logarithmic bisection and you'd know for sure where the problem is.
D.
On Thu, Apr 18, 2024 at 11:16 PM Marc Davenport
wrote:
> Hi Adrien et al
Hello,
This recent issue is exactly referring to what you need - you may want to
continue the discussion there, perhaps
ask the commenter to make the code public on github for experiments?
https://github.com/apache/lucene/issues/13065
Dawid
On Sun, Feb 11, 2024 at 8:42 AM _ SATNAM wrote:
> He
Hi Erel,
3. A few days ago I started the local server and was surprised to see
> that the index is corrupt. It failed to decompress a stored field.
> Something deep inside Lucene.
>
If you can include a stack trace, it would be great. Also, try running
CheckIndex on that index to see if it s
You can use two different queries - the query is just used as a source of
information on what to highlight (it can even be completely different and
unrelated to the query that retrieved the documents).
Separately, unified highlighter is great but you may also try the matches
API - I found it to be
Can't open this repository, it's probably private.
Dawid
On Tue, Feb 14, 2023 at 2:42 PM Thanos Agelakpoulos
wrote:
>
> Thanks for the response David !
>
> I created a quick repo just to showcase,
> https://github.com/aggelako/JavaSpellchecker
> In there you can see how im using lucene, in the
It'd be good if you could share the problematic scenario as a piece of code
(ideally a forked Lucene repository, with a test case?) so that we can take
a look. There's been a ton of improvements to hunspell packages in Lucene 9
(and on the main branch) - you should take a look and perhaps take some
if I'm using the MMapDirectory. The data
> is on heap.
>
> For my use case, it's a huge waste of memory :( 90% of my data could be
> correctly organised and kept in disk.
>
> Thanks for the support
>
> Best regards
> Marcos Rebelo
>
> On Tue, 27 Dec 2022, 09:11 Dawi
Looking at the code briefly, I think WFSTCompletionLookup uses on heap
store for the fst. You'd have to load it with off heap fst store instead:
https://github.com/apache/lucene/blob/1b9d98d6ec079e950bdd37137082f81400d3bc2e/lucene/core/src/java/org/apache/lucene/util/fst/OffHeapFSTStore.java
but
Don't ignore them - restore them to previous values after the test is
complete. This can be done with a test rule or a before/afterclass hook.
See here, for example:
https://github.com/apache/lucene/blob/main/lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java#L642-L655
> (so deleted docs == max docs) and call commit. Will/Can this segment still
> exist after commit?
>
Depends on your merge policy index deletion policy. You can configure
Lucene to keep older commits (and then you'll preserve all historical
segments).
I don't know the answer to your second quest
svcpp/version/CommandLineToolVersionLocator.java#L63
>
> -Rahul
>
> On Wed, Sep 14, 2022 at 11:51 AM Dawid Weiss wrote:
>
> > > I have no idea how to fix this. Dawid: Maybe we can also make the
> > > configuration of that native stuff only opt-in? So only detect Vis
> I have no idea how to fix this. Dawid: Maybe we can also make the
> configuration of that native stuff only opt-in? So only detect Visual
> Studio when you actively activate native code compilation?
It is an opt-in, actually. The problem is: gradle fails on applying the
plugin - even if the task
ting. Here
> > is the link to the diagnostics that you requested (since attachments/images
> > won't make it through):
> >
> > https://drive.google.com/file/d/15pt9Qt1H98gOvA5e0NrtY8YYHao0lgdM/view?usp=sharing
> >
> >
> > Thanks,
> > Rahul
> &
Hi Rahul,
Well, that's weird.
> "releases/lucene/9.2.0" -> Run "gradlew help"
>
> If you need additional stacktrace or other diagnostics I am happy to
> provide the same.
Could you do the following:
1) run: git --version so that we're on the same page as to what the
git version is (I don't thi
It does work just fine. Use cmd or powershell though. I don't think
things are even tested with cygwin/msys.
Dawid
On Tue, Sep 13, 2022 at 4:55 AM Rahul Goswami wrote:
>
> Hello,
> I am using gitbash to build lucene 9.2.0 on Windows. I checked out the
> release/lucene/9.2.0 tag and tried running
It looks great, thank you for this monumental effort, Tomoko.
Dawid
On Wed, Aug 24, 2022 at 9:19 PM Tomoko Uchida
wrote:
>
>
>
> Issue migration has been completed (except for minor cleanups).
> This is the Jira -> GitHub issue number mapping for possible future usage.
> https://github.com/apac
Yes, you need to build a third FST. You can build a merging iterator
that will combine two or more FST traversal streams so that they're in
order and then build a merged FST directly, with no extra sorting cost.
https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/util/fst/Builder.html#add-
This change was intentional to make it consistent with package naming,
Dawid
On Tue, Jul 26, 2022 at 10:34 PM Baris Kazar wrote:
> Dear Folks,-
> I see that Lucene has changed one of the JAR files' name to
> lucene-analysis-common-9.1.0.jar in Lucene version 9.1.0.
> It used to use analyzers.
ion to get more log output. Run with --scan to get full insights.
> * Get more help at https://help.gradle.org
> BUILD FAILED in 6s
> 56 actionable tasks: 16 executed, 40 up-to-date
>
>
> 吴 达
>
> 云管研发部 | 梯度科技
>
> 15915997306
> w...@ti
ut. Run with --scan to get full insights.
>
> * Get more help at https://help.gradle.org
>
> Deprecated Gradle features were used in this build, making it incompatible
> with Gradle 7.0.
> Use '--warning-mode all' to show the individual deprecation warnings.
> See
> http
I'm sorry, I was out of office. I can't see that attachment you
posted. If it's still a problem, can you copy-paste what you see on
the console once you issue "git status"?
Dawid
On Wed, Aug 4, 2021 at 1:07 PM Da Wu wrote:
>
> I have executed like this.
>
&g
What does "git status" say? The hashes of generated files are not what
they're supposed to be - either something has changed them or you have
a git configuration that replaces something on the fly (line endings,
most likely).
Dawid
On Wed, Jul 28, 2021 at 9:55 AM Da Wu wrote:
>
> i want to contr
For what it is worth, I would be also interested in answers to
these questions. ;)
On Mon, Sep 21, 2020, 19:08 Uwe Schindler wrote:
> Hi all, hi Alan,
>
> I am currently rewriting some SpanQuery code to use IntervalQuery. Most of
> the transformations can be done quite easily and it is also bett
D. Binding.
Dawid
On Tue, Sep 1, 2020 at 10:21 PM Ryan Ernst wrote:
> Dear Lucene and Solr developers!
>
> Sorry for the multiple threads. This should be the last one.
>
> In February a contest was started to design a new logo for Lucene
> [jira-issue]. The initial attempt [first-vote] to call
I'm still in favor of the current logo (D), binding vote.
Dawid
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
A is nice and modern... but I still like the current logo better, so
for me it's "C".
Dawid
On Tue, Jun 16, 2020 at 12:08 AM Ryan Ernst wrote:
>
> Dear Lucene and Solr developers!
>
> In February a contest was started to design a new logo for Lucene [1]. That
> contest concluded, and I am now (
Thank you - I filed an issue for this, we will investigate.
https://issues.apache.org/jira/browse/LUCENE-9117
Dawid
On Mon, Jan 6, 2020 at 11:39 PM Cleber Muramoto
wrote:
>
> After generating a pre-compiled image lucene-core (8.3.0) with jaotc (JDK
> 13.0.1), RamUsageEstimator class is never lo
> You always pass "piwko" for stemming.
I'm afraid that's not correct? You should *never* pass on piwko when
stemming. :)
Dawid
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: j
Hi Maciej,
Stempel uses a pretrained heuristic. You can find a longer description
at [1] and [2]. The specific reason for the problems you mentioned may
be the smaller training dictionary used for the version embedded in
Lucene, I honestly don't know. If you need exact stemming/
lemmatization then
> We have chosen G1GC for both Java 8 and Java 11 versions.
It's not like we have answers for everything. ;) If it's the same GC
on both and there is still a slowdown then something else may be
causing it -- hard to tell without doing trial-and-error. There is a
set of performance benchmarks; perh
k on it.
>
> Regards,
> Jerven
> On 11/30/18 12:01 PM, Dawid Weiss wrote:
> > Just FYI: I implemented a quick and dirty PoC to see what it'd work
> > like. Not much of a difference on my machine (since postings merging
> > dominates everything else). Interesting prob
ashMap by simply counting the size of the
> Node that is used for each entry, although given the dynamic nature of
> these data structures (HashMap eg can use TreeNodes sometimes
> depending on data distribution) it would be almost impossible to be
> 100% accurate.
> On Thu, Dec 6, 2018
> It's entirely possible it fails to dig into Maps correctly with newer Java
> releases; maybe Dawid or Uwe would know?
We have removed all reflection from that class a while ago exactly
because of encapsulation issues introduced in newer Java versions.
https://github.com/apache/lucene-solr/blob/
bq. We switched to ByteBuffersDirectory with 7.5, but
I actually didn't see much performance improvements or savings in memory.
Once the indexes are built I don't think there will be much of a
difference. The core problem with RAMDirectory was related to
synchronizations during merges/ file manipu
/jira/browse/LUCENE-8580
Dawid
On Fri, Nov 2, 2018 at 10:17 PM Dawid Weiss wrote:
>
> Thanks for chipping in, Toke. A ~1TB index is impressive.
>
> Back of the envelope says reading & writing 900GB in 8 hours is
> 2*900GB/(8*60*60s) = 64MB/s. I don't remember the interfa
Thanks for chipping in, Toke. A ~1TB index is impressive.
Back of the envelope says reading & writing 900GB in 8 hours is
2*900GB/(8*60*60s) = 64MB/s. I don't remember the interface for our
SSD machine, but even with SATA II this is only ~1/5th of the possible
fairly sequential IO throughput. So f
> int processors = Runtime.getRuntime().availableProcessors();
> int ConcurrentMergeScheduler cms = new ConcurrentMergeScheduler();
> cms.setMaxMergesAndThreads(processors,processors);
See the number of threads in the CMS only matters if you have
concurrent merges of independent segments. What you
We are faced with a similar situation. Yes, the merge process can take
a long time and is mostly single-threaded (if you're merging from N
segments into a single segment, only one thread does the job). As
Erick pointed out, the merge process takes a backseat compared to
indexing and searches (in mo
Use MMapDirectory on a temporary location, Matthias. If you really
need in-memory indexes, a new Directory implementation is coming
(RAMDirectory will be deprecated, then removed), but the difference
compared to MMapDirectory is typically not worth the hassle. See this
issue for more discussion.
h
Erick already pointed you at the "cleanup" rule. This is fairly
generic, but if you know
the properties being modified you should still clean them up in @After or
@AfterClass -- this is useful for other people to know that you're modifying
them, if for nothing else.
Randomized testing package has
> That helps and explains why there is no support in std api
This isn't an API problem. This is by design -- this is how it works.
If you wish
to retrieve fields that are indexed and stored with the document, the
API provides
such an option (indexed and stored field type). Your indexed fields
are
t. Actual indexed content would be same if both index have
> "status" field indexed so we only need to validate fieldnames per
> document. Something like
>
> Thanks for reading all this if you have read so far :)
>
> Chetan Mehrotra
> [1]
> https://github.com/apach
basis using the Lucene API?
> Chetan Mehrotra
>
>
> On Tue, Jan 2, 2018 at 1:03 PM, Dawid Weiss wrote:
>> How about the quickest solution: dump the content of both indexes to a
>> document-per-line text
>> file, sort, diff?
>>
>> Even if your indexes are large,
How about the quickest solution: dump the content of both indexes to a
document-per-line text
file, sort, diff?
Even if your indexes are large, if you have large spare disk, this
will be super fast.
Dawid
On Tue, Jan 2, 2018 at 7:33 AM, Chetan Mehrotra
wrote:
> Hi,
>
> We use Lucene for indexin
for example,
> be usefull in bioinformatic or all those cases where data is not a basic
> ADT.
>
> Cristian
>
> 2017-09-30 12:24 GMT+02:00 Dawid Weiss :
>
>> > Hi , it is possible to create a Automaton in lucene parsing not a string
>> > but a byte array?
> Hi , it is possible to create a Automaton in lucene parsing not a string
> but a byte array?
Can you state what problem are you trying to solve? This seems to be a
question stripped of a more general context -- why do you need those
byte-based automata?
Dawid
--
Hi Mike. Search lucene dev archives. I did write a decompounder with Daniel
Naber. The quality was not ideal but perhaps better than nothing. Also,
Daniel works on languagetool.org? They should have something in there.
Dawid
On Sep 16, 2017 1:58 AM, "Michael McCandless"
wrote:
> Hello,
>
> I ne
> it will be good if Lucene team can share their plans for a full java 9
> support (e.g. named modules of Lucene libraries)
So, here it is: we plan to support it. (*)
Dawid
(*) When it's stabilized and documented (it still isn't) [1]. And when
somebody has the time to do it (patches welcome, it'
https://issues.apache.org/jira/browse/SOLR-1105
Yes, this is spot-on what I need with regard to copyTo fields, thanks
for the link!
> Or are the overlaps coming from passage offset ranges from separate queries
> to the same content?
The overlaps are caused by the fact that we have multiple sour
> #2 & #3 is the same requirement; you elaborate on #2 with more detail in #3.
> The UH can't currently do this; but with the OH (original Highlighter) you
> can but it appears somewhat awkward. See SimpleSpanFragmenter. I had said
> it was easy but I was mistaken; I'm getting rustier on the OH.
Thanks for your explanation, David.
I actually found working with all Lucene highlighters pretty
difficult. I have a few requirements which seemed deceptively simple:
1) highlight query hit regions (phrase, fuzzy, terms);
2) try to organise the resulting snippets to visually "center" the hit
regi
> Dawid, the thing is that I am not even sure that Automata are the perfect
> fit for my project and I thought some literature on it would help me decide
> whether to use it or not.
Still looks to me like you're approaching the problem from the wrong
side or don't
want to share the core problem wh
> One small correction: we moved away from objects to more compact int[] a
> while ago for our automata implementation.
Right, forgot about that. There are still some trappy object-heavy
utilities like this one:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/luc
> I'd like to read something written by who designed these classes. What
> motivated, usage examples, what it is good for and what it is not good for.
> Maybe a history of the development of Automata on Lucene
Are you looking for a historical book on Lucene development or are you
looking to solve
Or you could encode those term/ ngram frequencies one FST and then
reuse it. This would be memory-saving and fairly fast (~comparable to
a hash table).
Dawid
On Fri, Mar 10, 2017 at 11:41 AM, Michael McCandless
wrote:
> Yes, this is a reasonable way to use Lucene (to see terms statistics across
> PatriciaTrie. In particular building an FST with doShareSuffix = false is
> the fastest of any option,
If you don't share the suffix then you are building a kind of Patricia
trie... But suffix sharing is cheap and can give you a memory saving
(and resulting cache locality sometimes) that is non-
tate registry
> more ram efficient too ... I think it's essentially the same thing as
> the FST.Builder's NodeHash, just minus the outputs that FSTs have vs
> automata.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Feb 15, 2017 a
You could try using morfologik's byte-based implementation:
https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java
I can't guarantee it'll be fast enough -- you need to sort those input
sequences and even thi
> But it is fairly trivially to tweak/extend the query parser to produce
> diff behavior.
I think the conclusion for the original poster should be that there's
really not enough information to provide a definite answer. Lucene is
a search engine. Much like with a mechanical engine, its final
appli
curl -s 'localhost:9200/test/_search?pretty' -d '{ "query": {
> "query_string": { "query": "foo AND bar OR baz" , "default_operator": "or"
> } } , "profile" : true}' | grep luce
> "lucene&qu
Which Lucene version and which query parser is this? Can you provide a
test case/ code sample?
I just tried with StandardQueryParser and for:
sqp.setDefaultOperator(StandardQueryConfigHandler.Operator.AND);
dump(sqp.parse("foo AND bar OR baz", "field_a"));
sqp.setDefaultOpe
There are multiple Highlighter implementations for this purpose. Check
them out -- I'm sure one of them will suit your needs. In fact,
there's a new highlighter implemented very recently! Check out this
JIRA issue:
https://issues.apache.org/jira/browse/LUCENE-7438
Dawid
On Fri, Sep 30, 2016 at 8
> I think I see this now, and how skipping determinization and matching with
> the NFA could easily leave you with an intractable amount of backtracking
> for even the simpler binary question of does my input match any of the
> automatons I've unioned.
Note that with NFAs you may answer the questi
> Point taken, but I wonder if there's an algorithmic shortcut to determinize
> the union of Levenshtein DFAs...
Levenshtein DFA is an automaton like any other; when you merge two
such automata, they will very likely contain states that need to be
merged (and their transition split) in order to be
You could try to implement this refactoring, which would combine
linear storage of values (without the need to save the length of each
key explicitly) with their incremental addition order.
https://issues.apache.org/jira/browse/LUCENE-5854
The outcome may or may not be faster in practice (due to
The GC change is after this:
BJ (2015-12-02): Upgrade to beast2 (72 cores, 256 GB RAM)
which leads me to believe these results are not comparable (different
machines, architectures, disks, CPUs perhaps?).
Dawid
On Thu, Apr 14, 2016 at 7:13 PM, Otis Gospodnetić
wrote:
> Hi,
>
> I was looking a
You can addIndexes(Directory... dirs) -- then you don't have to deal
with CodecReader?
Dawid
On Tue, Jan 12, 2016 at 4:43 PM, Manner Róbert wrote:
> Hi,
>
> we have used lucene 4.7.0 before, we are on the way to upgrade to 5.4.0.
>
> The problem I have is that writer.addIndexes now needs CodecRe
> LowerCaseFilter will not handle that. So whereas it is "safe" for
> English hard-coded strings, it isn't safe for all fields you might
> index in general.
This filter is a "safe" fallback that works identically regardless of
the locale you
have on your computer (or on the server). This, I believ
unt(Character.toLowerCase(cp));
if (c1 != c2 ||
c1 != c3) {
System.out.println(String.format(Locale.ROOT,
"%d %d %d",
c1, c2, c3));
}
}
D.
On Thu, Oct 22, 2015 at 10:15 AM, Dawid Weiss wrote:
I think the issue here is what happens if an "uppercase" codepoint requires
a surrogate pair and the lowercase counterpart does not -- then the index
variable would indeed be screwed.
Dawid
On Thu, Oct 22, 2015 at 10:05 AM, Uwe Schindler wrote:
> Hi,
>
> > Setting aside the fact that Character.
It is (b).
D.
On Fri, Aug 7, 2015 at 3:05 AM, Trejkaz wrote:
> I have recently done updates from Lucene 3.6 to 4.x and 4.x to 5.2.
>
> During this process, I noticed that the FST used by the Japanese
> analyser (AKA Kuromoji) was changing between releases. As I fear
> breakages in backwards comp
> Otherwise, it violates the Liskov substitution principle as well.
Sadly it also violates the Heisenberg's principle at the bit state
energy levels. We're working on improving that.
>From your heated comments I think you should switch the language to
something that guarantees immutability of any
> BytesRef is not different, because it is just a "reference" to pass around.
> And cloning a reference for sure should not clone the target of the
> reference. You are "cloning" the reference and only that (as the name of the
> class says: Bytes*Ref*)!
Exactly. It is a reference and as such, c
Yes, BytesRef can be surprising. No, it probably won't change in
Lucene to comply with superb design principles. Yes, the odd design is
there for performance reasons and it does provide noticeable gain.
Perhaps you could file a JIRA issue to improve the documentation, this
would be helpful. For wh
Thanks for contributing time to the release, Anshum.
Dawid
On Fri, Feb 20, 2015 at 10:16 PM, Anshum Gupta wrote:
> Sure, I'll fix that on the wiki. Thanks for pointing that out Uwe.
>
> On Fri, Feb 20, 2015 at 1:10 PM, Uwe Schindler wrote:
>
>> Many thanks! :-) Nice work!
>>
>> I found a small
> This could speed up tests, especially Solr where some dirs are copied over
> and over for every test case. :-)
A wild idea, but since there's NIO everywhere now you could use an
in-memory filesystem for tests and avoid going to disk entirely :D
https://github.com/google/jimfs
Dawid
-
.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Dawid Weiss [mailto:dawid.we...@gmail.com]
>> Sent: Monday, December 22, 2014 8:48 AM
>> To: java-user@lucene.apache.org
>> Cc: Uwe Schindler
>> Subject: Re: BTRFS ?
>&
> I spotted Uwe's comment in JIRA the other day "BTRFS, which might also
> bring some cool things for Lucene.".
What cool things about BTRFS are you talking about, Uwe? Just curious.
Dawid
-
To unsubscribe, e-mail: java-user
Hi Steve,
I have to admit I also find it frequently useful to include
punctuation as tokens (even if it's filtered out by subsequent token
filters for indexing, it's a useful to-have for other NLP tasks). Do
you think it'd be possible (read: relatively easy) to create an
analyzer (or a modificatio
Hi Michal,
Pretty cool. Your work reminds me of what Leo Galambos did a while back:
http://link.springer.com/chapter/10.1007/978-3-540-39985-8_22
I believe his implementation is still available in the Egothor search
engine project.
Dawid
On Wed, Oct 23, 2013 at 5:17 PM, Michal Hlavac wrote:
Here's another thought: if you desperately need complex searches then
you could do a heuristic filtering to narrow down the search: use an
analyzer that does some form of input splitting into terms (removing
excess whitespace or even producing n-grams from the input), then do
the same for the query
Jerome,
Some of the tokens are removed because their part of speech tags are
in the stoptags file? That's my guess at least -- you can always try
to copy/paste Japanese analyzer and change the token stream
components:
protected TokenStreamComponents createComponents(String fieldName,
Reader rea
> Eg, you'd index only "boston", "red", "sox", "rumor" into the FST, and
> then have a separate search index with "boston red sox rumor" indexed
> as a document. If the user types "red so", then you run suggest on
> "red" and on "so", and then run a hmm MultiPhraseQuery for
> (red|redmond|reddit)
Iterating character-by-character is different than considering the
entire string at once so your observation is correct, that's how it's
supposed to work. In particular, note this in String#toLowerCase
documentation:
"Since case mappings are not always 1:1 char mappings, the resulting
String may b
> https://issues.apache.org/jira/browse/LUCENE-4491 ? Could you simply
> stuff your ISBN onto the end of the suggestion (ie enroll Lucene in
> Action|1933988177)?
Just remember that if your suffixes are unique then you'll be
expanding the automaton quite a bit (unique suffix paths).
D.
> The WhitespaceAnalyzer breaks up text by spaces and tabs and newlines.
> After that, you can wildcards. This will use very little space. I
> believe leading&trailing wildcards are supported now, right?
If leading wildcards take too much time (don't know, really) then one
could also try to index
> Does Lucene support this type of structure, or do I need to somehow implement
> it outside Lucene?
You'd have to implement it separately but it'd be much, much smaller
than Lucene itself (even obfuscated).
> By the way, I need this to run on an Android phone so size of memory might be
> an is
What you need is a suffix tree or a suffix array. Both data structures
will allow you to perform constant-time searches for existence/
occurrence of any input pattern. Depending on how much text you have
on the input it may either be a simple task -- see here:
http://labs.carrotsearch.com/jsuffixa
http://static1.blip.pl/user_generated/update_pictures/1758685.jpg
On Thu, Aug 2, 2012 at 8:32 AM, roz dev wrote:
> wow!! That was quick.
>
> Thanks a ton.
>
>
> On Wed, Aug 1, 2012 at 11:07 PM, Simon Willnauer
> wrote:
>
>> On Thu, Aug 2, 2012 at 7:53 AM, roz dev wrote:
>> > Thanks Robert for th
Read this:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
Dawid
On Thu, Jul 19, 2012 at 1:32 PM, Dragon Fly wrote:
>
> The slowest part of my application is to read the search hits from disk. I
> was hoping that using an SSD or RAMDirectory/MMapDirectory would speed th
> Why anyone buys computers without SSD's is a mystery to me. Use SSDs for
On topic and highly recommended:
http://www.youtube.com/watch?v=H7PJ1oeEyGg
Dawid
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For a
> Rum is an essential ingredient in all software systems :-)
You probably meant "social systems".
D.
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache
1 - 100 of 161 matches
Mail list logo