Community Over Code Asia Travel Assistance Applications

2025-03-04 Thread Dawid Weiss
I have been asked to forward this information to the dev and user mailing list - there is an opportunity to travel to Beijing for CoC Asia. Please read the information below and apply, if you're interested. Dawid -- Forwarded message - From: Gavin McDonald The Travel Assistance

Re: apache-lucene blowing up with large file

2025-03-01 Thread Dawid Weiss
s(DocumentsWriterPerThread.java:274) > >> at > >> > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425) > >> at > >> > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1552) > >> at > >> > org.ap

Re: apache-lucene blowing up with large file

2025-02-28 Thread Dawid Weiss
Split your large file into smaller fragments and index each fragment as a document. D. On Fri, Feb 28, 2025 at 10:30 AM Daniel Cerqueira wrote: > Hi. I have apache-lucene version 10.1.0: > ``` > $ pacman -Qs apache-lucene > local/apache-lucene 10.1.0-1 > Apache Lucene is a high-performance,

Re: Suggestions for modeling an Index

2025-01-21 Thread Dawid Weiss
You could flatten the intervals into different documents. This would make retrieval of all of document's sectors a bit more clumsy but searching would be simpler and the number of fields would be constant. So each document would look like this: document_id: xyz sector_num: ... start: ... end: ...

Re: Current command line tools for Lucene?

2024-09-24 Thread Dawid Weiss
> I spent some time with ChatGPT and Google, looking for a simple CLI method > to explore the content. I see mention of Luke, but it seems very dated. Luke is your best bet. There is no command-line tool to "explore the content" because Lucene indexes are fairly low level. I'm guessing you'd like

Re: Zipcode radius search outside certain miles of a zipcode

2024-05-08 Thread Dawid Weiss
You need to subtract the matching documents from everything else in the negative part, effectively: *:* AND NOT (zips-within area) D. On Wed, May 8, 2024 at 8:27 PM Siraj Haider wrote: > Hello there, > We are using Lucene v6.4.1 and are looking to implement geopoint searching > within or outsi

Re: Indexing time increase moving from Lucene 8 to 9

2024-04-23 Thread Dawid Weiss
ularly large, but > > are the most suspect. > > > > Thanks for any input, > > Marc > > > > 9.4.2 > > Time(ms) per Document > > facetConfig.build : 0.9882365 > > Taxo Add: 0.8334876 > > > > 9.5 > > facetConfig.build : 11.0

Re: Help running the demo program

2024-04-22 Thread Dawid Weiss
If you download the binary distribution, try this: Windows: java --module-path modules;modules-thirdparty --module org.apache.lucene.demo/org.apache.lucene.demo.IndexFiles Linux/Unix/Mac: java --module-path modules:modules-thirdparty --module org.apache.lucene.demo/org.apache.lucene.demo.IndexFil

Re: Indexing time increase moving from Lucene 8 to 9

2024-04-18 Thread Dawid Weiss
Hi Marc, You could try git bisect lucene repository to pinpoint the commit that caused what you're observing. It'll take some time to build but it's a logarithmic bisection and you'd know for sure where the problem is. D. On Thu, Apr 18, 2024 at 11:16 PM Marc Davenport wrote: > Hi Adrien et al

Re: Hebrew Tockenizer

2024-02-11 Thread Dawid Weiss
Hello, This recent issue is exactly referring to what you need - you may want to continue the discussion there, perhaps ask the commenter to make the code public on github for experiments? https://github.com/apache/lucene/issues/13065 Dawid On Sun, Feb 11, 2024 at 8:42 AM _ SATNAM wrote: > He

Re: Windows issue with "Using MemorySegmentIndexInput with Java 20" (?)

2023-12-14 Thread Dawid Weiss
Hi Erel, 3. A few days ago I started the local server and was surprised to see > that the index is corrupt. It failed to decompress a stored field. > Something deep inside Lucene. > If you can include a stack trace, it would be great. Also, try running CheckIndex on that index to see if it s

Re: Highlighting query results, my method is too crude, but how to improve it?

2023-02-20 Thread Dawid Weiss
You can use two different queries - the query is just used as a source of information on what to highlight (it can even be completely different and unrelated to the query that retrieved the documents). Separately, unified highlighter is great but you may also try the matches API - I found it to be

Re: Lucene Hunpell Spell checker

2023-02-17 Thread Dawid Weiss
Can't open this repository, it's probably private. Dawid On Tue, Feb 14, 2023 at 2:42 PM Thanos Agelakpoulos wrote: > > Thanks for the response David ! > > I created a quick repo just to showcase, > https://github.com/aggelako/JavaSpellchecker > In there you can see how im using lucene, in the

Re: Lucene Hunpell Spell checker

2023-02-13 Thread Dawid Weiss
It'd be good if you could share the problematic scenario as a piece of code (ideally a forked Lucene repository, with a test case?) so that we can take a look. There's been a ton of improvements to hunspell packages in Lucene 9 (and on the main branch) - you should take a look and perhaps take some

Re: Loading WFST to Memory Mapped File in Lucene

2022-12-27 Thread Dawid Weiss
if I'm using the MMapDirectory. The data > is on heap. > > For my use case, it's a huge waste of memory :( 90% of my data could be > correctly organised and kept in disk. > > Thanks for the support > > Best regards > Marcos Rebelo > > On Tue, 27 Dec 2022, 09:11 Dawi

Re: Loading WFST to Memory Mapped File in Lucene

2022-12-27 Thread Dawid Weiss
Looking at the code briefly, I think WFSTCompletionLookup uses on heap store for the fst. You'd have to load it with off heap fst store instead: https://github.com/apache/lucene/blob/1b9d98d6ec079e950bdd37137082f81400d3bc2e/lucene/core/src/java/org/apache/lucene/util/fst/OffHeapFSTStore.java but

Re: How to ignore system properties?

2022-09-19 Thread Dawid Weiss
Don't ignore them - restore them to previous values after the test is complete. This can be done with a test rule or a before/afterclass hook. See here, for example: https://github.com/apache/lucene/blob/main/lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java#L642-L655

Re: Questions about Lucene source

2022-09-17 Thread Dawid Weiss
> (so deleted docs == max docs) and call commit. Will/Can this segment still > exist after commit? > Depends on your merge policy index deletion policy. You can configure Lucene to keep older commits (and then you'll preserve all historical segments). I don't know the answer to your second quest

Re: Lucene 9.2.0 build fails on Windows

2022-09-14 Thread Dawid Weiss
svcpp/version/CommandLineToolVersionLocator.java#L63 > > -Rahul > > On Wed, Sep 14, 2022 at 11:51 AM Dawid Weiss wrote: > > > > I have no idea how to fix this. Dawid: Maybe we can also make the > > > configuration of that native stuff only opt-in? So only detect Vis

Re: Lucene 9.2.0 build fails on Windows

2022-09-14 Thread Dawid Weiss
> I have no idea how to fix this. Dawid: Maybe we can also make the > configuration of that native stuff only opt-in? So only detect Visual > Studio when you actively activate native code compilation? It is an opt-in, actually. The problem is: gradle fails on applying the plugin - even if the task

Re: Lucene 9.2.0 build fails on Windows

2022-09-13 Thread Dawid Weiss
ting. Here > > is the link to the diagnostics that you requested (since attachments/images > > won't make it through): > > > > https://drive.google.com/file/d/15pt9Qt1H98gOvA5e0NrtY8YYHao0lgdM/view?usp=sharing > > > > > > Thanks, > > Rahul > &

Re: Lucene 9.2.0 build fails on Windows

2022-09-13 Thread Dawid Weiss
Hi Rahul, Well, that's weird. > "releases/lucene/9.2.0" -> Run "gradlew help" > > If you need additional stacktrace or other diagnostics I am happy to > provide the same. Could you do the following: 1) run: git --version so that we're on the same page as to what the git version is (I don't thi

Re: Lucene 9.2.0 build fails on Windows

2022-09-13 Thread Dawid Weiss
It does work just fine. Use cmd or powershell though. I don't think things are even tested with cygwin/msys. Dawid On Tue, Sep 13, 2022 at 4:55 AM Rahul Goswami wrote: > > Hello, > I am using gitbash to build lucene 9.2.0 on Windows. I checked out the > release/lucene/9.2.0 tag and tried running

Re: [ANNOUNCE] Issue migration Jira to GitHub starts on Monday, August 22

2022-08-25 Thread Dawid Weiss
It looks great, thank you for this monumental effort, Tomoko. Dawid On Wed, Aug 24, 2022 at 9:19 PM Tomoko Uchida wrote: > > > > Issue migration has been completed (except for minor cleanups). > This is the Jira -> GitHub issue number mapping for possible future usage. > https://github.com/apac

Re: Lucene Suggester APIs question

2022-08-20 Thread Dawid Weiss
Yes, you need to build a third FST. You can build a merging iterator that will combine two or more FST traversal streams so that they're in order and then build a merged FST directly, with no extra sorting cost. https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/util/fst/Builder.html#add-

Re: Lucene 9.1.0 has changed name of lucene-analysis-common-9.1.0.jar

2022-07-27 Thread Dawid Weiss
This change was intentional to make it consistent with package naming, Dawid On Tue, Jul 26, 2022 at 10:34 PM Baris Kazar wrote: > Dear Folks,- > I see that Lucene has changed one of the JAR files' name to > lucene-analysis-common-9.1.0.jar in Lucene version 9.1.0. > It used to use analyzers.

Re: lucene execute ./gradlew precommit and ./gradlew test failed

2021-08-17 Thread Dawid Weiss
ion to get more log output. Run with --scan to get full insights. > * Get more help at https://help.gradle.org > BUILD FAILED in 6s > 56 actionable tasks: 16 executed, 40 up-to-date > > > 吴 达 > > 云管研发部 | 梯度科技 > > 15915997306 > w...@ti

Re: lucene execute ./gradlew precommit and ./gradlew test failed

2021-08-16 Thread Dawid Weiss
ut. Run with --scan to get full insights. > > * Get more help at https://help.gradle.org > > Deprecated Gradle features were used in this build, making it incompatible > with Gradle 7.0. > Use '--warning-mode all' to show the individual deprecation warnings. > See > http

Re: lucene execute ./gradlew precommit and ./gradlew test failed

2021-08-16 Thread Dawid Weiss
I'm sorry, I was out of office. I can't see that attachment you posted. If it's still a problem, can you copy-paste what you see on the console once you issue "git status"? Dawid On Wed, Aug 4, 2021 at 1:07 PM Da Wu wrote: > > I have executed like this. > &g

Re: lucene execute ./gradlew precommit and ./gradlew test failed

2021-08-02 Thread Dawid Weiss
What does "git status" say? The hashes of generated files are not what they're supposed to be - either something has changed them or you have a git configuration that replaces something on the fly (line endings, most likely). Dawid On Wed, Jul 28, 2021 at 9:55 AM Da Wu wrote: > > i want to contr

Re: IntervalQuery replacement for SpanFirstQuery? Closest replacement for slops?

2020-09-21 Thread Dawid Weiss
For what it is worth, I would be also interested in answers to these questions. ;) On Mon, Sep 21, 2020, 19:08 Uwe Schindler wrote: > Hi all, hi Alan, > > I am currently rewriting some SpanQuery code to use IntervalQuery. Most of > the transformations can be done quite easily and it is also bett

Re: [VOTE] Lucene logo contest, third time's a charm

2020-09-01 Thread Dawid Weiss
D. Binding. Dawid On Tue, Sep 1, 2020 at 10:21 PM Ryan Ernst wrote: > Dear Lucene and Solr developers! > > Sorry for the multiple threads. This should be the last one. > > In February a contest was started to design a new logo for Lucene > [jira-issue]. The initial attempt [first-vote] to call

Re: [VOTE] Lucene logo contest, here we go again

2020-09-01 Thread Dawid Weiss
I'm still in favor of the current logo (D), binding vote. Dawid - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: [VOTE] Lucene logo contest

2020-06-16 Thread Dawid Weiss
A is nice and modern... but I still like the current logo better, so for me it's "C". Dawid On Tue, Jun 16, 2020 at 12:08 AM Ryan Ernst wrote: > > Dear Lucene and Solr developers! > > In February a contest was started to design a new logo for Lucene [1]. That > contest concluded, and I am now (

Re: RamUsageEstimator hangs with AOT compilation

2020-01-06 Thread Dawid Weiss
Thank you - I filed an issue for this, we will investigate. https://issues.apache.org/jira/browse/LUCENE-9117 Dawid On Mon, Jan 6, 2020 at 11:39 PM Cleber Muramoto wrote: > > After generating a pre-compiled image lucene-core (8.3.0) with jaotc (JDK > 13.0.1), RamUsageEstimator class is never lo

Re: Limitations of StempelStemmer

2019-09-24 Thread Dawid Weiss
> You always pass "piwko" for stemming. I'm afraid that's not correct? You should *never* pass on piwko when stemming. :) Dawid - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: j

Re: Limitations of StempelStemmer

2019-09-10 Thread Dawid Weiss
Hi Maciej, Stempel uses a pretrained heuristic. You can find a longer description at [1] and [2]. The specific reason for the problems you mentioned may be the smaller training dictionary used for the version embedded in Lucene, I honestly don't know. If you need exact stemming/ lemmatization then

Re: Slowness on Java 11 with Lucene 6

2019-07-30 Thread Dawid Weiss
> We have chosen G1GC for both Java 8 and Java 11 versions. It's not like we have answers for everything. ;) If it's the same GC on both and there is still a slowdown then something else may be causing it -- hard to tell without doing trial-and-error. There is a set of performance benchmarks; perh

Re: Static index, fastest way to do forceMerge

2018-12-18 Thread Dawid Weiss
k on it. > > Regards, > Jerven > On 11/30/18 12:01 PM, Dawid Weiss wrote: > > Just FYI: I implemented a quick and dirty PoC to see what it'd work > > like. Not much of a difference on my machine (since postings merging > > dominates everything else). Interesting prob

Re: RamUsageCrawler

2018-12-06 Thread Dawid Weiss
ashMap by simply counting the size of the > Node that is used for each entry, although given the dynamic nature of > these data structures (HashMap eg can use TreeNodes sometimes > depending on data distribution) it would be almost impossible to be > 100% accurate. > On Thu, Dec 6, 2018

Re: RamUsageCrawler

2018-12-06 Thread Dawid Weiss
> It's entirely possible it fails to dig into Maps correctly with newer Java > releases; maybe Dawid or Uwe would know? We have removed all reflection from that class a while ago exactly because of encapsulation issues introduced in newer Java versions. https://github.com/apache/lucene-solr/blob/

Re: RAMDirectory or Redis

2018-12-02 Thread Dawid Weiss
bq. We switched to ByteBuffersDirectory with 7.5, but I actually didn't see much performance improvements or savings in memory. Once the indexes are built I don't think there will be much of a difference. The core problem with RAMDirectory was related to synchronizations during merges/ file manipu

Re: Static index, fastest way to do forceMerge

2018-11-30 Thread Dawid Weiss
/jira/browse/LUCENE-8580 Dawid On Fri, Nov 2, 2018 at 10:17 PM Dawid Weiss wrote: > > Thanks for chipping in, Toke. A ~1TB index is impressive. > > Back of the envelope says reading & writing 900GB in 8 hours is > 2*900GB/(8*60*60s) = 64MB/s. I don't remember the interfa

Re: Static index, fastest way to do forceMerge

2018-11-02 Thread Dawid Weiss
Thanks for chipping in, Toke. A ~1TB index is impressive. Back of the envelope says reading & writing 900GB in 8 hours is 2*900GB/(8*60*60s) = 64MB/s. I don't remember the interface for our SSD machine, but even with SATA II this is only ~1/5th of the possible fairly sequential IO throughput. So f

Re: Static index, fastest way to do forceMerge

2018-11-02 Thread Dawid Weiss
> int processors = Runtime.getRuntime().availableProcessors(); > int ConcurrentMergeScheduler cms = new ConcurrentMergeScheduler(); > cms.setMaxMergesAndThreads(processors,processors); See the number of threads in the CMS only matters if you have concurrent merges of independent segments. What you

Re: Static index, fastest way to do forceMerge

2018-11-02 Thread Dawid Weiss
We are faced with a similar situation. Yes, the merge process can take a long time and is mostly single-threaded (if you're merging from N segments into a single segment, only one thread does the job). As Erick pointed out, the merge process takes a backseat compared to indexing and searches (in mo

Re: RamDirectory vs MemoryIndex vs MMapDirectory for In-Memory-Index

2018-09-25 Thread Dawid Weiss
Use MMapDirectory on a temporary location, Matthias. If you really need in-memory indexes, a new Directory implementation is coming (RAMDirectory will be deprecated, then removed), but the difference compared to MMapDirectory is typically not worth the hassle. See this issue for more discussion. h

Re: testing with system properties

2018-08-09 Thread Dawid Weiss
Erick already pointed you at the "cleanup" rule. This is fairly generic, but if you know the properties being modified you should still clean them up in @After or @AfterClass -- this is useful for other people to know that you're modifying them, if for nothing else. Randomized testing package has

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-03 Thread Dawid Weiss
> That helps and explains why there is no support in std api This isn't an API problem. This is by design -- this is how it works. If you wish to retrieve fields that are indexed and stored with the document, the API provides such an option (indexed and stored field type). Your indexed fields are

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Dawid Weiss
t. Actual indexed content would be same if both index have > "status" field indexed so we only need to validate fieldnames per > document. Something like > > Thanks for reading all this if you have read so far :) > > Chetan Mehrotra > [1] > https://github.com/apach

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Dawid Weiss
basis using the Lucene API? > Chetan Mehrotra > > > On Tue, Jan 2, 2018 at 1:03 PM, Dawid Weiss wrote: >> How about the quickest solution: dump the content of both indexes to a >> document-per-line text >> file, sort, diff? >> >> Even if your indexes are large,

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-01 Thread Dawid Weiss
How about the quickest solution: dump the content of both indexes to a document-per-line text file, sort, diff? Even if your indexes are large, if you have large spare disk, this will be super fast. Dawid On Tue, Jan 2, 2018 at 7:33 AM, Chetan Mehrotra wrote: > Hi, > > We use Lucene for indexin

Re: Binary Automaton

2017-09-30 Thread Dawid Weiss
for example, > be usefull in bioinformatic or all those cases where data is not a basic > ADT. > > Cristian > > 2017-09-30 12:24 GMT+02:00 Dawid Weiss : > >> > Hi , it is possible to create a Automaton in lucene parsing not a string >> > but a byte array?

Re: Binary Automaton

2017-09-30 Thread Dawid Weiss
> Hi , it is possible to create a Automaton in lucene parsing not a string > but a byte array? Can you state what problem are you trying to solve? This seems to be a question stripped of a more general context -- why do you need those byte-based automata? Dawid --

Re: German decompounding/tokenization with Lucene?

2017-09-16 Thread Dawid Weiss
Hi Mike. Search lucene dev archives. I did write a decompounder with Daniel Naber. The quality was not ideal but perhaps better than nothing. Also, Daniel works on languagetool.org? They should have something in there. Dawid On Sep 16, 2017 1:58 AM, "Michael McCandless" wrote: > Hello, > > I ne

Re: Java 9 issues

2017-07-28 Thread Dawid Weiss
> it will be good if Lucene team can share their plans for a full java 9 > support (e.g. named modules of Lucene libraries) So, here it is: we plan to support it. (*) Dawid (*) When it's stabilized and documented (it still isn't) [1]. And when somebody has the time to do it (patches welcome, it'

Re: Highlighting and delineating Passages (fragmenting)

2017-05-30 Thread Dawid Weiss
https://issues.apache.org/jira/browse/SOLR-1105 Yes, this is spot-on what I need with regard to copyTo fields, thanks for the link! > Or are the overlaps coming from passage offset ranges from separate queries > to the same content? The overlaps are caused by the fact that we have multiple sour

Re: Highlighting and delineating Passages (fragmenting)

2017-05-30 Thread Dawid Weiss
> #2 & #3 is the same requirement; you elaborate on #2 with more detail in #3. > The UH can't currently do this; but with the OH (original Highlighter) you > can but it appears somewhat awkward. See SimpleSpanFragmenter. I had said > it was easy but I was mistaken; I'm getting rustier on the OH.

Re: Highlighting and delineating Passages (fragmenting)

2017-05-27 Thread Dawid Weiss
Thanks for your explanation, David. I actually found working with all Lucene highlighters pretty difficult. I have a few requirements which seemed deceptively simple: 1) highlight query hit regions (phrase, fuzzy, terms); 2) try to organise the resulting snippets to visually "center" the hit regi

Re: Automata and Transducer on Lucene 6

2017-04-19 Thread Dawid Weiss
> Dawid, the thing is that I am not even sure that Automata are the perfect > fit for my project and I thought some literature on it would help me decide > whether to use it or not. Still looks to me like you're approaching the problem from the wrong side or don't want to share the core problem wh

Re: Automata and Transducer on Lucene 6

2017-04-18 Thread Dawid Weiss
> One small correction: we moved away from objects to more compact int[] a > while ago for our automata implementation. Right, forgot about that. There are still some trappy object-heavy utilities like this one: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/luc

Re: Automata and Transducer on Lucene 6

2017-04-18 Thread Dawid Weiss
> I'd like to read something written by who designed these classes. What > motivated, usage examples, what it is good for and what it is not good for. > Maybe a history of the development of Automata on Lucene Are you looking for a historical book on Lucene development or are you looking to solve

Re: codec: accessing term dictionary

2017-03-10 Thread Dawid Weiss
Or you could encode those term/ ngram frequencies one FST and then reuse it. This would be memory-saving and fairly fast (~comparable to a hash table). Dawid On Fri, Mar 10, 2017 at 11:41 AM, Michael McCandless wrote: > Yes, this is a reasonable way to use Lucene (to see terms statistics across

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

2017-02-20 Thread Dawid Weiss
> PatriciaTrie. In particular building an FST with doShareSuffix = false is > the fastest of any option, If you don't share the suffix then you are building a kind of Patricia trie... But suffix sharing is cheap and can give you a memory saving (and resulting cache locality sometimes) that is non-

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

2017-02-15 Thread Dawid Weiss
tate registry > more ram efficient too ... I think it's essentially the same thing as > the FST.Builder's NodeHash, just minus the outputs that FSTs have vs > automata. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Feb 15, 2017 a

Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)

2017-02-15 Thread Dawid Weiss
You could try using morfologik's byte-based implementation: https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java I can't guarantee it'll be fast enough -- you need to sort those input sequences and even thi

Re: question

2017-01-20 Thread Dawid Weiss
> But it is fairly trivially to tweak/extend the query parser to produce > diff behavior. I think the conclusion for the original poster should be that there's really not enough information to provide a definite answer. Lucene is a search engine. Much like with a mechanical engine, its final appli

Re: Query parser and default operator

2016-11-10 Thread Dawid Weiss
curl -s 'localhost:9200/test/_search?pretty' -d '{ "query": { > "query_string": { "query": "foo AND bar OR baz" , "default_operator": "or" > } } , "profile" : true}' | grep luce > "lucene&qu

Re: Query parser and default operator

2016-11-09 Thread Dawid Weiss
Which Lucene version and which query parser is this? Can you provide a test case/ code sample? I just tried with StandardQueryParser and for: sqp.setDefaultOperator(StandardQueryConfigHandler.Operator.AND); dump(sqp.parse("foo AND bar OR baz", "field_a")); sqp.setDefaultOpe

Re: How can I get the term positions from a query?

2016-09-29 Thread Dawid Weiss
There are multiple Highlighter implementations for this purpose. Check them out -- I'm sure one of them will suit your needs. In fact, there's a new highlighter implemented very recently! Check out this JIRA issue: https://issues.apache.org/jira/browse/LUCENE-7438 Dawid On Fri, Sep 30, 2016 at 8

Re: Levenshtein FST's?

2016-05-29 Thread Dawid Weiss
> I think I see this now, and how skipping determinization and matching with > the NFA could easily leave you with an intractable amount of backtracking > for even the simpler binary question of does my input match any of the > automatons I've unioned. Note that with NFAs you may answer the questi

Re: Levenshtein FST's?

2016-05-28 Thread Dawid Weiss
> Point taken, but I wonder if there's an algorithmic shortcut to determinize > the union of Levenshtein DFAs... Levenshtein DFA is an automaton like any other; when you merge two such automata, they will very likely contain states that need to be merged (and their transition split) in order to be

Re: Re: Why Two Levels of Indirection in BytesRefHash class ?

2016-05-09 Thread Dawid Weiss
You could try to implement this refactoring, which would combine linear storage of values (without the need to save the length of each key explicitly) with their incremental addition order. https://issues.apache.org/jira/browse/LUCENE-5854 The outcome may or may not be faster in practice (due to

Re: Lucene indexing throughput (and Mike's lucenebench charts)

2016-04-14 Thread Dawid Weiss
The GC change is after this: BJ (2015-12-02): Upgrade to beast2 (72 cores, 256 GB RAM) which leads me to believe these results are not comparable (different machines, architectures, disks, CPUs perhaps?). Dawid On Thu, Apr 14, 2016 at 7:13 PM, Otis Gospodnetić wrote: > Hi, > > I was looking a

Re: IndexWriter.addIndexes with LeafReader parameter

2016-01-12 Thread Dawid Weiss
You can addIndexes(Directory... dirs) -- then you don't have to deal with CodecReader? Dawid On Tue, Jan 12, 2016 at 4:43 PM, Manner Róbert wrote: > Hi, > > we have used lucene 4.7.0 before, we are on the way to upgrade to 5.4.0. > > The problem I have is that writer.addIndexes now needs CodecRe

Re: Dubious stuff spotted in LowerCaseFilter

2015-10-22 Thread Dawid Weiss
> LowerCaseFilter will not handle that. So whereas it is "safe" for > English hard-coded strings, it isn't safe for all fields you might > index in general. This filter is a "safe" fallback that works identically regardless of the locale you have on your computer (or on the server). This, I believ

Re: Dubious stuff spotted in LowerCaseFilter

2015-10-22 Thread Dawid Weiss
unt(Character.toLowerCase(cp)); if (c1 != c2 || c1 != c3) { System.out.println(String.format(Locale.ROOT, "%d %d %d", c1, c2, c3)); } } D. On Thu, Oct 22, 2015 at 10:15 AM, Dawid Weiss wrote:

Re: Dubious stuff spotted in LowerCaseFilter

2015-10-22 Thread Dawid Weiss
I think the issue here is what happens if an "uppercase" codepoint requires a surrogate pair and the lowercase counterpart does not -- then the index variable would indeed be screwed. Dawid On Thu, Oct 22, 2015 at 10:05 AM, Uwe Schindler wrote: > Hi, > > > Setting aside the fact that Character.

Re: Why do the Japanese analyser FST files change every release?

2015-08-06 Thread Dawid Weiss
It is (b). D. On Fri, Aug 7, 2015 at 3:05 AM, Trejkaz wrote: > I have recently done updates from Lucene 3.6 to 4.x and 4.x to 5.2. > > During this process, I noticed that the FST used by the Japanese > analyser (AKA Kuromoji) was changing between releases. As I fear > breakages in backwards comp

Re: BytesRef violates the principle of least astonishment

2015-05-20 Thread Dawid Weiss
> Otherwise, it violates the Liskov substitution principle as well. Sadly it also violates the Heisenberg's principle at the bit state energy levels. We're working on improving that. >From your heated comments I think you should switch the language to something that guarantees immutability of any

Re: BytesRef violates the principle of least astonishment

2015-05-20 Thread Dawid Weiss
> BytesRef is not different, because it is just a "reference" to pass around. > And cloning a reference for sure should not clone the target of the > reference. You are "cloning" the reference and only that (as the name of the > class says: Bytes*Ref*)! Exactly. It is a reference and as such, c

Re: BytesRef violates the principle of least astonishment

2015-05-20 Thread Dawid Weiss
Yes, BytesRef can be surprising. No, it probably won't change in Lucene to comply with superb design principles. Yes, the odd design is there for performance reasons and it does provide noticeable gain. Perhaps you could file a JIRA issue to improve the documentation, this would be helpful. For wh

Re: [ANNOUNCE] Apache Lucene 5.0.0 released

2015-02-20 Thread Dawid Weiss
Thanks for contributing time to the release, Anshum. Dawid On Fri, Feb 20, 2015 at 10:16 PM, Anshum Gupta wrote: > Sure, I'll fix that on the wiki. Thanks for pointing that out Uwe. > > On Fri, Feb 20, 2015 at 1:10 PM, Uwe Schindler wrote: > >> Many thanks! :-) Nice work! >> >> I found a small

Re: BTRFS ?

2014-12-23 Thread Dawid Weiss
> This could speed up tests, especially Solr where some dirs are copied over > and over for every test case. :-) A wild idea, but since there's NIO everywhere now you could use an in-memory filesystem for tests and avoid going to disk entirely :D https://github.com/google/jimfs Dawid -

Re: BTRFS ?

2014-12-22 Thread Dawid Weiss
.de > eMail: u...@thetaphi.de > > >> -Original Message- >> From: Dawid Weiss [mailto:dawid.we...@gmail.com] >> Sent: Monday, December 22, 2014 8:48 AM >> To: java-user@lucene.apache.org >> Cc: Uwe Schindler >> Subject: Re: BTRFS ? >&

Re: BTRFS ?

2014-12-21 Thread Dawid Weiss
> I spotted Uwe's comment in JIRA the other day "BTRFS, which might also > bring some cool things for Lucene.". What cool things about BTRFS are you talking about, Uwe? Just curious. Dawid - To unsubscribe, e-mail: java-user

Re: Does StandardTokenizer remove punctuation (in Lucene 4.1)

2014-10-01 Thread Dawid Weiss
Hi Steve, I have to admit I also find it frequently useful to include punctuation as tokens (even if it's filtered out by subsequent token filters for indexing, it's a useful to-have for other NLP tasks). Do you think it'd be possible (read: relatively easy) to create an analyzer (or a modificatio

Re: JLemmaGen project

2013-11-04 Thread Dawid Weiss
Hi Michal, Pretty cool. Your work reminds me of what Leo Galambos did a while back: http://link.springer.com/chapter/10.1007/978-3-540-39985-8_22 I believe his implementation is still available in the Egothor search engine project. Dawid On Wed, Oct 23, 2013 at 5:17 PM, Michal Hlavac wrote:

Re: Lucene vs Glimpse

2013-02-05 Thread Dawid Weiss
Here's another thought: if you desperately need complex searches then you could do a heuristic filtering to narrow down the search: use an analyzer that does some form of input splitting into terms (removing excess whitespace or even producing n-grams from the input), then do the same for the query

Re: Japanese analyzer

2013-01-18 Thread Dawid Weiss
Jerome, Some of the tokens are removed because their part of speech tags are in the stoptags file? That's my guess at least -- you can always try to copy/paste Japanese analyzer and change the token stream components: protected TokenStreamComponents createComponents(String fieldName, Reader rea

Re: Suggesters: circumfix suggestions

2013-01-16 Thread Dawid Weiss
> Eg, you'd index only "boston", "red", "sox", "rumor" into the FST, and > then have a separate search index with "boston red sox rumor" indexed > as a document. If the user types "red so", then you run suggest on > "red" and on "so", and then run a hmm MultiPhraseQuery for > (red|redmond|reddit)

Re: Difference in behaviour between LowerCaseFilter and String.toLowerCase()

2012-12-01 Thread Dawid Weiss
Iterating character-by-character is different than considering the entire string at once so your observation is correct, that's how it's supposed to work. In particular, note this in String#toLowerCase documentation: "Since case mappings are not always 1:1 char mappings, the resulting String may b

Re: WFST/Analyzing Suggesters: foreign keys, user-supplied filter, highlighting

2012-10-30 Thread Dawid Weiss
> https://issues.apache.org/jira/browse/LUCENE-4491 ? Could you simply > stuff your ISBN onto the end of the suggestion (ie enroll Lucene in > Action|1933988177)? Just remember that if your suffixes are unique then you'll be expanding the automaton quite a bit (unique suffix paths). D.

Re: Efficient string lookup using Lucene

2012-08-26 Thread Dawid Weiss
> The WhitespaceAnalyzer breaks up text by spaces and tabs and newlines. > After that, you can wildcards. This will use very little space. I > believe leading&trailing wildcards are supported now, right? If leading wildcards take too much time (don't know, really) then one could also try to index

Re: Efficient string lookup using Lucene

2012-08-26 Thread Dawid Weiss
> Does Lucene support this type of structure, or do I need to somehow implement > it outside Lucene? You'd have to implement it separately but it'd be much, much smaller than Lucene itself (even obfuscated). > By the way, I need this to run on an Android phone so size of memory might be > an is

Re: Efficient string lookup using Lucene

2012-08-24 Thread Dawid Weiss
What you need is a suffix tree or a suffix array. Both data structures will allow you to perform constant-time searches for existence/ occurrence of any input pattern. Depending on how much text you have on the input it may either be a simple task -- see here: http://labs.carrotsearch.com/jsuffixa

Re: Memory leak?? with CloseableThreadLocal with use of Snowball Filter

2012-08-01 Thread Dawid Weiss
http://static1.blip.pl/user_generated/update_pictures/1758685.jpg On Thu, Aug 2, 2012 at 8:32 AM, roz dev wrote: > wow!! That was quick. > > Thanks a ton. > > > On Wed, Aug 1, 2012 at 11:07 PM, Simon Willnauer > wrote: > >> On Thu, Aug 2, 2012 at 7:53 AM, roz dev wrote: >> > Thanks Robert for th

Re: RAM or SSD...

2012-07-19 Thread Dawid Weiss
Read this: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Dawid On Thu, Jul 19, 2012 at 1:32 PM, Dragon Fly wrote: > > The slowest part of my application is to read the search hits from disk. I > was hoping that using an SSD or RAMDirectory/MMapDirectory would speed th

Re: RAM or SSD...

2012-07-18 Thread Dawid Weiss
> Why anyone buys computers without SSD's is a mystery to me. Use SSDs for On topic and highly recommended: http://www.youtube.com/watch?v=H7PJ1oeEyGg Dawid - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For a

Re: RAM or SSD...

2012-07-18 Thread Dawid Weiss
> Rum is an essential ingredient in all software systems :-) You probably meant "social systems". D. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache

  1   2   >