Re: Format of Wikipedia Index

2018-01-22 Thread Will Martin
From the javadoc for DocMaker: * *doc.stored* - specifies whether fields should be stored (default *false*). * *doc.body.stored* - specifies whether the body field should be stored (default = *doc.stored*). So ootb you won't get content stored. Does this help? regards -will On

Re: Explain Scoring function in LMJelinekMercerSimilarity Class

2016-12-20 Thread Will Martin
https://doi.org/10.3115/981574.981579 On 12/20/2016 12:21 PM, Dwaipayan Roy wrote: Hello, Can anyone help me understand the scoring function in the LMJelinekMercerSimilarity class? The scoring function in LMJelinekMercerSimilarity is shown below: -

Re: Multi-field IDF

2016-11-18 Thread Will Martin
ver "or" is, perhaps, not so usual in titles. Then, "or" will have a high IDF value and be treated as an important term. That's bad. One solution I see is to modify the Similarity to have a global, or multi-field IDF value. This value would include in its calculation longer

Re: Multi-field IDF

2016-11-17 Thread Will Martin
n be bad for very short fields (like titles). One example of this problem: If I don't delete stop words, then "or", "and", etc. should be dealt with low IDF values, however "or" is, perhaps, not so usual in titles. Then, "or" will have a high IDF value

Re: Searching in a bitMask

2016-08-27 Thread will martin
hi aren’t we waltzing terribly close to the use of a bit vector in your field caches? there’s no reason to not filter longword operations on a cache if alignment is consistent across multiple caches just be sure to abstract your operations away from individual bits….imo -will > On Aug

Re: how to backup index files with Replicator

2016-01-23 Thread will martin
() can trigger a commit. hmmm thread: http://grokbase.com/t/lucene/java-user/143dsnrxh8/replicator-how-to-use-it <http://grokbase.com/t/lucene/java-user/143dsnrxh8/replicator-how-to-use-it> -will > On Jan 23, 2016, at 4:39 AM, Dancer <462921...@qq.com> wrote: > > Hi, > h

Re: SolrIndexSearcher throws Misleading Error Message When timeAllowed is Specified.

2016-01-08 Thread will martin
Please read the javadoc for System.nanoTime(). I won’t bore you with the details about how computer clocks work. > On Jan 8, 2016, at 4:14 AM, Vishnu Mishra wrote: > > I am using Solr 5.3.1 and we are facing OutOfMemory exception while doing > some complex wildcard and proximity query (even fo

Re: Any lucene query sorts docs by Hamming distance?

2015-12-24 Thread will martin
m distance 0 to 3. > > 2015-12-22 21:42 GMT+08:00 will martin : > >> Yonghui: >> >> Do you mean sort, rank or score? >> >> Thanks, >> Will >> >> >> >>> On Dec 22, 2015, at 4:02 AM, Yonghui Zhao wrote: >>> >&

Re: range query highlighting

2015-12-23 Thread will martin
Todd: "This trick just converts the multi term queries like PrefixQuery or RangeQuery to boolean query by expanding the terms using index reader." http://stackoverflow.com/questions/7662829/lucene-net-range-queries-highlighting beware cost. (my comment) g’luck will > On Dec 2

Re: Any lucene query sorts docs by Hamming distance?

2015-12-22 Thread will martin
Yonghui: Do you mean sort, rank or score? Thanks, Will > On Dec 22, 2015, at 4:02 AM, Yonghui Zhao wrote: > > Hi, > > Is there any query can sort docs by hamming distance if field values are > same length, > > Seems fuzzy query onl

Re: Jensen–Shannon divergence

2015-12-14 Thread will martin
t;> On Sun, Dec 13, 2015 at 8:30 AM, Shay Hummel >> wrote: >> >>> Hi >>> >>> I need help to implement similarity between query model and document >> model. >>> I would like to use the JS-Divergence >>> <https://en.wikipedia.org/

Re: Jensen–Shannon divergence

2015-12-13 Thread will martin
g'luck > On Dec 13, 2015, at 10:55 AM, Shay Hummel wrote: > > Hi > > I am sorry but I didn't understand your answer. Can you please elaborate? > > Shay > > On Sun, Dec 13, 2015 at 3:41 PM will martin wrote: > >> expand your due d

Re: Jensen–Shannon divergence

2015-12-13 Thread will martin
expand your due diligence beyond wikipedia: i.e. http://ciir.cs.umass.edu/pubfiles/ir-464.pdf > On Dec 13, 2015, at 8:30 AM, Shay Hummel wrote: > > LMDiricletbut its feasibilit

Re: debugging growing index size

2015-11-13 Thread will martin
/201509.mbox/%3c55f0461a.2070...@gmail.com%3E hth -will > On Nov 13, 2015, at 11:23 AM, Rob Audenaerde wrote: > > I'm currently running using NIOFS. It seems to prevent the issue from > appearing. > > This is a second run (with applied deletes etc) > > rauden

Re: index size growing while deleting

2015-11-05 Thread will
Hi Rob: Do you understand how deletes work and how an index is compacted? There's some configuration/runtime activities you don't mention And you make testing process sound like a mirror of production? (Including configuration?) -will On 11/5/15 7:33 AM, Rob Audenaerde wrot

Re: Two different types of values in same field name in single index

2015-10-27 Thread will
Kumaran - Aren't you creating an unworkable scenario for sorting? -will On 10/27/15 5:49 AM, Kumaran Ramasubramanian wrote: Hi All, i have indexed module wise data in same index. In this case, we index two types of field in same name in two different document like this. *docu

Re: need help in search

2015-10-05 Thread will
Hi Bhaskar: or everyone's benefit, I hope you will collate the emails into a wiki page and carry it forward. Meritocracy's might have rtfm'd the whole thing. With all respect: Will On 10/5/15 1:06 PM, Bhaskar wrote: Hi, Actually I am looking for auto complete only.

RE: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-30 Thread will martin
call IndexReader.checkIntegrity. Mike McCandless http://blog.mikemccandless.com On Tue, Sep 29, 2015 at 9:00 PM, will martin wrote: > Ok So I'm a little confused: > > The 4.10 JavaDoc for LiveIndexWriterConfig supports volatile access on > a flag to setCheckIntegrityAtMerge ..

RE: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-29 Thread will martin
rom the runtime system. The file system is EMC Isilon via NFS. Jim From: will martin Sent: 29 September 2015 14:29 To: java-user@lucene.apache.org Subject: RE: Lucene 5 : any merge performance metrics compared to 4.x? This sounds robust. Is the index

RE: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-29 Thread will martin
This sounds robust. Is the index batch creation workflow a separate process? Distributed shared filesystems? --will -Original Message- From: McKinley, James T [mailto:james.mckin...@cengage.com] Sent: Tuesday, September 29, 2015 2:22 PM To: java-user@lucene.apache.org Subject: Re

RE: Lucene 5 : any merge performance metrics compared to 4.x?

2015-09-29 Thread will martin
So, if its new, it adds to pre-existing time? So it is a cost that needs to be understood I think. And, I'm really curious, what happens to the result of the post merge checkIntegrity IFF (if and only if) there was corruption pre-merge: I mean if you let it merge anyway could you get a false

RE: Solr java.lang.OutOfMemoryError: Java heap space

2015-09-28 Thread will martin
http://opensourceconnections.com/blog/2014/07/13/reindexing-collections-with-solrs-cursor-support/ -Original Message- From: Ajinkya Kale [mailto:kaleajin...@gmail.com] Sent: Monday, September 28, 2015 2:46 PM To: solr-u...@lucene.apache.org; java-user@lucene.apache.org Subject: Solr jav

RE: hello,I have a problem about lucene,please help me to explain ,thank you

2015-09-22 Thread will martin
Hi: Would you mind doing websearch and cataloging the relevant pages into a primer? Thx, Will -Original Message- From: 王建军 [mailto:jianjun200...@163.com] Sent: Tuesday, September 22, 2015 4:02 AM To: java-user@lucene.apache.org Subject: hello,I have a problem about lucene,please help me

RE: A really hairy token graph case

2014-10-24 Thread Will Martin
lemma2 PI 0 lemmaN PI 0 comp0-1 PI 0 comp1-1 PI 0 comp0-N compM-N That is, group all the first-components, and all the second-components. But now the bits and pieces of the compounds are interspersed. Maybe that's OK. On Fri, Oct 2

RE: A really hairy token graph case

2014-10-24 Thread Will Martin
HI Benson: This is the case with n-gramming (though you have a more complicated start chooser than most I imagine). Does that help get your ideas unblocked? Will -Original Message- From: Benson Margulies [mailto:bimargul...@gmail.com] Sent: Friday, October 24, 2014 4:43 PM To: java

Re: Lucene Challenge - sum, count, avg, etc.

2010-04-01 Thread Will Johnson
Hi Michel, You can do all of this with Lucene however not with a standard index/query operators. At Attivio we have a custom Lucene index structure + custom query operators that support relational joins across records in an index. You can write the queries in our standard query language or run

Re: Search query problem

2010-01-08 Thread Will Murnane
On Fri, Jan 8, 2010 at 16:27, Jamie wrote: > Hi Ian / Will > > Thanks. Surely, the Porter Stemmer should not stem proper noun's. i.e. it > could check the capitalization of the first letter of a word and whether or > not the word is the start of sentence. If so, it could

Re: Search query problem

2010-01-08 Thread Will Murnane
result = new PorterStemFilter(result); PorterStemFilter is changing Lowe to low. Change your tokenizer so that Lowe's is tokenized as a single token, and that should avoid it. Will - To unsubscribe, e-mail: java-user-unsubscr.

Re: Split single string into several fields?

2009-10-27 Thread Will Murnane
t; which means you have to analyze it. > > > I think Will is suggesting that he doesn't want to have to analyze it > *again* - > if he really has different fields for every tag type, it would get > prohibitively > expensive in terms of Indexing CPU usage to retokenize over an

Re: Split single string into several fields?

2009-10-27 Thread Will Murnane
ng like this: Document doc = new Document(); doc.add(new Field("h1", "hello\0world")); doc.add(new Field("alltext", "hello\0world\0goodnight\0moon")); I think that makes sense. Comments? Will > > HTH > Erick > > > On Tue, Oct 27, 2009 at

Split single string into several fields?

2009-10-27 Thread Will Murnane
hat's the best way to approach this? My initial thought is to make some kind of MultiAnalyzer that consumes the text and produces several token streams, which are added to the document one at a time. Is that a reasonable strategy? Thanks! Will

RE: Postcode/zipcode search

2008-05-06 Thread Will Johnson
do fuzzy search ie post1:NW10 post2:7?Y and so on. - will -Original Message- From: Chris Mannion [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 06, 2008 12:28 PM To: java-user@lucene.apache.org Subject: Postcode/zipcode search Hi all I've got a bit of a niggling problem with how one of

RE: Why indexing database is necessary? (RE: indexing database)

2008-03-04 Thread Will Johnson
#x27; which are probably the ones you would want anyways. - will -Original Message- From: Duan, Nick [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 04, 2008 2:29 PM To: java-user@lucene.apache.org Subject: RE: Why indexing database is necessary? (RE: indexing database) Hmm, I guess t

RE: Why indexing database is necessary? (RE: indexing database)

2008-03-04 Thread Will Johnson
t to say that a search engine is always better, just the it often times is for when the inputs and outputs are carefully defined. - will -Original Message- From: Darren Hartford [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 04, 2008 1:52 PM To: java-user@lucene.apache.org Subject: RE: Why in

Re: Postal Code Radius Search

2007-08-29 Thread Will Johnson
a CustomScoreQuery combined with a FieldCacheSource that holds the the lat/lon might work. - will On Aug 29, 2007, at 11:15 AM, Mike wrote: I've searched the mailing list archives, the web, read the FAQ, etc and I don't see anything relevant so here it goes… I'm trying

Re: function query - get DocValues

2007-08-27 Thread Will Johnson
; System.out.println(q.getDocValues().getMinValue()); - will On Aug 24, 2007, at 5:17 PM, Grant Ingersoll wrote: Can you provide more details on what you are trying to do? Are you trying to collect information from the FunctionQuery after it is done? -Grant On Aug 24, 2007, at 5:03 PM

Re: function query - get DocValues

2007-08-24 Thread Will Johnson
at a basic level yes, just getting the avg/min/max from a function query would be awesome. once that is in place getting more complex stats would be gravy. i need to do something in this area i just want to know if there is some more fundamental that i'm working against. - will O

function query - get DocValues

2007-08-24 Thread Will Johnson
resting to anyone other than me? - will - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

RE: TermEnum - previous() method ?

2007-07-20 Thread Will Johnson
ard only mode and just reverse out the strings on the display side. This method makes a number of assumptions about index size constraints, character sets; ie ymmv. - will -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Friday, July 20, 2007 10:05

RE: question about lucene

2007-06-01 Thread Will Johnson
Solr, which is built on top of lucene and adds highlighting among other features, gets close to what you want. Check out: http://wiki.apache.org/solr/HighlightingParameters - will -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Friday, June 01, 2007 8:57 AM To

Re: Doubt in FuzzyQuery

2007-05-03 Thread Stefan Will
It seems to me like a french stemmer is what you need instead of a fuzzy query. What analyzer are you using for your documents and queries ? -- Stefan [EMAIL PROTECTED] wrote: Hi! I have a problem in dealing whith a fuzzy query in Lucene 2.1.0. In order to explain my problem, I illustrate it

Re: Dealing with acronyms

2006-04-26 Thread Stefan Will
This makes perfect sense to me. Of course the hard part will be how to extract the acronyms. -- Stefan Hannes Carl Meyer wrote: Hi All, I would like enable users to do an acronym search on my index. My idea is the following: 1.) Extract acronyms (ABS, ESP, VCG etc.) from the given document

Re: AW: How does Lucene to compute score ?

2005-08-22 Thread Will (sent by Nabble.com)
Hey guys, here is the exact thing you want, check out this searchable archive hosted by Nabble: http://www.nabble.com/Lucene-f44.html - it archives all Lucene mailing lists into a forum, you can cross search all or drill down and search a single list. You can also narrow search by author, sort

RE: Using Highlighter to highlight entire HTML documents?

2005-05-24 Thread Will Allen
The challenge with this is always not breaking the HTML page itself. -Original Message- From: Fred Toth [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 24, 2005 3:47 PM To: java-user@lucene.apache.org Subject: Using Highlighter to highlight entire HTML documents? Hi, We have a need to pres

RE: Time taken in Indexing when the index is already huge

2005-04-05 Thread Will Allen
I would recommend not optimizing your index that often. Another solution is to use the multisearcher and keep one fully optimized primary index, and an unoptimized secondary index that you add to. Then search against both. During off peak hours you could merge the secondary index onto your pr