Re: How to get hits coordinates in Lucene 4.4.0

2013-08-18 Thread Karl Wettin
On Aug 13, 2013, at 12:55 PM, Michael McCandless wrote: > I'm less familiar with the older highlighters but likely it's possible > to get the absolute offsets from them as well. Using vector highlighter I've achieved that by extending and cloning the code of ScoreOrderFragmentsBuilder#makeFrag

A couple of thoughts on non technical users and query parsers.

2013-05-30 Thread Karl Wettin
Non technical users understand what a field is. All of them might however not know that they they can use them but It's easy for them to learn that name:john will search for john only in names. Non technical users can learn to understand that logic and functionality can be specified in their qu

Re: Blåbærsyltetøy v.s. Räksmörgås

2013-05-23 Thread Karl Wettin
22 maj 2013 kl. 20:29 skrev Petite Abeille: > > On May 22, 2013, at 7:08 PM, Karl Wettin wrote: > >>> * Use a filter after ASCIIFoldingFilter that discriminate all use of ae, >>> oe, oo, and other combination of double vowels, just keeping the first one. >>

Re: Blåbærsyltetøy v.s. Räksmörgås

2013-05-22 Thread Karl Wettin
22 maj 2013 kl. 14:37 skrev Karl Wettin: > * Use a filter after ASCIIFoldingFilter that discriminate all use of ae, oe, > oo, and other combination of double vowels, just keeping the first one. I ended up with that solution. https://issues.apache.org/jira/browse/LUCEN

Blåbærsyltetøy v.s. Räksmörgås

2013-05-22 Thread Karl Wettin
This is a question (or perhaps a line of thought) regarding the mutually intelligible Scandinavian languages Danish, Norwegian and Swedish. The Swedish letters åäö is in fact the same letters as the Danish/Norwegian åæø. A Norwegian writing about the Swedish city of Göteborg write Gøteborg and

Re: Best practices in boosting by proximity?

2013-05-04 Thread Karl Wettin
arser. Try something like "your proximity query"~20, but consider the cost of a great slop. 4 maj 2013 kl. 20:41 skrev Karl Wettin: > The most simple solution is to use of slop in PhraseQuery, SpanNearQuery, > etc(?). Also consider permutations of #isInOrder() with alternative

Re: Best practices in boosting by proximity?

2013-05-04 Thread Karl Wettin
The most simple solution is to use of slop in PhraseQuery, SpanNearQuery, etc(?). Also consider permutations of #isInOrder() with alternative query boosts. Even though slop will create a greater score the closer the terms are, it might still in some cases (usually when combined with other subq

Re: Reg Lucene Naive Bayesian classifier.

2013-01-14 Thread Karl Wettin
14 jan 2013 kl. 14:53 skrev VIGNESH S: > Anyone Used the Naive Bayesian Classifier? > > It will be really helpful if some one Can post how to use the > classifiers in Lucene .. Hi there, I posted a NB classifier in the jira back in 2007 that use Lucene as data matrix. It probably needs a bit

Re: SSD Experience

2011-08-22 Thread Karl Wettin
22 aug 2011 kl. 18.49 skrev Rich Cariens: > I found a Lucene SSD performance benchmark > docbut > the wiki engine is refusing to let me view the attachment (I get "You > are not allowed to d

Re: negative wildcard query

2011-06-29 Thread Karl Wettin
You'll also need things to exclude from, eg a MatchAllDocsQuery. karl 29 jun 2011 kl. 17.25 skrev Clemens Wyss: > Say I have a document with field "f1". How can I search Documents which have > not "test" in field "f" > I tried: > -f: *test* > f: -*test* > f: NOT *test* > > but no luck.

Re: Lemmatization

2011-06-08 Thread Karl Wettin
Perhaps "least frequent substring" or even "suffix truncation" might be enough for your needs. Here is a related paper: http://web.jhu.edu/bin/q/b/p75-mcnamee.pdf karl On Jun 8, 2011, at 1:52 PM, Mohamed Yahya wrote: > You're right. Still, I am not sure if there is a library that wo

Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-19 Thread Karl Wettin
On Jan 18, 2011, at 10:04 PM, Grant Ingersoll wrote: > As devs of Lucene/Solr, due to the way ASF mirrors, etc. works, we really > don't have a good sense of how people get Lucene and Solr for use in their > application. Because of this, there has been some talk of dropping Maven > support fo

Re: [SOLR] DisMaxQParserPlugin and Tokenization

2010-11-23 Thread Karl Wettin
22 nov 2010 kl. 10.56 skrev : > Using the SearchHandler with the deftype=”dismax” option enables the > DisMaxQParserPlugin. From investigating it seems, it is just tokenizing by > whitespace. > > Although by looking in the code I could not find the place, where this > behavior is enforced? I

Re: Fuzzy Phrase

2010-09-27 Thread Karl Wettin
There is a SpanFuzzyQuery for Lucene 1.9 from 2006 in LUCENE-522. karl 27 sep 2010 kl. 00.19 skrev Fabiano Nunes: > Thank you, Schindler. > When combining queries, I need two strings, one for each field. I want to > use just one string like -- head:"hello~ world"~3 AND contents:"colorle

Re: instantiated contrib

2010-08-27 Thread Karl Wettin
(exception Hotel Chain). so I guess it's distribution is a litter term is very frequent and other term is very rare. 2010/8/27 Karl Wettin : My mail client died while sending this mail.. Sorry for any duplicate. It is strange that it should take 20 second to gather fields, this is the only thing

Re: instantiated contrib

2010-08-26 Thread Karl Wettin
My mail client died while sending this mail.. Sorry for any duplicate. It is strange that it should take 20 second to gather fields, this is the only thing that really suprises me. I'd expect it to be instant compared to RAMDirectory. It is hard to say from the information you provided. Did

Re: Hot to get word importance in lucene index

2010-07-23 Thread Karl Wettin
Are you perhaps looking for this: http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/similar/MoreLikeThis.html ? karl 23 jul 2010 kl. 10.54 skrev Xaida: Hi! thanks for reply! I will try to explain better, sorry if it was unclear. I have user text document colle

Re: Reverse Lucene queries

2010-07-23 Thread Karl Wettin
23 jul 2010 kl. 08.30 skrev sk...@sloan.mit.edu: Hi all, I have an interesting problem...instead of going from a query to a document collection, is it possible to come up with the best fit query for a given document collection (results)? "Best fit" being a query which maximizes the hit scores o

Re: Hot to get word importance in lucene index

2010-07-23 Thread Karl Wettin
Hi, Please define "important". Important to do what? It would probably be helpful if you explained what it is you attempt to achieve by doing this. Perhaps there is something in MoreLikeThis that will help you? karl 23 jul 2010 kl. 04.44 skrev Xaida: Hi all! hmmm, i need t

Re: about contrib instantiated

2010-07-03 Thread Karl Wettin
2 jul 2010 kl. 08.32 skrev Li Li: I have an index of about 8,000,000 document and the current index size is about 30GB. Is it possbile to use this contrib to speed up my search? I have enough memory for it. In order to answer your question you'll need to benchmark using a lot of typical q

Re: Lucene Partition Size

2010-04-09 Thread Karl Wettin
d via NFS to EMC Celera devices. (NFS 3) - The drives are 300 gb fiber attached with 10,000 rpm. Thanks, Ivan --- On Thu, 4/8/10, Karl Wettin wrote: From: Karl Wettin Subject: Re: Lucene Partition Size To: java-user@lucene.apache.org Date: Thursday, April 8, 2010, 2:44 PM 8 apr 2010 kl.

Re: Lucene Partition Size

2010-04-08 Thread Karl Wettin
8 apr 2010 kl. 20.05 skrev Ivan Provalov: We are using Lucene for searching of 200+ mln documents (periodical publications). Is there any limitation on the size of the Lucene index (file size, number of docs, etc...)? The only such limitation in Lucene I'm aware of is Integer.MAX_VALUE

Re: query: order of search

2010-04-01 Thread Karl Wettin
1 apr 2010 kl. 11.21 skrev >: its written "to do a "search within search", so that the second search is constrained by the results of the first query" If I understand your needs you could while collecting search results populate a new filter with all matching documents and use that filt

Re: InstantiatedIndex performance

2010-03-31 Thread Karl Wettin
31 mar 2010 kl. 10.21 skrev Michael Stoppelman: I was wondering why the InstantiatedIndex gets very slow as the number of documents increases in the index. I've been looking at the source and have only found comments saying "it's slow" when the index is big but not why. Do folks just run

Re: Lucene as a primary datastore

2010-01-20 Thread Karl Wettin
20 jan 2010 kl. 04.58 skrev Guido Bartolucci: Am I just ignorant and scared of Lucene and too trusting of Oracle and MySQL? Since all your comparations is with relational databases I feel obligated to say what has been said so many times on this list: Lucene is an index and not a relatio

Re: Extracting contact data

2010-01-13 Thread Karl Wettin
Lucene will probably only be helpful if you know what you are looking for, e.g. that you search for a given person, a given street and given time intervals. Is this what you want to do? If you instead are looking for a way to really extract any person, street and time interval that a docum

Re: Text extraction from ms word doc

2010-01-11 Thread Karl Wettin
Have you tried antiword? http://www.winfield.demon.nl/ karl 11 jan 2010 kl. 21.04 skrev maxSchlein: I was looking for an option for Text extraction from a word doc. Currently I am using POI; however, when there is a table in the doc, for each column POI brings back a . The whites

Re: Copy and augment an indexed Document

2010-01-03 Thread Karl Wettin
31 dec 2009 kl. 02.19 skrev Erick Erickson: It is possible to reconstruct a document from the terms, but it's a lossy process. Luke does this (you can see from the UI, and the code is available). There's no utility that I know of to make this easy. http://svn.apache.org/repos/asf/lucene/java/

Re: NumericRangeQuery performance with 1/2 billion documents in the index

2010-01-03 Thread Karl Wettin
3 jan 2010 kl. 16.32 skrev Yonik Seeley: Perhaps this is just a huge index, and not enough of it can be cached in RAM. Adding additional clauses to a boolean query incrementally destroys locality. 104GB of index and 4GB of RAM means you're going to be hitting the disk constantly. You need

Re: about optimize() quetion ,Looking forward to hearing from you soon! Thank you in advance!

2010-01-03 Thread Karl Wettin
3 jan 2010 kl. 13.33 skrev luocanrao: 1、if the readers do not call re-open, segment file the readers will see is after merged or before merged when optimize() done 2、when old segment file on disk is removed,if old segment files are removed after optimize() done at once, How can the read

Re: MatchAllDocsQuery and InstantiatedIndex on Lucene 2.9.1

2009-12-10 Thread Karl Wettin
https://issues.apache.org/jira/browse/LUCENE-2144 9 dec 2009 kl. 23.22 skrev Uwe Schindler: This is a bug in InstantiatedIndex. The termDoc(null) was added to get all documents. This was never implemented in Instantiated Index. Can you open an issue? There maybe other queries fail because

Re: search problem

2009-10-29 Thread Karl Wettin
29 okt 2009 kl. 12.12 skrev m.harig: i've a doubt in search , i've a word in my index welcomelucene (without spaces) , when i search for welcome lucene(with a space) , am not able to get the hits. It should pick the document welcomelucene.. is there anyway to do it ? i've used wildcar

Re: XorReader?

2009-10-22 Thread Karl Wettin
22 okt 2009 kl. 20.00 skrev Chris Hostetter: : I'm thinking a decorator with deletions on top of the original reader, merged : with the clone reader using a MultiReader. But this would still require a new you don't really mean a clone do you? ... you should just need a very small index c

XorReader?

2009-10-21 Thread Karl Wettin
Hi people, I have an application in which the users are allowed to make changes to the database, changes visible only to that user. I.e. they don't modify the original data, they create a clone of the original. When the user request the instance I retrieve the modified clone rather than t

Re: Need to know pros and cons of using RAMDirectory

2009-10-17 Thread Karl Wettin
Hi, you should probably ask your self why your performance is bad before looking at solving it by scaling hardware. I.e. what are your application needs, how so you solve you needs at index/query time and how can you replace this with something better? If you tell us a bit more about your

Re: Using TermVectorMapper to compute term frequency across documents

2009-10-15 Thread Karl Wettin
14 okt 2009 kl. 15.15 skrev Grant Ingersoll: On Oct 12, 2009, at 10:46 PM, Thomas D'Silva wrote: I am trying to compute the counts of terms of the documents returned by running a query using a TermVectorMapper. I was wondering if anyone knew if there was a faster way to do this rather than

Re: Reverse stemmer?

2009-10-08 Thread Karl Wettin
For the case where the text contains mixed languages there are solutions that simutainously use morphological rules of two or more languages. Coveo search does this but I don't know what their solution looks like. I suppose one way to do it would be to stem all tokens with all algorithms an

Re: Phase Extraction, mainly for English

2009-10-06 Thread Karl Wettin
completely disregard the meaning, that's not good enough. Regards, Andrew On Tue, Oct 6, 2009 at 11:51 PM, Karl Wettin wrote: Hi Andrew, I think you are looking for the shingle package in contrib/analyzers. karl 6 okt 2009 kl. 13.42 skrev Andrew Zhang: Hi guys, The requirement

Re:InstantiatedIndex questions

2009-10-06 Thread Karl Wettin
6 okt 2009 kl. 18.54 skrev David Causse: David, your timing couldn't be better. Just the other day I proposed that we deprecate InstantiatedIndexWriter. The sum of the reasons to this is that I'm a bit lazy. Your mail makes me reconsider. https://issues.apache.org/jira/browse/LUCENE-1948

Re: Phase Extraction, mainly for English

2009-10-06 Thread Karl Wettin
Hi Andrew, I think you are looking for the shingle package in contrib/analyzers. karl 6 okt 2009 kl. 13.42 skrev Andrew Zhang: Hi guys, The requirement is very simple here, e.g. for this sentence, 'The NBA formally announced its new *social media* guidelines Wednesday', I want to t

Re: Help understanding fieldNorm

2009-10-05 Thread Karl Wettin
f the title was increased by 1, from 41 to 42 characters. -- Ole-Martin Mørk On Mon, Oct 5, 2009 at 12:39 PM, Karl Wettin wrote: sorry, I ment title. 5 okt 2009 kl. 11.57 skrev Simon Willnauer: Ole-Martin, did you mention that you did not change the URL value but the title? simon O

Re: Help understanding fieldNorm

2009-10-05 Thread Karl Wettin
sorry, I ment title. 5 okt 2009 kl. 11.57 skrev Simon Willnauer: Ole-Martin, did you mention that you did not change the URL value but the title? simon On Mon, Oct 5, 2009 at 11:52 AM, Karl Wettin wrote: Hi Ole-Martin, how many characters was it in the url in before and after update

Re: Help understanding fieldNorm

2009-10-05 Thread Karl Wettin
Hi Ole-Martin, how many characters was it in the url in before and after update? karl 5 okt 2009 kl. 10.21 skrev Ole-Martin Mørk: Hi. I am trying to understand Lucene's scoring algorithm. We're getting some strange results. First we search for a given page by it's url. We get this resul

Re: Help needed bubbling up relevant records with most recent date

2009-10-02 Thread Karl Wettin
Use a span near query to add boost for the phrases. If you only want to add boost for exact phrases (0 slop) you might want to consider using shingles. In order to add greater score for a date closer in time you can choose between a range of solutions depending on your needs. Using a functi

Re: Help needed ordering search results

2009-10-01 Thread Karl Wettin
Not quite sure what you ask for, but I think you want to use a span near query (for adding boost to phrases) in a disjunction max query (to define weights of the different fields). karl 1 okt 2009 kl. 02.40 skrev mitu2009: Hi, I've 3 records in Lucene index. Record 1 contains healt

Re: Whitespace/Standard Analyzer and punctuation

2009-09-30 Thread Karl Wettin
You could look in to modifying the standard tokenizer lexer code to handle punctuation (there is a patch in the isssue tracker for the old javacc grammer to handle punctuation) and there is also the Gate NLP project which has a fairly nice sentence splitter you might find useful. Add a whol

Re: Memory consumed by IndexSearcher

2009-09-23 Thread Karl Wettin
23 sep 2009 kl. 17.55 skrev Mindaugas Žakšauskas: I was kind of hinting on the resource planning. Every decent enterprise application, apart from other things, has to provide its memory requirements, and my point was - if it uses memory, how much of it needs to be allocated? What are the bounda

Re: Memory consumed by IndexSearcher

2009-09-23 Thread Karl Wettin
23 sep 2009 kl. 17.55 skrev Mindaugas Žakšauskas: Luke says: Has deletions? / Optimized? Yes (1614) / No Very quick response, try optimizing your index and see what happends. I'll get back to you unless someone beats me to it. karl

Re: Memory consumed by IndexSearcher

2009-09-22 Thread Karl Wettin
Hi Mindaugas, it is - as you sort of point out - the readers associated with your searcher that consumes the memory, and not so much the searcher it self. Thing that consume the most memory is probably field norms (8 bits per field and document unless omitted) and flyweighted terms (Strin

Re: Help Needed...

2009-05-28 Thread Karl Wettin
28 maj 2009 kl. 12.22 skrev Gaurav Kumar: Hi everyone, I am doing a project using Lucene where i need to index HTML files. I am using Tika to parse HTML files. But i need to index files according to their tags which means that every text present in different HTML tag (like ) should be s

Re: Using Lucene for a classification problem

2009-05-19 Thread Karl Wettin
Hi Jeetu, wether or not it makes sense to use Lucene as your data matrix depends a bit on your requirements. There is a Bayesian classifier available in the issue tracker that might be helpful, although it does need a little bit of refact

Re: InstantiatedIndex Memory required

2009-05-13 Thread Karl Wettin
Hi Ravichandra, this is a question better fitted the java-users maillinglist. On this list we talk about the development of the Lucene API rather than how to use it. To answer your question, there is no simple formula that says how much RAM an InstantiatedIndex will consume given the FSDi

Re: Lucene Index Encryption

2009-05-08 Thread Karl Wettin
I might be missing something here, but why not just store the index on a cryptographic virtual file system? karl 8 maj 2009 kl. 19.09 skrev >: Michael, Thanks for the comments they are very insightful. I hadn't thought about the Random Access issues until you brought it up. T

Re: interpreting scores

2009-05-08 Thread Karl Wettin
for my needs? Would the Lucene SpellChecker classes be of any use? I really feel like I'm floundering here. I am more than willing to put in the work, I just need a push or two in the right directions. :) Thanks! -Nate On Thu, May 7, 2009 at 7:50 AM, Karl Wettin wrote: Nate, will there al

Re: interpreting scores

2009-05-08 Thread Karl Wettin
of any use? I really feel like I'm floundering here. I am more than willing to put in the work, I just need a push or two in the right directions. :) Thanks! -Nate On Thu, May 7, 2009 at 7:50 AM, Karl Wettin wrote: Nate, will there always be a correspodning mp3 for any given note sheet? As

Re: interpreting scores

2009-05-07 Thread Karl Wettin
Nate, will there always be a correspodning mp3 for any given note sheet? As for analysis, I'd try using ngrams of the complete untokenized file name if I was you. "Michael Jackson Don't Stop 'till You Get Enough" -> "^mic", "mich", "icha", "chae", "hael", "ael ", "el j", "l ja", and so on

Re: Exact match on entire field

2009-05-06 Thread Karl Wettin
You should probably tell us the reason to why you need this functionallity. Given you only load the stored comparative field for the first it doesn't really have to be that expensive. If you know that the first hit was not a perfect match then you know that any matching documents with a l

Re: Suggestive Search

2009-04-08 Thread Karl Wettin
If you use prefix grams only then you'll get a forward-only suggestion scheme. I've seen several implementation that use that and it works quite well. harry potter: ^ha, ^har, ^harr, ^harry, ^harry p, ^harry po.. harry houdini: ^ha, ^har, ^harr, ^harry, ^harry h, ^harry ho.. I prefere the tr

Re: Suggestive Search

2009-04-08 Thread Karl Wettin
For this you probably want to use ngrams. Wether or not this is something that fits in your current index is hard to say. My guess is that you want to create a new index with one document per unique phrase. You might also want to try to load this index in an InstantiatedIndex, that could sp

Re: Filters, what's going on under the hood?

2009-04-06 Thread Karl Wettin
6 apr 2009 kl. 15.47 skrev Lebiram: I am thinking of adding search filters to my application thinking that they would more efficient. Can anyone explain what lucene does with search filters? Like, what generally happens when calling search() A filter is a bitset, one bit per document in t

Re: Lucene and Phrase Correction

2009-04-06 Thread Karl Wettin
6 apr 2009 kl. 14.59 skrev Glyn Darkin: Hi Glyn, to be able to spell check phrases E.g "Harry Poter" is converted to "Harry Potter" We have a fixed dataset so can build indexes/ dictionaries from our own data. the most obvious solution is index your contrib/spell checker with shingles. T

Re: Free software for language detection

2009-03-29 Thread Karl Wettin
You can also look at https://issues.apache.org/jira/browse/LUCENE-1039 that I've successfully used for language detection of user queries. karl 27 mar 2009 kl. 18.35 skrev Boris Aleksandrovsky: Lisheng, You might want to look at the Nutch LanguageID plugin (http://wiki.apache.org/nut

Re: "People you might know" ( a la Facebook) - *slightly offtopic*

2009-03-24 Thread Karl Wettin
There is even an old thread about this on the Mahout-users list: http://markmail.org/message/ludu5hjfczuvgk3n 17 mar 2009 kl. 15.17 skrev Grant Ingersoll: Have a look at the Lucene sister project: Mahout: http://lucene.apache.org/mahout . In there is the Taste collaborative filtering project

Re: Upper limit on number of Fields

2009-02-15 Thread Karl Wettin
15 feb 2009 kl. 16.27 skrev Joel Halbert: Is there any practical limit on the number of fields that can be maintained on an index? My index looks something like this, 1 million documents. For each group of 1000 documents I might have 10 indexed fields. This would mean in total about 1 f

Re: Partial / starts with searching

2009-02-14 Thread Karl Wettin
e without losing too much 'searching' speed. ...or am I wrong? Karl Wettin wrote: If you attach an NgramTokenFilter to your analyzer at index and query time you should be able to query for parts of the word. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/ngram/N

Re: Partial / starts with searching

2009-02-13 Thread Karl Wettin
but then I just don't understand how Google manages to do this :) Jori. Karl Wettin wrote: Hi again Jori, did you try N-grams as suggested in the reply on -dev? karl 13 feb 2009 kl. 09.05 skrev d-fader: Hi, I've actually posted this message in de dev mailing list e

Re: Partial / starts with searching

2009-02-13 Thread Karl Wettin
Hi again Jori, did you try N-grams as suggested in the reply on -dev? karl 13 feb 2009 kl. 09.05 skrev d-fader: Hi, I've actually posted this message in de dev mailing list earlier, because I though my 'issue' is a limitation of the functionality of Lucene, but they redirected me to th

Re: TermQuery search returns the same Document several times

2009-02-07 Thread Karl Wettin
5 feb 2009 kl. 14.44 skrev Lebiram: If HitCollector only returns a document once then he might be referring to an application ID that is assigned to a field that has been indexed twice or more with different document IDs. I'll clarify this with him. However is there a way to somehow do a

Re: Field.Store.YES Question

2009-02-05 Thread Karl Wettin
5 feb 2009 kl. 09.30 skrev Amin Mohammed-Coleman: Is there a seperate part in the lucene document that the tokenised strings are stored and therefore Lucene knows where to look? Yes. Stored fields is meta data bound to a document, for instance the primary key of the object the Lucene do

Re: ShingleMatrixFilter for synonyms

2009-01-14 Thread Karl Wettin
Hi Eric, ShingleMatrixFilter does not add some sort of multiple token synonym feature on top of a plain old Lucene index, it does however create permutations of tokens in a matrix. My suggestion is that you first look at what shingles are and make sure this is something you feel is intere

Re: updating payloads

2009-01-03 Thread Karl Wettin
I think it would be nice with little payload modification tool in the SVN. karl 2 jan 2009 kl. 23.02 skrev Grant Ingersoll: I don't think there is any API support for this, but in theory it is possible, as long as you aren't changing the size. It sounds like it could work for you si

Re: Re-combining already indexed documents

2009-01-02 Thread Karl Wettin
Hello, the easiest way would be to construct the combined document using the data from your primary source rather than reconstructing it from the index. If the source data no longer is available you could still reconstruct a token stream. The data is however a bit spread out so it can tur

Re: Extract the text that was indexed

2008-12-30 Thread Karl Wettin
30 dec 2008 kl. 17.13 skrev Lebiram: Hi Lebiram, contrib/misc contains a couple of tools that might be of help. Just wanted to reconstruct a new index based on an existing index(but turning off norms) that's all. If you want to create an identical index but without norms use FieldNormModi

Payloads

2008-12-26 Thread Karl Wettin
I would very much like to hear how people use payloads. Personally I use them for weight only. And I use them a lot, almost in all applications. I factor the weight of synonyms, stems, dediacritization and what not. I create huge indices that contains lots tokens at the same position but wi

Re: Any way to ignore repeated terms in TF calculation?

2008-12-26 Thread Karl Wettin
Hi Israel, you can solve your problem at search time by passing a custom Similarity class that looks something like this: private Similarity similarity = new DefaultSimilarity() { public float tf(float v) { return 1f; } public float tf(int i) { return 1f; } };

Re: Lucene - Authentication

2008-12-14 Thread Karl Wettin
13 dec 2008 kl. 06.05 skrev Aaron Schon: Hi , if I have a Lucene index (or Solr) that is installed in client premises. how would you go about securing the index from being queries in unauthorized fashion. For example, from malicious users or hackers, or for that matter "internal" users try

Re: Cannot find gdata-server

2008-12-04 Thread Karl Wettin
Hello Anees, the Gdata server was phased out by 2.3. You can still get if from the 2.2 tag in the SVN: http://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_2_0/ karl 5 dec 2008 kl. 07.13 skrev Anees Haider: I have setup lucene, test run it and go through samples. Now I have been w

Re: Slow queries with lots of hits

2008-12-04 Thread Karl Wettin
Hi Tim, is it possible that the slow queries contains terms that are very common in your index? If so you could replace those clauses with a filter. This would impact the score as filters does nothing with that, but if your query contains enough other clauses that should not be a problem.

Re: serialVersionUID issue between 2.3 and 2.4

2008-12-01 Thread Karl Wettin
You could get the 2.4 code and set the serialVersionUID of the Term class to the UID assigned to the 2.3 Term class (554776219862331599l) and recompile. As for statically setting a serialVersionUID in the class, one could instead set it to a final value and implement Externalizable in order

Re: SpanFirstQuery is not taking wildcard characters (like *) as a logical operator for the preffix

2008-11-28 Thread Karl Wettin
nced search techniques and i thought it can be possible with SpanFirstQuery because it can search by sequence, but it cant search as a startswith (for "library inf*") Karl Wettin wrote: SpanTermQuery is a TermQuery and not a WildcardQuery. You could use a SpanRegexQuery. You coul

Re: SpanFirstQuery is not taking wildcard characters (like *) as a logical operator for the preffix

2008-11-27 Thread Karl Wettin
SpanTermQuery is a TermQuery and not a WildcardQuery. You could use a SpanRegexQuery. You could also make your own SpanWildcardQuery based on either WildcardQuery or SpanRegexQuery. You should probably tell us a bit about the problem you try to solve rather than asking about the solution y

Re: Query time document group boosting

2008-11-27 Thread Karl Wettin
27 nov 2008 kl. 10.15 skrev Toke Eskildsen: On Thu, 2008-11-27 at 07:30 +0100, Karl Wettin wrote: The most scary part is that that you will have to score each and every document that has a source, probably all of the documents in your corpus. I now see my query-logic was flawed. In order

Re: Scoring issue

2008-11-26 Thread Karl Wettin
Alex, if you have length normalization turned on then the length (the number of tokens and perhaps even the distance between the tokens) of the second document is much greater than the length of the first document. The length is the complete number of tokens in the field, i.e. if you add

Re: Query time document group boosting

2008-11-26 Thread Karl Wettin
The most scary part is that that you will have to score each and every document that has a source, probably all of the documents in your corpus. So if you have a very large number of documents it might be a bit expensive. Also, appending this query for boost only means that you will get hit

Re: InstatiatedIndex questions

2008-11-19 Thread karl wettin
Hi David, thanks for the report! I suppose you speak of IndexWriter vs InstantiatedIndexWriter? These are definitely considered discrepancy problems. I've created a new issue in the tracker: http://issues.apache.org/jira/browse/LUCENE-1462 For what reason do you try to serialize the InstantatedIn

Re: InstantiatedIndex help + first impression

2008-11-18 Thread karl wettin
On Wed, Nov 19, 2008 at 3:27 AM, karl wettin <[EMAIL PROTECTED]> wrote: > rewritten query. I.e. this is probably as much a store related expense > as it is a Levenshtein calculation expense. "this is probably *not* as much a store related..&quo

Re: InstantiatedIndex help + first impression

2008-11-18 Thread karl wettin
The actual performance depends on how much you load to the index. Can you tell us how many documents and how large these documents are that you have in your index? Compared with RAMDirectory I'vee seen performance boosts of up to 100x in a small index that contains (1-20) Wikipedia sized document

Re: instantiated index in 2.4

2008-10-29 Thread Karl Wettin
Hi Darren, How large is your corpus? The speed you can expect depends on how much data you load it with. There is a graph in the package level javadocs that shows this: http://lucene.apache.org/java/2_4_0/api/contrib-instantiated/org/apache/lucene/store/instantiated/package-summary.html

Re: Calculation of fieldNorm causes irritating effect of sort order

2008-10-02 Thread Karl Wettin
2 okt 2008 kl. 14.47 skrev Jimi Hullegård: But apparently this setOmitNorms(true) also disables boosting aswell. That is ok for now, but what if we want to use boosting in the future? Is there no way to disable the length normalization while still keeping the boost calculation? You can m

Re: Index time Document Boosting and Query Time Sorts

2008-09-24 Thread Karl Wettin
24 sep 2008 kl. 12.40 skrev Grant Ingersoll: One side note based on your example, below: Index time boosting does not have much granularity (only 255 values), in other words, there is a loss of precision. Thus, you want to make sure your boosts are different enough such that you can dist

Re: lucene Front-end match

2008-09-19 Thread Karl Wettin
19 sep 2008 kl. 11.05 skrev 叶双明: Document> Document> How can I get the first Document buy some query string like "a" , "ab" or "abc" but no "b" and "bc"? You would create an ngram filter that create grams from the first position only. Take a look at EdgeNGramTokenFilter in contrib/analy

Re: Some SSD results to share

2008-09-16 Thread Karl Wettin
Related, I've been considering filesystem based filters on SSD. That ought to be rather fast, consume no memory and be as simple as a RandomAccessFile. I didn't spend to much time on it, gave up when I couldn't figure out when it made sense to close the file. Perhaps it would be nice with a

Re: instantiated index in 2.4

2008-09-15 Thread Karl Wettin
15 sep 2008 kl. 18.51 skrev Karl Wettin: Are the adds reflected directly to the index? Yes. An InstantiatedIndexReader is always current. You will probably still have to reconstruct your searcher. I never really looked in to what happends if you don't. The second statement was

Re: instantiated index in 2.4

2008-09-15 Thread Karl Wettin
15 sep 2008 kl. 18.45 skrev Cam Bazz: I have been looking at instantiated index in the trunk. Does this come with a searcher? Pass an InstantiatedIndexReader to the constructor of an IndexSearcher. Are the adds reflected directly to the index? Yes. An InstantiatedIndexReader is always cur

Re: Sorting in lucene through Document boosting

2008-09-15 Thread Karl Wettin
15 sep 2008 kl. 14.08 skrev Dragan Jotanovic: I made simple Similarity implementation: public float tf(float arg0) { return 1f; } Why do you touch the term frequency? Is that prehaps unrelated to what's discussed in this thread? karl

Re: Frequently updated fields

2008-09-12 Thread Karl Wettin
with frequently changing fields. Karl Wettin wrote: Hi Wojciech, can you please give us a bit more specific information about the meta data fields that will change? I would recommend you looking at creating filters from your primary persistency for query clauses such as unread/read, ma

Re: Frequently updated fields

2008-09-12 Thread Karl Wettin
12 sep 2008 kl. 14.51 skrev Wojciech Strzałka: The most changing fields will be I think: Status (read/unread): in fact I'm affraid of this at most - any mail incoming to the system will need to be indexed at least twice This is why I recommended you to use a filte

Re: Frequently updated fields

2008-09-12 Thread Karl Wettin
Hi Wojciech, can you please give us a bit more specific information about the meta data fields that will change? I would recommend you looking at creating filters from your primary persistency for query clauses such as unread/read, mailbox folders, et c. karl 12 sep 2008 kl. 13.57

Re: removing norms

2008-09-12 Thread Karl Wettin
12 sep 2008 kl. 12.25 skrev Bogdan Ghidireac: I have a large index and I want to remove the norms from a field. Is there a way to do this without reindexing everything ? You could invoke IndexReader#setNorm(int, String, float) and set the value to 1f. karl --

Re: string similarity measures

2008-09-04 Thread Karl Wettin
ply to my case? tanimoto coefficient over shingles? Not really, no. karl Best, On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <[EMAIL PROTECTED]> wrote: 4 sep 2008 kl. 14.38 skrev Cam Bazz: Hello, This came up before but - if we were to make a swear word filter, st

  1   2   3   4   5   6   7   >