Modifying IDF

2010-01-28 Thread Franz Allan Valencia See
Good day, I am currently using lucene for my searches. And one of the problems that Im facing is when keyword is a url. The tokens such as http, https, ://, index, html, etc seems to be messing up with our search results. The focus was supposed to be only on the url domain. The idea that I have i

Re: index demo throws LockObtainFailedException

2010-01-28 Thread Otis Gospodnetic
Fedora Core 4 is *ancient*! :) Could it be that the NFS client on it is old, and this is causing problems? I remember emails about NFS 3 vs. NFS 4 and some improvements in the latter. I don't recall the details and tend to keep my Lucene and Solr instances away from NFS mounts. Otis Sema

index demo throws LockObtainFailedException

2010-01-28 Thread Teruhiko Kurosaka
We have many Linux machines of different brands, sharing the same NFS filesystem for home. The Lucene file indexing demo program is failing with LockObainFailedException only on one particular Linux machine (Fedora Core 4, x86). I am including the console output at the bottom of this message.

AW: AW: index a database

2010-01-28 Thread Marc Schwarz
Maybe you should seperate the add method from the database function... Separate the db loop something like that: try { ResultSet rs2 = stm.executeQuery(sql); while(rs2.next()) { String text = rs2.getString("textvalue"); addDoc(w,

Re: AW: index a database

2010-01-28 Thread luciusvorenus
yes many thanks .. But /.../my index folder is empty. Have I done something wrong in "private static void indexDocs"? It is not indexed Marc Schwarz wrote: > > StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); > > any difference with that ? > > -Ursprün

Re: AW: index a database

2010-01-28 Thread luciusvorenus
"" Exception in thread "main" java.lang.NullPointerException at org.apache.lucene.analysis.StopFilter.getEnablePositionIncrementsVersionDefault(StopFilter.java:162) at org.apache.lucene.analysis.standard.StandardAnalyzer.(StandardAnalyzer.java:73) at org.apache.lucene.ana

AW: index a database

2010-01-28 Thread Marc Schwarz
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); any difference with that ? -Ursprüngliche Nachricht- Von: luciusvorenus [mailto:lucius.vore...@hotmail.de] Gesendet: Donnerstag, 28. Januar 2010 22:46 An: java-user@lucene.apache.org Betreff: Re: index a database

Re: index a database

2010-01-28 Thread luciusvorenus
lucene 3.3 i tried liek this "" import org.apache.lucene.demo.FileDocument; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index

AW: index a database

2010-01-28 Thread Marc Schwarz
I had that problem yesterday... this works in my app: Directory directory = new SimpleFSDirectory(new File("c:\\lucene\\index")); IndexWriter w = new IndexWriter(directory, analyzer,true, new IndexWriter.MaxFieldLength(25000)); -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:eric

Re: Average Precision - TREC-3

2010-01-28 Thread Ivan Provalov
Great points, Robert! I agree, we have a lot of fine tuning ahead of us. I think we probably have achieved the baseline with our MAP of 0.14. We should move on to stage two and apply some of the suggestions to improve the overall scores. These are just the first steps. Both you and Grant

Re: Average Precision - TREC-3

2010-01-28 Thread Ivan Provalov
Great reference, Grant! Thank you! Our content is very similar to TREC-3 (periodicals). In fact, there is some content overlap between our content and TREC's (actual documents). The query types are very similar (ad hoc). The cost of extracting our top queries is that we would have to also

Re: index a database

2010-01-28 Thread Erick Erickson
What version are you using? Because there's no such constructor (i.e. one that takes a File) in 3.0. You might want to use something like FSDirectory.open(file) in your IndexWriter constructor If this doesn't work, more details please Erick On Thu, Jan 28, 2010 at 3:30 PM, luciusvor

index a database

2010-01-28 Thread luciusvorenus
Hello I tried to index a database "" import org.apache.lucene.demo.FileDocument; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene

Re: lucene search

2010-01-28 Thread Erick Erickson
the issue with non-letter characters is, indeed, the analyzer. Have a look at all the different subclasses of Analyzer in the javadocs, getting a copy of Luke will show you exactly what gets in your index, but KeywordAnalyzer and WhitespaceAnalyzer may work for you (but they don't normalize the cas

Re: lucene search

2010-01-28 Thread Shashi Kant
Hi, if you want to search by substring (i.e. "lp" should return "lpg" as a result) you should look at wildcards. So a search for "lp*" (* is the wildcard character) would return lpg, lpghxyz, lp12345 and so on... On Thu, Jan 28, 2010 at 1:41 PM, andy green wrote: > > hello, > > I programmed wit

RE: combine query score with external score

2010-01-28 Thread Steven A Rowe
Hi Dennis, You should check out payloads (arbitrary per-index-term byte[] arrays), which can be used to encode values which are then incorporated into documents' scores, by overriding Similarity.scorePayload():

lucene search

2010-01-28 Thread andy green
hello, I programmed with Lucene code to handle the search on my site ... the articles indexed are those stored in a database, then I do a search with "lucene.queryparser" on the field "code" of various objects (a "code" is a word of 3 6-character) ... My problem is the fact that when I search, I

Re: Average Precision - TREC-3

2010-01-28 Thread Robert Muir
right, but the problem is when something is currently ranked as doc 20 but should be in the top 1, 5, or 10, and you aren't seeing it. so I think if you are judging top-N docs from an existing system, you should look a little farther ahead than the top-N you care about. I think you should also ind

Re: Average Precision - TREC-3

2010-01-28 Thread Grant Ingersoll
On Jan 28, 2010, at 11:00 AM, Robert Muir wrote: > in addition to what Grant said, even if your documents are similar, what > about queries? > > For example, if only a few trec queries contain proper names, acronyms, > abbreviations, or whatever, but your users frequently input things like > thi

Re: How to get matched terms

2010-01-28 Thread Benjamin Heilbrunn
You could use Query.extractTerms(..) and then search for possible matches in the field term vector (requires stored TV). 2010/1/28 Vaijanath Rao : > Hi All, > > What is the simplest way of getting the matched terms of the query with > respect to the document. So for example let's say a document ha

How to get matched terms

2010-01-28 Thread Vaijanath Rao
Hi All, What is the simplest way of getting the matched terms of the query with respect to the document. So for example let's say a document has field X and the contains of the field are "a b c" now when I do a search for 'b c'. The document will be returned I want to get back the terms that this

Re: Average Precision - TREC-3

2010-01-28 Thread Robert Muir
in addition to what Grant said, even if your documents are similar, what about queries? For example, if only a few trec queries contain proper names, acronyms, abbreviations, or whatever, but your users frequently input things like this, it won't be representative. i will disagree with him on a f

Highlighter / cannot be instantiated

2010-01-28 Thread Marc Schwarz
I'm trying to get the highlighter running, but didn't get it work. Everywhere it's posted as following: Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(), new QueryScorer(query)); but that gives me a "Highlighter is abstract; cannot be instantiated". I'm using version 2.9 of

Re: Average Precision - TREC-3

2010-01-28 Thread Grant Ingersoll
On Jan 27, 2010, at 1:36 PM, Ivan Provalov wrote: > Robert, Grant: > > Thank you for your replies. > > Our goal is to fine-tune our existing system to perform better on relevance. What kind of documents do you have? Are they very similar to the TREC docs (i.e. news articles)? There can be

Search a PhraseQuery one multiple terms with the same position

2010-01-28 Thread Karsten F.
Hi, I have a problem with the checkedRepeats in SloppyPhraseScorer. This feature is for phrases like "1st word 2st word". Without this feature the result would be the same as "1st word 2st". OK But I have an Index with more then one token on the same position. The german sentence "Die käuflich

Re: Lucene full text search

2010-01-28 Thread Erick Erickson
Well, there are a couple of approaches: 1> enable leading wildcards and search for *arabic*. You probably don't want to do this, it's really, really expensive. 2> use the ngram (edgengram?) tokenizers. This'll cost you some index space, but that may be acceptable. HTH Erick 2010/1/28

Lucene full text search

2010-01-28 Thread Lutischán Ferenc
Hi, I have a problem with Lucene: I'm indexed an english phrase list with Lucene: doc.add(new Field("r1", r1.toLowerCase(), Field.Store.NO, Field.Index.ANALYZED)); I searched for the word 'arabic': Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

combine query score with external score

2010-01-28 Thread Dennis Hendriksen
Hi, I'm struggling to create a performant query in Lucene 3.0.0 in which I want to combine 'regular' scoring with scores derived from external sources. For each document a fixed set of scores is calculated in the range [0.0, 1.0>. These scores represent the confidences that a document falls into

Roadmap for next release

2010-01-28 Thread Ganesh
Hello all, Please provide me the information related to road map for the next release. This information will be really helpful to plan our product road map for this year. Is the below feature planned for this year. - 1. To reduce sorting memory consumpti