date:20101202

Re: PDF text extracted without spaces

2010-12-02 Thread Lance Norskog

The text should come out as a stream of words with space, but without any of the formatting in the PDF. Extraction is only good enough to tell you that a word is somewhere inside a PDF file. Can you post a short bit of the text that it extracted? Also, you should try this test on different PDF fi

PDF text extracted without spaces

2010-12-02 Thread Ganesh

Hello all, I know, this is not the right group to ask this question, thought some of you guys might have experienced. I newbie with Tika. I am using latest version 0.8 version. I extracted text from PDF document but found spaces and new line missing. Indexing the data gives wrong result. Cou

RE: tokensFromAnalysis

2010-12-02 Thread McGibbney, Lewis John

Having done a search the version is 2.3!!! I have been making changes since posting my first message, if it is ok I will add a reply to this thread in upcoming days if I bump into more problems. I didn't realise the API had changed to the degree it has. Lewis -Original Message- From

RE: tokensFromAnalysis

2010-12-02 Thread Steven A Rowe

Lewis, Simon asked about the version of Lucene you're using because this section of the API has seen regular change. If you don't tell us which version, we can't help, because we don't know what you're coding against. Steve > -Original Message- > From: McGibbney, Lewis John [mailto:le

RE: tokensFromAnalysis

2010-12-02 Thread McGibbney, Lewis John

I have been trying to reuse code which was originally written a while ago as you can tell. I am looking for suggestions as to how I could get the code working, if this is not possible alternatively I will start from scratch. Thank you -Original Message- From: Simon Willnauer [mailto:s

Re: tokensFromAnalysis

2010-12-02 Thread Simon Willnauer

man what version of lucene are you useing? simon On Thu, Dec 2, 2010 at 4:27 PM, McGibbney, Lewis John wrote: > Hello List, > > Having posted a couple of days ago, I have one last question regarding the > following code fragment > > public static Token[] tokensFromAnalysis(Analyzer analyzer, S

tokensFromAnalysis

2010-12-02 Thread McGibbney, Lewis John

Hello List, Having posted a couple of days ago, I have one last question regarding the following code fragment public static Token[] tokensFromAnalysis(Analyzer analyzer, String text) throws IOException { TokenStream stream = analyzer.tokenStream("contents", new StringReader(t

Re: Using metadata of the requested files with Lucene

2010-12-02 Thread Erick Erickson

An example would help. But assuming you've indexed the part or filename (and that it's unique), just search for it. You should only get a single document back and then IndexReader.doc(luceneID) will get you the stored for that document. You have to watch out for tokenization of file names (use Key

Re: a proof that every word is indexing properly

2010-12-02 Thread Erick Erickson

I'm really curious how you expert knows that the present system "indexes every word properly". You can certainly test any scenario that can be defined precisely via unit tests as Lance suggests. Ask for *concrete* examples he's concerned with. Write tests to show that each example works. Ask for m

Re: Using metadata of the requested files with Lucene

2010-12-02 Thread Ian Lea

You need to store the data in the index (Field.Store.YES) and then you can get it back by calling doc.get("fieldname"). -- Ian. On Thu, Dec 2, 2010 at 1:34 PM, reis3k wrote: > > Hi All, > I'm trying to write a small app, ebook organizer, using Lucene. > > I index metadata of various file types

Using metadata of the requested files with Lucene

2010-12-02 Thread reis3k

Hi All, I'm trying to write a small app, ebook organizer, using Lucene. I index metadata of various file types properly,and when I search a keyword related to metada of documents I can get a result. However, I wanna get the metadata of some specific indexed document e.g. I'll send the part/filena

Re: Analyzer

2010-12-02 Thread Ahmet Arslan

> By the way, is there an analyzer > which splites each letter of a word? > e.g. > hello world => h/e/l/l/o/w/o/r/l/d There are classes under the package org.apache.lucene.analysis.ngram - To unsubscribe, e-mail: java

Re: Analyzer

2010-12-02 Thread Christoph Hermann

Am Donnerstag, 2. Dezember 2010, 11:11:03 schrieb Sean: Hi, > By the way, is there an analyzer which splites each letter of a word? > e.g. > hello world => h/e/l/l/o/w/o/r/l/d There is a CharTokenizer, that should help you. regards Christoph Hermann -- Christoph Hermann Institut für Informati

Re: Analyzer

2010-12-02 Thread Sean

By the way, is there an analyzer which splites each letter of a word? e.g. hello world => h/e/l/l/o/w/o/r/l/d Regards, Sean -- Original -- From: "Erick Erickson"; Date: Tue, Nov 30, 2010 09:07 PM To: "java-user"; Subject: Re: Analyzer WhitespaceAnalyzer

Re: Wikileaks Iraq log

2010-12-02 Thread Simon Willnauer

On Wed, Dec 1, 2010 at 11:10 PM, Uwe Schindler wrote: > The question is: What does this have to do with Lucene? nothing!!! Can you please not discuss already known as off-topic stuff on this list! thank you ! simon > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.th

Re: a proof that every word is indexing properly

2010-12-02 Thread Toke Eskildsen

On Thu, 2010-12-02 at 03:54 +0100, David Linde wrote: > Has anyone figured out a way to logically prove that lucene indexes ever > word properly? The "Precision and recall in lucene"-thread seems relevant here. > Our company has done alot of research into lucene, all of our IT department > is rea

Re: Analyzer

2010-12-02 Thread manjula wijewickrema

Dear Erick, Thanx for your information. Manjula. On Tue, Nov 30, 2010 at 6:37 PM, Erick Erickson wrote: > WhitespaceAnalyzer does just that, splits the incoming stream on > white space. > > From the javadocs for StandardAnalyzer: > > A grammar-based tokenizer constructed with JFlex > > This sho

Re: PDF text extracted without spaces

PDF text extracted without spaces

RE: tokensFromAnalysis

RE: tokensFromAnalysis

RE: tokensFromAnalysis

Re: tokensFromAnalysis

tokensFromAnalysis

Re: Using metadata of the requested files with Lucene

Re: a proof that every word is indexing properly

Re: Using metadata of the requested files with Lucene

Using metadata of the requested files with Lucene

Re: Analyzer

Re: Analyzer

Re: Analyzer

Re: Wikileaks Iraq log

Re: a proof that every word is indexing properly

Re: Analyzer

17 matches

Site Navigation

Mail list logo

Footer information