Re: PDF text extracted without spaces

2010-12-02 Thread Lance Norskog
The text should come out as a stream of words with space, but without any of the formatting in the PDF. Extraction is only good enough to tell you that a word is somewhere inside a PDF file. Can you post a short bit of the text that it extracted? Also, you should try this test on different PDF fi

PDF text extracted without spaces

2010-12-02 Thread Ganesh
Hello all, I know, this is not the right group to ask this question, thought some of you guys might have experienced. I newbie with Tika. I am using latest version 0.8 version. I extracted text from PDF document but found spaces and new line missing. Indexing the data gives wrong result. Cou

RE: tokensFromAnalysis

2010-12-02 Thread McGibbney, Lewis John
Having done a search the version is 2.3!!! I have been making changes since posting my first message, if it is ok I will add a reply to this thread in upcoming days if I bump into more problems. I didn't realise the API had changed to the degree it has. Lewis -Original Message- From

RE: tokensFromAnalysis

2010-12-02 Thread Steven A Rowe
Lewis, Simon asked about the version of Lucene you're using because this section of the API has seen regular change. If you don't tell us which version, we can't help, because we don't know what you're coding against. Steve > -Original Message- > From: McGibbney, Lewis John [mailto:le

RE: tokensFromAnalysis

2010-12-02 Thread McGibbney, Lewis John
I have been trying to reuse code which was originally written a while ago as you can tell. I am looking for suggestions as to how I could get the code working, if this is not possible alternatively I will start from scratch. Thank you -Original Message- From: Simon Willnauer [mailto:s

Re: tokensFromAnalysis

2010-12-02 Thread Simon Willnauer
man what version of lucene are you useing? simon On Thu, Dec 2, 2010 at 4:27 PM, McGibbney, Lewis John wrote: > Hello List, > > Having posted a couple of days ago, I have one last question regarding the > following code fragment > >  public static Token[] tokensFromAnalysis(Analyzer analyzer, S

tokensFromAnalysis

2010-12-02 Thread McGibbney, Lewis John
Hello List, Having posted a couple of days ago, I have one last question regarding the following code fragment public static Token[] tokensFromAnalysis(Analyzer analyzer, String text) throws IOException { TokenStream stream = analyzer.tokenStream("contents", new StringReader(t

Re: Using metadata of the requested files with Lucene

2010-12-02 Thread Erick Erickson
An example would help. But assuming you've indexed the part or filename (and that it's unique), just search for it. You should only get a single document back and then IndexReader.doc(luceneID) will get you the stored for that document. You have to watch out for tokenization of file names (use Key

Re: a proof that every word is indexing properly

2010-12-02 Thread Erick Erickson
I'm really curious how you expert knows that the present system "indexes every word properly". You can certainly test any scenario that can be defined precisely via unit tests as Lance suggests. Ask for *concrete* examples he's concerned with. Write tests to show that each example works. Ask for m

Re: Using metadata of the requested files with Lucene

2010-12-02 Thread Ian Lea
You need to store the data in the index (Field.Store.YES) and then you can get it back by calling doc.get("fieldname"). -- Ian. On Thu, Dec 2, 2010 at 1:34 PM, reis3k wrote: > > Hi All, > I'm trying to write a small app, ebook organizer, using Lucene. > > I index metadata of various file types

Using metadata of the requested files with Lucene

2010-12-02 Thread reis3k
Hi All, I'm trying to write a small app, ebook organizer, using Lucene. I index metadata of various file types properly,and when I search a keyword related to metada of documents I can get a result. However, I wanna get the metadata of some specific indexed document e.g. I'll send the part/filena

Re: Analyzer

2010-12-02 Thread Ahmet Arslan
> By the way, is there an analyzer > which splites each letter of a word? > e.g. > hello world => h/e/l/l/o/w/o/r/l/d There are classes under the package org.apache.lucene.analysis.ngram - To unsubscribe, e-mail: java

Re: Analyzer

2010-12-02 Thread Christoph Hermann
Am Donnerstag, 2. Dezember 2010, 11:11:03 schrieb Sean: Hi, > By the way, is there an analyzer which splites each letter of a word? > e.g. > hello world => h/e/l/l/o/w/o/r/l/d There is a CharTokenizer, that should help you. regards Christoph Hermann -- Christoph Hermann Institut für Informati

Re: Analyzer

2010-12-02 Thread Sean
By the way, is there an analyzer which splites each letter of a word? e.g. hello world => h/e/l/l/o/w/o/r/l/d Regards, Sean -- Original -- From: "Erick Erickson"; Date: Tue, Nov 30, 2010 09:07 PM To: "java-user"; Subject: Re: Analyzer WhitespaceAnalyzer

Re: Wikileaks Iraq log

2010-12-02 Thread Simon Willnauer
On Wed, Dec 1, 2010 at 11:10 PM, Uwe Schindler wrote: > The question is: What does this have to do with Lucene? nothing!!! Can you please not discuss already known as off-topic stuff on this list! thank you ! simon > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.th

Re: a proof that every word is indexing properly

2010-12-02 Thread Toke Eskildsen
On Thu, 2010-12-02 at 03:54 +0100, David Linde wrote: > Has anyone figured out a way to logically prove that lucene indexes ever > word properly? The "Precision and recall in lucene"-thread seems relevant here. > Our company has done alot of research into lucene, all of our IT department > is rea

Re: Analyzer

2010-12-02 Thread manjula wijewickrema
Dear Erick, Thanx for your information. Manjula. On Tue, Nov 30, 2010 at 6:37 PM, Erick Erickson wrote: > WhitespaceAnalyzer does just that, splits the incoming stream on > white space. > > From the javadocs for StandardAnalyzer: > > A grammar-based tokenizer constructed with JFlex > > This sho