The text should come out as a stream of words with space, but without
any of the formatting in the PDF. Extraction is only good enough to
tell you that a word is somewhere inside a PDF file. Can you post a
short bit of the text that it extracted?
Also, you should try this test on different PDF fi
Hello all,
I know, this is not the right group to ask this question, thought some of you
guys might have experienced.
I newbie with Tika. I am using latest version 0.8 version. I extracted text
from PDF document but found spaces and new line missing. Indexing the data
gives wrong result. Cou
Having done a search the version is 2.3!!!
I have been making changes since posting my first message, if it is ok I will
add a reply to this thread in upcoming days if I bump into more problems. I
didn't realise the API had changed to the degree it has.
Lewis
-Original Message-
From
Lewis,
Simon asked about the version of Lucene you're using because this section of
the API has seen regular change. If you don't tell us which version, we can't
help, because we don't know what you're coding against.
Steve
> -Original Message-
> From: McGibbney, Lewis John [mailto:le
I have been trying to reuse code which was originally written a while ago as
you can tell.
I am looking for suggestions as to how I could get the code working, if this is
not possible alternatively I will start from scratch.
Thank you
-Original Message-
From: Simon Willnauer [mailto:s
man what version of lucene are you useing?
simon
On Thu, Dec 2, 2010 at 4:27 PM, McGibbney, Lewis John
wrote:
> Hello List,
>
> Having posted a couple of days ago, I have one last question regarding the
> following code fragment
>
> public static Token[] tokensFromAnalysis(Analyzer analyzer, S
Hello List,
Having posted a couple of days ago, I have one last question regarding the
following code fragment
public static Token[] tokensFromAnalysis(Analyzer analyzer, String text)
throws IOException {
TokenStream stream = analyzer.tokenStream("contents",
new StringReader(t
An example would help. But assuming you've indexed the part or
filename (and that it's unique), just search for it. You should
only get a single document back and then IndexReader.doc(luceneID)
will get you the stored for that document.
You have to watch out for tokenization of file names (use Key
I'm really curious how you expert knows that the present system
"indexes every word properly". You can certainly test any scenario that
can be defined precisely via unit tests as Lance suggests.
Ask for *concrete* examples he's concerned with. Write tests to show that
each
example works. Ask for m
You need to store the data in the index (Field.Store.YES) and then you
can get it back by calling doc.get("fieldname").
--
Ian.
On Thu, Dec 2, 2010 at 1:34 PM, reis3k wrote:
>
> Hi All,
> I'm trying to write a small app, ebook organizer, using Lucene.
>
> I index metadata of various file types
Hi All,
I'm trying to write a small app, ebook organizer, using Lucene.
I index metadata of various file types properly,and when I search a keyword
related to metada of documents I can get a result. However, I wanna get the
metadata of some specific indexed document e.g. I'll send the part/filena
> By the way, is there an analyzer
> which splites each letter of a word?
> e.g.
> hello world => h/e/l/l/o/w/o/r/l/d
There are classes under the package org.apache.lucene.analysis.ngram
-
To unsubscribe, e-mail: java
Am Donnerstag, 2. Dezember 2010, 11:11:03 schrieb Sean:
Hi,
> By the way, is there an analyzer which splites each letter of a word?
> e.g.
> hello world => h/e/l/l/o/w/o/r/l/d
There is a CharTokenizer, that should help you.
regards
Christoph Hermann
--
Christoph Hermann
Institut für Informati
By the way, is there an analyzer which splites each letter of a word?
e.g.
hello world => h/e/l/l/o/w/o/r/l/d
Regards,
Sean
-- Original --
From: "Erick Erickson";
Date: Tue, Nov 30, 2010 09:07 PM
To: "java-user";
Subject: Re: Analyzer
WhitespaceAnalyzer
On Wed, Dec 1, 2010 at 11:10 PM, Uwe Schindler wrote:
> The question is: What does this have to do with Lucene?
nothing!!!
Can you please not discuss already known as off-topic stuff on this list!
thank you !
simon
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.th
On Thu, 2010-12-02 at 03:54 +0100, David Linde wrote:
> Has anyone figured out a way to logically prove that lucene indexes ever
> word properly?
The "Precision and recall in lucene"-thread seems relevant here.
> Our company has done alot of research into lucene, all of our IT department
> is rea
Dear Erick,
Thanx for your information.
Manjula.
On Tue, Nov 30, 2010 at 6:37 PM, Erick Erickson wrote:
> WhitespaceAnalyzer does just that, splits the incoming stream on
> white space.
>
> From the javadocs for StandardAnalyzer:
>
> A grammar-based tokenizer constructed with JFlex
>
> This sho
17 matches
Mail list logo