Nutch is written in Java, so Nutch itself *should* work on other non-Linux OSs
that the JVM supports.
But it does contain some shell scripts, as does Hadoop that Nutch uses. Oh, I
guess Windows people run it under Cygwin?
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
Sure, you can add any data to any document that you want,
probably stored but not indexed in this case. It could even
be a serialized Java object. Or an XML packet or a
stringized map. Or... whatever suits your fancy. If it's not
indexed, only stored it'll make your index larger but have
a negligib
I have a large number of text files (books) that I am trying to make
searchable with Lucene 2.3.2.
I would like search results to display the page and chapter in which a match
with the search term occurred.
My question is whether it is possible to add structural data (xml perhaps)
to the files s
On Fri, Jan 8, 2010 at 16:27, Jamie wrote:
> Hi Ian / Will
>
> Thanks. Surely, the Porter Stemmer should not stem proper noun's. i.e. it
> could check the capitalization of the first letter of a word and whether or
> not the word is the start of sentence. If so, it could choose not apply any
> ste
You can find the issue for this here
https://issues.apache.org/jira/browse/LUCENE-2199
On Fri, Jan 8, 2010 at 8:53 PM, Simon Willnauer
wrote:
> This is truly a bug. The outputUnigram internally only works if you
> request bi-grams.
> If the outputUnigram is set to false the filter increment the
>
Hi Ian / Will
Thanks. Surely, the Porter Stemmer should not stem proper noun's. i.e.
it could check the capitalization of the first letter of a word and
whether or not the word is the start of sentence. If so, it could choose
not apply any stemming. Or am I completely out of whack?
Jamie
I
Looks like PorterStemFilter converts "Lowe's" to low. Not very surprising.
Options include
. Drop the stemming
. Index stemmed and non-stemmed variants and search both, maybe
boosting the non-stemmed variant.
If you really want exact matches only, you may also/instead want
untokenized fields
On Fri, Jan 8, 2010 at 15:01, Jamie wrote:
> Hi There
>
> We are trying to search for the exact word "Lowe's" across a large set of
> indexed data. Our results include everything with "low" in it. Thus, we are
> receiving a much larger data set that we expected. The data is indexing
> using the an
Hi There
We are trying to search for the exact word "Lowe's" across a large set
of indexed data. Our results include everything with "low" in it. Thus,
we are receiving a much larger data set that we expected. The data is
indexing using the analyzer:
TokenStream result = new Standa
This is truly a bug. The outputUnigram internally only works if you
request bi-grams.
If the outputUnigram is set to false the filter increment the
shingleposition by one and therefore skips every even shingle. The
position should only be incremented if shingleBufferPosition %
maxShingle == 0
I ha
: I am using lucene 2.9.1 and I was trying to understand the ShingleFilter and
wrote the code below.
...
: I was expecting the output as follows with maxShingleSize=3 and
outputUnigrams=false :
...
: Am I missing something or this is the expected behavior?
I'm not very familiar
What are the associated Analyzers for your Gene and Token?
Because if they're NOT something akin to KeywordAnalyzer, you
have a problem. Specifically, most of the "regular" tokenizers will
break this stream up into three separate terms,
"brain", "natriuetic", and "peptide". If that's the case, the
I'm not going to go into too much code level detail, however I'd index
the phrases using tri-gram shingles, and as uni-grams. I think
this'll give you the results you're looking for. You'll be able to
quickly recall the count of a given phrase aka tri-gram such as
"blue_shorts_burough"
On Fri, J
@All : Elaborating the problem
The phrase is being indexed as a single token ...
I have a Gene tag in the xml document which is like
brain natriuretic peptide
This phrase is present in the abstract text for the given document .
Code is as :
doc.add(new Field("Gene", geneName, Field.Store.YES
When do you detect that they are phrases? During indexing or during search?
On Jan 8, 2010, at 5:16 AM, hrishim wrote:
>
> Hi .
> I have phrases like brain natriuretic peptide indexed as a single token
> using Lucene.
> When I calculate the term frequency for the same the count is 0 since the
Mike,
thanks a lot!
That's exactly what we'll do.
Actually we have a lot of dynamic fields which are not analyzed and not
involved in field/document boosting, so we can disable norms on these fields
without problems.
Thanks again.
Yuliya
> -Ursprüngliche Nachricht-
> Von: Michael
On a quick read, your statements are contradictory
<<>>
<<>>
Either "brain natriuretic peptide" is a single token/term or it's not
Are you sure you're not confusing indexing and storing? What
analyzer are you using at index time?
Erick
On Fri, Jan 8, 2010 at 5:16 AM, hrishim wrote:
Lucene stores 1 byte (disk and RAM, when searching that field) per
document for any field that has norms enabled, even for documents that
do not contain that field.
In your case, that's ~20 MB per field (once optimize is done), times
559 fields = ~11TB of storage.
You should index these fields wi
Thanks Michael.
You are probably wright.
Not optimized size is 4.1G, optimized index is about 15G.
Yes, our documents do have many different indexed fields and norms are enabled.
Nr of fields: 559
Nr of documents: 20845906
Nr of terms: 25615389
Could you please give me a more detailled explanat
Normally, this (using an IndexReader, [re-]opening a new IndexReader
while an IndexWriter is committing) is perfectly fine. The reader
searches the point-in-time snapshot of the index as of when it was
opened.
But: what filesystem are you using? NFS presents challenges, for example.
Mike
On Fr
One technique I've seen commonly used is to index both stemmed and
unstemmed fields, and during search query both and boost the unstemmed
field matches higher.
Erik
On Jan 8, 2010, at 4:05 AM, Yannick Caillaux wrote:
Hi,
I index 2 documents. the first contains the word "Wallis" in
Hi,
I often get a FileNotFoundException when my single IndexWriter commits while
the IndexReader also tries to read. My application is multithreaded (Tomcat
uses the business APIs); I firstly thought the read/write access was
thread-safe but I probably forget something.
Please help me to unde
On Fri, Jan 8, 2010 at 1:22 AM, Babak Farhang wrote:
>>> I wonder if renaming that to maxSegSizeMergeMB would make it more obvious
>>> what this does?
>
> How about using the *able* moniker to make it clear we're referring to
> the size of the to-be-merged segment, not the resultant merged
> segm
Issue a PhraseQuery and count how many hits came back? Is that too
slow? If so, you could detect all phrases during indexing and add
them as tokens to the index?
Mike
On Fri, Jan 8, 2010 at 5:16 AM, hrishim wrote:
>
> Hi .
> I have phrases like brain natriuretic peptide indexed as a single tok
Thanks Otis, that's very helpful.
On Fri, Jan 8, 2010 at 2:08 AM, Otis Gospodnetic wrote:
> Ah, well, masking it didn't help. Yes, ignore Bixo, Nutch, and Droids
> then.
> Consider DataImportHandler from Solr or wait a bit for Lucene Connectors
> Framework to materialize. Or use LuSql, or DbSi
Hi .
I have phrases like brain natriuretic peptide indexed as a single token
using Lucene.
When I calculate the term frequency for the same the count is 0 since the
tokens from the text are indexed separately i.e. brain , natriuretic ,
peptide.
Is there a way to solve this problem and get the ter
You need contrib-memory.jar in your classpath to use MemoryIndex.
simon
On Fri, Jan 8, 2010 at 10:42 AM, Li Leon wrote:
> Hi all,
>
> I was able to get a whole sentence(including stop words) highlighted with
> "StandardAnalyzer" and an empty stop words String[].
>
> The current issue I'm having
Hi Paul,
Thanks.
Use Nutch to do crawling. and integrate Lucene to the web application, so that
can do search online.
BTW, Nutch seems to have only Linux version, what my development is on Windows.
Am i right?
Zhou
--- On Fri, 8/1/10, Paul Libbrecht wrote:
From: Paul Libbrecht
Subject: Re
Hi all,
I was able to get a whole sentence(including stop words) highlighted with
"StandardAnalyzer" and an empty stop words String[].
The current issue I'm having is that not only the whole sentence got
highlighted but those tokens partially match with the sentence also
highlighted. I tried to u
Hi,
I index 2 documents. the first contains the word "Wallis" in the title
field. The second has the same title but "Wallis" is replaced by "Wall".
I execute the query : "title:wallis"
During the search, "Wallis" is cut by the FrenchAnalyzer and becomes
"wall". So the two documents are results
Zhou,
Lucene is a back-end library, it's very useful for developer but it is
not a complete site-search-engine.
A lucene-based site-search-engine is Nutch, it does crawl.
Solr also provides functions close to these with a large amount of
thoughts on flexible integration; crawling methods are
31 matches
Mail list logo