Alex Chew a écrit :
Hi,
Does somebody have practice building a distributed application with lucene
and Hadoop/HFS?
Lucene 2.3.1 looks not explose HFSDirectory.
Any advice will be appreciated.
Regards,
Alex
have a look to Nutch.
M.
--
Rafael Turk a écrit :
Hi Mathieu,
*What do you wont to do?*
An spell checker and related keyword suggestion
Here is a spell checker wich I try to finalize :
https://admin.garambrogne.net/projets/revuedepresse/browser/trunk/src/java
If you wont an ngram => popularity map, just use a berkl
Rafael Turk a écrit :
Hi Folks,
I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus contains
English word n-grams and their observed frequency counts. The length of the
n-grams ranges from unigrams(single words) to five-grams)
I´m loading each ngram (each row is a ngram) as an
Regarding data and its relationships - the use case I
am trying to solve is to partition my data into 2
indexes, a primary index that will contains majority
of the data and it is fairly static. The secondary
index will have related information for the same data
set in primary index and this relate
ild a Filter for the second, just like
with the previous JDBC example.
You can even cache the filter, like Solr does with its faceted search.
M.
Regards,
Rajesh
--- Mathieu Lecarme <[EMAIL PROTECTED]> wrote:
Have a look at Compass 2.0M3
http://www.kimchy.org/searchable-cascading-map
Antony Bowesman a écrit :
We're planning to archive email over many years and have been looking
at using DB to store mail meta data and Lucene for the indexed mail
data, or just Lucene on its own with email data and structure stored
as XML and the raw message stored in the file system.
For so
Have a look at Compass 2.0M3
http://www.kimchy.org/searchable-cascading-mapping/
Your multiple index will be nice for massive write. In a classical
read/write ratio, Compass will be much easier.
M.
Rajesh parab a écrit :
Hi,
We are using Lucene 2.0 to index data stored inside
relational dat
have a look at Compass.
M.
Prashant Saraf a écrit :
Hi,
We are planning to provide search functionality in the a
web base application. Can we use Lucene for it to search data from
database like oracle and MS-Sql?
Thanks and Regards
प्रशांत सराफ
(Prashant Saraf)
S
Allen Atamer a écrit :
My dictionary filter currently implements next() and everything works well
when dictionary entries are replaced one-to-one. For example: Can =>
Canada.
A problem arises when I try to replace it with more than one word. Going
through next() I encounter "shutdown". But
I'm cool :) I just think you are overcomplicating things.
Yes... I can use two words and OR
Suposse I query on this
The Lord of Rings: Return of King
The Lord of Rings: Fellowship
The Lord of Rings: The Two towers
The Lord of Weapons
The Lord of War
Suposse an user search: "The Lord of Rings
Le 8 avr. 08 à 18:34, Karl Wettin a écrit :
dreampeppers99 skrev:
1º Why need I pass a Directory objecto (obligatory) on constructor of
SpellChecker?
Mainly because it is a nasty peice of code. But it does a good job.
Because spellChecker use a directory to store data. It can be
FSDirectory
Use shingleFilter.
I'm working on a wider SpellChecker, I'll post a third patch soon.
https://admin.garambrogne.net/projets/revuedepresse/browser/trunk/src/java
M.
dreampeppers99 a écrit :
Hi,
I have two question about this GREAT tool.. (framework, library...
"whatever")
Well I decide put spe
[EMAIL PROTECTED] a écrit :
The need is:
I have millions of entries in database, each entry is in such format (more or
less)
ID NameDescription start (number) stop(number)
Currently my application uses the database to do search, queries are in the
following format:
Select * fr
Marjan Celikik a écrit :
Mathieu Lecarme wrote:
wever I don't fully understand what do you mean by "iterate over
your query". I would like a conceptual answer how is this done with
Lucene, not a technical one..
Your query is a tree, with BooleanQuery as branch and other que
Marjan Celikik a écrit :
Mathieu Lecarme wrote:
You have to iterate over your query, if it's a BooleanQuery, keep it,
if it's a TermQuery, replace it with a BooleanQuery with all variants
of the Term with Occur.SHOULD
M.
Thanks.. however I don't fully understand what
Marjan Celikik a écrit :
Hi everyone,
I know that there are packages that support the "Did you mean ... ?"
search features with lucene which tries to find the most suited
correct-word query.. however, so far I haven't encountered the opposite
search feature: given a correct query, find all docum
Wojtek H a écrit :
Hi all,
Snowball stemmers are part of Lucene, but for few languages only. We
have documents in various languages and so need stemmers for many
languages (in particular polish). One of the ideas is to use ispell
dictionaries. There are ispell dicts for many languages and so thi
Ivan Vasilev a écrit :
Thanks Mathieu,
I tryed to checkout but without success. Anyway I can do it manually,
but as the contribution is still not approved from Lucene our chiefs
will not whant it to be included to our project by now.
It's a right decision. I hope the third patch will be good
Ivan Vasilev a écrit :
Thanks Mathieu for your help!
The contribution that you have made to Lucene by this patch seems to
be great, but the hunspell dictionary is under LGPL which the lawyer
of our company does not like.
It's the spell tool used by Openoffice and firefox. Data must be multi
l
Ivan Vasilev a écrit :
Hi Guys,
Has anybody integrated the Spell Checker contributed to Lucene.
http://blog.garambrogne.net/index.php?post/2008/03/07/A-lexicon-approach-for-Lucene-index
https://issues.apache.org/jira/browse/LUCENE-1190
I need advise from where to get free dictionary file (one
milu07 a écrit :
Hello,
My machine is Ubuntu 7.10. I am working with Apache Lucene. I have done with
indexer and tried with command line Searcher (the default command line
included in Lucene package: http://lucene.apache.org/java/2_3_1/demo2.html).
When I use this at command line:
java Searcher
luceneuser a écrit :
Hi All,
I need help on retrieving results based on relevance + freshness. As of
now, i get based on either of the fields, either on relevance or freshness.
how can i achieve this. Lucene retrieves results on relevance but also
fetches old results too. i need more relevan
Raghu Ram a écrit :
to complicate it further ... the text for which language identification has
to be done is small, in most cases a short sentence like " I like Pepsi ".
Can something be done for this ?
Drinking water?
More seriously, if ngram pattern language guessing is too ambigous,
sear
Itamar Syn-Hershko a écrit :
For what it worths, I did something similar in my BidiAnalyzer so I can
index both Hebrew/Semitic texts and English/Latin words without switching
analyzers, giving each the proper treatment. I did it simply by testing the
first char and looking at its numeric value -
Dragon Fly a écrit :
Hi,
I'd like to find out if I can do the following with Lucene (on Windows).
On server A:
- An index writer creates/updates the index. The index is physically stored on
server A.
- An index searcher searches against the index.
On server B:
- Maps to the index directory.
Raghu Ram a écrit :
Hi all,
I guess this question is a bit off the track. Are there any language
identification modules inside Lucene ??? If not can somebody please suggest
me a good one.
Thank You.
nutch provide a tool for that, with ngram pattern, just like OO.o do it.
M.
---
Here is a POC about using Lucene, via Compass, from PHP or Python (other
languages will come later), with only XML configuration, object
notation, and native use of scripting language.
http://blog.garambrogne.net/index.php?post/2008/03/11/Using-Compass-without-dirtying-its-hands-with-java
It's
https://admin.garambrogne.net/projets/revuedepresse/browser/trunk/src/java/lexicon/src/java/org/apache/lucene/lexicon/QueryUtils.java
M.
Itamar Syn-Hershko a écrit :
Hi all,
I'm looking for the best way to inflate a query, so a query like: "synchronous AND colour" -- will become something lik
Borgman, Lennart a écrit :
Is there any possibility to use a thesaurus or an onthology when
indexing/searching with Lucene?
Yes. the WordNet contrib do that. And with a token filter, it's easy to
use your own.
What do you wont to do?
M.
---
Not sure, you might want to ask on Nutch. From a strict language
standpoint, the notion of a stopword in my mind is a bit dubious. If
the word really has no meaning, then why does the language have it to
begin with? In a search context, it has been treated as of minimal
use in the early da
And yes, each node contains their own documents and builds index against
them.
On Mon, Mar 3, 2008 at 1:30 AM, Mathieu Lecarme <[EMAIL PROTECTED]>
wrote:
Thanks for your clean and long answer.
With Term splitted on different node you've got different drawback :
PrefixQuery and Fu
There's no syntax to restore stemmed word. Stemming is done while
reading the news, so the index never knows the complete word.
I submit a patch for that :
https://issues.apache.org/jira/browse/LUCENE-1190
Be careful, rssbandit use .net lucene, not the java version.
M.
secou a écrit :
Hi,
he documents to be indexed are not
necessarily web pages. They are mostly files stored on each node's
file
system.
Node failures are also handled by replicas. The index for each term
will be
replicated on multiple nodes, whose nodeIDs are near to each other.
This
mechanism is handled
Le 2 mars 08 à 03:05, 仇寅 a écrit :
Hi,
I agree with your point that it is easier to partition index by
document.
But the partition-by-keyword approach has much greater scalability
over the
partition-by-document approach. Each query involves communicating with
constant number of nodes; whi
The easiest way is to split index by Document. In Lucene, index
contains Document and inverse index of Term. If you wont to put Term
in different place, Document will be duplicated on each index, with
only a part of their Term.
How will you manage node failure in your network?
They were so
Hi
Mathieu Lecarme wrote:
On a related topic, I'm also searching for a way to suggest
alternate spelling of words to the user, when we found a word
which is very less frequent used in the index or not in the index
at all. I'm Austrian based, when I e.g. search for
"r
Grant Ingersoll a écrit :
On Feb 29, 2008, at 5:39 AM, Mathieu Lecarme wrote:
Petite Abeille a écrit :
A proposal for a Lua entry for the "Google Summer of Code" '08:
A Lua implementation of Lucene.
For me, Lua is just a glue between C coded object, a super config
fil
Petite Abeille a écrit :
A proposal for a Lua entry for the "Google Summer of Code" '08:
A Lua implementation of Lucene.
For me, Lua is just a glue between C coded object, a super config file.
Like used in lighttpd or WoW.
Lulu will work on top of Lucy?
Did I miss something?
M.
--
Dharmalingam a écrit :
I am working on some sort of search mechanism to link a requirement (i.e. a
query) to source code files (i.e., documents). For that purpose, I indexed
the source code files using Lucene. Contrary to traditional natural language
search scenario, we search for code files that
[EMAIL PROTECTED] a écrit :
If you want something from an index it has to be IN the
index. So, store a
summary field in each document and make sure that field is part of the
query.
And how could one create automatically such a summary?
Have a look to http://alias-i.com/lingpipe/index.h
Yes, I've found a tester!
A patch was submited for this kind of job :
https://issues.apache.org/jira/browse/LUCENE-1190
And here is the svn work in progress :
https://admin.garambrogne.net/subversion/revuedepresse/trunk/src/java/lexicon
And the web version :
https://admin.garambrogne.net/projets
Markus Fischer a écrit :
Hi,
[Resent: guess I sent the first before I completed my subscription,
just in case it comes up twice ...]
the subject may be a bit weird but I couldn't find a better way to
describe a problem I'm trying to solve.
If I'm not mistaken, one factor of scoring is the
christophe blin a écrit :
Hi,
thanks for the pointer to the ellision filter, but I am currently stuck with
lucene-core-2.2.0 found in maven2 central repository (do not contain this
class). I'll watch for an upgrade to 2.3 in the future.
you can backport it easily with copy-paste.
M.
--
Well, javadoc: "prefixLength - length of common (non-fuzzy) prefix".
So, this
is some kind of "wildcard fuzzy" but not real fuzzy anymore.
I understand the optimitation but right now I hardly can image a
reasonable
use-case. Who care whether the levenstein distance is a the
beginnen, middle
fuzzy are simply not indexed.
If you wont to search quickly with fuzzy search, you should index word
and their ngrams, it's the "do you mean" pattern.
you first select used word wich share ngram with the query word, the
distance is computed with levenstein, and you use this word as a
synon
Laxmilal Menaria a écrit :
> Hello Everyone,
>
> I want to search 'abc-d' as exact keyword not 'abc d'. KeywordAnalyzer can
> be used for this purpose. StandradAnalyzer create different tokens for
> 'abc-d' as 'abc' and 'd'.
> But I can not use this, becuase I am indexing the content of a text fil
sandeep chawla a écrit :
> Hi ,
>
> I am working on a search application . This application requires me to
> implement a stop filter
> using a stop word list. I have implemented a stop filter using lucene's API.
>
> I want to take my application one step further.
>
> I want to remove all the words
Thomas Arni a écrit :
> Hello Luceners
>
> I have started a new project and need to index pdf documents.
> There are several projects around, which allow to extract the content,
> like pdfbox, xpdf and pjclassic.
>
> As far as I studied the FAQ's and examples, all these
> tools allow simple text ex
With Compass, indexing is linked to your database transaction, when
your object is persisted, it's indexed too.
All your questions are managed cleanly and silently by Compass, just
have a look to the source code if you don't wont to use this product.
M.
Le 10 août 07 à 12:24, Antonello Prove
some tools exist for finding duplicated parts in document.
You split document in phrase, and build ngram with word. If you wont
complete phrase, work with all words, for a partial, work with 5
words ngram, for example. ngram list is convert to hash, and hash is
used as an indexed Field for t
Jens Grivolla a écrit :
> Hello,
>
> I'm looking to extract significant terms characterizing a set of
> documents (which in turn relate to a topic).
>
> This basically comes down to functionality similar to determining the
> terms with the greatest offer weight (as used for blind relevance
> feedba
Le mardi 24 juillet 2007 à 13:01 -0700, Shaw, James a écrit :
> Hi, guys,
> I found Analyzers for Japanese, Korean and Chinese, but not stemmers;
> the Snowball stemmers only include European languages. Does stemming
> not make sense for ideograph-based languages (i.e., no stemming is
> needed for
IMHO, stemming is hurting index. A stemmed index can't be use for
completion or other kind of search.
But stemming is nice for synonyms search. You should look for
spellchecker code. If you index your word with stemmed version, you can
provide a synonym filter, just like wordnet example.
M.
Le lu
Le dimanche 22 juillet 2007 à 13:17 -0500, Dmitry a écrit :
> Mathieru,
> I never used Compass, i know that there is integration Shards /Search with
> Hibernate, but it absolutely different what actually I need, probably I can
> take a look on it. any way thanks
> thanks,
> DT
Not only hibernate
e a
relational database?
On 7/17/07, Mathieu Lecarme < [EMAIL PROTECTED]> wrote:
http://www.opensymphony.com/compass/
The project is free, following Lucene version quickly, the forum is
great, and the lead developer is quick reacting.
M.
Mohammad Norouzi a écrit :
> Mathieu,
>
If you wont to index Hibernate persisted data, just use Compass.
M.
Le 22 juil. 07 à 04:19, Dmitry a écrit :
Folks,
Trying to integrate PDM system : WTPart obejct with Lucene
indexing search framework.
Part of the work is integration with persistent layer +
indeces storage+ mysql
Could no
http://www.opensymphony.com/compass/
The project is free, following Lucene version quickly, the forum is
great, and the lead developer is quick reacting.
M.
Mohammad Norouzi a écrit :
> Mathieu,
> I need an object mapper for lucene would you please give me the
> Compass web
> site? is it open so
Sorry, I use Compass, an object mapper for Lucene, and it provides a
special field "all", I thought it was a Lucene feature.
M.
Renaud Waldura a écrit :
> Often documents can be divided in "metadata" and "contents" sections. Say
> you're indexing Web pages, you could index them with HEAD data all
you can use the "all" special field, but you loose the differents
boost values.
M.
Le 14 juil. 07 à 10:50, Mohammad Norouzi a écrit :
Hello all
is there any way to search through all the fields without using
MultiFieldQueryParser? currently I am using this parser but it
requires to
pass all
You creates your own class wich extends SynonymTokenFilter.
You pipe it with your Analyzer.
M.
Le 13 juil. 07 à 19:09, Mohsen Saboorian a écrit :
Thanks for quick replies.
Mathieu Lecarme wrote:
Uses synonyms in the query parser
Sorry, I didn't get the point of synonym. Ca
Mohsen Saboorian a écrit :
> Is it possible that I inject my own matching mechanism into Lucene
> IndexSearcher? In other words, is this possible that my own method be called
> in order to collect results (hits)? Suppose the case that I want to match -
> for example - "foo" with both "foo" and "oof
Ndapa Nakashole a écrit :
> I am considering using Lucene in my mini Grid-based search engine. I
> would
> like to partition my index by term as opposed to partition by
> document. From
> what i have read in the mailing list so far, it seems like partition
> by term
> is impossible with Lucene. am
stop writing
scp index to another computer
play with it
scp indexModified to the server
mv indexModified indexCurrent
all done. mv is atomic.
Jens Grivolla a écrit :
> Hi,
>
> I have a Lucene index with a few million entries, and I will need to
> add batches of a few hundred thousand or a few mil
Samuel LEMOINE a écrit :
> I'm acutely interrested by this issue too, as I'm working on
> distributed architecture of Lucene. I'm only at the very beginning of
> my study so that I can't help you much, but Hadoop maybe could fit to
> your requirements. It's a sub-project of Lucene aiming to paralle
Server One handle website
Server Two is a light version of tomcat wich handle Lucene Search
In front, a lighttpd which use server two for /search, and server one
for all others things
You can add lucene server with round robin in lighttpd with this scheme.
Careful with fault tolerance and index
It's not so far from Lucene!
http://en.wikipedia.org/wiki/Sentence_extraction
have a look at wordnet (http://wordnet.princeton.edu/).
Get some list of articles, verb, nouns, and affix rules (like aspell,
myspell ...)
You will use more cooking rules than code.
M.
Le 18 juin 07 à 20:29, Mordo,
Lee Li Bin a écrit :
> Hi,
>
> I still met problem for searching of Chinese words.
> XMl file which is the datasource and analyzer has already been encoded.
> Have testing on StandardAnalyzer, CJKAnalyzer, and ChineseAnalyzer, but it
> still can't get any results.
>
> 1.do we need any encoding
Compass use a trick to manage father-son indexation.
If you index "collection", with a fields Date, wich are the newest
picture inside, and putting all picture's keyword to it collection?
Then, with a keyword search, you will find the collection with the
most tag occurence number and date s
Walt explain differently what I said.
Lucene can be efficiently use for selecting objects, without sorting
or scoring anything, then, with id stored in Lucene, you can sort
yourself with a simple Sortable implementation.
The only limit is that lucene gives you not too much results, with
your
e rules.
>>
>> M.
>>
>> Antoine Baudoux a écrit :
>>> The problem is that i want lucene to do the sorting, because the query
>>> qould return thousands of results, and I'm displaying documents one
>>> page at a time.
>>>
>>>
ith at most 300 elements
you can sort it with strange rules.
M.
Antoine Baudoux a écrit :
> The problem is that i want lucene to do the sorting, because the query
> qould return thousands of results, and I'm displaying documents one
> page at a time.
>
>
> On 15 Jun 2007, at 17
First step is to feed a Set with "collection"
Second step is to sort it.
With a sortedSet, you can do that, isnt'it?
M.
Antoine Baudoux a écrit :
> Could-you be more precise? I dont understand what you mean.
>
>
>
> On 15 Jun 2007, at 17:20, Mathieu Lecarme wrote:
Your request seems to be a two steps query.
First step, you select image, and then collection
Second step, you sort collection.
BitVector can help you?
M.
Antoine Baudoux a écrit :
> Hi,
>
> I'm developping an image database. Each lucene document
> representing an image contains (among ot
if you don't use the same tokenizer for indexing and searching, you will
have troubles like this.
Mixing exact match (with ") and wildcard (*) is a strange idea.
Typographical rules says that you have a space after a comma, no?
Your field is tokenized?
M.
Renaud Waldura a écrit :
> My very simple
You can work like with lucene spelling.
A specific Index with word as Document, boost with something
proportionnal of number of occurences (with log and math magic)
The magical stuff is n Fields with starting ngram, not stored, no
tokenized.
For example, if you wont to index the word "carott",
Why don't use Document?
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/
org/apache/lucene/document/Document.html
HTMLDocument manage HTML stuff like encoding, header, and other
specificity.
Nutch use specific word tools (http://lucene.apache.org/nutch/apidocs/
org/ap
If you do that, you enumerate every terms!!!
If you use a alphabeticaly sorted collection, you can stop, when
match stop, but, you have to test every terms before matching.
Lucene gives you tools to match begining of a term, just use it!!
M.
Le 8 juin 07 à 14:57, Patrick Turcotte a écrit :
H
have a look of opensearch.org specification, your self-completion
will work with IE7 and Firefox 2.
JSON serialization is quicker than XML stuff.
Be careful to limit the number of responses.
A search in "test*" works very well in my project with ten thousands
of documents.
Begin completion onl
78 matches
Mail list logo