This is a parts-of-speech analyzer for tweets. It would make your index
far more useful.
http://www.ark.cs.cmu.edu/TweetNLP/
On 11/04/2013 11:40 PM, Stéphane Nicoll wrote:
Hi,
I am building an application that indexes tweet and offer some basic
search facilities on them.
I am trying to find
This is very cool! Lemmatization is an important tool for making search
work better.
Would you consider changing the licensing to the Apache 2.0 license?
On 10/23/2013 08:17 AM, Michal Hlavac wrote:
Hi,
I rewrote lemmatizer project LemmaGen (http://lemmatise.ijs.si/) to java.
Originally it's
Is there a Trie-based term index? Seems like this would be smaller, and
very fast on non-leading wildcards.
On 07/09/2013 02:34 PM, Uwe Schindler wrote:
Hi,
You can replace the term by their hash directly in the analyzer chain. Just
write a custom TermToBytesRef attribute that hashes the term
an't store anything on disk in the clear.
Lance
On 07/01/2013 07:07 AM, Emmanuel Espina wrote:
Hi Erick! Nice to hear from you again! From time to time my interest
in these "Lucene things" returns and I do some experiments :p
Just to add to this conversation, I found an intere
Solr/Lucene has two features for this:
1) the MoreLikeThis code, and
2) the clustering project in solr/contrib.
Lance
On 06/28/2013 11:15 AM, Luis Carlos Guerrero Covo wrote:
I only have about a million docs right now so scaling is not a big issue.
I'm looking to provide a quick implement
I'm responsible for the OpenNLP wiki page:
https://wiki.apache.org/solr/OpenNLP
Please add me to the list of editors.
flushed to
disk and then the segment* files change. At any point in this sequence,
all of the files in the directory form one consistent index.
This isn't like MySQL or other databases where you have to shut down the
DB to get a safe copy of the files.
Lance
On 04/17/2013 03:57 AM, Ashish
)
On Mon, Jun 3, 2013 at 6:46 AM, Lance Norskog wrote:
What is a Lucene query that will find two words at the same term position?
Is there a class that will do this? Is the feature available from the
Lucene query syntax or any other syntax parsers?
For example, if I'm using synonyms at
t the same position. What is a query that will
find a document with the synonym substituted, but will not find a
document which has the base word and a synonym at two different positions?
Thanks,
Lance.
-
To unsubscribe, e
3.x and 4.0 Solr releases have nice analyzers just for Japanese. In 4.0
they are the "Kuromoji" package.
In 4.0, the JapaneseAnalyzer probably does what you need:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-analyzers-kuromoji/4.0.0/org/apache/lucene/analysis/ja/Japan
4.x does not promise backwards compatibility with 3.x. Have you made
your own extensions?
On 01/02/2013 04:38 AM, Shai Erera wrote:
There's no specific branch for 4.1 yet. All development still happens on
the 4x branch (
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/).
Note tha
There were memory leak problems with earlier versions of Java. You
should upgrade to Java 6_30.
Lance
On 01/02/2013 05:26 AM, Alon Muchnick wrote:
Hello All ,
we are using Lucune 3.6.2 in our web application on tomcat 5.5 and recently
we started testing our application on tomcat 7
How do you choose t2 and t2a? If you have a full inventory of these
pairs, you can make these multi-word synonyms and use the Synonym filter
to combine them.
On 12/20/2012 11:50 PM, Xi Shen wrote:
Hi,
I am looking for a token filter that can combine 2 terms into 1? E.g.
the input has been to
Go to the top directory and do this:
cp dev-tools/eclipse/dot.project .project
cp dev-tools/eclipse/dot.classpath .classpath
cp -r dev-tools/eclipse/dot.settings .settings
The 'ant eclipse' target does this setup.
On 12/24/2012 10:45 PM, Xi Shen wrote:
Hi Lance,
I got the lucene 4
You need to use an IDE. Find the Attribute type and show all subclasses.
This shows a lot of rare ones and a few which are used a lot. Now, look
at source code for various TokenFilters and search for other uses of the
Attributes you find. This generally is how I figured it out.
Also, after the
n put parts of
speech as either payloads (PartOfSpeechAttribute?) on a token or at
the same position."
This adds it to a token, not a span. 'same position' does not suggest
it also records the end position.
-Glen
On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog wrote:
Parts-of-spe
Parts-of-speech is available now, in the indexer.
LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an
Apache project for natural-language processing.
Some parts are in Solr that could be in Lucene.
https://issues
coded by people who think in good grammar, and are perfect spellers.
If you find 'too aggressive' and 'too mild' to be a problem, what you want is
'lemmatization' where you work from a dictionary of word forms. Solr supports
using Wordnet for this purpose.
An option: instead of merging continuously as you run, you can optimize with
'maxSegments=10'. This mean 'optimize but only until there are 10 segments'. If
there are fewer than 10 segments, nothing happens. This lets you schedule
merging I/O.
Is the number of files a problem due to file space
Scott, did you mean the Lucene integer id, or the unique id field?
- Original Message -
| From: "Martijn v Groningen"
| To: java-user@lucene.apache.org
| Sent: Sunday, October 28, 2012 2:24:29 PM
| Subject: Re: Lucene 4.0 delete by ID
|
| A top level document ID can change over time. For
gt; Dawid
>
> -----
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
t
>> doesn't help me).
>
>
> I'm wrong, its there, but eclipse isn't seeing it (haven't tried javac by
> itself), even though it sees HighFreqTerms just fine.
>
>
> ----
ontaining $$?
>> >
>> >
>> > --
>> > Ian.
>> >
>> >
>> > On Tue, Aug 14, 2012 at 9:13 AM, zhoucheng2008
>> > wrote:
>> > > Hi,
>> > >
>> > >
>> > > I have a big index, and when I s
bucket. SSDs speeds up almost everything, saves
> RAM and spares a lot of work hours optimizing I/O-speed.
>
> Regards,
> Toke Eskildsen
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apac
gh estimate of direct memory usage per GB of
>> indexed data, or per directory/writer instance, if applicable.
>>
>> Thanks,
>> -V
>
> -
> To unsubscribe, e-mail: java-user-unsubs
anyhow for large mergeFactor and large
> RAMBufferSizeMB.
>
> Maxim
>
>
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene
And, no RamDirectory does not help.
On Mon, May 28, 2012 at 5:54 PM, Lance Norskog wrote:
> Can you use filter queries? Filters short-circuit a lot of search
> processing. "City:San Francisco" is a classic filter - it is a small
> part of the documents and it is reused a lot.
it seems that io is not the
>> >>> >> > I am reading
>> >>> > http://www.cnlp.org/presentations/slides/AdvancedLuceneEU.pdf
>> >>> >> > it mentions
>> >>> >> > Size
>> >>> >> > – Stop
the smallest one is byte. It is possible to use only
> ceil(log2(#unique_values)) bits/document, although that requires a bit
> of custom coding.
>
> Regards,
> Toke Eskildsen
>
>
> ------
, 2012 at 1:09 AM, Lance Norskog wrote:
> I would like to remove a payload attribute from a token before it is
> indexed. PayloadAttribute lets you set the payload to null.
> AttributeSource (parent of all Tokens) does not have a 'remove
> Attribute' method. You cannot capture
then monkey with it (at least Eclipse does not show
me its methods).
If I set the payload to null, when the Token is saved in the index,
will a null payload be saved? Or does the payload get quietly dropped?
--
Lance Norskog
goks...@gmail.com
ommands, e-mail: java-user-h...@lucene.apache.org
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
-
Lance
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
Hi Ian,
Thanks for your help. I was just checking to see if my solr/lucene developer
understands this. He's in India.
He says this takes place at the time of integration. Is that correct?
-
Lance
On Jan 20, 2012, at 3:40 PM, Ian Lea wrote:
No idea about
HI,
Could you please help me with a quick question - Is there a way to restrict
lucene/solr fuzzy search to only analyze words that have more than 5 characters
and to ignore words with less than that (i.e. less than 6 character words)?
Thanks
-
Lance
HI,
Could you please help me with a quick question - Is there a way to restrict
lucene/solr fuzzy search to only analyze words that have more than 5 characters
and to ignore words with less than that (i.e. less than 6 character words)?
Thanks
-
Lance
a bit worried about this solution since Yonik has pointed out that
the tier based approach is broken. Yonik, any more info on why this is
broken? Perhaps a bounding box that works is better than a circle that
doesn't ;)
Cheers,
Lance.
On 31 December 2011 18:07, Yonik Seeley wrote:
> O
s)
>>
>>
>> Any other suggestions? I have tried some of the basic ideas on the Lucene
>> wiki, such as leaving the IndexSearcher open for the life of the process (a
>> servlet). Any help would be greatly appreciated!
>>
>>
>> Rob
>
>
> -
&g
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
--
Lance Norskog
goks...@gmail.com
useless. Maybe the
Instantiated index stuff is more what you want?
Lance
On Tue, Jun 7, 2011 at 2:52 AM, zhoucheng2008 wrote:
> Makes sense. Thanks
>
> -Original Message-
> From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
> Sent: Tuesday, June 07, 2011 4:28 PM
blem solving and analytical
abilities. You must have a solid grasp of English – written and verbal.
Please note that I am a start-up and I am not going to be able to pay what a
large established company can pay.
Thank you,
Lance
-----
Lance
t;>
>> I'm sure this has quite a simple explanation but I'm unable to find it right
>> now ;-) Perhaps you can help with that.
>>
>> Thanks a lot!
>>
>> Best regards,
>>
>> Erik
>>
>> ---
-
>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >>
>> >> >>
>> >> >>
> -
>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >>
>> >> >>
>> >> >>
> -
>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> > -
>> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
Check out the Mahout project: mahout.apache.org -> there is a
lucene-based text classifier project in there.
Lance
On Tue, Mar 1, 2011 at 9:25 PM, Sundus Hassan wrote:
> I am doing MS-Thesis on content-based text categorization.
> For This purpose I intend to use LUCENE.I need so
a time. That being said, basic grep/regex is probably
>> >> fast
>> >> > enough.
>> >> > >
>> >> >
>> >> > In cases where you are doing a 'find' in a document similar to what a
>> >> > wordprocessor would do (especially if you want to iterate
>> >> > forwards/backwards through matches etc), you might want to consider
>> >> > something like
>> >> >
>> http://icu-project.org/apiref/icu4j/com/ibm/icu/text/StringSearch.html
>> >> >
>> >> > -
>> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >
>> >> >
>> >>
>> >
>>
>>
>>
>> --
>> ---
>> Thanks & Regards
>> Umesh Prasad
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
p,
>>> presumably because UCS doesn't have a hash an equals
>>> or hash method.
>>>
>>> Suggestions? I've worked around it by registering a class
>>> based builder, checking for the field name and either
>>> delegating to the original builder or doing my custom
>>> processing, but it'
the Similarity class.
>> >
>> > How can I change the scoring formula? ( by customizing only the
>> Similarity
>> > class? or Scorer?)
>> >
>> > Do you have an Example of this use case?
>> >
>> > Thank for your help.
>> >
>&g
I couldn't find it, so...
>>
>> Is it possible / advisable / practical to use Lucene as the basis of a
>> live
>> document search capability? By "live document" I mean a largish document
>> such as a word processor might be able to handle which is
SMS to your Friends on Mobile from your Yahoo! Messenger. Download
> Now! http://messenger.yahoo.com/download.php
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-m
> Any ideas or thoughts, would be very much appreciated.
>
> Thanks in advance
> David
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>
The Lucene CheckIndex program does this. It is a class somewhere in
Lucene with a main() method.
Samarendra Pratap wrote:
It is not guaranteed that every term will be indexed. There is a limit on
maximum number of terms (as in lucene 3.0 and may be earlier too) per field.
Check out this
http://
document was scored.
After all that- you might have a problem with the PrnP etc. stuff
getting chopped up in weird ways. I don't know how people handle this in
chemistry/bio search.
Lance
Ahmet Arslan wrote:
Example of Question:
- What is the role of PrnP in mad cow disease?
First
The Lucene MoreLikeThis tool in lucene/contrib/similar will do one
variant of what you want.
You can do this particular test in Solr- you'll find it much much
easier to put together.
For other text similarities, you'll have to code them directly.
Lance
On Sat, Nov 13, 2010 at 7:07
You would have to control your MergePolicy so it doesn't collapse
everything back to one segment.
On Tue, Nov 2, 2010 at 12:03 PM, Simon Willnauer
wrote:
> On Tue, Nov 2, 2010 at 1:58 AM, Lance Norskog wrote:
>> 2billion is a hard limit. Usually people split indexes into multiple
lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
Tika has some mailbox file parsing that includes metadata parsing.
For POP/IMAP email servers I don't know any tools.
Hasan Diwan wrote:
On 27 October 2010 18:16, Troy Wical wrote:
Depends on what your trying to index, I suppose. Maildir or mbox? For some time
now, off and on, I have been
already
> does this using Lucene/Solr.
> Thanks!
> Maria
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@
The Lucene CheckIndex program opens an index and walks all of the data
structures. It is a good start for you.
Sahin Buyrukbilen wrote:
Thank you Uwe, I will read the docs and try to do it, however do you have an
example code? I need because I am not very familiar with Java.
Thank you.
Sahin
off. Usually the data
structures are damaged and Lucene throws CorruptIndexExceptions, NPE or
array out-of-bounds exceptions. There is no checksumming of the index
files.
Lance
Pulkit Singhal wrote:
Hello Everyone,
What happens if:
a) lucene index gets written half-way to the disk and then
This can probably be done. The hardest part is cross-correlating your
Lucene analyzer use with the Solr analyzer stack definition. There are
a few things Lucene does that Solr doesn't- span queries for one.
Lance
On Fri, Sep 17, 2010 at 12:39 PM, Christopher Gross wrote:
> Yes, I'm
appears puny. (-:
>>>
>>> Thanks,
>>> Chris
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional command
e of the data you're going to index? If you're
> relying on your SOLR index to be your backup, you simply must back it up
> somewhere "often enough" to get by if your building burns down. I'd also
> think about storing your original input...
>
> This is no diffe
fying an
> existing order-array is cheaper than a full re-sort or not depends on
> your batch size.
>
> Regards,
> Toke Eskildsen
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
&
>>
>> When I print synonymMap using synonymMap.toString(), I get the output like
>>
>> <{New York=<{Chicago=<{Seattle=<{New
>> Orleans=<[(CONCEPTcity,0,0,type=SYNONYM),ORIG],null>}>}>}>}>
>>
>> so it looks like all the synonyms are loaded. But if I search for
>> "CONCEPTcity" then it says no matches found. I am not sure whether I have
>> loaded the synonyms correctly in the synonymMap.
>>
>> Any help will be deeply appreciated. Thanks!
>>
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
gle
> MultiThreadedHttpConnectionManager and HttpClient for all the SolrServer’s,
> and the other with a new MultiThreadedHttpConnectionManager and HttpClient
> for each SolrServer.
>
> Both tries yielded similar performance results.
>
> Also tried to give setMaxTotalConnection
tomScoreProvider. The QWF/CSQ trick is more convenient and used quite
> often inside Lucene, too.
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>&g
gt;> > Philippe
>> >
>> > -
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>&
Glen, thank you for this very thorough and informative post.
Lance Norskog
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
SpanFirstQuery is the clean option. Another option is to add a "start
token" to each title. Then, search for "startToken oil spill". This
will be faster than SpanFirstQuery. But it also requires doing
something weird to the field.
Lance
On Thu, Jun 17, 2010 at 3:19 PM, Micha
o be caused by the new behavior introduced here?
>> https://issues.apache.org/jira/browse/LUCENE-2386
>> If you open a writer, add docs, and then crash before calling commit?
>
> That could be; Maryam is that what happened?
>
> Mike
>
> -
> To unsubscribe, e-mail
y
> copying these from another lucene index directory generated with the same
> lucene version or can I merge this inex with another index which has
> segments_N to retrieve the data ?
>
> Thanks
>
--
Lance Norskog
goks...@gmail.com
-
es per day. You
> would need 625,000 of the largest iPods to store that much information; if
> these were stacked end-to-end they would go for more than 40 miles
>
>
> -----
> To unsubs
t; To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mai
-
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
_g9 into a new
directory and generate a segments.gen for just those two segments. Is
this all that's needed?
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For addit
ated.
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
&
Hi-
Around Sept. 20 I started getting Japanese spam to this account. This is
a special account I only use for the Solr and Lucene user mailing
lists. Did anybody else get these, starting around 9/20?
Lance Norskog
solid funding and
are a real business. We have a contract with a large company to lease our
index and provide various services.
Thank for your time. Please contact me at [EMAIL PROTECTED]
Lance Norskog
650-922-8831
Hi-
The page http://lucene.apache.org/java/docs/queryparsersyntax.html does not
mention that \u Unicode syntax is supported.
For example, \u0048\u0045\u004c\u004c\u004f is HELLO.
Please add this to the page, it took experimentation to discover it.
Thanks,
Lance Norskog
79 matches
Mail list logo