This is a parts-of-speech analyzer for tweets. It would make your index
far more useful.
http://www.ark.cs.cmu.edu/TweetNLP/
On 11/04/2013 11:40 PM, Stéphane Nicoll wrote:
Hi,
I am building an application that indexes tweet and offer some basic
search facilities on them.
I am trying to find
This is very cool! Lemmatization is an important tool for making search
work better.
Would you consider changing the licensing to the Apache 2.0 license?
On 10/23/2013 08:17 AM, Michal Hlavac wrote:
Hi,
I rewrote lemmatizer project LemmaGen (http://lemmatise.ijs.si/) to java.
Originally it's
Is there a Trie-based term index? Seems like this would be smaller, and
very fast on non-leading wildcards.
On 07/09/2013 02:34 PM, Uwe Schindler wrote:
Hi,
You can replace the term by their hash directly in the analyzer chain. Just
write a custom TermToBytesRef attribute that hashes the term
My current open source project is a Directory that is just like
RAMDirectory, but everything is memory-mapped. The idea is it creates a
disk file, opens it, and immediately deletes the file. The file still
exists until the IndexReader/Writer/Searcher closes it. But, it cannot
be found from the
Solr/Lucene has two features for this:
1) the MoreLikeThis code, and
2) the clustering project in solr/contrib.
Lance
On 06/28/2013 11:15 AM, Luis Carlos Guerrero Covo wrote:
I only have about a million docs right now so scaling is not a big issue.
I'm looking to provide a quick implementation
I'm responsible for the OpenNLP wiki page:
https://wiki.apache.org/solr/OpenNLP
Please add me to the list of editors.
The simple answer (that somehow nobody gave) is that you can make a copy
of an index directory at any time. Indexes are changed in "generations".
The segment* files describe the current generation of files. All active
indexing goes on in new files. In a commit, all new files are flushed to
disk
)
On Mon, Jun 3, 2013 at 6:46 AM, Lance Norskog wrote:
What is a Lucene query that will find two words at the same term position?
Is there a class that will do this? Is the feature available from the
Lucene query syntax or any other syntax parsers?
For example, if I'm using synonyms at
What is a Lucene query that will find two words at the same term
position? Is there a class that will do this? Is the feature available
from the Lucene query syntax or any other syntax parsers?
For example, if I'm using synonyms at index time I should get the base
word and all synonyms at the
3.x and 4.0 Solr releases have nice analyzers just for Japanese. In 4.0
they are the "Kuromoji" package.
In 4.0, the JapaneseAnalyzer probably does what you need:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-analyzers-kuromoji/4.0.0/org/apache/lucene/analysis/ja/Japan
4.x does not promise backwards compatibility with 3.x. Have you made
your own extensions?
On 01/02/2013 04:38 AM, Shai Erera wrote:
There's no specific branch for 4.1 yet. All development still happens on
the 4x branch (
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/).
Note tha
There were memory leak problems with earlier versions of Java. You
should upgrade to Java 6_30.
Lance
On 01/02/2013 05:26 AM, Alon Muchnick wrote:
Hello All ,
we are using Lucune 3.6.2 in our web application on tomcat 5.5 and recently
we started testing our application on tomcat 7, unfortunate
How do you choose t2 and t2a? If you have a full inventory of these
pairs, you can make these multi-word synonyms and use the Synonym filter
to combine them.
On 12/20/2012 11:50 PM, Xi Shen wrote:
Hi,
I am looking for a token filter that can combine 2 terms into 1? E.g.
the input has been to
from
http://mirror.bjtu.edu.cn/apache/lucene/java/4.0.0/lucene-4.0.0-src.tgz, it
is an Ant project. But I do not which IDE can import it...I tried Eclipse,
it cannot import the build.xml file.
Thanks,
D.
On Mon, Dec 24, 2012 at 12:02 PM, Lance Norskog wrote:
You need to use an IDE. Find the
You need to use an IDE. Find the Attribute type and show all subclasses.
This shows a lot of rare ones and a few which are used a lot. Now, look
at source code for various TokenFilters and search for other uses of the
Attributes you find. This generally is how I figured it out.
Also, after the
n put parts of
speech as either payloads (PartOfSpeechAttribute?) on a token or at
the same position."
This adds it to a token, not a span. 'same position' does not suggest
it also records the end position.
-Glen
On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog wrote:
Parts-of-spe
Parts-of-speech is available now, in the indexer.
LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an
Apache project for natural-language processing.
Some parts are in Solr that could be in Lucene.
https://issues
Nope! This slang term only exists in the plural. The kind of prose with this
usage may not follow standard grammatical and spelling rules anyway.
Historically, text search has been funded mostly by the US intelligence
agencies because they want to analyze formal and technical prose. And, it is
An option: instead of merging continuously as you run, you can optimize with
'maxSegments=10'. This mean 'optimize but only until there are 10 segments'. If
there are fewer than 10 segments, nothing happens. This lets you schedule
merging I/O.
Is the number of files a problem due to file space
Scott, did you mean the Lucene integer id, or the unique id field?
- Original Message -
| From: "Martijn v Groningen"
| To: java-user@lucene.apache.org
| Sent: Sunday, October 28, 2012 2:24:29 PM
| Subject: Re: Lucene 4.0 delete by ID
|
| A top level document ID can change over time. For
gt; Dawid
>
> ---------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
t
>> doesn't help me).
>
>
> I'm wrong, its there, but eclipse isn't seeing it (haven't tried javac by
> itself), even though it sees HighFreqTerms just fine.
>
>
> --------
ontaining $$?
>> >
>> >
>> > --
>> > Ian.
>> >
>> >
>> > On Tue, Aug 14, 2012 at 9:13 AM, zhoucheng2008
>> > wrote:
>> > > Hi,
>> > >
>> > >
>> > > I have a big index, and when I s
bucket. SSDs speeds up almost everything, saves
> RAM and spares a lot of work hours optimizing I/O-speed.
>
> Regards,
> Toke Eskildsen
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apac
gh estimate of direct memory usage per GB of
>> indexed data, or per directory/writer instance, if applicable.
>>
>> Thanks,
>> -V
>
> -
> To unsubscribe, e-mail: java-user-unsubs
anyhow for large mergeFactor and large
> RAMBufferSizeMB.
>
> Maxim
>
>
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene
And, no RamDirectory does not help.
On Mon, May 28, 2012 at 5:54 PM, Lance Norskog wrote:
> Can you use filter queries? Filters short-circuit a lot of search
> processing. "City:San Francisco" is a classic filter - it is a small
> part of the documents and it is reused a lot.
it seems that io is not the
>> >>> >> > I am reading
>> >>> > http://www.cnlp.org/presentations/slides/AdvancedLuceneEU.pdf
>> >>> >> > it mentions
>> >>> >> > Size
>> >>> >> > – Stop
the smallest one is byte. It is possible to use only
> ceil(log2(#unique_values)) bits/document, although that requires a bit
> of custom coding.
>
> Regards,
> Toke Eskildsen
>
>
> ----------
, 2012 at 1:09 AM, Lance Norskog wrote:
> I would like to remove a payload attribute from a token before it is
> indexed. PayloadAttribute lets you set the payload to null.
> AttributeSource (parent of all Tokens) does not have a 'remove
> Attribute' method. You cannot capture
then monkey with it (at least Eclipse does not show
me its methods).
If I set the payload to null, when the Token is saved in the index,
will a null payload be saved? Or does the payload get quietly dropped?
--
Lance Norskog
goks...@gmail.com
ommands, e-mail: java-user-h...@lucene.apache.org
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
s)
>>
>>
>> Any other suggestions? I have tried some of the basic ideas on the Lucene
>> wiki, such as leaving the IndexSearcher open for the life of the process (a
>> servlet). Any help would be greatly appreciated!
>>
>>
>> Rob
>
>
> -
&g
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
--
Lance Norskog
goks...@gmail.com
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> ---------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.
t;>
>> I'm sure this has quite a simple explanation but I'm unable to find it right
>> now ;-) Perhaps you can help with that.
>>
>> Thanks a lot!
>>
>> Best regards,
>>
>> Erik
>>
>> ---
-
>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >>
>> >> >>
>> >> >>
> -
>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >>
>> >> >>
>> >> >>
> -
>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >> > -
>> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> --
> Regards,
> Sundus Hassan
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
--
Lance Norskog
goks...@gmail.com
-
a time. That being said, basic grep/regex is probably
>> >> fast
>> >> > enough.
>> >> > >
>> >> >
>> >> > In cases where you are doing a 'find' in a document similar to what a
>> >> > wordprocessor would do (especially if you want to iterate
>> >> > forwards/backwards through matches etc), you might want to consider
>> >> > something like
>> >> >
>> http://icu-project.org/apiref/icu4j/com/ibm/icu/text/StringSearch.html
>> >> >
>> >> > -
>> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >
>> >> >
>> >>
>> >
>>
>>
>>
>> --
>> ---
>> Thanks & Regards
>> Umesh Prasad
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
p,
>>> presumably because UCS doesn't have a hash an equals
>>> or hash method.
>>>
>>> Suggestions? I've worked around it by registering a class
>>> based builder, checking for the field name and either
>>> delegating to the original builder or doing my custom
>>> processing, but it'
the Similarity class.
>> >
>> > How can I change the scoring formula? ( by customizing only the
>> Similarity
>> > class? or Scorer?)
>> >
>> > Do you have an Example of this use case?
>> >
>> > Thank for your help.
>> >
>&g
I couldn't find it, so...
>>
>> Is it possible / advisable / practical to use Lucene as the basis of a
>> live
>> document search capability? By "live document" I mean a largish document
>> such as a word processor might be able to handle which is
SMS to your Friends on Mobile from your Yahoo! Messenger. Download
> Now! http://messenger.yahoo.com/download.php
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-m
> Any ideas or thoughts, would be very much appreciated.
>
> Thanks in advance
> David
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>
The Lucene CheckIndex program does this. It is a class somewhere in
Lucene with a main() method.
Samarendra Pratap wrote:
It is not guaranteed that every term will be indexed. There is a limit on
maximum number of terms (as in lucene 3.0 and may be earlier too) per field.
Check out this
http://
First, to understand what your query looks like, go to
admin/analysis.jsp. It lets you see what happens to your queries when
they go in. Then, do the query with debugQuery=true. This will add some
complex junk to the end of the XML page that describes in painful detail
exactly how each document
le code for that?
>> Thanks a lot, and I apologize for the fact that for many of you this
>> looks like a stupid post :).
>>
>> Best Regards,
>> Ciprian.
>>
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
You would have to control your MergePolicy so it doesn't collapse
everything back to one segment.
On Tue, Nov 2, 2010 at 12:03 PM, Simon Willnauer
wrote:
> On Tue, Nov 2, 2010 at 1:58 AM, Lance Norskog wrote:
>> 2billion is a hard limit. Usually people split indexes into multiple
lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
Tika has some mailbox file parsing that includes metadata parsing.
For POP/IMAP email servers I don't know any tools.
Hasan Diwan wrote:
On 27 October 2010 18:16, Troy Wical wrote:
Depends on what your trying to index, I suppose. Maildir or mbox? For some time
now, off and on, I have been
already
> does this using Lucene/Solr.
> Thanks!
> Maria
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@
The Lucene CheckIndex program opens an index and walks all of the data
structures. It is a good start for you.
Sahin Buyrukbilen wrote:
Thank you Uwe, I will read the docs and try to do it, however do you have an
example code? I need because I am not very familiar with Java.
Thank you.
Sahin
If an index file is not completely written to disk, it never become
available. Lucene has a file describing the current active index
segments. It writes all new files to the disk, and changes the
description file (segments.gen) only after that.
If the index files are corrupted, all bets are of
---
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: j
appears puny. (-:
>>>
>>> Thanks,
>>> Chris
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional command
e of the data you're going to index? If you're
> relying on your SOLR index to be your backup, you simply must back it up
> somewhere "often enough" to get by if your building burns down. I'd also
> think about storing your original input...
>
> This is no diffe
fying an
> existing order-array is cheaper than a full re-sort or not depends on
> your batch size.
>
> Regards,
> Toke Eskildsen
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
&
>>
>> When I print synonymMap using synonymMap.toString(), I get the output like
>>
>> <{New York=<{Chicago=<{Seattle=<{New
>> Orleans=<[(CONCEPTcity,0,0,type=SYNONYM),ORIG],null>}>}>}>}>
>>
>> so it looks like all the synonyms are loaded. But if I search for
>> "CONCEPTcity" then it says no matches found. I am not sure whether I have
>> loaded the synonyms correctly in the synonymMap.
>>
>> Any help will be deeply appreciated. Thanks!
>>
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
gle
> MultiThreadedHttpConnectionManager and HttpClient for all the SolrServer’s,
> and the other with a new MultiThreadedHttpConnectionManager and HttpClient
> for each SolrServer.
>
> Both tries yielded similar performance results.
>
> Also tried to give setMaxTotalConnection
tomScoreProvider. The QWF/CSQ trick is more convenient and used quite
> often inside Lucene, too.
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>&g
gt;> > Philippe
>> >
>> > -
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>&
Glen, thank you for this very thorough and informative post.
Lance Norskog
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
-
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
o be caused by the new behavior introduced here?
>> https://issues.apache.org/jira/browse/LUCENE-2386
>> If you open a writer, add docs, and then crash before calling commit?
>
> That could be; Maryam is that what happened?
>
> Mike
>
> -
> To unsubscribe, e-mail
y
> copying these from another lucene index directory generated with the same
> lucene version or can I merge this inex with another index which has
> segments_N to retrieve the data ?
>
> Thanks
>
--
Lance Norskog
goks...@gmail.com
-
es per day. You
> would need 625,000 of the largest iPods to store that much information; if
> these were stacked end-to-end they would go for more than 40 miles
>
>
> ---------
> To unsubs
t; To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mai
-
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
_g9 into a new
directory and generate a segments.gen for just those two segments. Is
this all that's needed?
--
Lance Norskog
goks...@gmail.com
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For addit
ated.
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
&
solid funding and
are a real business. We have a contract with a large company to lease our
index and provide various services.
Thank for your time. Please contact me at [EMAIL PROTECTED]
Lance Norskog
650-922-8831
Hi-
The page http://lucene.apache.org/java/docs/queryparsersyntax.html does not
mention that \u Unicode syntax is supported.
For example, \u0048\u0045\u004c\u004c\u004f is HELLO.
Please add this to the page, it took experimentation to discover it.
Thanks,
Lance Norskog
72 matches
Mail list logo