Don't commit after adding each and every document.
On Tue, Sep 3, 2013 at 7:20 AM, nischal reddy wrote:
> Hi,
>
> Some more update on my progress,
>
> i have multithreaded indexing in my application, i have used thread pool
> executor and used a pool size of 4 but had a very slight increase in
Easiest way would be to pre-process your input and join those 2 tokens
before splitting them by white space.
But from given context I might miss some details...still worth a shot.
On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen wrote:
> Hi,
>
> I am looking for a token filter that can combine 2 terms
Ironically most of the changes are in unicode handling and standard
analyzer ;)
On Tue, Nov 20, 2012 at 12:31 PM, Ramprakash Ramamoorthy <
youngestachie...@gmail.com> wrote:
> On Tue, Nov 20, 2012 at 3:54 PM, Danil ŢORIN wrote:
>
> > However behavior of some analyzers change
However behavior of some analyzers changed.
So even after upgrade the old index is readable with 4.0, it doesn't mean
everything still works as before.
On Tue, Nov 20, 2012 at 12:20 PM, Ian Lea wrote:
> You can upgrade the indexes with org.apache.lucene.index.IndexUpgrader.
> You'll need to do
To avoid wildcard queries, you can write a TokenFilter that will
create both tokens "ADJ" and "ADJ:brown" in same position.
so you can use you index for both lookups without doing wildcard.
On Tue, Aug 7, 2012 at 12:31 PM, Carsten Schnober
wrote:
> Hi Danil,
>
>>> Just transform your input like
le to do phrase queries),
and still maintain join capability.
On Tue, Aug 7, 2012 at 12:13 PM, Carsten Schnober
wrote:
> Am 07.08.2012 10:20, schrieb Danil ŢORIN:
>
> Hi Danil,
>
>> If you do intersection (not join), maybe it make sense to put every
>> thing into 1 index?
>
&g
If you do intersection (not join), maybe it make sense to put every
thing into 1 index?
Just transform your input like "brown fox" into "ADJ:brown| NOUN:fox|"
Write a custom tokenizer, some filters and that's it.
Of course I'm not aware of all the details, so my solution might not
be applicable
Do you really HAVE to keep all those indexes opened?
You could use a LRU or LFU cache of reasonable size with opened
indexes, and open new searcher if it's not in the cache.
If your indexes are quite small, the open call shouldn't be too expensive.
On Mon, Jul 16, 2012 at 11:51 AM, Ian Lea wrot
Listen to Uwe.
Keeping your date/time in milliseconds is the best solution.
You don't care about how the user likes his data DD.MM. (Europe)
of MM.DD.(US), about timezones, daylight saving changes, leap
seconds, or any other complications.
Your dates are simple long numbers, you can easy
If you can afford it, you could add one additional untokenized stored
field that will contain the serialized(one way or another) form of the
document.
Add FieldCache on top of it, and return it right away.
But we are getting into the area where you basically have to keep all
your documents in mem
I think you are looking for FieldCache.
I'm not sure the current status in 4x, but it worked in 2.9/3.x.
Basically it's an array, so access is quite straight forward, and the
best part IndexReader manage those for you, so on reopen only new
segments are read.
Small catch is that FiledCaches are p
Ranges on String are painfully slow.
Format them as MMDD and store as class="solr.TrieIntField"
precisionStep="8" omitNorms="true" positionIncrementGap="0"
On Thu, Feb 23, 2012 at 10:19, findbestopensource
wrote:
> Yes. By storing as String, You should be able to do range search. I am not
>
It also depends on your queries.
For example if you only query data for 1 month intervals, and you
partition by date, you can calculate in which shard your data can be
found, and query just that shard.
If you can find a partition key that is always present in the query,
you can create a gazillion
Or you may simple store the field as is, but index it in whatever way you
like (replacing some tokens with other, or maybe storing both words with
position increment = 0).
On Mon, Jan 16, 2012 at 13:23, Dmytro Barabash wrote:
> I think you need index this field with
> org.apache.lucene.document
Maybe you could simply use String.replace()?
Or the text actually needs to be tokenized?
On Fri, Jan 13, 2012 at 18:44, Ilya Zavorin wrote:
> I am trying to perform a "translation" of sorts of a stream of text. More
> specifically, I need to tokenize the input stream, look up every term in a
> s
erating one index per file.
>
> Am I right to say that you would definitely not go for one index per file
> solution? is it also due to memory consumption?
>
> Many thanks,
> Rui Wang
>
>
> On 6 Dec 2011, at 10:05, Danil ŢORIN wrote:
>
> > How many documents
How many documents there are in the system ?
approximate it by: 2 files * avg(docs/file)
>From my understanding your queries will be just lookup for a document ID
(Q: are those IDs unique between files? or you need to filter by filename?)
If that will be the only usecase than maybe you should
It depends
If all documents are distinct then, yeah, go for it.
If you have multiple versions of same document in your data and you
only want to index the latest version...then you need a clever way to
split data to make sure that all versions of document will be indexed
on same host, and you
There are no noticeable performance gains/loses when moving to 64 bit,
assuming is the exactly same hardware (just 64bit OS), same index and
reasonable amount of java heap
(keep in mind that if you had 2gb on 32 bit you'll need almost 3gb on
64 bit due to lager pointer representation)
But once you
GC times on large heaps are pretty painfull right now (haven't tried
G1 collector, knowledgeable people : please advise)
Also it's very dependent on your index and query pattern, so you could
improve it by using some -XX magic.
My recommendation is to scale horizontally (spit index into shards),
You could encode term score as payload while indexing, and use those
payloads on search time.
On Fri, Oct 15, 2010 at 11:30, Zaharije Pasalic
wrote:
> Hi
>
> my original problem is to index large number of documents which
> contains 360 integers in rage from 0-90K. Searching it's a little bit
> c
I think that StandardAnalyzer will do exactly that thing.
When you specify your field as STORED, the exact copy of the field is
stored so you can retrieve it later.
Analyzer job is just to extract tokens (the things that you'll search
for) and that's where you can play with lower case/stemming/sto
n the SAN, but it's only part of the problem IMHO)
> 9-10) Thank you for the information
> 11) On the high end server, after we optimized the index the average search
> time dropped from 10s to below 2s, now (after 2.5 weeks) the average search
> time is 7s. Optimization
Lucene 2.1 is really old...you should be able to migrate to lucene 2.9
without changing your code (almost jar drop-in, but be careful on
analyzers), and there could be huge improvements if you use lucene
properly.
Few questions:
- what does "all data to be indexed is stored in DB fields" mean? you
Is it possible for you to migrate to 2.9.x ? Or even 3.x ?
There are some huge optimization in 2.9 on reopening indexes that
significantly improve search speed.
I'm not sure..but I think indexWriter.getReader() for almost realtime
was added to 2.9, so you can keep your writer always open and get v
ce on my very first day on this
>> > mailing list...
>> > At end of day, I have very optimistic results. 100bln search in less than
>> > 1ms and the index creation time is not huge either ( close to 15
>> minutes).
>> >
>> > I am now hitting the 1bln mark
It's not optimized, trust me.
An optimized index will contain only 1 segment and no delete files.
On Mon, Aug 16, 2010 at 04:34, Andrew Bruno wrote:
> The index is optimized every 60 secs... so it must have already been cleaned
> up.
>
> Thanks for feedback.
>
> On Sat, Aug 14, 2010 at 8:15 PM,
gt; user may type any one token, this will not work. I can further tweak this
> such that I index the same document into multiple indices (one for each
> token). So, the same document may be indexed into Shard"A", "M", "N" and "D".
> I am not able to think
I'd second that.
It doesn't have to be date for sharding. Maybe every query has some
specific field, like UserId or something, so you can redirect to
specific shard instead of hitting all 10 indices.
You have to have some kind of narrowing: searching 1bn documents with
queries that may hit all do
The problem actually won't be the indexing part.
Searching such large dataset will require a LOT of memory.
If you'll need sorting or faceting on one of the fields, jvm will explode ;)
Also GC times on large jvm heap are pretty disturbing (if you care
about your search performance)
So I'd advise
Try to use CJK analyzer for both indexing and searching chinese language.
Then you won't need "text"->"*text*" transformation.
There might be some false positives in the results though.
You can also may want to try smartcn analyzer which is dictionary based, but
I have no expertise to evaluate the
What will your search look like?
If your document is:
f1:"1"
f2:"2"
f3:"3"
You could create a lucene document with a single field instead of 20k:
fields:"f1/1 f2/2 f3/3"
I replaced ":" with "/" and let assume you use whitespace analyzer on
indexing.
On search your old query "+f1:1 +f2:2" should
You can simple index both "files" and "cards" into same index (no need
for 2 indexes)
Lucene easily support documents of different structure.
You may add some boosting per field or document, and tune similarity
to get most important stuff in top.
On Tue, Jan 19, 2010 at 16:35, Anna Hunecke wro
an
> immediate optimize handling the conversion. Can I safely assume that 3.0.0 is
> able to read 2.3.1?
>
> Making code changes to the readers in production is tricky in my
> infrastructure and making one transition rather than two is very desirable.
>
> -Original Message-
eld.Store.COMPRESS
> predecessors, where index reader client use of Field.Store.COMPRESS is in
> transit to the explicit decompression approach.]
> 5. Convert the readers to 3.0.0, which should be able to read 2.9.1, if there
> are no compressed fields (??)
> 6. Convert the wr
You NEED to update your readers first, or else they will be unable to
read files created by newer version.
And trust me, there are changes in index format from 2.3 -> 2.9
On Wed, Dec 9, 2009 at 15:11, Weiwei Wang wrote:
> Hi, Rob,
> I read
> http://wiki.apache.org/lucene-java/BackwardsCompatibili
Run System.gc() exactly before measuring memory usage.
On sun jvm it will FORCE gc (unless DisableExplicitGC is used).
On Thu, Dec 3, 2009 at 16:30, Ganesh wrote:
> Thanks mike.
>
> I am opening the reader and warming it up and then calculating the memory
> consumed.
> long usedMemory = runt
Try to open with very large value (MAX_INT) it will load only first
term, and look up the rest from disk.
On Fri, Nov 27, 2009 at 12:24, Michael McCandless
wrote:
> If you are absolutely certain you won't be doing any lookups by term.
>
> The only use case I know of is internal, when Lucene's Seg
There is no such thing in lucene as "unique" doc.
They might be unique from your application point of view (have some ID
that is unique)
>From lucene's point of view it's perfectly fine to have duplicate documents.
So the "deleted" documents in combined index are coming from your second index.
E
I'd vote A with following addition:
What about creating major version more often?
If there are incremental improvements which don't clutter the code too
much continue with 3.0 -> 3.1 -> 3.2 -> .. -> 3.X
Once there are significant changes which are hard to maintain backward
compatible start a 4.0
There should be no problem with large segments.
Please describe OS, FileSystem and JDK you are running on.
There might be some problems with file >2Gb on Win32/FAT, or in some
ancient Linuxes.
On Tue, Sep 1, 2009 at 12:37, wrote:
> I met a problem to open an index bigger than 8GB and the followi
Try WhitespaceAnalyzer for both indexing and searching.
On search-time you may also need to escape "+", "(", ")" with "\".
"#" shouldn't need escaping.
On Thu, Jul 16, 2009 at 17:23, Chris Salem wrote:
> I'm using the StandardAnalyzer for both searching and indexing.
> Here's the code to parse the
2GB size is a limitation of OS and/or file systems, not of the index
as supported by Lucene.
There is some other kind of limitation in Lucene: number of documents
< 2147483648
However the size of the lucene index may reach tens and hundreds of GB
way before that.
If you are thinking about BIG inde
iPhone doesn't support java, so there is no way to run lucene on it.
Creating a sqlite database and search inside it is compltetly
different solution,
which has nothing to do with Lucene.
On Wed, May 6, 2009 at 13:08, Shashi Kant wrote:
> Hi all,
>
> I am working on an iPhone application where t
If you store such sensitive data that you think about index encription.
then I may suggest simply isolate the host with lucene index:
- ssh only, VERY limited set of users to login
- provide a solr over https to search the index (avoid in-tranzit interception)
- setup firewall rules
This way Lu
You can use solr (http://lucene.apache.org/solr/)
Index on one machine and distribute the index to many.
On Wed, Mar 25, 2009 at 18:18, kgeeva wrote:
>
> I have an application clustered on two servers. Is the best practice to have
> two lucene indexes - one on each server for the app or is it bes
The problem you may face that for such large documents,is that there
is a high probability that most of terms will be present in all
documents.
So on search you'll receive a lot of documents (if you need to
retrieve full text, it will take a while), but the bigger problem is
usability: what a user
It depends what you call a server :
- 4 dual Xeon, 64G RAM, 1TB of 15000 rpm raid10 hard-disks is one thing
- 1 P4, 512M RAM, 40G 5400 rpm hard-disk, Win2K is completly something else
It depends on index structure and the size of the documents you index/store .
It depends on the way you query
You can generate n-grams: for example when you index "lucene" you
create tokens "luce", "ucen", "cene".
It will increase term count (and index size), however on search you
will simply search for a single term, which will be extremely fast.
It depends how may documents you have, size of each docum
According to
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/TopDocCollector.html
it does.
After search, simple retrieve TopDocs and read documens you need:
List result = new ArrayList(10);
for( ScoreDoc sDoc :collector.topDocs().scoreDocs) {
result.add(contentSearcher.doc(s
50 matches
Mail list logo