Re: Making lucene indexing multi threaded

2013-09-02 Thread Danil ŢORIN
Don't commit after adding each and every document. On Tue, Sep 3, 2013 at 7:20 AM, nischal reddy wrote: > Hi, > > Some more update on my progress, > > i have multithreaded indexing in my application, i have used thread pool > executor and used a pool size of 4 but had a very slight increase in

Re: Which token filter can combine 2 terms into 1?

2012-12-21 Thread Danil ŢORIN
Easiest way would be to pre-process your input and join those 2 tokens before splitting them by white space. But from given context I might miss some details...still worth a shot. On Fri, Dec 21, 2012 at 9:50 AM, Xi Shen wrote: > Hi, > > I am looking for a token filter that can combine 2 terms

Re: Using Lucene 2.3 indices with Lucene 4.0

2012-11-20 Thread Danil ŢORIN
Ironically most of the changes are in unicode handling and standard analyzer ;) On Tue, Nov 20, 2012 at 12:31 PM, Ramprakash Ramamoorthy < youngestachie...@gmail.com> wrote: > On Tue, Nov 20, 2012 at 3:54 PM, Danil ŢORIN wrote: > > > However behavior of some analyzers change

Re: Using Lucene 2.3 indices with Lucene 4.0

2012-11-20 Thread Danil ŢORIN
However behavior of some analyzers changed. So even after upgrade the old index is readable with 4.0, it doesn't mean everything still works as before. On Tue, Nov 20, 2012 at 12:20 PM, Ian Lea wrote: > You can upgrade the indexes with org.apache.lucene.index.IndexUpgrader. > You'll need to do

Re: Small Vocabulary

2012-08-07 Thread Danil ŢORIN
To avoid wildcard queries, you can write a TokenFilter that will create both tokens "ADJ" and "ADJ:brown" in same position. so you can use you index for both lookups without doing wildcard. On Tue, Aug 7, 2012 at 12:31 PM, Carsten Schnober wrote: > Hi Danil, > >>> Just transform your input like

Re: Small Vocabulary

2012-08-07 Thread Danil ŢORIN
le to do phrase queries), and still maintain join capability. On Tue, Aug 7, 2012 at 12:13 PM, Carsten Schnober wrote: > Am 07.08.2012 10:20, schrieb Danil ŢORIN: > > Hi Danil, > >> If you do intersection (not join), maybe it make sense to put every >> thing into 1 index? > &g

Re: Small Vocabulary

2012-08-07 Thread Danil ŢORIN
If you do intersection (not join), maybe it make sense to put every thing into 1 index? Just transform your input like "brown fox" into "ADJ:brown| NOUN:fox|" Write a custom tokenizer, some filters and that's it. Of course I'm not aware of all the details, so my solution might not be applicable

Re: many index reader problem

2012-07-16 Thread Danil ŢORIN
Do you really HAVE to keep all those indexes opened? You could use a LRU or LFU cache of reasonable size with opened indexes, and open new searcher if it's not in the cache. If your indexes are quite small, the open call shouldn't be too expensive. On Mon, Jul 16, 2012 at 11:51 AM, Ian Lea wrot

Re: about some date store

2012-07-11 Thread Danil ŢORIN
Listen to Uwe. Keeping your date/time in milliseconds is the best solution. You don't care about how the user likes his data DD.MM. (Europe) of MM.DD.(US), about timezones, daylight saving changes, leap seconds, or any other complications. Your dates are simple long numbers, you can easy

Re: any good idea for loading fields into memory?

2012-06-22 Thread Danil ŢORIN
If you can afford it, you could add one additional untokenized stored field that will contain the serialized(one way or another) form of the document. Add FieldCache on top of it, and return it right away. But we are getting into the area where you basically have to keep all your documents in mem

Re: any good idea for loading fields into memory?

2012-06-20 Thread Danil ŢORIN
I think you are looking for FieldCache. I'm not sure the current status in 4x, but it worked in 2.9/3.x. Basically it's an array, so access is quite straight forward, and the best part IndexReader manage those for you, so on reopen only new segments are read. Small catch is that FiledCaches are p

Re: date issues

2012-02-23 Thread Danil ŢORIN
Ranges on String are painfully slow. Format them as MMDD and store as class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0" On Thu, Feb 23, 2012 at 10:19, findbestopensource wrote: > Yes. By storing as String, You should be able to do range search. I am not >

Re: How best to handle a reasonable amount to data (25TB+)

2012-02-08 Thread Danil ŢORIN
It also depends on your queries. For example if you only query data for 1 month intervals, and you partition by date, you can calculate in which shard your data can be found, and query just that shard. If you can find a partition key that is always present in the query, you can create a gazillion

Re: how to preserve whitespaces etc when tokenizing stream?

2012-01-16 Thread Danil ŢORIN
Or you may simple store the field as is, but index it in whatever way you like (replacing some tokens with other, or maybe storing both words with position increment = 0). On Mon, Jan 16, 2012 at 13:23, Dmytro Barabash wrote: > I think you need index this field with > org.apache.lucene.document

Re: how to preserve whitespaces etc when tokenizing stream?

2012-01-16 Thread Danil ŢORIN
Maybe you could simply use String.replace()? Or the text actually needs to be tokenized? On Fri, Jan 13, 2012 at 18:44, Ilya Zavorin wrote: > I am trying to perform a "translation" of sorts of a stream of text. More > specifically, I need to tokenize the input stream, look up every term in a > s

Re: Use multiple lucene indices

2011-12-06 Thread Danil ŢORIN
erating one index per file. > > Am I right to say that you would definitely not go for one index per file > solution? is it also due to memory consumption? > > Many thanks, > Rui Wang > > > On 6 Dec 2011, at 10:05, Danil ŢORIN wrote: > > > How many documents

Re: Use multiple lucene indices

2011-12-06 Thread Danil ŢORIN
How many documents there are in the system ? approximate it by: 2 files * avg(docs/file) >From my understanding your queries will be just lookup for a document ID (Q: are those IDs unique between files? or you need to filter by filename?) If that will be the only usecase than maybe you should

Re: distributing the indexing process

2011-06-30 Thread Danil ŢORIN
It depends If all documents are distinct then, yeah, go for it. If you have multiple versions of same document in your data and you only want to index the latest version...then you need a clever way to split data to make sure that all versions of document will be indexed on same host, and you

Re: Re: Scale up design

2010-12-21 Thread Danil ŢORIN
There are no noticeable performance gains/loses when moving to 64 bit, assuming is the exactly same hardware (just 64bit OS), same index and reasonable amount of java heap (keep in mind that if you had 2gb on 32 bit you'll need almost 3gb on 64 bit due to lager pointer representation) But once you

Re: Scale up design

2010-12-13 Thread Danil ŢORIN
GC times on large heaps are pretty painfull right now (haven't tried G1 collector, knowledgeable people : please advise) Also it's very dependent on your index and query pattern, so you could improve it by using some -XX magic. My recommendation is to scale horizontally (spit index into shards),

Re: Overriding DefaultScore

2010-10-15 Thread Danil ŢORIN
You could encode term score as payload while indexing, and use those payloads on search time. On Fri, Oct 15, 2010 at 11:30, Zaharije Pasalic wrote: > Hi > > my original problem is to index large number of documents which > contains 360 integers in rage from 0-90K. Searching it's a little bit > c

Re: Case insensitive search

2010-10-08 Thread Danil ŢORIN
I think that StandardAnalyzer will do exactly that thing. When you specify your field as STORED, the exact copy of the field is stored so you can retrieve it later. Analyzer job is just to extract tokens (the things that you'll search for) and that's where you can play with lower case/stemming/sto

Re: Questions about Lucene usage recommendations

2010-09-27 Thread Danil ŢORIN
n the SAN, but it's only part of the problem IMHO) > 9-10) Thank you for the information > 11) On the high end server, after we optimized the index the average search > time dropped from 10s to below 2s, now (after 2.5 weeks) the average search > time is 7s. Optimization

Re: Questions about Lucene usage recommendations

2010-09-27 Thread Danil ŢORIN
Lucene 2.1 is really old...you should be able to migrate to lucene 2.9 without changing your code (almost jar drop-in, but be careful on analyzers), and there could be huge improvements if you use lucene properly. Few questions: - what does "all data to be indexed is stored in DB fields" mean? you

Re: In lucene 2.3.2, needs to stop optimization?

2010-09-24 Thread Danil ŢORIN
Is it possible for you to migrate to 2.9.x ? Or even 3.x ? There are some huge optimization in 2.9 on reopening indexes that significantly improve search speed. I'm not sure..but I think indexWriter.getReader() for almost realtime was added to 2.9, so you can keep your writer always open and get v

Re: Scaling Lucene to 1bln docs

2010-08-16 Thread Danil ŢORIN
ce on my very first day on this >> > mailing list... >> > At end of day, I have very optimistic results. 100bln search in less than >> > 1ms and the index creation time is not huge either ( close to 15 >> minutes). >> > >> > I am now hitting the 1bln mark

Re: 140GB index directory, what can I do?

2010-08-16 Thread Danil ŢORIN
It's not optimized, trust me. An optimized index will contain only 1 segment and no delete files. On Mon, Aug 16, 2010 at 04:34, Andrew Bruno wrote: > The index is optimized every 60 secs... so it must have already been cleaned > up. > > Thanks for feedback. > > On Sat, Aug 14, 2010 at 8:15 PM,

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Danil ŢORIN
gt; user may type any one token, this will not work. I can further tweak this > such that I index the same document into multiple indices (one for each > token). So, the same document may be indexed into Shard"A", "M", "N" and "D". > I am not able to think

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Danil ŢORIN
I'd second that. It doesn't have to be date for sharding. Maybe every query has some specific field, like UserId or something, so you can redirect to specific shard instead of hitting all 10 indices. You have to have some kind of narrowing: searching 1bn documents with queries that may hit all do

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Danil ŢORIN
The problem actually won't be the indexing part. Searching such large dataset will require a LOT of memory. If you'll need sorting or faceting on one of the fields, jvm will explode ;) Also GC times on large jvm heap are pretty disturbing (if you care about your search performance) So I'd advise

Re: Lucene and Chinese language

2010-07-01 Thread Danil ŢORIN
Try to use CJK analyzer for both indexing and searching chinese language. Then you won't need "text"->"*text*" transformation. There might be some false positives in the results though. You can also may want to try smartcn analyzer which is dictionary based, but I have no expertise to evaluate the

Re: custom low-level indexer (to speed things up) when fields, terms and docids are in order

2010-03-26 Thread Danil ŢORIN
What will your search look like? If your document is: f1:"1" f2:"2" f3:"3" You could create a lucene document with a single field instead of 20k: fields:"f1/1 f2/2 f3/3" I replaced ":" with "/" and let assume you use whitespace analyzer on indexing. On search your old query "+f1:1 +f2:2" should

Re: Indexing and Searching linked files

2010-01-19 Thread Danil ŢORIN
You can simple index both "files" and "cards" into same index (no need for 2 indexes) Lucene easily support documents of different structure. You may add some boosting per field or document, and tune similarity to get most important stuff in top. On Tue, Jan 19, 2010 at 16:35, Anna Hunecke wro

Re: Index file compatibility and a migration plan to lucene 3

2009-12-09 Thread Danil ŢORIN
an > immediate optimize handling the conversion. Can I safely assume that 3.0.0 is > able to read 2.3.1? > > Making code changes to the readers in production is tricky in my > infrastructure and making one transition rather than two is very desirable. > > -Original Message-

Re: Index file compatibility and a migration plan to lucene 3

2009-12-09 Thread Danil ŢORIN
eld.Store.COMPRESS > predecessors, where index reader client use of Field.Store.COMPRESS is in > transit to the explicit decompression approach.] > 5. Convert the readers to 3.0.0, which should be able to read 2.9.1, if there > are no compressed fields (??) > 6. Convert the wr

Re: Index file compatibility and a migration plan to lucene 3

2009-12-09 Thread Danil ŢORIN
You NEED to update your readers first, or else they will be unable to read files created by newer version. And trust me, there are changes in index format from 2.3 -> 2.9 On Wed, Dec 9, 2009 at 15:11, Weiwei Wang wrote: > Hi, Rob, > I read > http://wiki.apache.org/lucene-java/BackwardsCompatibili

Re: IndexDivisor

2009-12-03 Thread Danil ŢORIN
Run System.gc() exactly before measuring memory usage. On sun jvm it will FORCE gc (unless DisableExplicitGC is used). On Thu, Dec 3, 2009 at 16:30, Ganesh wrote: > Thanks mike. > > I am opening the reader and warming it up and then calculating the memory > consumed. > long usedMemory   = runt

Re: IndexDivisor

2009-11-27 Thread Danil ŢORIN
Try to open with very large value (MAX_INT) it will load only first term, and look up the rest from disk. On Fri, Nov 27, 2009 at 12:24, Michael McCandless wrote: > If you are absolutely certain you won't be doing any lookups by term. > > The only use case I know of is internal, when Lucene's Seg

Re: Adding segments to an optimized index

2009-10-28 Thread Danil ŢORIN
There is no such thing in lucene as "unique" doc. They might be unique from your application point of view (have some ID that is unique) >From lucene's point of view it's perfectly fine to have duplicate documents. So the "deleted" documents in combined index are coming from your second index. E

Re: Proposal for changing Lucene's backwards-compatibility policy

2009-10-16 Thread Danil ŢORIN
I'd vote A with following addition: What about creating major version more often? If there are incremental improvements which don't clutter the code too much continue with 3.0 -> 3.1 -> 3.2 -> .. -> 3.X Once there are significant changes which are hard to maintain backward compatible start a 4.0

Re: exception to open a large index Insufficient system resources exist

2009-09-01 Thread Danil ŢORIN
There should be no problem with large segments. Please describe OS, FileSystem and JDK you are running on. There might be some problems with file >2Gb on Win32/FAT, or in some ancient Linuxes. On Tue, Sep 1, 2009 at 12:37, wrote: > I met a problem to open an index bigger than 8GB and the followi

Re: searching for c++, c#, etc...

2009-07-16 Thread Danil ŢORIN
Try WhitespaceAnalyzer for both indexing and searching. On search-time you may also need to escape "+", "(", ")" with "\". "#" shouldn't need escaping. On Thu, Jul 16, 2009 at 17:23, Chris Salem wrote: > I'm using the StandardAnalyzer for both searching and indexing. > Here's the code to parse the

Re: Max size of index? How do search engines avoid this?

2009-05-18 Thread Danil ŢORIN
2GB size is a limitation of OS and/or file systems, not of the index as supported by Lucene. There is some other kind of limitation in Lucene: number of documents < 2147483648 However the size of the lucene index may reach tens and hundreds of GB way before that. If you are thinking about BIG inde

Re: Lucene index on iPhone

2009-05-06 Thread Danil ŢORIN
iPhone doesn't support java, so there is no way to run lucene on it. Creating a sqlite database and search inside it is compltetly different solution, which has nothing to do with Lucene. On Wed, May 6, 2009 at 13:08, Shashi Kant wrote: > Hi all, > > I am working on an iPhone application where t

Re: Lucene Index Encryption

2009-05-05 Thread Danil ŢORIN
If you store such sensitive data that you think about index encription. then I may suggest simply isolate the host with lucene index: - ssh only, VERY limited set of users to login - provide a solr over https to search the index (avoid in-tranzit interception) - setup firewall rules This way Lu

Re: Lucene index architecture question

2009-03-25 Thread Danil ŢORIN
You can use solr (http://lucene.apache.org/solr/) Index on one machine and distribute the index to many. On Wed, Mar 25, 2009 at 18:18, kgeeva wrote: > > I have an application clustered on two servers. Is the best practice to have > two lucene indexes - one on each server for the app or is it bes

Re: index large size file

2009-03-11 Thread Danil ŢORIN
The problem you may face that for such large documents,is that there is a high probability that most of terms will be present in all documents. So on search you'll receive a lot of documents (if you need to retrieve full text, it will take a while), but the bigger problem is usability: what a user

Re: how many size of the index is the lucene's limit on per server ?

2009-03-02 Thread Danil ŢORIN
It depends what you call a server : - 4 dual Xeon, 64G RAM, 1TB of 15000 rpm raid10 hard-disks is one thing - 1 P4, 512M RAM, 40G 5400 rpm hard-disk, Win2K is completly something else It depends on index structure and the size of the documents you index/store . It depends on the way you query

Re: contains functionality in Lucene

2009-02-26 Thread Danil ŢORIN
You can generate n-grams: for example when you index "lucene" you create tokens "luce", "ucen", "cene". It will increase term count (and index size), however on search you will simply search for a single term, which will be extremely fast. It depends how may documents you have, size of each docum

Re: Returning hits by highest score

2008-12-17 Thread Danil ŢORIN
According to http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/TopDocCollector.html it does. After search, simple retrieve TopDocs and read documens you need: List result = new ArrayList(10); for( ScoreDoc sDoc :collector.topDocs().scoreDocs) { result.add(contentSearcher.doc(s