RE: Lucene 4.0 scalability and performance.

2012-12-24 Thread Vitaly_Artemov
Thank you

-Original Message-
From: Steve Rowe [mailto:sar...@gmail.com] 
Sent: Sunday, December 23, 2012 8:20 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 4.0 scalability and performance.

Hi Vitaly,

Anything by Tom Burton-West should interest you - he works on the HathiTrust 
digital library project , which currently indexes 
7TB of full-length books, e.g.:

"Practical Relevance Ranking for 10 Million Books" (paper) INEX 2012, September 
2012, Rome, Italy 


"HathiTrust Large Scale Search: Scalability meets Usability" (slides) Code4Lib 
2012, February 2012, Seattle, Washington 


"Large-scale Search" (blog)


Steve

On Dec 23, 2012, at 6:11 AM, vitaly_arte...@mcafee.com wrote:

> Hi all,
> We start to evaluate Lucene 4.0 for using in the production environment.
> This means that we need to index millions of document with TeraBytes of 
> content and search in it.
> For now we want to define only one indexed field, contained the content of 
> the documents, with possibility to search terms and retrieving the terms 
> offsets.
> Does somebody already tested Lucene with TerabBytes of data?
> Does Lucene has some known limitations related to the indexed documents 
> number or to the indexed documents size?
> What is about search performance in huge set of data?
> Thanks in advance, Vitaly


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.0 scalability and performance.

2012-12-24 Thread Carsten Schnober
Am 23.12.2012 12:11, schrieb vitaly_arte...@mcafee.com:


> This means that we need to index millions of document with TeraBytes of 
> content and search in it.
> For now we want to define only one indexed field, contained the content of 
> the documents, with possibility to search terms and retrieving the terms 
> offsets.
> Does somebody already tested Lucene with TerabBytes of data?
> Does Lucene has some known limitations related to the indexed documents 
> number or to the indexed documents size?
> What is about search performance in huge set of data?

Hi Vitali,
we've been working on a linguistic search engine based on Lucene 4.0 and
have performed a few tests with large text corpora. There are at least
some overlaps in the functionality you mentioned (term offsets). See
http://www.oegai.at/konvens2012/proceedings/27_schnober12p/ (mainly
section 5).
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene 4.0 scalability and performance.

2012-12-24 Thread Vitaly_Artemov
Thank you

-Original Message-
From: Carsten Schnober [mailto:schno...@ids-mannheim.de] 
Sent: Monday, December 24, 2012 3:25 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene 4.0 scalability and performance.

Am 23.12.2012 12:11, schrieb vitaly_arte...@mcafee.com:


> This means that we need to index millions of document with TeraBytes of 
> content and search in it.
> For now we want to define only one indexed field, contained the content of 
> the documents, with possibility to search terms and retrieving the terms 
> offsets.
> Does somebody already tested Lucene with TerabBytes of data?
> Does Lucene has some known limitations related to the indexed documents 
> number or to the indexed documents size?
> What is about search performance in huge set of data?

Hi Vitali,
we've been working on a linguistic search engine based on Lucene 4.0 and have 
performed a few tests with large text corpora. There are at least some overlaps 
in the functionality you mentioned (term offsets). See 
http://www.oegai.at/konvens2012/proceedings/27_schnober12p/ (mainly section 5).
Carsten

--
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis 
Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to implement a TokenFilter?

2012-12-24 Thread Xi Shen
Hi Lance,

I got the lucene 4 from
http://mirror.bjtu.edu.cn/apache/lucene/java/4.0.0/lucene-4.0.0-src.tgz, it
is an Ant project. But I do not which IDE can import it...I tried Eclipse,
it cannot import the build.xml file.


Thanks,
D.


On Mon, Dec 24, 2012 at 12:02 PM, Lance Norskog  wrote:

> You need to use an IDE. Find the Attribute type and show all subclasses.
> This shows a lot of rare ones and a few which are used a lot. Now, look at
> source code for various TokenFilters and search for other uses of the
> Attributes you find. This generally is how I figured it out.
>
> Also, after the full Analyzer stack is called, the caller saves the output
> (I guess to codecs?). You can look at which Attributes it saves.
>
>
> On 12/23/2012 06:30 PM, Xi Shen wrote:
>
>> thanks a lot :)
>>
>>
>> On Mon, Dec 24, 2012 at 10:22 AM, feng lu  wrote:
>>
>>  hi Shen
>>>
>>> May be you can see some source code in org.apache.lucene.analysis
>>> package,
>>> such LowerCaseFilter.java,**StopFilter.java and so on.
>>>
>>> and some common attribute includes:
>>>
>>> offsetAtt = addAttribute(OffsetAttribute.**class);
>>> termAtt = addAttribute(**CharTermAttribute.class);
>>> typeAtt = addAttribute(TypeAttribute.**class);
>>>
>>> Regards
>>>
>>>
>>> On Sun, Dec 23, 2012 at 4:01 PM, Rafał Kuć  wrote:
>>>
>>>  Hello!

 The simplest way is to look at Lucene javadoc and see what
 implementations of Attribute interface there are -

  http://lucene.apache.org/core/**4_0_0/core/org/apache/lucene/**
>>> util/Attribute.html
>>>
 --
 Regards,
   Rafał Kuć
   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch

  thanks, i read this ready. it is useful, but it is too 'small'...
> e.g. for this.charTermAttr = addAttribute(**CharTermAttribute.class);
> i want to know what are the other attributes i need in order to
>
 implement
>>>
 my function. where i can find a references to these attributes? i tried
>
 on

> lucene & solr wiki, but all i found is a list of the names of these
> attributes, nothing about what are they capable of...
>



  On Sat, Dec 22, 2012 at 10:37 PM, Rafał Kuć  wrote:
>
>> Hello!
>>
>> A small example with some explanation can be found here:
>> http://solr.pl/en/2012/05/14/**developing-your-own-solr-**filter/
>>
>> --
>> Regards,
>>   Rafał Kuć
>>   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>>
>>  Hi,
>>> I need a guide to implement my own TokenFilter. I checked the wiki,
>>>
>> but I

> could not find any useful guide :(
>>>
>>
>>
>> --**--**
>> -
>> To unsubscribe, e-mail: 
>> java-user-unsubscribe@lucene.**apache.org
>> For additional commands, e-mail: 
>> java-user-help@lucene.apache.**org
>>
>>
>>

 --**--**
 -
 To unsubscribe, e-mail: 
 java-user-unsubscribe@lucene.**apache.org
 For additional commands, e-mail: 
 java-user-help@lucene.apache.**org



>>> --
>>> Don't Grow Old, Grow Up... :-)
>>>
>>>
>>
>>
>
> --**--**-
> To unsubscribe, e-mail: 
> java-user-unsubscribe@lucene.**apache.org
> For additional commands, e-mail: 
> java-user-help@lucene.apache.**org
>
>


-- 
Regards,
David Shen

http://about.me/davidshen
https://twitter.com/#!/davidshen84