michael, thanks for your input..

i already extended the defaultCodec to return the
BlockTreeOrdsPostingFormat for testing and this works nicely and i can
access terms via ordinal.

speed is not really the issue ( some things simply take a while... ;-) ) .
i also don't want to index shingles, because i can get them via positions
anyway..

so what i gonna do for a first test is to loop over docs/terms + positions
to accumulate shingles of size n as arrays of longs.. do the math and then
retrieve terms via those ordinals..

let's see... ;-)

kr j



*Jürgen Jakobitsch*
Innovation Director
Semantic Web Company GmbH
EU: +43-1-4021235-0
Mobile: +43-676-6212710 <+43%20676%206212710>
http://www.semantic-web.at
http://www.poolparty.biz



PERSONAL INFORMATION
| web       : http://www.turnguard.com
| foaf      : http://www.turnguard.com/turnguard
| g+        : https://plus.google.com/111233759991616358206/posts
| skype     : jakobitsch-punkt
| xmlns:tg  = "http://www.turnguard.com/turnguard#";
| blockchain : https://onename.com/turnguard

2017-03-10 11:41 GMT+01:00 Michael McCandless <luc...@mikemccandless.com>:

> Yes, this is a reasonable way to use Lucene (to see terms statistics
> across the corpus) but it may not be performant enough for your needs.
>
> E.g. wasting memory and making a giant hash table for one time or periodic
> corpus analysis may be faster.
>
> If you are looking for word N gram stats, you could index your text with
> ShingleFilter to make it faster to get ngram counts.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Mar 9, 2017 at 3:22 PM, Jürgen Jakobitsch <
> juergen.jakobit...@semantic-web.com> wrote:
>
>> hi,
>>
>> i'd like to ask users for their experiences with the fastest way to access
>> the term dictionary.
>>
>> what i want to do is to implement some algorithms to find phrases (e.g.
>> mutual rank ratio [1])
>> (and other statistics on term distribution, generally: corpus related
>> stuff)
>>
>> the idea would be to do statistics on numbers (i.e. long from the term
>> dictionary) to minimize memory usage. i did try this with termsEnum +
>> ordinal number of terms, which are easily retrievable, but getting a term
>> by ord then throws UnsupportedOperationException [2]. i see there's also a
>> codec blocktreeord [3].
>>
>> now before diving deeper into this (i.e. changing codecs for my indexes),
>> i'd like to ask if a workflow like described above is considered at least
>> semi smart or if i'm on the wrong track with this and there's a smarter
>> way
>> to be able to not having to calculate collocations based an actualy
>> strings
>> or byteRefs?
>>
>> any pointer really appreciated.
>>
>> kind regard jürgen
>>
>> [1] http://www.google.ch/patents/US20100250238
>> [2]
>> https://github.com/apache/lucene-solr/blob/releases/lucene-
>> solr/6.4.0/lucene/core/src/java/org/apache/lucene/codecs/
>> blocktree/SegmentTermsEnum.java
>> [3]
>> https://github.com/apache/lucene-solr/blob/master/lucene/
>> codecs/src/java/org/apache/lucene/codecs/blocktreeords/Or
>> dsSegmentTermsEnum.java
>>
>> *Jürgen Jakobitsch*
>> Innovation Director
>> Semantic Web Company GmbH
>> EU: +43-1-4021235-0
>> Mobile: +43-676-6212710 <+43%20676%206212710>
>> http://www.semantic-web.at
>> http://www.poolparty.biz
>>
>>
>>
>> PERSONAL INFORMATION
>> | web       : http://www.turnguard.com
>> | foaf      : http://www.turnguard.com/turnguard
>> | g+        : https://plus.google.com/111233759991616358206/posts
>> | skype     : jakobitsch-punkt
>> | xmlns:tg  = "http://www.turnguard.com/turnguard#";
>> | blockchain : https://onename.com/turnguard
>>
>
>

Reply via email to