Yes, this is a reasonable way to use Lucene (to see terms statistics across the corpus) but it may not be performant enough for your needs.
E.g. wasting memory and making a giant hash table for one time or periodic corpus analysis may be faster. If you are looking for word N gram stats, you could index your text with ShingleFilter to make it faster to get ngram counts. Mike McCandless http://blog.mikemccandless.com On Thu, Mar 9, 2017 at 3:22 PM, Jürgen Jakobitsch < juergen.jakobit...@semantic-web.com> wrote: > hi, > > i'd like to ask users for their experiences with the fastest way to access > the term dictionary. > > what i want to do is to implement some algorithms to find phrases (e.g. > mutual rank ratio [1]) > (and other statistics on term distribution, generally: corpus related > stuff) > > the idea would be to do statistics on numbers (i.e. long from the term > dictionary) to minimize memory usage. i did try this with termsEnum + > ordinal number of terms, which are easily retrievable, but getting a term > by ord then throws UnsupportedOperationException [2]. i see there's also a > codec blocktreeord [3]. > > now before diving deeper into this (i.e. changing codecs for my indexes), > i'd like to ask if a workflow like described above is considered at least > semi smart or if i'm on the wrong track with this and there's a smarter way > to be able to not having to calculate collocations based an actualy strings > or byteRefs? > > any pointer really appreciated. > > kind regard jürgen > > [1] http://www.google.ch/patents/US20100250238 > [2] > https://github.com/apache/lucene-solr/blob/releases/ > lucene-solr/6.4.0/lucene/core/src/java/org/apache/lucene/codecs/blocktree/ > SegmentTermsEnum.java > [3] > https://github.com/apache/lucene-solr/blob/master/ > lucene/codecs/src/java/org/apache/lucene/codecs/blocktreeords/ > OrdsSegmentTermsEnum.java > > *Jürgen Jakobitsch* > Innovation Director > Semantic Web Company GmbH > EU: +43-1-4021235-0 > Mobile: +43-676-6212710 <+43%20676%206212710> > http://www.semantic-web.at > http://www.poolparty.biz > > > > PERSONAL INFORMATION > | web : http://www.turnguard.com > | foaf : http://www.turnguard.com/turnguard > | g+ : https://plus.google.com/111233759991616358206/posts > | skype : jakobitsch-punkt > | xmlns:tg = "http://www.turnguard.com/turnguard#" > | blockchain : https://onename.com/turnguard >