Or you could encode those term/ ngram frequencies one FST and then reuse it. This would be memory-saving and fairly fast (~comparable to a hash table).
Dawid On Fri, Mar 10, 2017 at 11:41 AM, Michael McCandless <luc...@mikemccandless.com> wrote: > Yes, this is a reasonable way to use Lucene (to see terms statistics across > the corpus) but it may not be performant enough for your needs. > > E.g. wasting memory and making a giant hash table for one time or periodic > corpus analysis may be faster. > > If you are looking for word N gram stats, you could index your text with > ShingleFilter to make it faster to get ngram counts. > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Mar 9, 2017 at 3:22 PM, Jürgen Jakobitsch < > juergen.jakobit...@semantic-web.com> wrote: > >> hi, >> >> i'd like to ask users for their experiences with the fastest way to access >> the term dictionary. >> >> what i want to do is to implement some algorithms to find phrases (e.g. >> mutual rank ratio [1]) >> (and other statistics on term distribution, generally: corpus related >> stuff) >> >> the idea would be to do statistics on numbers (i.e. long from the term >> dictionary) to minimize memory usage. i did try this with termsEnum + >> ordinal number of terms, which are easily retrievable, but getting a term >> by ord then throws UnsupportedOperationException [2]. i see there's also a >> codec blocktreeord [3]. >> >> now before diving deeper into this (i.e. changing codecs for my indexes), >> i'd like to ask if a workflow like described above is considered at least >> semi smart or if i'm on the wrong track with this and there's a smarter way >> to be able to not having to calculate collocations based an actualy strings >> or byteRefs? >> >> any pointer really appreciated. >> >> kind regard jürgen >> >> [1] http://www.google.ch/patents/US20100250238 >> [2] >> https://github.com/apache/lucene-solr/blob/releases/ >> lucene-solr/6.4.0/lucene/core/src/java/org/apache/lucene/codecs/blocktree/ >> SegmentTermsEnum.java >> [3] >> https://github.com/apache/lucene-solr/blob/master/ >> lucene/codecs/src/java/org/apache/lucene/codecs/blocktreeords/ >> OrdsSegmentTermsEnum.java >> >> *Jürgen Jakobitsch* >> Innovation Director >> Semantic Web Company GmbH >> EU: +43-1-4021235-0 >> Mobile: +43-676-6212710 <+43%20676%206212710> >> http://www.semantic-web.at >> http://www.poolparty.biz >> >> >> >> PERSONAL INFORMATION >> | web : http://www.turnguard.com >> | foaf : http://www.turnguard.com/turnguard >> | g+ : https://plus.google.com/111233759991616358206/posts >> | skype : jakobitsch-punkt >> | xmlns:tg = "http://www.turnguard.com/turnguard#" >> | blockchain : https://onename.com/turnguard >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org