david, thanks for your input.. initially i was hoping to be able to use FST somehow in this process, but my knowledge in this area is fairly manageable.. i will give it a second thought anyway... ;-)
krj *Jürgen Jakobitsch* Innovation Director Semantic Web Company GmbH EU: +43-1-4021235-0 Mobile: +43-676-6212710 <+43%20676%206212710> http://www.semantic-web.at http://www.poolparty.biz PERSONAL INFORMATION | web : http://www.turnguard.com | foaf : http://www.turnguard.com/turnguard | g+ : https://plus.google.com/111233759991616358206/posts | skype : jakobitsch-punkt | xmlns:tg = "http://www.turnguard.com/turnguard#" | blockchain : https://onename.com/turnguard 2017-03-10 11:49 GMT+01:00 Dawid Weiss <dawid.we...@gmail.com>: > Or you could encode those term/ ngram frequencies one FST and then > reuse it. This would be memory-saving and fairly fast (~comparable to > a hash table). > > Dawid > > On Fri, Mar 10, 2017 at 11:41 AM, Michael McCandless > <luc...@mikemccandless.com> wrote: > > Yes, this is a reasonable way to use Lucene (to see terms statistics > across > > the corpus) but it may not be performant enough for your needs. > > > > E.g. wasting memory and making a giant hash table for one time or > periodic > > corpus analysis may be faster. > > > > If you are looking for word N gram stats, you could index your text with > > ShingleFilter to make it faster to get ngram counts. > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > On Thu, Mar 9, 2017 at 3:22 PM, Jürgen Jakobitsch < > > juergen.jakobit...@semantic-web.com> wrote: > > > >> hi, > >> > >> i'd like to ask users for their experiences with the fastest way to > access > >> the term dictionary. > >> > >> what i want to do is to implement some algorithms to find phrases (e.g. > >> mutual rank ratio [1]) > >> (and other statistics on term distribution, generally: corpus related > >> stuff) > >> > >> the idea would be to do statistics on numbers (i.e. long from the term > >> dictionary) to minimize memory usage. i did try this with termsEnum + > >> ordinal number of terms, which are easily retrievable, but getting a > term > >> by ord then throws UnsupportedOperationException [2]. i see there's > also a > >> codec blocktreeord [3]. > >> > >> now before diving deeper into this (i.e. changing codecs for my > indexes), > >> i'd like to ask if a workflow like described above is considered at > least > >> semi smart or if i'm on the wrong track with this and there's a smarter > way > >> to be able to not having to calculate collocations based an actualy > strings > >> or byteRefs? > >> > >> any pointer really appreciated. > >> > >> kind regard jürgen > >> > >> [1] http://www.google.ch/patents/US20100250238 > >> [2] > >> https://github.com/apache/lucene-solr/blob/releases/ > >> lucene-solr/6.4.0/lucene/core/src/java/org/apache/lucene/ > codecs/blocktree/ > >> SegmentTermsEnum.java > >> [3] > >> https://github.com/apache/lucene-solr/blob/master/ > >> lucene/codecs/src/java/org/apache/lucene/codecs/blocktreeords/ > >> OrdsSegmentTermsEnum.java > >> > >> *Jürgen Jakobitsch* > >> Innovation Director > >> Semantic Web Company GmbH > >> EU: +43-1-4021235-0 > >> Mobile: +43-676-6212710 <+43%20676%206212710> > >> http://www.semantic-web.at > >> http://www.poolparty.biz > >> > >> > >> > >> PERSONAL INFORMATION > >> | web : http://www.turnguard.com > >> | foaf : http://www.turnguard.com/turnguard > >> | g+ : https://plus.google.com/111233759991616358206/posts > >> | skype : jakobitsch-punkt > >> | xmlns:tg = "http://www.turnguard.com/turnguard#" > >> | blockchain : https://onename.com/turnguard > >> >