Lars Gullik Bjønnes wrote:
> This is the token frequencies I get when loading Userguide.lyx.
> 
> 3341 \family
> 1891 \layout
> 1166 \bar
> 874 \emph
> 799 \begin_inset
> 575 \shape
> 547 \noun
> 505 \series
> 502 \color
> 492 \SpecialChar
> 330 \size
> 107 \end_deeper
> 107 \begin_deeper
> 80 \labelwidthstring
> 68 \backslash
> 54 \i
> 39 \hfill
> 25 \align
> 21 \newline
> 16 \added_space_top
> 15 \added_space_bottom
> 5 \bibitem
> 4 \noindent
> 1 \use_numerical_citations
> 1 \use_natbib
> 1 \use_geometry
> 1 \use_amsmath
> 1 \tocdepth
> 1 \the_end
> 1 \textclass
> 1 \spacing
> 1 \secnumdepth
> 1 \quotes_times
> 1 \quotes_language
> 1 \paragraph_separation
> 1 \papersize
> 1 \papersides
> 1 \paperpagestyle
> 1 \paperpackage
> 1 \paperorientation
> 1 \paperfontsize
> 1 \papercolumns
> 1 \line_bottom
> 1 \language
> 1 \inputencoding
> 1 \graphics
> 1 \fontscheme
> 1 \defskip
> 1 \begin_preamble
> 
> 
> To create this I add:
> 
>       {
>               ofstream tofs("/tmp/tokens.txt", std::ios_base::app);
>               tofs << token << "\n";
>       }
> 
> 
> To the top of parseSingleLyXformat2Token in buffer.C
> 
> /tmp/tokens.txt is parsed by this prog:
> 
> bucketcheck.C:
> 
> #include <fstream>
> #include <iostream>
> #include <map>
> 
> using namespace std;
> 
> 
> int main() 
> {
>       ifstream ifs("/tmp/realtokens.txt");
> 
>       map<string, int> buckets;
>       
>       string line;
>       
>       while (getline(ifs, line)) {
>               ++buckets[line];
>       }
> 
>       map<string, int>::const_iterator cit = buckets.begin();
>       map<string, int>::const_iterator end = buckets.end();
>       
>       for (; cit != end; ++cit) {
>               cout << cit->second << " " << cit->first << "\n";
>       }
> }
> 
> which pipes its output to this:
> 
> egrep "^\\\\" /tmp/tokens.txt > /tmp/realtokens.txt
> ./bucketcheck | sort -g | tac > /tmp/tokenorder.txt
> 
> 
> Results for other largish lyx files would be nice to have.
> 

If I may reattach my earlier script (which is admitedly not 100%
accurate but who cares?)... It gives similar but different results:
    3335 \family
    1889 \layout
    1398 \begin_inset
    1398 \end_inset
    1166 \bar
     871 \emph
     575 \shape
     547 \noun
     505 \series
     502 \color
     491 \SpecialChar
     330 \size
     107 \begin_deeper
     107 \end_deeper
      80 \labelwidthstring
      67 \backslash
      54 \i
      39 \hfill
      21 \newline
[snip]

If I run it on Herbert's Versuch.lyx (translated to 1.2.0 format)
(this is a file he sent because of performance problems), I get
ginette: ./counttokens fichiers/Versuch120.lyx
   10869 \layout
    9612 \begin_inset
    9612 \end_inset
    4841 \backslash
    2589 \emph
    2506 \family
    2492 \noun
    2486 \bar
    2486 \shape
    2375 \color
    2371 \size
    1684 \series
     192 \SpecialChar
      64 \align
      64 \newline
[snip]

I doubt though that there is still a big gain to be found in this area.

JMarc
#!/bin/sh

grep '\\[a-zA-Z_]* ' $1 | sed -e 's/^[^\]*\(\\[A-Za-z_]*\) .*$/\1/'|sort|uniq -c|sort 
-k1,1nr

Reply via email to