On Thursday 10 November 2016 17:53, Wolfram Hinderer wrote: [...] > 1. The startup looks slightly ugly to me. > 2. If n is large, tee has to maintain a lot of unnecessary state.
But n should never be large. If practice, n-grams are rarely larger than n=3. Occasionally you might use n=4 or even n=5, but I can't imagine using n=20 in practice, let alone the example you show of n=500. See, for example: http://stackoverflow.com/a/10382221 In practice, large n n-grams run into three problems: - for word-based n-grams, n=3 is about the maximum needed; - for other applications, n can be moderately large, but as n-grams are a kind of auto-correlation function, and few data sets are auto-correlated *that* deeply, you still rarely need large values of n; - there is the problem of sparse data and generating a good training corpus. For n=10, and just using ASCII letters (lowercase only), there are 26**10 = 141167095653376 possible 10-grams. Where are you going to find a text that includes more than a tiny fraction of those? -- Steven 299792.458 km/s — not just a good idea, it’s the law! -- https://mail.python.org/mailman/listinfo/python-list