Using the `customlt` function for the sort order cut the time in half on a test file. So try:
customlt(a,b) = (b.second < a.second) ? true : b.second == a.second ? a.first < b.first : false function main() wc = Dict{UTF8String,Int64}() for l in eachline(STDIN) for w in split(l) wc[w]=get(wc, w, 0) + 1 end end v = collect(wc) sort!(v,lt=customlt) # in-place sort saves a memory copy for t in v println("$(t.first)\t$(t.second)") end end main() On Monday, November 30, 2015 at 5:31:20 PM UTC+2, Attila Zséder wrote: > > Hi, > > > The data I'm using is part of (Hungarian) Wikipedia dump with 5M lines of > text. On this data, python runs for 65 seconds, cpp for 35 seconds, julia > baseline for 340 seconds, julia with FastAnonymous.jl for 280 seconds. (See > https://github.com/juditacs/wordcount#leaderboard for details) > > Dan: > I can use external packages, it's not a big issue. However, FastAnonymous > didn't give results comparable to python. > The baseline python code I compare to is here: > https://github.com/juditacs/wordcount/blob/master/python/wordcount_py2.py > > 2) The community is part of the language, so it should be regarded when > making considerations. > What do you mean by this? > My (our) purpose is not to evaluate this language or that one is > better/faster/etc because it is faster in unoptimized word counting. So I > don't want to make any judgements, considerations and anything like this. > This is just for fun. And even though it looks like _my_ julia > implementation of wc is not fast right now, I didn't lose interest in > following what's going on with this language. > > > Your other points: > 1) I do this with all the other languages as well. The test runs for about > 30-300 seconds. If Julia load time or any other thing takes serious amount > of time, then it does. This test is not precise, I didn't include c++ > compile time for example, but it took less than a second. But I felt like > my implementation is dummy, and other things take my time, not Julia load. > 2) What if my test is about IO + dictionary storage? Then I have to > include printouts into my test. > 3) I think 5m lines of text file is enough to avoid this noises. > > > > Tim: > Yes, I did this code split, and with larger files it looked like after > sorting, dictionary manipulation (including hashes) took most of the time, > and printing was less of an issue. But I do have to analyze this more > precisely, seeing your numbers. > > > Thank you all for your help! > > Attila > > On Mon, Nov 30, 2015 at 4:20 PM, Tim Holy <tim....@gmail.com <javascript:> > > wrote: > >> If you don't want to figure out how to use the profiler, your next best >> bet is >> to split out the pieces so you can understand where the bottleneck is. For >> example: >> >> function docount(io) >> wc = Dict{AbstractString,Int64}() >> for l in eachline(io) >> for w in split(l) >> wc[w]=get(wc, w, 0) + 1 >> end >> end >> wc >> end >> >> @time open("somefile.tex") do io >> docount(io) >> end; >> 0.010617 seconds (27.70 k allocations: 1.459 MB) >> >> vs >> >> @time open("somefile.tex") do io >> main(io) >> end; >> # < lots of printed output > >> 1.233154 seconds (330.59 k allocations: 10.829 MB, 1.53% gc time) >> >> (I modified your `main` to take an io input.) >> >> So it's the sorting and printing which is taking 99% of the time. Most of >> that >> turns out to be the printing. >> >> --Tim >> >> >