Using the `customlt` function for the sort order cut the time in half on a 
test file. So try:

customlt(a,b) = (b.second < a.second) ? true : b.second == a.second ? 
a.first < b.first : false

function main()
    wc = Dict{UTF8String,Int64}()
    for l in eachline(STDIN)
        for w in split(l)
            wc[w]=get(wc, w, 0) + 1
        end
    end

    v = collect(wc) 
    sort!(v,lt=customlt) # in-place sort saves a memory copy
    for t in v
        println("$(t.first)\t$(t.second)")
    end
end

main()



On Monday, November 30, 2015 at 5:31:20 PM UTC+2, Attila Zséder wrote:
>
> Hi,
>
>
> The data I'm using is part of (Hungarian) Wikipedia dump with 5M lines of 
> text. On this data, python runs for 65 seconds, cpp for 35 seconds, julia 
> baseline for 340 seconds, julia with FastAnonymous.jl for 280 seconds. (See 
> https://github.com/juditacs/wordcount#leaderboard for details)
>
> Dan:
> I can use external packages, it's not a big issue. However, FastAnonymous 
> didn't give results comparable to python.
> The baseline python code I compare to is here:
> https://github.com/juditacs/wordcount/blob/master/python/wordcount_py2.py
>
> 2) The community is part of the language, so it should be regarded when 
> making considerations.
> What do you mean by this?
> My (our) purpose is not to evaluate this language or that one is 
> better/faster/etc because it is faster in unoptimized word counting. So I 
> don't want to make any judgements, considerations and anything like this. 
> This is just for fun. And even though it looks like _my_ julia 
> implementation of wc is not fast right now, I didn't lose interest in 
> following what's going on with this language.
>
>
> Your other points:
> 1) I do this with all the other languages as well. The test runs for about 
> 30-300 seconds. If Julia load time or any other thing takes serious amount 
> of time, then it does. This test is not precise, I didn't include c++ 
> compile time for example, but it took less than a second. But I felt like 
> my implementation is dummy, and other things take my time, not Julia load.
> 2) What if my test is about IO + dictionary storage? Then I have to 
> include printouts into my test.
> 3) I think 5m lines of text file is enough to avoid this noises.
>
>
>
> Tim:
> Yes, I did this code split, and with larger files it looked like after 
> sorting, dictionary manipulation (including hashes) took most of the time, 
> and printing was less of an issue. But I do have to analyze this more 
> precisely, seeing your numbers.
>
>
> Thank you all for your help!
>
> Attila
>
> On Mon, Nov 30, 2015 at 4:20 PM, Tim Holy <tim....@gmail.com <javascript:>
> > wrote:
>
>> If you don't want to figure out how to use the profiler, your next best 
>> bet is
>> to split out the pieces so you can understand where the bottleneck is. For
>> example:
>>
>> function docount(io)
>>     wc = Dict{AbstractString,Int64}()
>>     for l in eachline(io)
>>         for w in split(l)
>>             wc[w]=get(wc, w, 0) + 1
>>         end
>>     end
>>     wc
>> end
>>
>> @time open("somefile.tex") do io
>>            docount(io)
>>        end;
>>   0.010617 seconds (27.70 k allocations: 1.459 MB)
>>
>> vs
>>
>> @time open("somefile.tex") do io
>>            main(io)
>>        end;
>> # < lots of printed output >
>>   1.233154 seconds (330.59 k allocations: 10.829 MB, 1.53% gc time)
>>
>> (I modified your `main` to take an io input.)
>>
>> So it's the sorting and printing which is taking 99% of the time. Most of 
>> that
>> turns out to be the printing.
>>
>> --Tim
>>
>>
>

Reply via email to