Using Symbols seems to help, ie:
using DataStructures
function wordcounter_sym(filename)
counts = counter(Symbol)
words=split(readall(filename), Set(['
','\n','\r','\t','-','.',',',':','_','"',';','!']),false)
for w in words
add!(counts,symbol(w))
end
return counts
end
On my system this cut the time from 0.67 sec to 0.48 sec, about 30% less.
Memory use is also quite a bit lower.
On Tuesday, March 4, 2014 8:58:51 PM UTC-5, Roman Sinayev wrote:
>
> I updated the gist with times and code snippets
> https://gist.github.com/lqdc/9342237
>
> On Tuesday, March 4, 2014 5:15:29 PM UTC-8, Steven G. Johnson wrote:
>>
>> It's odd that the performance gain that you see is so much less than the
>> gain on my machine.
>>
>> Try putting @time in front of "for w in words" and also in front of
>> "words=...". That will tell you how much time is being spent in each, and
>> whether the limitation is really hashing performance.
>>
>> On Tuesday, March 4, 2014 7:55:12 PM UTC-5, Roman Sinayev wrote:
>>>
>>> I got to about 0.55 seconds with the above suggestions. Still about 2x
>>> slower than Python unfortunately.
>>> The reason I find it necessary for hashing to work quickly is that I use
>>> it heavily for both NLP and when serving data on a Julia webserver.
>>>
>>