Re: RFC 283 (v1) C in array context should return a histogram

Bennett Todd Tue, 26 Sep 2000 19:21:56 -0700
2000-09-26-21:56:04 Paris Sinclair:
> A "small" fixed upper bound? It is N that is bounded, that doesn't
> stop it from using N*50 variables to represent N, or N*150
> variables if I'm only matching vs 2 characters.

In big-O notation, the N is the size of the problem; in this case,
it could be the size of the alphabet, or the length of the string.
The rest of big-O notation describes the order of growth, ignoring
constant factors, of the resources consumed by the solution as a
function of the size of the input.

I was coding for reasonable-sized alphabets, so I used a simple
linear array; as I pointed out, if you have an alphabet of
preposterous size, where it's impractical to e.g. build a single
copy of it in memory, and any document's characters from that
alphabet are sparse, then use a hash.

> Perhaps instead of using O() I should have just said, "it is 0 to
> 150 times slower."

You're either going to build a dense array, which you can directly
subscript with some kind of ordinal value of the "letters", or
you're going to build a sparse array holding the chr => count map, a
hash. You can do either one inside the implementation of tr///. You
can do either one outside. There may be a significant performance
difference, there may not, that's not obvious to me.

> The overal algorithm, that is, I am assuming that this list is
> going to be iterated over.

You mean using something like 0..$#hist or keys(%hist)? Yup, I'd
expect that too. If the alphabet is reasonable sized, 0..$#hist is
so teensy it's free. If it's monster big, then keys(%hist) will be
scaling with the size of your input. Whether you do this with a loop
over split// incrementing array/hash buckets or whether it's done by
tr///, the O of the algorithms and their data structures are the
same.

> What's the upper bound in a 16bit language? Or does that case just
> have to break? "Sorry, you're not European. Please be assimilated
> before using this tool. Resistance is futile."

Lordie lordie lordie, you're one of the persecuted minority, and
a brand-waving rioter too. I've clearly stepped on a corn, not to
mention picked the wrong person to persecute. I'll go speak english
to other bigots who only speak english, and leave the future of the
civilized universe in your responsible hands.

Only kidding, I'm not about to let you off the hook that easy.

If you've got a variable-length coding where the values produced by
split /// are themselves strings, and ord() returns bignums, then
you need to

        $hist{$_}++ for split //, $input;

and if you don't care to collect all the data, you're only
interested in xyz, then you can do various tricks; if someone forced
me to work in such a distressing environment, I might express it
something like

        $hist{$_} = 0 for split //, "xyz";
        for (split //, $input) { $hist{$_}++ if exists $hist{$_} }

or then again I might

        for("$input") {
                tr/xyz//cd; $hist{$_}++ for split //;
        }

I dunno. If you expect some tr invocation to do something
reasonable and appropriate with weird multibyte encodings of
gigantic and sparse alphabets, do let us get this stated explicitly
in the RFC; otherwise it's not a reasonable argument for trying to
add this feature.

> > What's all this about eval?
> 
> That was in reference to my previous map example, which is the best
> general way I've seen proposed to handle the specified counting in p5.

Wow, I'd skipped that, now that you force me to review it, I see
why. Maybe it looks better when interpreted as UTF-32.

Perhaps neither of my previous constructs were as clean as

        %hist = $input =~ tr/xyz//;

but then too, it's not awfully obvious just what that does, and it's
not something people need to do every day. If it's really cheap to
toss into the bucket, what the heck, I can't think of anything else
that the syntax would be better for. But I've also yet to hear
anything like a strong case made for it. For sure the ability to gen
up an equivalent expression that uses eval doesn't seem like
appealing grounds.

-Bennett
PGP signature
Re: RFC 283 (v1) C in array context should return a histogram

Reply via email to