A starting point for this is now on gerrit: https://code.wireshark.org/review/#/c/965/
It does all the tvb_get_* routines, but only the proto_add_* routine for guint8. I didn't bother creating the other variants for that routine because there are open questions in it still. It already helped me find one bug in packet-sdp.c. :) -hadriel On Apr 4, 2014, at 5:01 PM, Hadriel Kaplan <hadriel.kap...@oracle.com> wrote: > > On Apr 4, 2014, at 4:04 PM, Guy Harris <g...@alum.mit.edu> wrote: > >> On Apr 4, 2014, at 7:30 AM, Hadriel Kaplan <hadriel.kap...@oracle.com> wrote: >> >>> I might be overlooking something, but I don’t see a tvb_get_* function to >>> get a uint8/16/32/64 that was encoded as a ascii or utf-8 string in the >>> packet. Is there such a thing? >> >> No. >> I've occasionally also thought there should be such a routine. > > I've started coding it today but my real day-job is getting in the way. :) > > >> Note, though, that, whilst tvb_get_guint8() and tvb_get_{n,le}tohXXX() can >> never fail, because every possible sequence of octets is a valid 2's >> complement integral value, routines to get a number encoded as a string >> *can* fail, e.g. 0123xyzw is not a valid number in bases 8, 10, or 16. > > I'm thinking the tvb_get_* routines should just execute strtol() internally > (and the other flavors depending on number type)... and thus return 0 if it > couldn't be read. And it would internally handle out-of-range errors, but > still return 0. > > But I'm also thinking the tvb_get_* would take the same optional char** > endptr param that strtol() takes, so it can pass back the ending offset - > that way you could catch things like "0123xyzw", because if I recall > correctly it actually *is* successfully converted to a number: the long int > 123 for the "0123" portion. (though I could be mis-remembering) > > Fwiw, looking at some of the consumers of such a routine, they don't seem > overly concerned with errors currently when they call atoi/strtol/etc. > > Also, I'm only planning to do this for base 10. A proto needing something > else can do it the long way. > > >> There are other cases where a tvb_get_ routine can return "you lose", e.g. >> tvb_get_string_enc() can fail if there are invalid octet sequences (about >> the only encodings I know of where *every* octet sequence is a valid string >> are some of the ISO 8859-n encodings), and at least some floating-point >> formats probably have invalid values (I guess an IEEE NaN is "valid", at >> least to the extent that if we try to format it it'll show up as "NaN", but >> if we try to do calculations with it we might get a floating-point exception. > > As an aside, I'm *only* thinking of having this for ASCII strings; anything > else needs to do it the long way. > > For protocols which are actually truly UTF-8, I'm planning to just assume > treating them as ASCII is ok, because as far as I know the atoi/strtol/etc. > functions don't actually care: if they see the ASCII characters for digits > (and +/-/etc.) they'll parse it, else not. So any non-ASCII UTF-8 character > in the sequence is meaningless to them and they stop parsing at that > character. > > And since the current potential consumers of these routines only call > atoi/strtol/etc. right now, they're really only doing the conversion for > ASCII anyway, afaik. But I'll test it to verify, in case I'm wrong. > > So what that means is I don't plan to have an encoding ENC_* param for the > tvb_get_* routines, nor need the tvb_get_* routines to internally call > tvb_get_string_enc(). In fact I'm hoping to do without any temp string being > created, even inside the tvb_get_* functions. > > >> And I'd like to see proto_tree_add_XXX_item() routines that add an item with >> a particular type *and* take a pointer argument and return the value for the >> item through that pointer; > > Good idea. > > >>> And if we had common functions handle ascii and utf-8 string-encoded >>> numbers, they could avoid creating temporary strings as well. >> >> The only real encoding issues are "ASCII superset" (so that "0123456789", >> for example, is encoded the same as in ASCII) vs. "2 or more bytes per ASCII >> character" (e.g., UCS-2, UTF-16, and UCS-4) vs. "one of those 7-bit GSM >> character encodings" vs. "EBCDIC". > > See above. The number of proto's that need such a thing for anything but > ASCII and UTF-8 is pretty small I think. And it appears the ones that are > truly UTF-8 do it for ASCII only anyway. > > -hadriel > > ___________________________________________________________________________ > Sent via: Wireshark-dev mailing list <wireshark-dev@wireshark.org> > Archives: http://www.wireshark.org/lists/wireshark-dev > Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev > mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe ___________________________________________________________________________ Sent via: Wireshark-dev mailing list <wireshark-dev@wireshark.org> Archives: http://www.wireshark.org/lists/wireshark-dev Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe