Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-31 Thread Bruno Haible
Paolo Bonzini wrote: > >> The point is that u8_mbtouc will look only one byte past the end of > >> a (valid or invalid) character > > > > This is not guaranteed. If you call u8_mbtouc(&uc, needle, 4) you are > > asserting that needle[0..3] are valid memory addresses. > > In practice that's not wha

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-31 Thread Paolo Bonzini
On 07/31/2010 09:35 PM, Bruno Haible wrote: Hi Paolo, I pushed it now, thanks for the review! After I added some more unit tests for u8_strchr, I see crashes and assertion failures: 1) mem = (UNIT *) (page_boundary - 2 * sizeof (UNIT)); mem[0] = 0x50; mem[1] = 0;

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-31 Thread Bruno Haible
Hi Paolo, > I pushed it now, thanks for the review! After I added some more unit tests for u8_strchr, I see crashes and assertion failures: 1) mem = (UNIT *) (page_boundary - 2 * sizeof (UNIT)); mem[0] = 0x50; mem[1] = 0; ASSERT (u8_strchr (mem, 0x123) == NULL);

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-29 Thread Paolo Bonzini
On 07/29/2010 06:00 PM, Pádraig Brady wrote: On 29/07/10 16:29, Paolo Bonzini wrote: if (!needle[0]) return haystack; n = u8_mbtouc(&uc, needle, 4); ^ u8_strmbtouc here as suggested by Bruno. u8_strmbtouc also checks itself for NULL, and returns zero in that cas

guarantees of u8_mbtouc/u8_strmbtouc (was Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm)

2010-07-29 Thread Paolo Bonzini
On 07/29/2010 05:36 PM, Bruno Haible wrote: You should better use u8_strmbtouc in this case. Aha, that's true. This function is a better match, but only because it supports CONFIG_UNICODE_SAFETY and is faster in that case (because it can just compare with zero). Still, without safety u8_st

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-29 Thread Paolo Bonzini
On 07/29/2010 05:36 PM, Bruno Haible wrote: Paolo Bonzini wrote: The point is that u8_mbtouc will look only one byte past the end of a (valid or invalid) character This is not guaranteed. If you call u8_mbtouc(&uc, needle, 4) you are asserting that needle[0..3] are valid memory addresses. In

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-29 Thread Bruno Haible
Paolo Bonzini wrote: > The point is that > u8_mbtouc will look only one byte past the end of a (valid or invalid) > character This is not guaranteed. If you call u8_mbtouc(&uc, needle, 4) you are asserting that needle[0..3] are valid memory addresses. If it's not the case, u8_mbtouc may crash. Yo

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-29 Thread Paolo Bonzini
On 07/29/2010 04:32 PM, Pádraig Brady wrote: No! You can cheat!:) Just pass 4 to u8_mbtouc. There will be no out-of-bounds access: if the string ends before that, you get either a complete character or an EILSEQ. You don't get EILSEQ for u8_mbtouc(&uc,"aa",6); But you get a complete character

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-29 Thread Paolo Bonzini
On 07/29/2010 12:40 PM, Pádraig Brady wrote: I thought of that but thought it would be too much overhead for the general case since it seemed like it would need strlen, u8_mbsnlen& u8_mbtouc, with conversions back and forth. No! You can cheat! :) Just pass 4 to u8_mbtouc. There will be no o

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-29 Thread Pádraig Brady
On 29/07/10 10:26, Paolo Bonzini wrote: > On 07/29/2010 11:05 AM, Pádraig Brady wrote: >>> I would better recommend to use >>> the u8_strstr function. >> >> I wonder could we speed that up for UTF-8 >> by just deferring to strstr() ? >> I've not tested this so feel free to bin it. > > An alternati

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-29 Thread Paolo Bonzini
On 07/29/2010 11:05 AM, Pádraig Brady wrote: I would better recommend to use the u8_strstr function. I wonder could we speed that up for UTF-8 by just deferring to strstr() ? I've not tested this so feel free to bin it. An alternative idea: check if there is more than 1 character using u8_mb

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-29 Thread Pádraig Brady
On 28/07/10 22:32, Bruno Haible wrote: > Pádraig Brady wrote: >> I would suggest a new function due to the >> way I see this function called most often. >> >> /* definitely not sure of this name */ >> uint8_t * >> u8_str_u8_chr (const uint8_t *s, const uint8_t *c, size_t size) >> { >> switch (siz

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-28 Thread Bruno Haible
Pádraig Brady wrote: > I would suggest a new function due to the > way I see this function called most often. > > /* definitely not sure of this name */ > uint8_t * > u8_str_u8_chr (const uint8_t *s, const uint8_t *c, size_t size) > { > switch (size): > { > case 1: > return (uint8_

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-28 Thread Bruno Haible
Hi Paolo, > I pushed it now, thanks for the review! I thought you were going to add ChangeLog entries right before pushing. I added them for you (from the git commit messages). When compiling a testdir like this: ./gnulib-tool --create-testdir --dir=$HOME/data/tmp/testdir1 --with-tests unist

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-28 Thread Paolo Bonzini
On 07/28/2010 11:01 PM, Bruno Haible wrote: I thought you were going to add ChangeLog entries right before pushing. I added them for you (from the git commit messages). Oops, looks like I was confusing coreutils/grep vs. gnulib habits. Paolo

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-27 Thread Paolo Bonzini
On 07/27/2010 06:28 PM, Pádraig Brady wrote: On 27/07/10 19:14, Paolo Bonzini wrote: On 07/27/2010 06:06 PM, Pádraig Brady wrote: I would suggest a new function due to the way I see this function called most often. I.E. repeatedly with the same character. Is this really a bottleneck? i.e.,

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-27 Thread Pádraig Brady
On 27/07/10 19:14, Paolo Bonzini wrote: > On 07/27/2010 06:06 PM, Pádraig Brady wrote: >> >> I would suggest a new function due to the >> way I see this function called most often. >> I.E. repeatedly with the same character. > > Is this really a bottleneck? i.e., what does u8_uctomb_aux look like

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-27 Thread Paolo Bonzini
On 07/27/2010 06:06 PM, Pádraig Brady wrote: On 23/07/10 13:03, Paolo Bonzini wrote: On 07/23/2010 09:21 AM, Bruno Haible wrote: Otherwise fine, please commit. I can add comments that prove the validity of the code later, after you committed it. v2 already had large comments that detail the a

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-27 Thread Pádraig Brady
On 23/07/10 13:03, Paolo Bonzini wrote: > On 07/23/2010 09:21 AM, Bruno Haible wrote: >> Otherwise fine, please commit. I can add comments that prove the >> validity of >> the code later, after you committed it. > > v2 already had large comments that detail the algorithm used and do > include all

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-23 Thread Pádraig Brady
On 23/07/10 07:53, bonz...@gnu.org wrote: > From: Paolo Bonzini > > v2 of the patch, now including strchr operations too and with the > changes suggested by Bruno. Padraig, can you test it on your > cases too? Cool. There was no measurable change for the 2 byte case, but there was another 6% im

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-23 Thread Paolo Bonzini
On 07/23/2010 09:21 AM, Bruno Haible wrote: Otherwise fine, please commit. I can add comments that prove the validity of the code later, after you committed it. v2 already had large comments that detail the algorithm used and do include all the info in the comments you suggested. Feel free to

Re: [PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-23 Thread Bruno Haible
Hi Paolo, Thanks for the update. In [PATCH v2 3/5] unistr/u*-chr: test multibyte sequences more: > +n = U_UCTOMB(c, uc, 6); Still lacking a space before the opening parenthesis. In [PATCH v2 4/5] unistr/u*-strchr: add tests: > +n = U_UCTOMB(c, uc, 6); Likewise. > +/* Test of u

[PATCH v2 0/5] Speed up uNN_chr and uNN_strchr with Boyer-Moore algorithm

2010-07-22 Thread bonzini
From: Paolo Bonzini v2 of the patch, now including strchr operations too and with the changes suggested by Bruno. Padraig, can you test it on your cases too? Paolo Bonzini (5): unistr/u*-chr: prepare for multibyte tests unistr/u*-chr: test multibyte sequences unistr/u*-chr: test multibyte