Norbert, thanks for the feedback.
2011/1/30 Norbert Thiebaud <nthieb...@gmail.com> > On Sat, Jan 29, 2011 at 5:43 PM, Kenneth Venken > <kenneth.ven...@gmail.com> wrote: > > Hi, > > > > i came across gendict.cxx while fixing a possible memleak. It took me > some > > time to figure out what the code did. > > I notice a lot of very very long function bodies in LO-code, gendict was > no > > exception. So i refactored the code, did some google searches on gendict > and > > was able to fix the memleak. > > I ended up submitting only the fix, not the refactoring, because i didn't > > want to break any de facto coding style guideliness and i am still a > fairly > > new contributer to LO. > > > > I submit these patches now, so you guys can decide if you push them or > not. > > In case it could save some other new contributor some time in > understanding > > gendict ;) > > BTW, i tested the code on ja.dic and output is still the same. > > Even though the program is not _that_ long, I think the intent is > good, but I do have few questions, remarks: > > Preamble: > - Some of the remarks below concern code-choice you made, some of them > concern > things that were already like that in the original code. I did not > distinguished the two because > a/ the remarks are on the resulting code, not on who wrote it nor when > b/ since this is a complete re factoring, minimizing changes are not a > priority; So, existing > oddities should be fixed along the way. > > - This is by now means an exhaustive review. > > 1/ implement before use. > I'd strongly prefer if you implemented-before-use rather than > prototype + use + implementation. > in other words: main() at the end and the function above it, in order > such as they are implemented before they are used. > Ok, i'll do it like that from now on. > 2/ keep private function private: in C that means qualify the print* > and init* functions you created to break-down > the big bad main function, as 'static'. Good point. 3/ patch 0002: you factor printFunctions() but I can't see where it is > being called. ( oh, it 're-appear' in patch 0004, but still) > > yeah, sorry about that, i noticed it was gone after a few commits. > 4/ the indentation of the original was horrible, but if you are going > to refactor, you may want to fix that. > i.e > + if (*u != current) { > + if (*u < current) > + printf("u %x, current %x, count %d, lenArray.size() %d\n", *u, > current, > + sal::static_int_cast<int>(count), > sal::static_int_cast<int>(lenArray.size())); > + current = *u; > + charArray[current] = lenArray.size(); > + } > > + if (*u != current) > + { > + if (*u < current) > + printf("u %x, current %x, count %d, lenArray.size() > %d\n", *u, current, > + sal::static_int_cast<int>(count), > sal::static_int_cast<int>(lenArray.size())); > + current = *u; > + charArray[current] = lenArray.size(); > + } > > Note that *I* prefer systematic {} even for one-liner if/for/while, > but that is not universally shared... so I won't insist on it. > Note2: while we are at it the > sal:static_int_cast<int>(count/lenArray.size()) is pretty stupid. > printf(" %d %d", count, lenArray.size()) works just fine. and btw > there is no reason for count (or in general any local counter) to be a > sal_int32. int is just fine) > the if statement used here should probably be an ASSERT( *u < current ) because the input dictionary file should be a sorted list of words. if *u < current is true for the last word in the dictionary, the line charArray[current+1] = lenArray.size(); (after loop) could produce unintended behaviour. > > 5/ patch 004: > +void printLenArray(FILE* source_fp, const vector<sal_Int32>& lenArray, > + sal_Int32 count) > +{ > + fprintf(source_fp, "static const sal_Int32 lenArray[] = {\n\t"); > + count = 1; > > what's the point passing cont as a parameter if you are going to > override it's value right away ? (note: ok so, patch 0005 actually fix > that...) > these patches should be viewed as a wholel. The refactoring was a process. But you're wright. > > + fprintf(source_fp, "0x%x, ", 0); // insert one slat for skipping > 0 in index2 array. > + for (size_t k = 0; k < lenArray.size(); k++) > + { > + fprintf(source_fp, "0x%lx, ", static_cast<long unsigned > int>(lenArray[k])); > > lenArray, in the generated code, is declared as sal_int32[] so why > casting it's element to long unsigned int ? > > + if (count == 0xf) > + { > + count = 0; > + fprintf(source_fp, "\n\t"); > + } > + else count++; > if( (k & 0xe) == 0xe) > { > fprintf(source_fp, "\n\t"); > } > achieve the same result without the need to maintain 'count' > It is still a bit odd, due to the shift introduce by the special element 0. > > I 'd prefer > + for (size_t k = 0; k < lenArray.size(); k++) > + { > + if (!(k &0xf) > + { > + fprintf(source_fp, "\n\t"); > + } > + fprintf(source_fp, "0x%lx, ", static_cast<long unsigned > int>(lenArray[k])); > + } > > that change a bit the output (by having the first special 0 entry > standing out, but actually that is a good thing since it _is_ special > > +void printIndex2(FILE *source_fp, sal_Int32 *charArray, sal_Int16 *set) > +{ > + fprintf (source_fp, "static const sal_Int32 index2[] = {\n\t"); > + sal_Int32 prev = 0; > + for (sal_Int32 i = 0; i < 0x100; i++) { > + if (set[i] != 0xff) { > + for (sal_Int32 j = 0; j < 0x100; j++) { > + sal_Int32 k = (i*0x100) + j; > > the compiler should be smart enough to detect that multiply by 256 is > left-shift by 8, but still why take a risk: > k = (i << 8 | j); > + if (prev != 0 && charArray[k] == 0) { > + for (k++; k < 0x10000; k++) > + if (charArray[k] != 0) > + break; > + } > + prev = charArray[(i*0x100) + j]; > + fprintf( > + source_fp, "0x%lx, ", > + sal::static_int_cast< unsigned long >( > + k < 0x10000 ? charArray[k] + 1 : 0)); > + if ((j+1) % 0x10 == 0) > > if (j & 0x0f == 0x0f) > does the same thing without an addition AND a modulo operation, note > that you are in the inner loop here. this will > be called up to 65K times. granted that since we use fprintf() to push > constant string to a file, performance is clearly not > very high on the list of concern. > fputs("\n\t", source_fp) is at least an order of magnitude faster than > fprintf(source_cp, "\n\t"); > > > +void printExistMark(FILE *source_fp, sal_Bool *exists, sal_Int32 count) > +{ > + count = 0; > + fprintf (source_fp, "static const sal_uInt8 existMark[] = {\n\t"); > + for (sal_Int32 i = 0; i < 0x1FFF; i++) { > + sal_uInt8 bit = 0; > + for (sal_Int32 j = 0; j < 8; j++) > + if (exists[i * 8 + j]) > + bit |= 1 << j; > + fprintf(source_fp, "0x%02x, ", bit); > + if (count == 0xf) { > + count = 0; > + fprintf(source_fp, "\n\t"); > + } else count++; > + } > + fprintf (source_fp, "\n};\n"); > } > since exist[] is really an array of bool that is init to false and the > set to true when needed, only to be then packed and dump as a > bitfield, you could do > > static inline set_exist(int index) > { > exist[index>>3] |= 1 << (index & 0x07); > } > > then replace the two place that set exist > exist[u[0]] = sal_True; ->set_exist(u[0]); > and > exist[u[i]] = sal_True; -> set_exist(u[i]); > > now exist[] is then defined as > static unsigned char exist[0x2000]; > (note declaring it static means that you don't need to initialize it > anymore since the C standard guarantee that it will be initialized > with 0, much more efficiently than the loop that currently do that in > the code, and the code is not meant to be re-entrant anyway) > > and you can simplify printExistMark in a simple byte dump. > Nice. > > 6/ patch 006 > > - if (argc < 3) exit(-1); > + if (argc < 3) > + { > + printf("2 arguments required: dictionary_file_name > source_file_name"); > + exit(-1); > + } > > The convention is to send error message to stderr. > (and the output -- here a source file -- should default to stdout) > > Ok, i'll stick to that convention from now on. > 7/ patch 008 > /* FIXME?: what happens if in every range i there is at least one charArray > != 0 > + => this will make index1[] = {0x00, 0x01, 0x02,... 0xfe, 0xff } > + => then in index2, the last range will be ignored incorrectly */ > > luckily this is not possible since utf16 characters in the range > D800-DBFF are not supported/valid > see http://en.wikipedia.org/wiki/UTF-16/UCS-2 for a reason why. > > so you will have, at least, 4 ranges without any hit. > > So "It is not possible to encode these code points in UTF-16. The Unicode standard permanently reserves these values for UTF-16 encoding only, so a single 16-bit code unit in the range 0xD800 to 0xDFFF never represents a character in Plane 0." But since we are using the UTF-16 encoding, it could be possible that a sal_Unicode value (a unicode code unit) is in this range. See section about Code points U+10000..U+10FFFF in the link you sent me. Norbert > > > -- Kenneth > > > > > > > > _______________________________________________ > > LibreOffice mailing list > > LibreOffice@lists.freedesktop.org > > http://lists.freedesktop.org/mailman/listinfo/libreoffice > > > > > So, should i make the changes to the code or have you already done them? -- Kenneth
_______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice