Thanks, Malcom. I reviewed it. The only thing you still have to do is hit "submit patch" to get a Jenkins run. See our HowToContribute wiki page for more details.
wiki.apache.org/hadoop/HowToContribute best, Colin On Sat, Dec 13, 2014 at 9:22 PM, malcolm <malcolm.kaval...@oracle.com> wrote: > I am checking on the latest release of Solaris 11 and yes, it is still > thread safe (or MT Safe as documented on the man page). > > strerror checks the error code, and returns the same "unknown error" string > as terror does, if it receives an invalid code. I checked this on Windows, > Solaris and Linux (though my changes only affect Solaris platforms). > > JIRA newbie question: > > I have filed the JIRA attaching the patch HADOOP-11403 against the trunk, > asking for reviewers in the comments section. > Is there any other protocol I should follow ? > > Thanks, > Malcolm > > > On 12/14/2014 01:08 AM, Asokan, M wrote: >> >> Malcom, >> That's great! Is strerror() thread-safe in the recent version of >> Solaris? In any case, to be correct you still need to make sure that the >> code passed to strerror() is a valid one. For this you need to check errno >> after the call to strerror(). Please check the code snippet I sent earlier >> for HPUX. >> >> -- Asokan >> ________________________________________ >> From: malcolm [malcolm.kaval...@oracle.com] >> Sent: Saturday, December 13, 2014 3:13 PM >> To: common-dev@hadoop.apache.org >> Subject: Re: Solaris Port SOLVED! >> >> Wiping egg off face ... >> >> After consulting with the Solaris team (and looking at the source code >> and man page) , it turns out that strerror itself on Solaris is MT-Safe >> ! (Just like HPUX) >> >> So, after all this effort, all I need to do is modify terror as follows: >> >> const char* terror(int errnum) >> { >> >> #if defined(__sun) >> return strerror(errnum); // MT-Safe under Solaris >> #else >> if ((errnum < 0) || (errnum >= sys_nerr)) { >> return "unknown error."; >> } >> return sys_errlist[errnum]; >> #endif >> } >> >> And in two other files where sys_errlist is referenced directly >> (NativeIO and hdfs_http_client.c), I replaced this direct access instead >> with a call to terror. >> >> Thanks for all your help and patience, >> >> I'll file a JIRA asap, >> >> Cheers, >> Malcolm >> >> On 12/13/2014 05:26 PM, malcolm wrote: >>> >>> Thanks Asokan, >>> >>> Looked up Gcc's thread local variables, seems a bit complex though and >>> quite specific to Gnu. >>> >>> Intialization of the static errlist array should be thread safe i.e. >>> initially the array is nulled out, and afterwards if two threads write >>> to the same address, then they would be writing the same string. >>> >>> But if we are ok with changing 5 files, not just terror, then I would >>> just remove terror completely and use strerror_r (or the alternatives >>> for Windows and HP_UX) in the caller code instead i.e. using your >>> suggestion for a local buffer in the caller, wherever needed. The more >>> I think about it, the more this seems to be the right thing to do. >>> >>> Cheers, >>> Malcolm >>> >>> >>> On 12/13/2014 04:38 PM, Asokan, M wrote: >>>> >>>> Malcom, >>>> Gcc supports thread-local variables. See >>>> >>>> https://gcc.gnu.org/onlinedocs/gcc-3.3.1/gcc/Thread-Local.html >>>> >>>> I am not sure about native compilers on Solaris, HPUX, or AIX. >>>> >>>> In any case, I found out that the Windows native code in Hadoop seems >>>> to handle error messages properly. Here is what I found: >>>> >>>> $ find ~/work/hadoop/hadoop-trunk/ -name '*.c'|xargs grephadoop how to >>>> file a jira >>>> >>>> FormatMessage|awk -F: '{print $1}'|sort -u >>>> >>>> /home/asokan/work/hadoop/hadoop-trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c >>>> >>>> >>>> /home/asokan/work/hadoop/hadoop-trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/security/JniBasedUnixGroupsMappingWin.c >>>> >>>> >>>> /home/asokan/work/hadoop/hadoop-trunk/hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.c >>>> >>>> >>>> >>>> $ find ~/work/hadoop/hadoop-trunk/ -name '*.c'|xargs grep terror|awk >>>> -F: '{print $1}'|sort -u >>>> >>>> /home/asokan/work/hadoop/hadoop-trunk/hadoop-common-project/hadoop-common/src/main/native/src/exception.c >>>> >>>> >>>> /home/asokan/work/hadoop/hadoop-trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/SharedFileDescriptorFactory.c >>>> >>>> >>>> /home/asokan/work/hadoop/hadoop-trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net/unix/DomainSocket.c >>>> >>>> >>>> /home/asokan/work/hadoop/hadoop-trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/net/unix/DomainSocketWatcher.c >>>> >>>> >>>> /home/asokan/work/hadoop/hadoop-trunk/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/security/JniBasedUnixGroupsMapping.c >>>> >>>> >>>> >>>> This means you need not worry about the Windows version of terror(). >>>> You need to change five files that contain UNIX specific native code. >>>> >>>> I have a question on your suggested implementation: >>>> >>>> How do you initialize the static errlist array in a thread-safe manner? >>>> >>>> ________________________________ >>>> Here is another thread-safe implementation that I could come up with: >>>> >>>> #include <string.h> >>>> #include <stdlib.h> >>>> #include <errno.h> >>>> #include <stdio.h> >>>> >>>> #define MESSAGE_BUFFER_SIZE 256 >>>> >>>> char * getSystemErrorMessage(char * buf, int buf_len, int code) { >>>> #if defined(_HPUX_SOURCE) >>>> char * msg; >>>> errno = 0; >>>> msg = strerror(code); >>>> if (errno == 0) { >>>> strncpy(buf, msg, buf_len-1); >>>> buf[buf_len-1] = '\0'; >>>> } else { >>>> snprintf(buf, buf_len, "%s %d", >>>> "Can't get system error message for code", code); >>>> } >>>> #else >>>> if (strerror_r(code, buf, buf_len) != 0) { >>>> snprintf(buf, buf_len, "%s %d", >>>> "Can't get system error message for code", code); >>>> } >>>> #endif >>>> return buf; >>>> } >>>> >>>> #define TERROR(code) \ >>>> getSystemErrorMessage(messageBuffer, sizeof(messageBuffer), code) >>>> >>>> int main(int argc, char ** argv) { >>>> if (argc > 1) { >>>> char messageBuffer[MESSAGE_BUFFER_SIZE]; >>>> int code = atoi(argv[1]); >>>> >>>> fprintf(stderr, "System error for code %s: %s\n", argv[1], >>>> TERROR(code)); >>>> } >>>> return 0; >>>> } >>>> >>>> >>>> This changes terror to a macro TERROR and requires all functions that >>>> call TERROR macro to declare the local variable messageBuffer. Since >>>> there are only five files to modify, I think it is not a big effort. >>>> What do you think? >>>> >>>> -- Asokan >>>> >>>> On 12/13/2014 04:29 AM, malcolm wrote: >>>> Colin, >>>> >>>> I am not sure what you mean by a thread-local buffer (in native >>>> code). In Java this is pretty standard, but I couldn't find any >>>> implementation for C code. >>>> >>>> Here is the terror function: >>>> >>>> const char* terror(int errnum) >>>> { >>>> if ((errnum < 0) || (errnum >= sys_nerr)) { >>>> return "unknown error."; >>>> } >>>> return sys_errlist[errnum]; >>>> } >>>> >>>> >>>> The interface is identical to strerror, but the implementation is >>>> actually re-entrant since it returns a pointer to a static string. >>>> >>>> If I understand your suggestion, the new function would look like this: >>>> >>>> const char* terror(int errnum) >>>> { >>>> static char result[65]; >>>> >>>> strerror_r(errnum, result, 64); >>>> >>>> return result; >>>> } >>>> >>>> No need for snprintf, strerror_r has the 'n' bounding built-in. >>>> >>>> Of course, this is still non-re-entrant, so unless the caller copies >>>> the returned buffer, before the function is called again, there is a >>>> problem. >>>> >>>> After considerable thought, I have come up with this version of >>>> terror, tested OK on Windows, Linux and Solaris: >>>> >>>> #if defined(_WIN32) >>>> #define strerror_r(errno,buf,len) strerror_s(buf,len,errno) >>>> #endif >>>> >>>> #define MAX_ERRORS 256 >>>> #define MAX_ERROR_LEN 80 >>>> >>>> char *terror(int errnum) >>>> { >>>> >>>> static char errlist[MAX_ERRORS][MAX_ERROR_LEN+1]; // cache of >>>> error messages >>>> >>>> if ( errnum >= 0 && errnum < MAX_ERRORS ) >>>> { >>>> if ( errlist[errnum][0] == 0 ) >>>> strerror_r( errnum, errlist[errnum], MAX_ERROR_LEN); >>>> >>>> return errlist[errnum]; >>>> } >>>> else >>>> { >>>> return "Unknown error"; >>>> } >>>> } >>>> >>>> This version is portable and re-entrant. >>>> >>>> On windows, the largest errnum is 43, on Ubuntu 14.04 we have 133, >>>> and on Solaris 11.1 we get 151. >>>> >>>> If this is OK with you, I will open a jira for this. >>>> >>>> >>>> Thanks, >>>> Malcolm >>>> >>>> >>>> On 12/12/2014 11:10 PM, Colin McCabe wrote: >>>> Just use snprintf to copy the error message from strerror_r into a >>>> thread-local buffer of 64 bytes or so. Then preserve the existing >>>> terror() interface. >>>> >>>> Can you open a jira for this? >>>> >>>> best, >>>> Colin >>>> >>>> On Thu, Dec 11, 2014 at 8:35 PM, >>>> malcolm<malcolm.kaval...@oracle.com><mailto:malcolm.kaval...@oracle.com> >>>> wrote: >>>> So, turns out that if I had naively changed all calls to terror or >>>> references to sys_errlist, to using strerror_r, then I would have broken >>>> code for Windows and HPUX (and possibly other OSes). >>>> >>>> If we are to assume that current code runs fine on all platforms >>>> (maybe even >>>> AIX an MacOS, for example), then any change/additions made to the >>>> code and >>>> not ifdeffed appropriately can break on other OSes. On the other >>>> hand, too >>>> many ifdefs can pollute the code source and render it less readable >>>> (though >>>> possibly less important). >>>> >>>> In the general case what are code contributors responsibilities to >>>> adding >>>> code regarding OSes besides Linux ? >>>> What OSes does jenkins test on ? >>>> I guess maintainers of code on non-tested platforms are responsible for >>>> their own testing ? >>>> >>>> How do we avoid the ping-pong effect, i.e. I make a generic change to >>>> code >>>> which breaks on Windows, then the Windows maintainer reverts changes to >>>> break on Solaris for example ? Or does this not happen in actuality ? >>>> >>>> >>>> On 12/11/2014 11:25 PM, Asokan, M wrote: >>>> Hi Malcom, >>>> The Windows versions of strerror() and strerror_s() functions are >>>> probably meant for ANSI C library functions that set errno. For core >>>> Windows API calls (like UNIX system calls), one gets the error number by >>>> calling GetLastError() function. In the code snippet I sent earlier, >>>> the >>>> "code" argument is the value returned by GetLastError(). Neither >>>> strerror() >>>> nor strerror_s() will give the correct error message for this error >>>> code. >>>> >>>> You could probably look at libwinutils.c in Hadoop source. It uses >>>> FormatMessageW (which returns messages in Unicode.) My requirement >>>> was to >>>> return messages in current system locale. >>>> >>>> -- Asokan >>>> ________________________________________ >>>> From: malcolm >>>> [malcolm.kaval...@oracle.com<mailto:malcolm.kaval...@oracle.com>] >>>> Sent: Thursday, December 11, 2014 4:04 PM >>>> To:common-dev@hadoop.apache.org<mailto:To:common-dev@hadoop.apache.org> >>>> Subject: Re: Solaris Port >>>> >>>> Hi Asok, >>>> >>>> I googled and found that windows has strerror, and strerror_s (which is >>>> the strerror_r equivalent). >>>> Is there a reason why you didn't use this call ? >>>> >>>> On 12/11/2014 06:27 PM, Asokan, M wrote: >>>> Hi Malcom, >>>> Recently, I had to work on a function to get system error >>>> message on >>>> various systems. Here is the piece of code I came up with. Hope it >>>> helps. >>>> >>>> static void get_system_error_message(char * buf, int buf_len, int code) >>>> { >>>> #if defined(_WIN32) >>>> LPVOID lpMsgBuf; >>>> DWORD status = FormatMessage(FORMAT_MESSAGE_ALLOCATE_BUFFER | >>>> FORMAT_MESSAGE_FROM_SYSTEM | >>>> FORMAT_MESSAGE_IGNORE_INSERTS, >>>> NULL, code, >>>> MAKELANGID(LANG_NEUTRAL, >>>> SUBLANG_DEFAULT), >>>> /* Default >>>> language */ >>>> (LPTSTR) &lpMsgBuf, 0, NULL); >>>> if (status > 0) >>>> { >>>> strncpy(buf, (char *)lpMsgBuf, buf_len-1); >>>> buf[buf_len-1] = '\0'; >>>> /* Free the buffer returned by system */ >>>> LocalFree(lpMsgBuf); >>>> } >>>> else >>>> { >>>> _snprintf(buf, buf_len-1 , "%s %d", >>>> "Can't get system error message for code", code); >>>> buf[buf_len-1] = '\0'; >>>> } >>>> #else >>>> #if defined(_HPUX_SOURCE) >>>> { >>>> char * msg; >>>> errno = 0; >>>> msg = strerror(code); >>>> if (errno == 0) >>>> { >>>> strncpy(buf, msg, buf_len-1); >>>> buf[buf_len-1] = '\0'; >>>> } >>>> else >>>> { >>>> snprintf(buf, buf_len, "%s %d", >>>> "Can't get system error message for code", code); >>>> } >>>> } >>>> #else >>>> if (strerror_r(code, buf, buf_len) != 0) >>>> { >>>> snprintf(buf, buf_len, "%s %d", >>>> "Can't get system error message for code", code); >>>> } >>>> #endif >>>> #endif >>>> } >>>> >>>> Note that HPUX does not have strerror_r() since strerror() itself is >>>> thread-safe. Also Windows does not have snprintf(). The equivalent >>>> function _snprintf() has a subtle difference in its interface. >>>> >>>> -- Asokan >>>> ________________________________________ >>>> From: malcolm >>>> [malcolm.kaval...@oracle.com<mailto:malcolm.kaval...@oracle.com>] >>>> Sent: Thursday, December 11, 2014 11:02 AM >>>> To:common-dev@hadoop.apache.org<mailto:To:common-dev@hadoop.apache.org> >>>> Subject: Re: Solaris Port >>>> >>>> Fine with me, I volunteer to do this, if accepted. >>>> >>>> On 12/11/2014 05:48 PM, Allen Wittenauer wrote: >>>> sys_errlist was removed for a reason. Creating a fake sys_errlist on >>>> Solaris will mean the libhadoop.so will need to be tied a specific build >>>> (kernel/include pairing) and therefore limits upward >>>> mobility/compatibility. >>>> That doesn’t seem like a very good idea. >>>> >>>> IMO, switching to strerror_r is much preferred, since other than the >>>> brain-dead GNU libc version, is highly portable and should work >>>> regardless >>>> of the kernel or OS in place. >>>> >>>> On Dec 11, 2014, at 5:20 AM, >>>> malcolm<malcolm.kaval...@oracle.com><mailto:malcolm.kaval...@oracle.com> >>>> wrote: >>>> >>>> FYI, there are a couple more files that reference sys_errlist directly >>>> (not just terror within exception.c) , but also hdfs_http_client.c and >>>> NativeiO.c >>>> >>>> On 12/11/2014 07:38 AM, malcolm wrote: >>>> Hi Colin, >>>> >>>> Exactly, as you noticed, the problem is the thread-local buffer needed >>>> to return from terror. >>>> Currently, terror just returns a static string from an array, this is >>>> fast, simple and error-proof. >>>> >>>> In order to use strerror_r inside terror, would require allocating a >>>> buffer inside terror and depend on the caller to free the buffer after >>>> using it, or to pass a buffer to terrror (which is basically the same as >>>> strerror_r, rendering terror redundant). >>>> Both cases require modification outside terror itself, as far as I can >>>> tell, no simple fix. Unless you have an alternative which I haven't >>>> thought >>>> of ? >>>> >>>> As far as I can tell, we have two choices: >>>> >>>> 1. Remove terror and replace calls with strerror_r, passing a buffer >>>> from the callee. >>>> Advantage: a more modern portable interface. >>>> Disadvantage: All calls to terror need to be modified, though >>>> all seem to be in a few files as far as I can tell. >>>> >>>> 2. Adding a sys_errlist array (ifdeffed for Solaris) >>>> Advantage: no change to any calls to terror >>>> Disadvantage: 2 additional files added to source tree (.c and >>>> .h) and some minor ifdefs only used for Solaris. >>>> >>>> I think it is more a question of style than anything else, so I leave >>>> you to make the call. >>>> >>>> Thanks for your patience, >>>> Malcolm >>>> >>>> >>>> >>>> >>>> >>>> On 12/10/2014 09:54 PM, Colin McCabe wrote: >>>> On Wed, Dec 10, 2014 at 2:31 AM, malcolm >>>> <malcolm.kaval...@oracle.com><mailto:malcolm.kaval...@oracle.com> wrote: >>>> Hi Colin, >>>> >>>> Thanks for the hints around JIRAs. >>>> >>>> You are correct errno still exists, however sys_errlist does not. >>>> >>>> Hadoop uses a function terror (defined in exception.c) which indexes >>>> sys_errlist by errno to return the error message from the array. >>>> This >>>> function is called 26 times in various places (in 2.2) >>>> >>>> Originally, I thought to replace all calls to terror with strerror, >>>> but >>>> there can be issues with multi-threading (it returns a buffer which >>>> can be >>>> overwritten), so it seemed simpler just to recreate the sys_errlist >>>> message >>>> array. >>>> >>>> There is also a multi-threaded version strerror_r where you pass the >>>> buffer >>>> as a parameter, but this would necessitate changing every call to >>>> terror >>>> with mutiple lines of code. >>>> Why don't you just use strerror_r inside terror()? >>>> >>>> I wrote that code originally. The reason I didn't want to use >>>> strerror_r there is because GNU libc provides a non-POSIX definition >>>> of strerror_r, and forcing it to use the POSIX one is a pain. But you >>>> can do it. You also will require a thread-local buffer to hold the >>>> return from strerror_r, since it is not guaranteed to be static >>>> (although in practice it is 99% of the time-- another annoyance with >>>> the API). >>>> >>>> >>>> ________________________________ >>>> >>>> >>>> >>>> ATTENTION: ----- >>>> >>>> The information contained in this message (including any files >>>> transmitted with this message) may contain proprietary, trade secret or >>>> other confidential and/or legally privileged information. Any pricing >>>> information contained in this message or in any files transmitted >>>> with this >>>> message is always confidential and cannot be shared with any third >>>> parties >>>> without prior written approval from Syncsort. This message is >>>> intended to be >>>> read only by the individual or entity to whom it is addressed or by >>>> their >>>> designee. If the reader of this message is not the intended >>>> recipient, you >>>> are on notice that any use, disclosure, copying or distribution of this >>>> message, in any form, is strictly prohibited. If you have received this >>>> message in error, please immediately notify the sender and/or >>>> Syncsort and >>>> destroy all copies of this message in your possession, custody or >>>> control. >>>> >>>> >>>> >> >> ________________________________ >> >> >> >> ATTENTION: ----- >> >> The information contained in this message (including any files transmitted >> with this message) may contain proprietary, trade secret or other >> confidential and/or legally privileged information. Any pricing information >> contained in this message or in any files transmitted with this message is >> always confidential and cannot be shared with any third parties without >> prior written approval from Syncsort. This message is intended to be read >> only by the individual or entity to whom it is addressed or by their >> designee. If the reader of this message is not the intended recipient, you >> are on notice that any use, disclosure, copying or distribution of this >> message, in any form, is strictly prohibited. If you have received this >> message in error, please immediately notify the sender and/or Syncsort and >> destroy all copies of this message in your possession, custody or control. > >