Dear Guys,

I recently downloaded trafodion 1.1 from 
https://github.com/apache/incubator-trafodion/tree/stable/1.1, and followed the 
build guide from 
https://wiki.trafodion.org/wiki/index.php/Building_the_Software, and solved a 
lot of problems (no need to list all details), I am able to run trafodion over 
a hadoop sandbox environment.

But I got a serious problem, that is, all Trafodion related process will go 
down after several minutes (not sure how long), only few of them will left:
[nieyy@redhat-72 ~]$ ps ux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nieyy     76554  0.1  0.1 590988 139768 pts/6   Sl   19:14   0:04 
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java 
-XX:OnOutOfMemoryError=kill -9 %p -Xmx128m
nieyy    118833  0.7  0.3 1535452 420996 ?      Sl   19:40   0:12 
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java -Dproc_namenode 
-Xmx1000m -Djava.net.prefe
nieyy    119085  0.6  0.2 1572688 367388 ?      Sl   19:40   0:10 
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java -Dproc_datanode 
-Xmx1000m -Djava.net.prefe
nieyy    119320  0.4  0.2 1512656 340636 ?      Sl   19:41   0:07 
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java 
-Dproc_secondarynamenode -Xmx1000m -Djava.
nieyy    119972  1.2  0.2 1708408 378536 pts/6  Sl   19:41   0:20 
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java -Dproc_resourcemanager 
-Xmx1000m -Dhadoop.
nieyy    120133  0.9  0.2 1616388 309976 ?      Sl   19:41   0:16 
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java -Dproc_nodemanager 
-Xmx1000m -Dhadoop.log.
nieyy    120371  0.0  0.0   9824  1772 pts/6    S    19:41   0:00 /bin/sh 
./bin/mysqld_safe 
--defaults-file=/home/nieyy/trafodion_build/incubator-trafodion-stable-1.
nieyy    120594  0.0  0.0 452604 89908 pts/6    Sl   19:41   0:01 
/home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/sql/local_hadoop/mysql/bin/mysq
nieyy    120789  0.0  0.0   9692  1736 pts/6    S    19:41   0:00 bash 
/home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/sql/local_hadoop/hbase/bin
nieyy    120806  2.0  0.3 1809048 509164 pts/6  Sl   19:41   0:34 
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java -Dproc_master 
-XX:OnOutOfMemoryError=kill
nieyy    122554  0.0  0.0  13624  1304 pts/6    S    19:41   0:00 mpirun 
-disable-auto-cleanup -demux select -env SQ_IC TCP -env MPI_ERROR_LEVEL 2 -env 
SQ_PIDMAP 1 -
nieyy    122555  0.0  0.0      0     0 ?        Zs   19:41   0:00 
[hydra_pmi_proxy] <defunct>
nieyy    122556  1.0  0.0 335212 36748 ?        Ssl  19:41   0:17 
/home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/export/bin64d/monitor
 COLD
nieyy    122557  0.8  0.0 335212 36768 ?        Ssl  19:41   0:14 
/home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sqf/export/bin64d/monitor
 COLD
nieyy    123946  0.9  0.1 828072 223088 pts/6   Sl   19:42   0:14 
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java 
-XX:OnOutOfMemoryError=kill -9 %p -Xmx128m
nieyy    124044  1.0  0.1 629200 187180 pts/6   Sl   19:42   0:16 
/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/bin/java 
-XX:OnOutOfMemoryError=kill -9 %p -Xmx128m

And then I need to kill all processes and use swstartall and sqstart to reset 
the environment, however, the environment will still go down after a while, and 
I need to restart again.

I found some cores under 
trafodion_build/incubator-trafodion-stable-1.1/core/sqf/sql/scripts, all cored 
were generated by mxssmp:
[nieyy@redhat-72 scripts]$ ll core*
…
-rw------- 1 nieyy nieyy 156008448 Sep  7 17:56 core.mxssmp.173357
-rw------- 1 nieyy nieyy 145518592 Sep  7 17:56 core.mxssmp.173372
-rw------- 1 nieyy nieyy 156008448 Sep  7 19:24 core.mxssmp.74146
-rw------- 1 nieyy nieyy 145518592 Sep  7 19:24 core.mxssmp.74197

I used gdb to track the stack:
[nieyy@redhat-72 scripts]$ gdb 
/home/nieyy/trafodion_build/incubator-trafodion-stable-1.1/core/sql/lib/linux/64bit/debug/mxssmp
 ./core.mxssmp.141469
...
(gdb) where
#0  0x000000000044166c in ProcessStats::getHeap (this=0x2000) at 
../runtimestats/SqlStats.h:271
#1  0x000000000043990a in StatsGlobals::removeProcess (this=0x10000000, 
pid=65536, calledAtAdd=0) at ../runtimestats/SqlStats.cpp:276
#2  0x0000000000439e05 in StatsGlobals::checkForDeadProcesses (this=0x10000000, 
myPid=141469) at ../runtimestats/SqlStats.cpp:382
#3  0x00000000004440be in SsmpGlobals::work (this=0x7f062660c7e8) at 
../runtimestats/ssmpipc.cpp:582
#4  0x000000000042f06a in runServer (argc=1, argv=0x7fff5b0e5a48) at 
../bin/ex_ssmp_main.cpp:259
#5  0x000000000042eb12 in main (argc=1, argv=0x7fff5b0e5a48) at 
../bin/ex_ssmp_main.cpp:127

Then I searched via Google, and found a link 
https://bugs.launchpad.net/trafodion/+bug/1368891 which looks similar, but it 
claimed the bug has been fixed at v0.9, but my version is 1.1.

So, could you kindly help me to solve this problem cause I can’t find more 
useful information via Google.

Thanks a lot.

-- 
Mailing list: https://launchpad.net/~trafodion-development
Post to     : trafodion-development@lists.launchpad.net
Unsubscribe : https://launchpad.net/~trafodion-development
More help   : https://help.launchpad.net/ListHelp

Reply via email to