Author Topic: A problem half a year in the making...  (Read 2844 times)

Psibyre

  • Traveller
  • *
  • Posts: 23
    • View Profile
A problem half a year in the making...
« on: June 30, 2006, 03:14:49 pm »
Well, for half a year, I couldn't be arsed about this because of school and the like.  But, now that school is out, I can be arsed.  Before, I had hoped this problem would go away with time.  It isn't...

Well, my PS dies everytime at startup.  Here is the debugger output.

Quote
Program received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread -1208543536 (LWP 3377)]
CS::Debug::AssertMessage (expr=0x835d1c8 "*(CookieType*)p == startCookie",
    filename=0x835d1a9 "libs/csutil/ptmalloc_wrap.cpp", line=98,
    msg=0x835d160 "Memory block has wrong cookie (was probably allocated in another module)") at libs/csutil/csassert.cpp:75
75
(gdb) continue
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
CS::Debug::AssertMessage (
    expr=0x835d22c "*(CookieType*)((uint8*)P + n) == endCookie",
    filename=0x835d1a9 "libs/csutil/ptmalloc_wrap.cpp", line=103,
    msg=0x835d1e8 "Memory block has wrong cookie (probably corrupted by an overflow)") at libs/csutil/csassert.cpp:75
75
(gdb) continue
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0810c74a in mspace_free (msp=0xb7f87028, mem=0xb7f4ef50)
    at libs/csutil/dlmalloc.c:5018
75
(gdb) continue
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x0810c74a in mspace_free (msp=0xb7f87028, mem=0xb7f4ef50)
    at libs/csutil/dlmalloc.c:5018
5018                    unlink_chunk(fm, p, prevsize);
Current language:  auto; currently c
(gdb) bt
#0  0x0810c74a in mspace_free (msp=0xb7f87028, mem=0xb7f4ef50)
    at libs/csutil/dlmalloc.c:5018
#1  0x080e06d4 in ptfree (mem=0xb7f4ef50)
    at libs/csutil/ptmalloc/ptmalloc3.c:875
#2  0x080daf3f in ptfree (P=0xb7f4ef58) at libs/csutil/ptmalloc_wrap.cpp:106
#3  0x0804c275 in operator delete (p=0xb7f4ef58)
    at /home/chimp/dev/cs/include/cssysdef.h:716
#4  0x0462bf3a in CalLoader::loadCoreAnimation ()
    at ./include/csutil/reftrackeraccess.h:45
#5  0x0462ce42 in CalLoader::loadCoreAnimation ()
    at ./include/csutil/reftrackeraccess.h:45
#6  0x02412fe3 in CS::Plugin::SprCal3d::csSpriteCal3DMeshObjectFactory::LoadCoreAnimation () at ./include/csutil/reftrackeraccess.h:45
Animation () at ./include/csutil/reftrackeraccess.h:45
#7  0x0193522d in CS::Plugins::SprCal3dLoader::csSpriteCal3DFactoryLoader::Parse () at ./include/csutil/reftrackeraccess.h:45
#8  0x00adf747 in csLoader::LoadMeshObjectFactory ()
    at ./include/csutil/reftrackeraccess.h:45
#9  0x00ae5772 in csLoader::LoadLibrary ()
    at ./include/csutil/reftrackeraccess.h:45
#10 0x00ae786a in csLoader::Load () at ./include/csutil/reftrackeraccess.h:45
#11 0x00ad911c in csLoader::Load () at ./include/csutil/reftrackeraccess.h:45
#12 0x00ad95cc in csLoader::Load () at ./include/csutil/reftrackeraccess.h:45
#13 0x0805fedd in ClientCacheManager::LoadNewFactory (this=0xae028a90,

Your thoughts?  If you look closely, it complains about a single line at libs/csutil/dlmalloc.c:5018

5018 is: unlink_chunk(fm, p, prevsize);

And another prominent thing is this: #0  0x0810c74a in mspace_free (msp=0xb7f87028, mem=0xb7f4ef50)
    at libs/csutil/dlmalloc.c:5018

I think that line 5018 is my problem, yet I dont know what to do.


Edited to quote the debug output, instead of using rafb.  That expires in only a day.  --DaveG
« Last Edit: June 30, 2006, 03:38:42 pm by DaveG »

DaveG

  • Forum Addict
  • *
  • Posts: 2058
    • View Profile
Re: A problem half a year in the making...
« Reply #1 on: June 30, 2006, 03:18:51 pm »
You're hitting ptmalloc crap.  Make sure this is a build of latest PS CVS + CS SVN 24805 + CEL SVN 2018.  If it's not, get exactly those revisions and recompile clean.  If you're still getting ptmalloc barfs you're hitting a pretty bad problem in CS that we'll need to get res to fix...

::  PlaneShift Team Programmer  ::

Psibyre

  • Traveller
  • *
  • Posts: 23
    • View Profile
Re: A problem half a year in the making...
« Reply #2 on: June 30, 2006, 03:20:21 pm »
Yep, latest revisions.  Followed the build guide exactly.

http://planeshift.cvs.sourceforge.net/*checkout*/planeshift/planeshift/docs/compiling.html Used this

DaveG

  • Forum Addict
  • *
  • Posts: 2058
    • View Profile
Re: A problem half a year in the making...
« Reply #3 on: June 30, 2006, 03:30:44 pm »
If you used those exact numbers, something is odd with your system.  Please post system specs, namely your OS/distro, CPU, and compiler.

Edited to add:
Check that you have those revisions (those revisons, not latest) as of today, as we've changed them a few times recently.
« Last Edit: June 30, 2006, 03:34:07 pm by DaveG »

::  PlaneShift Team Programmer  ::

Psibyre

  • Traveller
  • *
  • Posts: 23
    • View Profile
Re: A problem half a year in the making...
« Reply #4 on: June 30, 2006, 03:46:08 pm »
Linux localhost.localdomain 2.6.17-1.2139_FC5 #1 Fri Jun 23 12:40:16 EDT 2006 i686 i686 i386 GNU/Linux

Fedora Core 5.

Intel P4 1.8 GHz.

ATI Radeon 9700 Pro.

gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)

Quote
[chimp@localhost planeshift]$ cat CVS/Tag
D2006.06.30.00.00.00

[chimp@localhost cs]$ svn info
Path: .
URL: https://svn.sourceforge.net/svnroot/crystal/CS/trunk
Repository Root: https://svn.sourceforge.net/svnroot/crystal
Repository UUID: 8cc4aa7f-3514-0410-904f-f2cc9021211c
Revision: 24805
Node Kind: directory
Schedule: normal
Last Changed Author: caedesv
Last Changed Rev: 24805
Last Changed Date: 2006-06-26 22:05:49 -0400 (Mon, 26 Jun 2006)
Properties Last Updated: 2006-06-29 21:05:40 -0400 (Thu, 29 Jun 2006)

[chimp@localhost cel]$ svn info
Path: .
URL: https://svn.sourceforge.net/svnroot/cel/cel/trunk
Repository Root: https://svn.sourceforge.net/svnroot/cel
Repository UUID: 72a1d917-3814-0410-aa68-8e5598793349
Revision: 2018
Node Kind: directory
Schedule: normal
Last Changed Author: dfryer
Last Changed Rev: 2018
Last Changed Date: 2006-06-28 04:08:33 -0400 (Wed, 28 Jun 2006)
Properties Last Updated: 2006-06-29 21:08:14 -0400 (Thu, 29 Jun 2006)
« Last Edit: June 30, 2006, 03:47:45 pm by Psibyre »

DaveG

  • Forum Addict
  • *
  • Posts: 2058
    • View Profile
Re: A problem half a year in the making...
« Reply #5 on: June 30, 2006, 03:49:11 pm »
Damn.

Ok, we're going to have to grab res... though I think res is going on vacation now or something...

Edited to add:
Confirmed.  Other people with GCC 4 are complaining of this with current CS.

After futzing with a few things, even if we bypass the asserts it still dies.  We'll have to wait until res gets back to fix this.  Linux people who get this will have to play using the Linux package after it's released.
« Last Edit: June 30, 2006, 04:32:40 pm by DaveG »

::  PlaneShift Team Programmer  ::

Bereror

  • Hydlaa Notable
  • *
  • Posts: 773
    • View Profile
    • Planeshift API
Re: A problem half a year in the making...
« Reply #6 on: June 30, 2006, 05:44:32 pm »
One option would be rebuilding everything without the ptmalloc support. Configure cs with the --disable-ptmalloc switch and rebuild cs, cel and planeshift. I also had to modify cs/libs/csutil/Jamfile and comment out the line with "SubInclude TOP libs csutil ptmalloc ;" when I tried it last time, but it might be changed now.
PlaneShift Sources
PlaneShift API
"Words never spoken
Are the strongest resounding"

DaveG

  • Forum Addict
  • *
  • Posts: 2058
    • View Profile
Re: A problem half a year in the making...
« Reply #7 on: June 30, 2006, 06:04:06 pm »
It compiles for him, so just disabling it should hopefully work.

::  PlaneShift Team Programmer  ::

Psibyre

  • Traveller
  • *
  • Posts: 23
    • View Profile
Re: A problem half a year in the making...
« Reply #8 on: June 30, 2006, 06:28:48 pm »
Yeah, I am compiling with that switch on now.  If this works I will give Bereror's character all of the money I have.  It probably isnt a lot though.  I don't remember what I have.

Oh no, still I get a segmentation fault.

Quote
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1208985904 (LWP 29071)]
0x4fb06411 in free () from /lib/libc.so.6

(gdb) bt
#0  0x4fb06411 in free () from /lib/libc.so.6
#1  0x412d76c1 in operator delete () from /usr/lib/libstdc++.so.6
#2  0x0238af3a in CalLoader::loadCoreAnimation ()
    at ./include/csutil/reftrackeraccess.h:45
#3  0x0238be42 in CalLoader::loadCoreAnimation ()
    at ./include/csutil/reftrackeraccess.h:45
#4  0x02eb8b31 in CS::Plugin::SprCal3d::csSpriteCal3DMeshObjectFactory::LoadCoreAnimation () at ./include/csutil/reftrackeraccess.h:45
#5  0x0066eead in CS::Plugins::SprCal3dLoader::csSpriteCal3DFactoryLoader::Parse () at ./include/csutil/reftrackeraccess.h:45
#6  0x00291ffb in csLoader::LoadMeshObjectFactory ()
    at ./include/csutil/reftrackeraccess.h:45
#7  0x00297fce in csLoader::LoadLibrary ()
    at ./include/csutil/reftrackeraccess.h:45
#8  0x00299fff in csLoader::Load () at ./include/csutil/reftrackeraccess.h:45
#9  0x0028baba in csLoader::Load () at ./include/csutil/reftrackeraccess.h:45
#10 0x0028bf6a in csLoader::Load () at ./include/csutil/reftrackeraccess.h:45
#11 0x0805f8c9 in ClientCacheManager::LoadNewFactory (this=0x98e2640,
    filename=0x98fae48 "/planeshift/models/stonebm/stonebm.cal3d")
    at src/client/clientcachemanager.cpp:91
#12 0x08095a48 in psEngine::PreloadModels (this=0x97f0488)
    at src/client/psengine.cpp:1352
#13 0x08087dd0 in psEngine::HandleEvent (this=0x97f0488, ev=@0x98b41ac)

Hot damn, I was so close I guess.  The problems with free() and delete() were the same ones that I had half a year a go.  I fixed the ptmalloc crap, and now I am back where I started...
« Last Edit: June 30, 2006, 07:33:40 pm by Psibyre »

Psibyre

  • Traveller
  • *
  • Posts: 23
    • View Profile
Re: A problem half a year in the making...
« Reply #9 on: July 01, 2006, 06:36:32 pm »
Hmm, I don't exactly want to double post, but the subject matter is different than my previous post.

I tried compiling statically (What a looong compile that was) and I don't get this error.  I can even get past the loading screen and attempt to join a server.  Sad thing is, it seems a bunch of plugins don't get loaded properly.  For instance, the stonebreaker's skin is instead a checkerboard.  And when I tried to join someone's server (Not main, that's down) I get a fail message saying, "Failed to load a widget."  I will try a clean static compile now, because yesterday I was compiling pretty much everything over and over again in the same folder, so it might have gotten confused.

 :beta:

Wired_Crawler

  • Hydlaa Citizen
  • *
  • Posts: 429
    • View Profile
Re: A problem half a year in the making...
« Reply #10 on: July 01, 2006, 10:03:58 pm »
"Failed to load a widget."

This one may be easy to fix. Look carefuly, what widget fails to load and replace appropriate xml file in data/gui with version from cvs repository. It usually works.

And about problems with compiling recent version of PS suite - I was unable to build working application with gcc 3.4.5, it was crashing every time when cal3d library was touched (PS, viewmesh...), similarly like in last bt from Psibyre. I have switched to gcc 4.0.1 and it gives me much better results. The server is quite unstable though, and I'm not sure what can be a reason (crashing mostly in common/net/message.h).
"Close the world, txEn eht nepO."

DaveG

  • Forum Addict
  • *
  • Posts: 2058
    • View Profile
Re: A problem half a year in the making...
« Reply #11 on: July 02, 2006, 02:24:44 am »
Wired_Crawler:  If message.h is crashing, that's a big problem.  It'd like to see any info you have on that.

::  PlaneShift Team Programmer  ::

Wired_Crawler

  • Hydlaa Citizen
  • *
  • Posts: 429
    • View Profile
Re: A problem half a year in the making...
« Reply #12 on: July 02, 2006, 01:51:20 pm »
I start server, load npcroom and podium (or all maps), spawn NPCs, unlocking server. Then I log in. Then I just run around npcroom, targeting NPCs, saying hello to them, going near them and then away (out of visibility range), sometimes I enter admin mode, sometimes teleport to other map and back etc, etc... I don't do any unusual things. It seems crashes happen randomly, I can't find any regularity. I use the same machine for server and client, CVS version from July the 1st.

Here are two backtraces, they differ slightly, I can't find any relations between my actions and those cases:

Case 1:
Code: [Select]
Core was generated by `./psserver'.
Program terminated with signal 11, Segmentation fault.
(...)
#0  0x08070736 in ~csSyncRefCount (this=0xb07d7aa8) at ./src/common/net/message.h:54
54        virtual ~csSyncRefCount () {}
(gdb) bt
#0  0x08070736 in ~csSyncRefCount (this=0xb07d7aa8) at ./src/common/net/message.h:54
#1  0x08070860 in ~MsgEntry (this=0xb07d7aa8) at ./src/common/net/message.h:192
#2  0x080706ec in csSyncRefCount::Delete (this=0xb07d7aa8) at ./src/common/net/message.h:51
#3  0x08053a74 in csSyncRefCount::DecRef (this=0xb07d7aa8) at ./src/common/net/message.h:76
#4  0x082c3ee1 in csRef<MsgEntry>::operator= (this=0xb4a59408, newobj=0x0)
    at /mnt/lin_g/cvs_planeshift/CS/include/csutil/ref.h:235
#5  0x082c367a in EventManager::Run (this=0xb71a09a0) at src/common/util/eventmanager.cpp:193
#6  0x08169a4e in csPosixThread::ThreadRun () at ./src/common/net/sockuni.h:56
#7  0x00000000 in ?? ()

Case 2:
Code: [Select]
Program terminated with signal 11, Segmentation fault.
(...)
#0  0x08070851 in ~MsgEntry (this=0xaffc2280) at ./src/common/net/message.h:191
191             free ((void*) bytes);
(gdb) bt
#0  0x08070851 in ~MsgEntry (this=0xaffc2280) at ./src/common/net/message.h:191
#1  0x080706ec in csSyncRefCount::Delete (this=0xaffc2280) at ./src/common/net/message.h:51
#2  0x08053a74 in csSyncRefCount::DecRef (this=0xaffc2280) at ./src/common/net/message.h:76
#3  0x082c3ee1 in csRef<MsgEntry>::operator= (this=0xb4a71408, newobj=0x0)
    at /mnt/lin_g/cvs_planeshift/CS/include/csutil/ref.h:235
#4  0x082c367a in EventManager::Run (this=0xb71b89a0) at src/common/util/eventmanager.cpp:193
#5  0x08169a4e in csPosixThread::ThreadRun () at ./src/common/net/sockuni.h:56
#6  0x00000000 in ?? ()

And the following is another crash, probably completely unrelated (but trace starts from sockuni.h too). It happens after I try to spawn any kind of money, other items are OK (doing /admin, then /item, then selecting coins/hexa and clcking spawn). I am adding it for you so you can compare behaviour of your installation, maybe something is wrong with mine:

Code: [Select]
Core was generated by `./psserver'.
Program terminated with signal 11, Segmentation fault.
(...)
#0  0x0805396c in csRefCount::IncRef (this=0xdddddddd) at /mnt/lin_g/cvs_planeshift/CS/include/csutil/refcount.h:66
66          csRefTrackerAccess::TrackIncRef (this, ref_count);
(gdb) bt
#0  0x0805396c in csRefCount::IncRef (this=0xdddddddd) at /mnt/lin_g/cvs_planeshift/CS/include/csutil/refcount.h:66
#1  0x080846f3 in csRef<psItemSafe>::operator= (this=0xb26b69d8, newobj=0xdddddddd)
    at /mnt/lin_g/cvs_planeshift/CS/include/csutil/ref.h:234
#2  0x081e7cfb in psDelaySave (this=0xb26b6980, length=500, item=0xdddddddd) at src/server/bulkobjects/psitem.cpp:2454
#3  0x081e7da9 in psItem::Save (this=0xb26db334) at src/server/bulkobjects/psitem.cpp:440
#4  0x08118a58 in AdminManager::SpawnItemInv (this=0xb6fbe3e0, me=0xb26c8168, msg=@0xb4affd8c, client=0xb6c31298)
    at src/server/adminmanager.cpp:3255
#5  0x08120084 in AdminManager::HandleMessage (this=0xb6fbe3e0, me=0xb26c8168, client=0xb6c31298)
    at src/server/adminmanager.cpp:188
#6  0x082497c3 in MsgHandler::Publish (this=0xb72479a0, me=0xb26c8168) at src/common/net/msghandler.cpp:94
#7  0x082c3524 in EventManager::Run (this=0xb72479a0) at src/common/util/eventmanager.cpp:161
#8  0x08169a4e in csPosixThread::ThreadRun () at ./src/common/net/sockuni.h:56
#9  0x00000000 in ?? ()

EDIT:
-------

I have recompiled CS and CEL in debug mode to have more complete backtraces. After doing this server was crashing during any network activity, first "ping" sent from client causes this:

Code: [Select]
Core was generated by `./psserver'.
Program terminated with signal 6, Aborted.
(...)
#0  0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0  0xffffe410 in __kernel_vsyscall ()
#1  0xb7d66ef1 in raise () from /lib/tls/libc.so.6
#2  0xb7d6883b in abort () from /lib/tls/libc.so.6
#3  0xb7d9cff5 in __fsetlocking () from /lib/tls/libc.so.6
#4  0xb7da3587 in malloc_usable_size () from /lib/tls/libc.so.6
#5  0xb7da3a02 in free () from /lib/tls/libc.so.6
#6  0x08072431 in ~MsgEntry (this=0xb402c760) at ./src/common/net/message.h:191
#7  0x080722cc in csSyncRefCount::Delete (this=0xb402c760) at ./src/common/net/message.h:51
#8  0x08055654 in csSyncRefCount::DecRef (this=0xb402c760) at ./src/common/net/message.h:76
#9  0x082480f9 in ~csRef (this=0xb6b3c2d8) at /mnt/lin_g/cvs_planeshift/CS/include/csutil/ref.h:184
#10 0x08245b26 in NetBase::BuildMessage (this=0xb7024ba0, pkt=0xb40303b8, connection=@0xb6b3c34c, addr=0xb6b3c354)
    at src/common/net/netbase.cpp:837
#11 0x082460c2 in NetBase::CheckIn (this=0xb7024ba0) at src/common/net/netbase.cpp:256
#12 0x082465da in NetBase::ProcessNetwork (this=0xb7024ba0, timeout=34514) at src/common/net/netbase.cpp:136
#13 0x080c298a in NetManager::Run (this=0xb7024ba0) at src/server/netmanager.cpp:295
#14 0x0816af81 in csPosixThread::ThreadRun (param=0xb724ffe0) at libs/csutil/generic/cspthrd.cpp:531
#15 0xb7fabc40 in start_thread () from /lib/tls/libpthread.so.0
#16 0xb7e0b0ee in clone () from /lib/tls/libc.so.6

Setting MALLOC_CHECK_=0 before starting server was preventing those crashes, it also seems to prevent crashes in message.h (or at least cause them to happen more rarely).
With CS/CEL built in optimized mode MALLOCK_CHECK_=0 does not help.
« Last Edit: July 02, 2006, 05:41:36 pm by Wired_Crawler »
"Close the world, txEn eht nepO."

DaveG

  • Forum Addict
  • *
  • Posts: 2058
    • View Profile
Re: A problem half a year in the making...
« Reply #13 on: July 03, 2006, 10:39:09 pm »
Any time you see pointers to 0xdddddddd that means the object containing them was deleted already, and accessed afterwards.  I fixed that issue just now.  (money items are deleted when adding to inventory, as they're stored as a number instead of item objects; it wasn't NULLing the pointer outside of that scope and it needed to NULL check before trying to dereference)

Not sure what's going on with your message dtors there, however.

::  PlaneShift Team Programmer  ::

Wired_Crawler

  • Hydlaa Citizen
  • *
  • Posts: 429
    • View Profile
Re: A problem half a year in the making...
« Reply #14 on: July 05, 2006, 01:08:45 am »
(Sorry Psibyre for hijacking your thread)

Segfaults seem to happen during execution of destructors ~MsgEntry or (following?) ~csSyncRefCount.

Code: [Select]
#0  0x08070851 in ~MsgEntry (this=0xaffc2280) at ./src/common/net/message.h:191
191             free ((void*) bytes);

I have commented out the line visible above and... all started to work smoothly. If the change causes any memory leaks, they are too small for me to notice (how to check this?). So my guess is, that either freeing the memory in ~MsgEntry is not needed (auto-done when object is deleted? explicit "free" is duplicate free?) or size of memory allocated for "bytes" is calculated incorrectly.
Or maybe it happens only in my particular environment ?
You are programmer(s), you should know better...
"Close the world, txEn eht nepO."