PlaneShift
Support => Server Status => Topic started by: acraig on October 26, 2005, 08:46:26 pm
-
For all of you that\'s wondering what\'s going on with the server here\'s an example of what we are trying to deal with. After a crash a core dump is generated. This is a large file ( 500+MB ) that contains what the program was doing up to that point. So after every crash we can examine that and try to figure out what went wrong. Think of it as a black box like on an airplane.
This core file contains a back trace which is how the programming was flowing before the crash and looks something like this:
0 0x08255a55 in psCharacter::CheckQuestCompleted(psQuest*) (this=0xabb0228,
quest=0x8674770) at src/server/bulkobjects/pscharacter.cpp:2820
2820 if (IsQuestAssigned(quest->GetID())->status == \'C\')
(gdb) bt
#0 0x08255a55 in psCharacter::CheckQuestCompleted(psQuest*) (this=0xabb0228,
quest=0x8674770) at src/server/bulkobjects/pscharacter.cpp:2820
#1 0x0823127a in VerifyQuestCompletedResponseOp::Run(gemNPC*, Client*, NpcResponse*) (this=0x8676b90, who=0xaecaf48, target=0x11348488, owner=0x86763e8)
at src/server/bulkobjects/dictionary.cpp:957
#2 0x08230f92 in NpcResponse::ExecuteScript(Client*, gemNPC*) (
this=0x86763e8, client=0x11348488, target=0xaecaf48)
at src/server/bulkobjects/dictionary.cpp:902
#3 0x081b2a89 in PlayerToNPCExchange::HandleAccept(Client*) (this=0x13774c30,
client=0x11348488) at src/server/exchangemanager.cpp:1200
#4 0x081b3926 in ExchangeManager::HandleMessage(MsgEntry*, Client*) (
this=0x8488c30, me=0xc438a78, client=0x11348488)
at src/server/exchangemanager.cpp:1369
#5 0x082fc3e9 in MsgHandler::Publish(MsgEntry*) (this=0x8529cb8, me=0xc438a78)
at src/common/net/msghandler.cpp:97
#6 0x0830c3f6 in EventManager::Run() (this=0x8529cb8)
at src/common/util/eventmanager.cpp:154
#7 0x08282b4c in csPosixThread::ThreadRun(void*) (param=0x84a3980)
at libs/csutil/generic/cspthrd.cpp:479
#8 0x40021941 in pthread_start_thread () from /lib/i686/libpthread.so.0
Most crashes are a result of null pointer errors. So in this case the line:
if (IsQuestAssigned(quest->GetID())->status == \'C\')
has 2 potential pointers to check, quest and the result from IsQuestAssigned()
(gdb) print quest
$1 = (psQuest *) 0x8674770
Seems ok.
(gdb) print quest->id
$2 = 13479
Seems ok
So lets look at the code a bit here now:
bool psCharacter::CheckQuestCompleted(psQuest *quest)
{
if (IsQuestAssigned(quest->GetID())->status == \'C\')
return true;
else return false;
}
So now we should check IsQuestAssigned to see what that is doing:
QuestAssignment *psCharacter::IsQuestAssigned(int id)
{
for (size_t i=0; i {
if (assigned_quests[i]->quest->GetID() == id &&
assigned_quests[i]->status != \'D\')
return assigned_quests[i];
}
return NULL;
}
So this returns NULL if the quest is not found in the assigned quests list.
So lets take a look at that:
(gdb) p assigned_quests
$3 = {count = 5, capacity = 16, threshold = 16, root = 0x12204fa0}
Ok there is only 5 elements in there. Lets take a look and see if we
can find the one with our id in it.
(gdb) p assigned_quests.root[0]->quest->id
$4 = 11
(gdb) p assigned_quests.root[1]->quest->id
$5 = 15
(gdb) p assigned_quests.root[2]->quest->id
$6 = 19
(gdb) p assigned_quests.root[3]->quest->id
$7 = 20
(gdb) p assigned_quests.root[4]->quest->id
$8 = 23
Aha so we can see that our quest we are looking for is not found so
IsQuestAssigned(quest->GetID() is returning NULL therefore when we try
to use ->status on this it falls down and goes boom.
Solution?
Fix the
bool psCharacter::CheckQuestCompleted(psQuest *quest) function to check
for null pointer first.
QuestAssignment* assignment = IsQuestAssigned(quest->GetID());
if (assignment && assignment->status == \'C\')
...
Now this will make sure assignment exists first before trying to use ->status on it.
So that bug should be fixed ( I hope ) and I can delete that core file and move on to the next one.
-
Next core dump:
#0 0x0816206c in csStringBase::Length() const (this=0x718)
at /home/planeshift/psserver/CS/include/csutil/csstring.h:362
#1 0x0827258f in csStringBase::Append(csStringBase const&, unsigned) (
this=0x413214ac, iStr=@0x718, iCount=4294967295)
at libs/csutil/csstring.cpp:283
#2 0x08163a7e in csStringBase& csStringBase::Replace(csString const&) (this=0x413214ac, val=@0x718)
at /home/planeshift/psserver/CS/include/csutil/csstring.h:572
#3 0x081623aa in csString::operator=(csString const&) (this=0x413214ac,
copy=@0x718) at /home/planeshift/psserver/CS/include/csutil/csstring.h:1143
#4 0x08194f0e in GuildManager::SendMemberData(Client*, bool) (this=0x84be920,
client=0xdb30578, onlineOnly=true) at src/server/guildmanager.cpp:862
#5 0x08193267 in GuildManager::HandleSubscribeGuildData(Client*, iDocumentNode*) (this=0x84be920, client=0xdb30578, root=0xd2a2838)
at src/server/guildmanager.cpp:470
#6 0x08192a62 in GuildManager::HandleGUIMessage(psGUIGuildMessage&, Client*) (
this=0x84be920, msg=@0x413216ac, client=0xdb30578)
at src/server/guildmanager.cpp:359
#7 0x081923fb in GuildManager::HandleMessage(MsgEntry*, Client*) (
this=0x84be920, me=0xcf2ab80, client=0xdb30578)
at src/server/guildmanager.cpp:240
#8 0x082fc3e9 in MsgHandler::Publish(MsgEntry*) (this=0x852bc98, me=0xcf2ab80)
at src/common/net/msghandler.cpp:97
---Type to continue, or q to quit---
#9 0x0830c3f6 in EventManager::Run() (this=0x852bc98)
at src/common/util/eventmanager.cpp:154
#10 0x08282b4c in csPosixThread::ThreadRun(void*) (param=0x8491990)
at libs/csutil/generic/cspthrd.cpp:479
#11 0x40021941 in pthread_start_thread () from /lib/i686/libpthread.so.0
Looking at this I can guess that \'this\' is screwed up. 0x718 does not look like a valid pointer. Looking down the trace I can see this pointer all the way down until #4. So we go down to that frame in the trace.
#4 0x08194f0e in GuildManager::SendMemberData(Client*, bool) (this=0x84be920,
client=0xdb30578, onlineOnly=true) at src/server/guildmanager.cpp:862
862 lastOnline = memberClient->GetCharacterData()->lastlogintime;
(gdb) print memberClient
$14 = (Client *) 0xcf06ba8
That seems ok. Lets check the actor on this client since the character data is attached to the actor.
(gdb) print memberClient->actor
$15 = (gemActor *) 0x0
Aha! No actor therefore trying to get the last logintime will be messed up. The most likely reason for this is that the client has made a connection and has not fully logged on yet to get their actor created. So we need to see if the client was ready or not.
(gdb) print memberClient->ready
$16 = false
Ok so this confirms my thoughts. The client is not yet ready to start using
it\'s actor data.
So lets go look at the code here:
Client * memberClient = clients->FindPlayer(member->char_id);
if (memberClient != NULL)
This is trying to make sure that the client we are looking for is online. What we need to do here is to make sure that the client exists AND is ready.
So we can change this to:
if (memberClient != NULL && memberClient->IsReady() )
So another rebuild and this core dump can be deleted.
Deactivated smileys ;) --Kary
-
Seems like most of our crashes are due to bad pointers. I\'ve updated the CVS docs, to include guide lines and precautions.
The lesson is that we should test the pointer returned by a function before using it for NULL, if there\'s a chance of it being NULL.
-
*collapses to floor, writhing in pain and screaming:*
MY BRAAAAAIN!!!!!!
more power to you guys. that looks insanely difficult. i\'m sure i\'m not alone in thinking that, either. XD
-
acraig, thank you for posting this. It does help that we now have some idea of just what you folks have been doing over the last few days. We all knew you were \"troubleshooting\", but having some concrete examples helps everyone to understand how excruciating such work can be.
It also helps to explain why the server has bouncing like a big red rubber ball for the past few days. Knowing what I do of programming (which is just enough to know that I could never have the patience for it), I do not envy you your task.
Good luck, and don\'t forget to breathe.
-
That is very interesting. With so many variables it must be hard to keep track of the flow. Do you guys write down a logic flow chart ahead of time before creatig the code or do you kind of expand on the core for each function on the fly?
I would think a small \"this has to happen before this happens\" kind of chart might be helpful to avoid events taking place out of order, but I am no coder nor do I envy your position. I do however appreciate the time and effort you guys and gals have put into it.
Thanks and good luck...Jinkeen (newbie)
-
WOW...i for one am very impressed...My brain sees letters and numbers but yours sees the world of plane shift...amazing...hopefully everyone will read this and stop wondering why the server is down so long...i have found new respect for programmers...
-
(gdb) bt
#0 0x08224eda in QuestionManager::HandleMessage(MsgEntry*, Client*) (
this=0x85c9f80, me=0xeb396c8, client=0xe6bddd8)
at src/server/questionmanager.cpp:117
117 question->event->valid = false; // This keeps the cancellation timeout from firing.
#1 0x082fc411 in MsgHandler::Publish(MsgEntry*) (this=0x8523e48, me=0xeb396c8)
at src/common/net/msghandler.cpp:97
#2 0x0830c41e in EventManager::Run() (this=0x8523e48)
at src/common/util/eventmanager.cpp:154
#3 0x08282b74 in csPosixThread::ThreadRun(void*) (param=0x8491990)
at libs/csutil/generic/cspthrd.cpp:479
#4 0x40021941 in pthread_start_thread () from /lib/i686/libpthread.so.0
Again, first thing is to check the null pointers.
1) question
2) question->event
gdb) print question
$3 = (PendingQuestion *) 0xe42cc00
(gdb) print question->event
$4 = (psQuestionCancelEvent *) 0x0
Well, there is our null pointer. So trying to use ->valid on this will cause
the server to crash. So now we look at the code to see what it\'s trying to do:
...
PendingQuestion *question = questions.Find(&key);
if (!question)
{
Error2(\"Received psQuestionResponseMsg from client %u that was not questioned.\",me->clientnum);
return;
}
question->event->valid = false; // This keeps the cancellation timeout from firing.
...
So we need to check to make sure the event is ok as well before doing the call to question->event->valid:
if ( !question->event )
{
questions.Delete(question);
delete question;
return;
}
Rebuild and delete core file.
-
I agree with the replies to this thread, I thoroughly enjoy playing PS and seeing what\'s underneath is truly mind blowing. You guys are very talented and to think that you do this in your spare time makes me appreciate PS so much more.
Keep up the good work and I promise never to scream at my monitor again :)
-
Ah yes, Null Pointer Exceptions, I just LOVE them so much...
Ack.
Probably one of the hardest errors to track down.
At least you guys have a way of tracing these errors... probably saves a lot of time.
And yes... it is always good to check anything that could ever be NULL... even if you don\'t think it will. But you guys know that, of course.
The server seems to be getting better day by day, we all have to thank you for taking the time to work through this, I know it can be rough looking through all that code.
Now you just have to get NPC mobs working again :)
-
Alas, my head is too sore to continue slamming it into the wall... Maybe it is just to be assumed that I am not destined to play tonight...
OK, well good luck with the servers, and I harbor no jealousy to the plight that you are abviously experiencing!
-
(http://www.planeshift3d.com/wbboard/images/icons/thumb_up.gif) Excellent basic gdb tutorial :)
Helpful also or especially usefule for people who know what the problem is, but are used to a fance GUI and doubleclicks and whatnotmore to track problems down.
-
Originally posted by DivineLight
Seems like most of our crashes are due to bad pointers. I\'ve updated the CVS docs, to include guide lines and precautions.
The lesson is that we should test the pointer returned by a function before using it for NULL, if there\'s a chance of it being NULL.
what have been deemed \"proper\" C coding practices suggest checking for Null on any returned pointer. games should only make an exception on a case such as incredibly time sensitive operations.
edit:
not that I\'m an expert. That\'s just what has been pretty much accepted. even if the NULL is supposed to be propogated up the return stack, the conditional statement will set asside the possibility for any programmer reading the same code later on without the need for confusing comments.
-
Originally posted by Coroveir
Ah yes, Null Pointer Exceptions, I just LOVE them so much...
Ack.
Probably one of the hardest errors to track down.
No, they\'re actually one of the more easy ones to track down, because they take effect immediately. What is a real pain to track down are buffer overflow-induced errors, because these can easily take effect hours or even weeks after the actual error occured, and also in a completely different part of the program, and furthermore depend on the state the program was in when the error occured, as well as the data it contained by that time, which can also change the effect to something completely different that looks like a separate, unrelated bug. Boundary checks are the only thing to effectively prevent these from going undetected, as reproduction of these is near impossible.
Next are pointers that aren\'t initialised at all which just happen to point to something that looks usable most of the time or all the time on a specific machine. Both cases can easily cause silent, creeping data corruption. That\'s why I always initialise pointers, so that they\'re either valid or NULL.
@ ikanreed: That is true. However, it is also true that in highly experimental and evolving code, this practice tends to be ignored. :) AFAICS, these checks usually are introduced only after the code has been mostly frozen. With the way PS works, this method seems to not be viable, as DivineLight noted.
Also, for speed you can in some cases omit the check if you have other means in place to ensure that the pointer is valid in all cases, like the object that contains the pointer being created only if the pointer is valid. However, in this case you must make sure that those who come after you won\'t accidently invelidate this assumption by changing the crucial methods.
-
Andrew, your patience and steadfastness, in both
troubleshooting the PS code and also putting up with
a horde of \"gotta-get-my-PS-fix\" gamers (myself
included), is nothing short of astounding. Thanks
for the enlightening primer, and Bless You for your
contributions to the game and to the PS community
in general!
-
Originally posted by Pogopuschel
(http://www.planeshift3d.com/wbboard/images/icons/thumb_up.gif) Excellent basic gdb tutorial :)
Helpful also or especially usefule for people who know what the problem is, but are used to a fance GUI and doubleclicks and whatnotmore to track problems down.
How dare you say basic! All I got out of those entire posts was that something nulled up the server and caused it to crash. My head hurts from seeing so many jumbled letters and numbers, and I think I need to take a motrin now.
Thank you for setting off a 3 hour long headache, small letters and numbers, you never fail!
-
Originally posted by ikanreed
what have been deemed \"proper\" C coding practices suggest checking for Null on any returned pointer. games should only make an exception on a case such as incredibly time sensitive operations.
You are certainly correct ikanreed. But unfortunately, in practical terms, that is a very hard guideline to follow. Maybe it?s just me but there always seems to be something lacking in those pointer-less computer languages.
Just to show you how bad that could be I checked the source for the most uses of pointer references in a single line of code. Of the 25,409 lines of code that specifically use \"->\" ?
the winner is ?
its a tie! At nine pointer references in one line:
client\\pscelclient.cpp:758:
if(pos->entity->pcmesh && pos->entity->pcmesh->GetMesh() && pos->entity->pcmesh->GetMesh()->GetMovable())
common\\net\\netbase.cpp:235:
pkt->packet->pktid, pkt->packet->offset, pkt->clientnum, pkt->packet->pktsize,packetlen, pkt->packet->flags);
At least the pscelclient code is trying to check as it goes along.
-
The true \'proper\' way would be to use exception handling ( the whole try,catch,throw system ). But that would require a fair bit of work and would add a fair bit of bulk. BUT it would make crashing a bit better and more informative. But it\'s just a big headache to add in now.
-
My head hurts...Can\'t you just say that the server has a lot of mixed numbers and letters that need shook up a few times to restart? >_>
-
Wow... I now remeber why I got out of programming and into networking. I respect anyone who has the patients for that stuff... and to think I am going into secruity logs and the such.... Oh well.
Thanks for the updates.
-
acraig: that\'s one reason I like working with Java. Built-in exceptions with informative messages. It\'s rare that I come up against a run-time error I can\'t solve within ten minutes. Logic errors are another story. And yes, I know and agree that Java would not be good for the game... Java\'s graphics are still lagging in speed.
-
Please ignor the complaints about headaches ( even though I am getting one too ). I think this is realy informative for those people who do not appriciate this game enough. I\'ve done some coding, but your my new hero, acraig! :D Keep the info coming!
-
Posting info such as this probably brings a great sense of \"something is being done\" about the game crashes... and it is difficult to always trace, requiring much time and effort. This is in comparison to people saying \"well, it crashes because it\'s pre-alpha\", which makes people think that the devs are just going \"um, server crashed again, should I try tweaking this variable and cross our fingers?\". ha ha
-
Where can I find the code in it\'s entirety, I am sure it is posted somewhere simple that I have not seen.
-
Originally posted by Andrek
Where can I find the code in it\'s entirety, I am sure it is posted somewhere simple that I have not seen.
=> CVS (http://cvs.sourceforge.net/viewcvs.py/planeshift/planeshift/#dirlist) <=
You can browse the code via that link by looking in the src dir, or use a CVS app to download it all and build from it. There are various guides in docs to help you out.
-
I\'m a programmer, and I gotta say, \"what\'s that jarble?\" This would have been much easier using the Python programming language (http://www.python.org/) . It is super easy.
-
Originally posted by defender43
I\'m a programmer, and I gotta say, \"what\'s that jarble?\" This would have been much easier using the Python programming language (http://www.python.org/) . It is super easy.
Beauty is in the eye of the beholder (http://www.angelfire.com/mn2/bjkap/images/beholder2.jpg) :)