Author Topic: Example server crash.  (Read 6417 times)

acraig

  • Administrator
  • Veteran
  • *
  • Posts: 1562
    • View Profile
Example server crash.
« on: October 26, 2005, 01:46:26 pm »
For all of you that\'s wondering what\'s going on with the server here\'s an example of what we are trying to deal with.   After a crash a core dump is generated.  This is a large file ( 500+MB ) that contains what the program was doing up to that point.  So after every crash we can examine that and try to figure out what went wrong.  Think of it as a black box like on an airplane.

This core file contains a back trace which is how the programming was flowing before the crash and looks something like this:
Code: [Select]

0  0x08255a55 in psCharacter::CheckQuestCompleted(psQuest*) (this=0xabb0228,
    quest=0x8674770) at src/server/bulkobjects/pscharacter.cpp:2820
2820        if (IsQuestAssigned(quest->GetID())->status == \'C\')
(gdb) bt
#0  0x08255a55 in psCharacter::CheckQuestCompleted(psQuest*) (this=0xabb0228,
    quest=0x8674770) at src/server/bulkobjects/pscharacter.cpp:2820
#1  0x0823127a in VerifyQuestCompletedResponseOp::Run(gemNPC*, Client*, NpcResponse*) (this=0x8676b90, who=0xaecaf48, target=0x11348488, owner=0x86763e8)
    at src/server/bulkobjects/dictionary.cpp:957
#2  0x08230f92 in NpcResponse::ExecuteScript(Client*, gemNPC*) (
    this=0x86763e8, client=0x11348488, target=0xaecaf48)
    at src/server/bulkobjects/dictionary.cpp:902
#3  0x081b2a89 in PlayerToNPCExchange::HandleAccept(Client*) (this=0x13774c30,
    client=0x11348488) at src/server/exchangemanager.cpp:1200
#4  0x081b3926 in ExchangeManager::HandleMessage(MsgEntry*, Client*) (
    this=0x8488c30, me=0xc438a78, client=0x11348488)
    at src/server/exchangemanager.cpp:1369
#5  0x082fc3e9 in MsgHandler::Publish(MsgEntry*) (this=0x8529cb8, me=0xc438a78)
    at src/common/net/msghandler.cpp:97
#6  0x0830c3f6 in EventManager::Run() (this=0x8529cb8)
    at src/common/util/eventmanager.cpp:154
#7  0x08282b4c in csPosixThread::ThreadRun(void*) (param=0x84a3980)
    at libs/csutil/generic/cspthrd.cpp:479
#8  0x40021941 in pthread_start_thread () from /lib/i686/libpthread.so.0

Most crashes are a result of null pointer errors.  So in this case the line:

if (IsQuestAssigned(quest->GetID())->status == \'C\')

has 2 potential pointers to check, quest and the result from IsQuestAssigned()


(gdb) print quest
$1 = (psQuest *) 0x8674770

Seems ok.

(gdb) print quest->id
$2 = 13479

Seems ok

So lets look at the code a bit here now:
Code: [Select]

bool psCharacter::CheckQuestCompleted(psQuest *quest)
{
    if (IsQuestAssigned(quest->GetID())->status == \'C\')
        return true;
    else return false;
}

So now we should check IsQuestAssigned to see what that is doing:
Code: [Select]

QuestAssignment *psCharacter::IsQuestAssigned(int id)
{
    for (size_t i=0; i    {
        if (assigned_quests[i]->quest->GetID() == id &&
            assigned_quests[i]->status != \'D\')
            return assigned_quests[i];
    }
    return NULL;
}

So this returns NULL if the quest is not found in the assigned quests list.  

So lets take a look at that:
(gdb) p assigned_quests
$3 = {count = 5, capacity = 16, threshold = 16, root = 0x12204fa0}

Ok there is only 5 elements in there.  Lets take a look and see if we
can find the one with our id in it.
Code: [Select]

(gdb) p assigned_quests.root[0]->quest->id
$4 = 11
(gdb) p assigned_quests.root[1]->quest->id
$5 = 15
(gdb) p assigned_quests.root[2]->quest->id
$6 = 19
(gdb) p assigned_quests.root[3]->quest->id
$7 = 20
(gdb) p assigned_quests.root[4]->quest->id
$8 = 23


Aha so we can see that our quest we are looking for is not found so
IsQuestAssigned(quest->GetID() is returning NULL therefore when we try
to use ->status on this it falls down and goes boom.


Solution?
Fix the
bool psCharacter::CheckQuestCompleted(psQuest *quest) function to check
for null pointer first.
Code: [Select]

QuestAssignment* assignment = IsQuestAssigned(quest->GetID());
    if (assignment && assignment->status == \'C\')
    ...
   
Now this will make sure assignment exists first before trying to use ->status on it.

So that bug should be fixed ( I hope ) and I can delete that core file and move on to the next one.

----------
Andrew
"For all I know, she's lying, everyone's lying; welcome to the Internet"

acraig

  • Administrator
  • Veteran
  • *
  • Posts: 1562
    • View Profile
(No subject)
« Reply #1 on: October 26, 2005, 02:09:54 pm »
Next core dump:

Code: [Select]

#0  0x0816206c in csStringBase::Length() const (this=0x718)
    at /home/planeshift/psserver/CS/include/csutil/csstring.h:362
#1  0x0827258f in csStringBase::Append(csStringBase const&, unsigned) (
    this=0x413214ac, iStr=@0x718, iCount=4294967295)
    at libs/csutil/csstring.cpp:283
#2  0x08163a7e in csStringBase& csStringBase::Replace(csString const&) (this=0x413214ac, val=@0x718)
    at /home/planeshift/psserver/CS/include/csutil/csstring.h:572
#3  0x081623aa in csString::operator=(csString const&) (this=0x413214ac,
    copy=@0x718) at /home/planeshift/psserver/CS/include/csutil/csstring.h:1143
#4  0x08194f0e in GuildManager::SendMemberData(Client*, bool) (this=0x84be920,
    client=0xdb30578, onlineOnly=true) at src/server/guildmanager.cpp:862
#5  0x08193267 in GuildManager::HandleSubscribeGuildData(Client*, iDocumentNode*) (this=0x84be920, client=0xdb30578, root=0xd2a2838)
    at src/server/guildmanager.cpp:470
#6  0x08192a62 in GuildManager::HandleGUIMessage(psGUIGuildMessage&, Client*) (
    this=0x84be920, msg=@0x413216ac, client=0xdb30578)
    at src/server/guildmanager.cpp:359
#7  0x081923fb in GuildManager::HandleMessage(MsgEntry*, Client*) (
    this=0x84be920, me=0xcf2ab80, client=0xdb30578)
    at src/server/guildmanager.cpp:240
#8  0x082fc3e9 in MsgHandler::Publish(MsgEntry*) (this=0x852bc98, me=0xcf2ab80)
    at src/common/net/msghandler.cpp:97
---Type to continue, or q to quit---
#9  0x0830c3f6 in EventManager::Run() (this=0x852bc98)
    at src/common/util/eventmanager.cpp:154
#10 0x08282b4c in csPosixThread::ThreadRun(void*) (param=0x8491990)
    at libs/csutil/generic/cspthrd.cpp:479
#11 0x40021941 in pthread_start_thread () from /lib/i686/libpthread.so.0

Looking at this I can guess that \'this\' is screwed up. 0x718 does not look like a valid pointer.  Looking down the trace I can see this pointer all the way down until #4.  So we go down to that frame in the trace.

Code: [Select]

#4  0x08194f0e in GuildManager::SendMemberData(Client*, bool) (this=0x84be920,
    client=0xdb30578, onlineOnly=true) at src/server/guildmanager.cpp:862
862                      lastOnline = memberClient->GetCharacterData()->lastlogintime;


(gdb) print memberClient
$14 = (Client *) 0xcf06ba8

That seems ok.  Lets check the actor on this client since the character data is attached to the actor.

(gdb) print memberClient->actor
$15 = (gemActor *) 0x0

Aha! No actor therefore trying to get the last logintime will be messed up.  The most likely reason for this is that the client has made a connection and has not fully logged on yet to get their actor created.  So we need to see if the client was ready or not.

(gdb) print memberClient->ready
$16 = false

Ok so this confirms my thoughts.  The client is not yet ready to start using
it\'s actor data.

So lets go look at the code here:

Code: [Select]

Client * memberClient = clients->FindPlayer(member->char_id);
if (memberClient != NULL)


This is trying to make sure that the client we are looking for is online.  What we need to do here is to make sure that the client exists AND is ready.

So we can change this to:
if (memberClient != NULL && memberClient->IsReady() )


So another rebuild and this core dump can be deleted.


Deactivated smileys ;) --Kary
« Last Edit: October 26, 2005, 02:14:50 pm by acraig »
----------
Andrew
"For all I know, she's lying, everyone's lying; welcome to the Internet"

DivineLight

  • Hydlaa Resident
  • *
  • Posts: 181
    • View Profile
(No subject)
« Reply #2 on: October 26, 2005, 10:41:30 pm »
Seems like most of our crashes are due to bad pointers. I\'ve updated the CVS docs, to include guide lines and precautions.

The lesson is that we should test the pointer returned by a function before using it for NULL, if there\'s a chance of it being NULL.
Every one will die in pain except those who were less pain for the others, while they were.....

.~\'| MyBlog | FireFox | Gaim | \'~.

TheMinority

  • Hydlaa Resident
  • *
  • Posts: 170
    • View Profile
(No subject)
« Reply #3 on: October 26, 2005, 10:59:33 pm »
*collapses to floor, writhing in pain and screaming:*


MY BRAAAAAIN!!!!!!


more power to you guys. that looks insanely difficult. i\'m sure i\'m not alone in thinking that, either. XD

Sekhemet Basek, Depthseeker in the Explorer\'s Guild

SuburbanPlankton

  • Guest
(No subject)
« Reply #4 on: October 26, 2005, 11:02:32 pm »
acraig, thank you for posting this.  It does help that we now have some idea of just what you folks have been doing over the last few days.    We all knew you were \"troubleshooting\", but having some concrete examples helps everyone to understand how excruciating such work can be.

It also helps to explain why the server has bouncing like a big red rubber ball for the past few days.  Knowing what I do of programming (which is just enough to know that I could never have the patience for it), I do not envy you your task.

Good luck, and don\'t forget to breathe.

Jinkeen

  • Hydlaa Resident
  • *
  • Posts: 68
    • View Profile
(No subject)
« Reply #5 on: October 27, 2005, 12:47:07 am »
That is very interesting.  With so many variables it must be hard to keep track of the flow.  Do you guys write down a logic flow chart ahead of time before creatig the code or do you kind of expand on the core for each function on the fly?

I would think a small \"this has to happen before this happens\" kind of chart might be helpful to avoid events taking place out of order, but I am no coder nor do I envy your position.  I do however appreciate the time and effort you guys and gals have put into it.  

Thanks and good luck...Jinkeen (newbie)

Grndmasta

  • Hydlaa Resident
  • *
  • Posts: 58
    • View Profile
(No subject)
« Reply #6 on: October 27, 2005, 12:53:23 pm »
WOW...i for one am very impressed...My brain sees letters and numbers but yours sees the world of plane shift...amazing...hopefully everyone will read this and stop wondering why the server is down so long...i have found new respect for programmers...
Gian Flyus

Dragon Sibling in the Dragon Council

Purple Dragon

acraig

  • Administrator
  • Veteran
  • *
  • Posts: 1562
    • View Profile
(No subject)
« Reply #7 on: October 27, 2005, 01:39:56 pm »
Code: [Select]

(gdb) bt
#0  0x08224eda in QuestionManager::HandleMessage(MsgEntry*, Client*) (
    this=0x85c9f80, me=0xeb396c8, client=0xe6bddd8)
    at src/server/questionmanager.cpp:117
117         question->event->valid = false;  // This keeps the cancellation timeout from firing.

#1  0x082fc411 in MsgHandler::Publish(MsgEntry*) (this=0x8523e48, me=0xeb396c8)
    at src/common/net/msghandler.cpp:97
#2  0x0830c41e in EventManager::Run() (this=0x8523e48)
    at src/common/util/eventmanager.cpp:154
#3  0x08282b74 in csPosixThread::ThreadRun(void*) (param=0x8491990)
    at libs/csutil/generic/cspthrd.cpp:479
#4  0x40021941 in pthread_start_thread () from /lib/i686/libpthread.so.0


Again, first thing is to check the null pointers.

1) question
2) question->event

gdb) print question
$3 = (PendingQuestion *) 0xe42cc00
(gdb) print question->event
$4 = (psQuestionCancelEvent *) 0x0

Well, there is our null pointer.  So trying to use ->valid on this will cause
the server to crash.  So now we look at the code to see what it\'s trying to do:

Code: [Select]

...
 PendingQuestion *question = questions.Find(&key);
    if (!question)
    {
        Error2(\"Received psQuestionResponseMsg from client %u that was not questioned.\",me->clientnum);
        return;
    }

    question->event->valid = false;  // This keeps the cancellation timeout from firing.
...



So we need to check to make sure the event is ok as well before doing the call to question->event->valid:

Code: [Select]

if ( !question->event )
{

    questions.Delete(question);
    delete question;    
    return;
}


Rebuild and delete core file.
----------
Andrew
"For all I know, she's lying, everyone's lying; welcome to the Internet"

nexevo

  • Wayfarer
  • *
  • Posts: 9
    • View Profile
(No subject)
« Reply #8 on: October 27, 2005, 02:54:56 pm »
I agree with the replies to this thread, I thoroughly enjoy playing PS and seeing what\'s underneath is truly mind blowing. You guys are very talented and to think that you do this in your spare time makes me appreciate PS so much more.

Keep up the good work and I promise never to scream at my monitor again :)
Crystal Knight in Guild Knights

 

Only two things are infinite, the universe and human stupidity, and I\'m not sure about the former - Albert Einstein 1879-1955
 

Coroveir

  • Wayfarer
  • *
  • Posts: 2
    • View Profile
(No subject)
« Reply #9 on: October 27, 2005, 07:33:53 pm »
Ah yes, Null Pointer Exceptions, I just LOVE them so much...

Ack.

Probably one of the hardest errors to track down.

At least you guys have a way of tracing these errors... probably saves a lot of time.

And yes... it is always good to check anything that could ever be NULL... even if you don\'t think it will.  But you guys know that, of course.

The server seems to be getting better day by day, we all have to thank you for taking the time to work through this, I know it can be rough looking through all that code.

Now you just have to get NPC mobs working again :)

Grateeome

  • Wayfarer
  • *
  • Posts: 1
    • View Profile
DOH!
« Reply #10 on: October 27, 2005, 10:09:41 pm »
Alas, my head is too sore to continue slamming it into the wall...  Maybe it is just to be assumed that I am not destined to play tonight...  

OK, well good luck with the servers, and I harbor no jealousy to the plight that you are abviously experiencing!

Uyaem

  • Hydlaa Notable
  • *
  • Posts: 747
    • View Profile
(No subject)
« Reply #11 on: October 28, 2005, 05:04:07 am »
Excellent basic gdb tutorial :)
Helpful also or especially usefule for people who know what the problem is, but are used to a fance GUI and doubleclicks and whatnotmore to track problems down.
The internet is "the terrorists'" most important weapon, they say.
Wrong.
Fear is their most important weapon.
Ours is our freedom.

ikanreed

  • Wayfarer
  • *
  • Posts: 1
    • View Profile
(No subject)
« Reply #12 on: November 06, 2005, 04:08:20 pm »
Quote
Originally posted by DivineLight
Seems like most of our crashes are due to bad pointers. I\'ve updated the CVS docs, to include guide lines and precautions.

The lesson is that we should test the pointer returned by a function before using it for NULL, if there\'s a chance of it being NULL.


what have been deemed \"proper\" C coding practices suggest checking for Null on any returned pointer.  games should only make an exception on a case such as incredibly time sensitive operations.

edit:
not that I\'m an expert.  That\'s just what has been pretty much accepted.  even if the NULL is supposed to be propogated up the return stack, the conditional statement will set asside the possibility for any programmer reading the same code later on without the need for confusing comments.
« Last Edit: November 06, 2005, 04:10:42 pm by ikanreed »

Seytra

  • Forum Addict
  • *
  • Posts: 2049
  • No system can compensate lack of common sense.
    • View Profile
(No subject)
« Reply #13 on: November 06, 2005, 04:32:33 pm »
Quote
Originally posted by Coroveir
Ah yes, Null Pointer Exceptions, I just LOVE them so much...

Ack.

Probably one of the hardest errors to track down.

No, they\'re actually one of the more easy ones to track down, because they take effect immediately. What is a real pain to track down are buffer overflow-induced errors, because these can easily take effect hours or even weeks after the actual error occured, and also in a completely different part of the program, and furthermore depend on the state the program was in when the error occured, as well as the data it contained by that time, which can also change the effect to something completely different that looks like a separate, unrelated bug. Boundary checks are the only thing to effectively prevent these from going undetected, as reproduction of these is near impossible.

Next are pointers that aren\'t initialised at all which just happen to point to something that looks usable most of the time or all the time on a specific machine. Both cases can easily cause silent, creeping data corruption. That\'s why I always initialise pointers, so that they\'re either valid or NULL.

@ ikanreed: That is true. However, it is also true that in highly experimental and evolving code, this practice tends to be ignored. :) AFAICS, these checks usually are introduced only after the code has been mostly frozen. With the way PS works, this method seems to not be viable, as DivineLight noted.

Also, for speed you can in some cases omit the check if you have other means in place to ensure that the pointer is valid in all cases, like the object that contains the pointer being created only if the pointer is valid. However, in this case you must make sure that those who come after you won\'t accidently invelidate this assumption by changing the crucial methods.
« Last Edit: November 06, 2005, 04:41:11 pm by Seytra »

Danok

  • Traveller
  • *
  • Posts: 41
    • View Profile
(No subject)
« Reply #14 on: November 12, 2005, 01:54:40 am »
Andrew, your patience and steadfastness, in both
troubleshooting the PS code and also putting up with
a horde of \"gotta-get-my-PS-fix\" gamers (myself
included), is nothing short of astounding.  Thanks
for the enlightening primer, and Bless You for your
contributions to the game and to the PS community
in general!
-----Danok Danoi
Dragon Master of The White Dragons' Clan
The Dragon Council