Thursday, August 28, 2014

Crisis Day



It was business as usual one day and then it happened , didn't start exactly as Jungle fire but rather slowly , Key application used by many countries in a region had decided to take it easy , gradually users started complaining that application is not responding as fast as it used be , for some it wasn't even responding . Business panicked on possible loss of time and not able to move on with routine activities , multiple high incidents and a final blow came as an urgent incident - The moment of truth had arrived , formal P1 declared that means almost every one need to jump in to the crisis call , application support , Application/DB administrators , Managers , product SMEs.


Initially everything was a suspect , All teams started coming out with their analysis but seems no problem identified for funny system behavior, temperature was rising in conference , Business was waiting to move on in such case responsibility landed with product vendor , they have designed it and they would know it for sure and as always they were full of suggestions , we can try this or that , issue seems to be here or little there .


IT realized it was better to try something and move forward rather than not trying any thing so they started with options , subsequent restarts but it did not help so it was time to move on with second option , meanwhile looking at criticality of the issue some more resources pumped in hoping to revive system , that seems to be working for a while.


Teams went back to check further , monitoring on all parameter was put in place meanwhile application kept showing its mood swings and one day it again became business as usual , problem stopped as abruptly as it started  , some had clue that executed actions may have improved the condition only hoping that it would not happen again