Later that day, I was informed that the changes had been put into effect. I went home for the weekend happy, thinking our company's servers would be much safer.
The fix that wasn't
When I arrived at work on Monday, I found that disaster had struck. My inbox had messages from Mr_Roboto showing the evening batch jobs he'd attempted had failed. There were open tickets from multiple users, reporting their batch jobs had not run. I had users calling to ask what was happening. I checked, and all of the batch processes had failed.
My team went into crisis mode immediately and tried to figure out what had happened. Looking over errors, it become apparent that Mr_Roboto no longer had access to the servers he was supposed to be running on. In fact, Mr_Roboto was no longer in Active Directory.
Horrified, we called up the system admin to find out what was going on. The response: "Yes, I changed Mr_Roboto to Mr_RobotoA and Mr_RobotoB and only gave them access to one of your two processing servers, respectively."
Our displeasure with this situation was immediately and loudly communicated to the system admin. After a few minutes he agreed to change his "security upgrades" back to way they were before. It wouldn't fix the security log-in problem, but at that point we had a larger issue on our hands: A whole weekend's worth of batch processing still needed to our attention.
As a last resort, my boss had the developers use their machines to run the batch processes. Thankfully, by the end of the day we had cleared up the backlog, and our users ran all their reports and sent them to the correct parties.
It'll be better next time - right?
Our team conducted a postmortem on the situation and came to the following conclusions:
First, as easy it was to blame the system admin, I should have requested more details before allowing the change. I had wanted to hear -- and took away from the conversation -- that the system admin accounts were being changed, but that was not what the system admin was saying. Also, any future changes needed to have an email listing what changes and why. For the future we resolved not to stand in the way of any positive changes, but wanted a clear explanation about what changes were being made before anything was done.
Second, we also realized that the system admins didn't know enough about what we were doing. This was handled by a two-hour meeting with the admin team in which we brought donuts and explained how our application worked and described the incident as "the weekend in which Mr_Roboto got fired."
Sign up for CIO Asia eNewsletters.