If the mobile apps supporting half your online betting business fail on the day of the Melbourne Cup, one of the biggest gambling events in Australia, then you can bet your heart would be racing faster than Phar Lap's.
That's what Alan Alderson experienced in 2013 as head of IT infrastructure and operations at William Hill, an online sports betting business. For six hours, Alderson and his team worked to get server infrastructure back up and running smoothly so punters could bet using the company's mobile apps just before the race started.
"At 6:30am, I got a phone call to say there is high CPU usage with the main transactional database. It was a very frustrating ... we didn't have any real indication of the problem.
"The business at that time was a 50/50 split between mobile and web browsers, so it was a big chunk that was down. CPU returned to normal at about 2:00pm, an hour before the race. IT confidence across the business was probably at an all-time low after that," he said during a webinar on Tuesday (24 November 2015).
The problem was the business was not using sophisticated tools for clear visibility of systems and detailed monitoring where issues could be spotted and acted on early, he said.
There were also too many siloed views and non-integrated tools, with no real unified visibility, he said.
"Back in the day, we were very reactive. Customers generally told us first there was a problem - [our] internal and external customers."
For the past two years, William Hill has undergone a transformation of its IT operations, which includes implementing the CA Unified Infrastructure Management, App Synthetic Monitor and and Server Management tools, as well as Splunk.
Alderson said these tools have improved monitoring, visibility and reporting across its server infrastructure, showing CPU and disk memory usage, and website availability.
Alderson said he looked to CA because the tools required minimal management overhead. They automatically identify any part of William Hill's infrastructure or website performance that deviates from the norm, and send out alerts to operations staff so they can act on issues quickly, he said.
Alderson said he wanted to be more proactive rather than reactive in solving mission critical issues, so problems are fixed early before they manifest into larger issues.
"It's about knowing before our business and customers know. With all the monitoring and alerting we have in place, we are on it straight away and can get things fixed and sorted out quickly," he said.
Also there is more visual, real-time dashboarding on website performance - everything from uptime, availability, to downloading.
Sign up for CIO Asia eNewsletters.