Internet giants such as Google and Amazon run IT operations that are far larger than most enterprises even dream of, but lessons they learn from managing those humongous systems can benefit others in the industry.
At a few conferences in recent weeks, engineers from Google and Amazon revealed some of the secrets they use to scale their systems with a minimum of administrative headache.
At the Usenix LISA (Large Installation Systems Administration) conference in Washington, Google site reliability engineer Todd Underwood highlighted one of the company's imperatives that may be surprising: frugality.
"A lot of what Google does is about being super-cheap," he told an audience of systems administrators.
Google is forced to maniacally control costs because it has learned that "anything that scales with demand is a disaster if you are not cheap about it."
As a service grows more popular, its costs must grow in a "sub-linear" fashion, he said.
"Add a million users, you really have to add less than a 1,000 quanta of whatever expense you are incurring," Underwood said. A "quanta" of expense could be people's time, compute resources, or power.
That thinking is behind Google's efforts not to purchase off-the-shelf routing equipment from companies such as Cisco or Juniper. Google would need so many ports that it's more cost-effective to build its own, Underwood said.
He refuted the idea that the challenges Google faces are unique to a company of its size. For one, Google is composed of many smaller services, such as Gmail and Google+.
"The scale of all of Google is not what most application developers inside of Google deal with. They run these things that are comprehensible to each and every one of you," he told the audience.
Another technique Google employs is to automate everything possible. "We're doing too much of the machines' work for them," he said.
Ideally, an organization should get rid of its system administration altogether, and just build and innovate on existing services offered by others, Underwood said, though he admitted that's not feasible yet.
Underwood, who has a flair for the dramatic, stated: "I think system administration is over, and I think we should stop doing it. It's mostly a bad idea that was necessary for a long time but I think it has become a crutch."
Google's biggest competitor is not Bing or Apple or Facebook. Rather, it is itself, he said. The company's engineers aim to make its products as reliable as possible, but that's not their sole task. If a product is too reliable -- which is to say, beyond the five 9's of reliability (99.999 percent) -- then that service is "wasting money" in the company's eyes.
Sign up for CIO Asia eNewsletters.