Tiny chips, big headaches

New York Times

last updated: Feb 12,2022

2. CHIPS RELIABILITY — Facebook’s data center in Prineville, Oregon, on February 16, 2018. Large data centers have experienced outages that may be partly the result of chip errors. (Photo: NYTimes)

Imagine for a moment that the millions of computer chips inside the servers that power the largest data centers in the world had rare, almost undetectable flaws. And the only way to find the flaws was to throw those chips at giant computing problems that would have been unthinkable just a decade ago. As the tiny switches in computer chips have shrunk to the width of a few atoms, the reliability of chips has become another worry for the people who run the biggest networks in the world. Companies like Amazon, Facebook, Twitter, and many other sites have experienced surprising outages over the last year. The outages have had several causes, like programming mistakes and congestion on the networks. But there is growing anxiety that as cloud-computing networks have become larger and more complex, they are still dependent, at the most basic level, on computer chips that are now less reliable and, in some cases, less predictable. In the past year, researchers at both Facebook and Google have published studies describing computer hardware failures whose causes have not been easy to identify. The problem, they argued, was not in the software — it was somewhere in the computer hardware made by various companies. Google declined to comment on its study, while Facebook did not return requests for comment on its study. “They’re seeing these silent errors, essentially coming from the underlying hardware,” said Subhasish Mitra, a Stanford University electrical engineer who specializes in testing computer hardware. Increasingly, Mitra said, people believe that manufacturing defects are tied to these so-called silent errors that cannot be easily caught. Researchers worry that they are finding rare defects because they are trying to solve bigger and bigger computing problems, which stresses their systems in unexpected ways. Companies that run large data centers began reporting systematic problems more than a decade ago. In 2015, in the engineering publication IEEE Spectrum, a group of computer scientists who study hardware reliability at the University of Toronto reported that each year as many as 4 percent of Google’s millions of computers had encountered errors that could not be detected and that caused them to shut down unexpectedly. In a microprocessor that has billions of transistors — or a computer memory board composed of trillions of the tiny switches that can each store a 1 or 0 — even the smallest error can disrupt systems that now routinely perform billions of calculations each second. At the beginning of the semiconductor era, engineers worried about the possibility of cosmic rays occasionally flipping a single transistor and changing the outcome of a computation. Now they are worried that the switches themselves are increasingly becoming less reliable. The Facebook researchers even argue that the switches are becoming more prone to wearing out and that the life span of computer memories or processors may be shorter than previously believed. There is growing evidence that the problem is worsening with each new generation of chips. A report published in 2020 by chip maker Advanced Micro Devices found that the most advanced computer memory chips at the time were approximately 5.5 times less reliable than the previous generation. AMD did not respond to requests for comment on the report. Until now, computer designers have tried to deal with hardware flaws by adding to special circuits in chips that correct errors. The circuits automatically detect and correct bad data. It was once considered an exceedingly rare problem. But several years ago, Google production teams began to report errors that were maddeningly difficult to diagnose. Calculation errors would happen intermittently and were difficult to reproduce, according to their report. A team of researchers attempted to track down the problem, and last year they published their findings. They concluded that the company’s vast data centers, composed of computer systems based upon millions of processor “cores,” were experiencing new errors that were probably a combination of a couple of factors: smaller transistors that were nearing physical limits and inadequate testing. In their paper “Cores That Don’t Count,” the Google researchers noted that the problem was challenging enough that they had already dedicated the equivalent of several decades of engineering time to solving it. Modern processor chips are made up of dozens of processor cores, calculating engines that make it possible to break up tasks and solve them in parallel. The researchers found a tiny subset of the cores produced inaccurate results infrequently and only under certain conditions. They described the behavior as sporadic. In some cases, the cores would produce errors only when computing speed or temperature was altered. Increasing complexity in processor design was one important cause of failure, according to Google. But the engineers also said that smaller transistors, three-dimensional chips and new designs that create errors only in certain cases all contributed to the problem. In a similar paper released last year, a group of Facebook researchers noted that some processors would pass manufacturers’ tests but then began exhibiting failures when they were in the field. Intel executives said they were familiar with the Google and Facebook research papers and were working with both companies to develop new methods for detecting and correcting hardware errors. Bryan Jorgensen, vice president of Intel’s data platforms group, said that the assertions the researchers made were correct and that “the challenge that they are making to the industry is the right place to go.” He said that Intel recently started a project to help create standard, open-source software for data center operators. The software would make it possible for them to find and correct hardware errors that were not being detected by the built-in circuits in chips. Computer engineers are divided over how to respond to the challenge. One widespread response is demand for new kinds of software that proactively watch for hardware errors and make it possible for system operators to remove hardware when it begins to degrade. That has created an opportunity for new startups offering software that monitors the health of the underlying chips in data centers. One such operation is TidalScale, a company in Los Gatos, California, that makes specialized software for companies trying to minimize hardware outages. Its chief executive, Gary Smerdon, suggested that TidalScale and others faced an imposing challenge. “It will be a little bit like changing an engine while an airplane is still flying,” he said. Read more Technology (window.globalAmlAds = window.globalAmlAds || []).push('admixer_async_509089081') (window.globalAmlAds = window.globalAmlAds || []).push('admixer_async_552628228') Read More Apple Warns iPhone Users About Popular App Google Announces: 350 Million Monthly Users for Gemini Elon Musk Makes Major Moves with New Changes on "X"

As the tiny switches in computer chips have shrunk to the width of a few atoms, the reliability of chips has become another worry for the people who run the biggest networks in the world. Companies like Amazon, Facebook, Twitter, and many other sites have experienced surprising outages over the last year.

The outages have had several causes, like programming mistakes and congestion on the networks. But there is growing anxiety that as cloud-computing networks have become larger and more complex, they are still dependent, at the most basic level, on computer chips that are now less reliable and, in some cases, less predictable.

In the past year, researchers at both Facebook and Google have published studies describing computer hardware failures whose causes have not been easy to identify. The problem, they argued, was not in the software — it was somewhere in the computer hardware made by various companies. Google declined to comment on its study, while Facebook did not return requests for comment on its study.

“They’re seeing these silent errors, essentially coming from the underlying hardware,” said Subhasish Mitra, a Stanford University electrical engineer who specializes in testing computer hardware. Increasingly, Mitra said, people believe that manufacturing defects are tied to these so-called silent errors that cannot be easily caught.

Researchers worry that they are finding rare defects because they are trying to solve bigger and bigger computing problems, which stresses their systems in unexpected ways.

Companies that run large data centers began reporting systematic problems more than a decade ago. In 2015, in the engineering publication IEEE Spectrum, a group of computer scientists who study hardware reliability at the University of Toronto reported that each year as many as 4 percent of Google’s millions of computers had encountered errors that could not be detected and that caused them to shut down unexpectedly.

In a microprocessor that has billions of transistors — or a computer memory board composed of trillions of the tiny switches that can each store a 1 or 0 — even the smallest error can disrupt systems that now routinely perform billions of calculations each second.

At the beginning of the semiconductor era, engineers worried about the possibility of cosmic rays occasionally flipping a single transistor and changing the outcome of a computation. Now they are worried that the switches themselves are increasingly becoming less reliable. The Facebook researchers even argue that the switches are becoming more prone to wearing out and that the life span of computer memories or processors may be shorter than previously believed.

There is growing evidence that the problem is worsening with each new generation of chips. A report published in 2020 by chip maker Advanced Micro Devices found that the most advanced computer memory chips at the time were approximately 5.5 times less reliable than the previous generation. AMD did not respond to requests for comment on the report.

Until now, computer designers have tried to deal with hardware flaws by adding to special circuits in chips that correct errors. The circuits automatically detect and correct bad data. It was once considered an exceedingly rare problem. But several years ago, Google production teams began to report errors that were maddeningly difficult to diagnose. Calculation errors would happen intermittently and were difficult to reproduce, according to their report.

A team of researchers attempted to track down the problem, and last year they published their findings. They concluded that the company’s vast data centers, composed of computer systems based upon millions of processor “cores,” were experiencing new errors that were probably a combination of a couple of factors: smaller transistors that were nearing physical limits and inadequate testing.

In their paper “Cores That Don’t Count,” the Google researchers noted that the problem was challenging enough that they had already dedicated the equivalent of several decades of engineering time to solving it.

Modern processor chips are made up of dozens of processor cores, calculating engines that make it possible to break up tasks and solve them in parallel. The researchers found a tiny subset of the cores produced inaccurate results infrequently and only under certain conditions. They described the behavior as sporadic. In some cases, the cores would produce errors only when computing speed or temperature was altered.

Increasing complexity in processor design was one important cause of failure, according to Google. But the engineers also said that smaller transistors, three-dimensional chips and new designs that create errors only in certain cases all contributed to the problem.

In a similar paper released last year, a group of Facebook researchers noted that some processors would pass manufacturers’ tests but then began exhibiting failures when they were in the field.

Intel executives said they were familiar with the Google and Facebook research papers and were working with both companies to develop new methods for detecting and correcting hardware errors.

Bryan Jorgensen, vice president of Intel’s data platforms group, said that the assertions the researchers made were correct and that “the challenge that they are making to the industry is the right place to go.”

He said that Intel recently started a project to help create standard, open-source software for data center operators. The software would make it possible for them to find and correct hardware errors that were not being detected by the built-in circuits in chips.

Computer engineers are divided over how to respond to the challenge. One widespread response is demand for new kinds of software that proactively watch for hardware errors and make it possible for system operators to remove hardware when it begins to degrade. That has created an opportunity for new startups offering software that monitors the health of the underlying chips in data centers.

One such operation is TidalScale, a company in Los Gatos, California, that makes specialized software for companies trying to minimize hardware outages. Its chief executive, Gary Smerdon, suggested that TidalScale and others faced an imposing challenge.

“It will be a little bit like changing an engine while an airplane is still flying,” he said.

Read more Technology

Tiny chips, big headaches

New York Times

Apple Warns iPhone Users About Popular App

Google Announces: 350 Million Monthly Users for Gemini

Elon Musk Makes Major Moves with New Changes on "X"

Technology

lifestyle

Google

Facebook

Software

Intel

The Weekly App Roundup

The collateral damage of Facebook’s flops

Meta adds ‘personal boundary’ tool after virtual world harassment

Prime Minister: Jordan's identity is one of moderation and tolerance

Court issues gag order on 'missile manufacturing, recruitment, training, drone manufacturing' case

Abu Al-Samen and Al-Kurki Inspect Progress of Princess Iman Hospital Expansion Project in Ma’adi

Developing the Securities Commission

Jordan Declares Muslim Brotherhood Illegal, Bans Membership and Seizes Its Assets

Cybercrime Unit Warns Against Publishing Content Related to the Muslim Brotherhood

Is the Complete Solution Really the Solution?

Fiery El Clásico: Barcelona Seeks Dominance, Real Madrid Aims to Salvage Their Season

Microsoft Announces Shutdown Date for Skype

"Prayer of Anxiety" by Egyptian Author Mohamed Samir Nada Wins International Prize for Arabic Fiction

Study: Artificial Sweeteners Disrupt Appetite and Slow Weight Loss

Study: Adding Potassium to the Diet Helps Regulate Blood Pressure

Apple Warns iPhone Users About Popular App

The Effects of Bad Habits: When Do They Start to Show?

This is How Prince Hussein Celebrated Princess Rajwa's Birthday

Google Announces: 350 Million Monthly Users for Gemini

Sultan of Oman Receives the Ruler of Sharjah at Al Alam Palace