reghardware wrote:AMD Opteron CPUs hit by heat stroke
By Tony Smith
28th April 2006 11:59 GMT
Exclusive AMD today admitted it has inadvertently allowed a number of 2.6GHz and 2.8GHz single-core Opteron x52 and x54 processors that could corrupt data under extreme conditions to escape into the wild.
It is believed that the glitch is triggered when the affected chip's FPU is made to loop through a series of memory-fetch, multiplication and addition operations without any condition checks on the result of the calculations. The loop has to run over and over again for long enough to cause localised heating which together with high ambient temperatures could combine to cause the result of the operation to be recorded incorrectly, leading to data corruption.
To trigger the effect, the loop has to be run millions of time, an AMD customer source told Reg Hardware, potentially for hours at a time with no other operations being introduced during the run.
According to the source - who claimed to be party to emails highlighting the issue and sent by AMD to a number of the chip maker's major customers and partners - AMD has investigated the problem and found it was only able to reproduce the bug's effects in a synthetic benchmark test.
The problem is believed to affect only a fraction - perhaps no more than 3,000 individual CPUs - which managed to slip through AMD's screening net. It is not known how this so-called 'test escape' ocurred, but it took place "in part of 2005 and early 2006", an AMD spokesman said.
AMD said it has introduced another screening test to catch any further affected parts. Chips caught in this test in future will be re-rated at a lower clock speed to prevent the problem. The company is also working with OEMs to identify affected parts and contact customers who could be affected - if they are, they will be offered free replacements.
AMD stressed the problem was due to "a convergence of three specific simultaneous conditions", not a fault with the Opteron architecture. The company claimed the issue had not been observed on systems running commercially available applications.
"It's very hard to imagine this type of [tight FP loop] code in our [financial services] environment," Reg Hardware's source said. "The only thing I could think that would be coded this way would be some type of strange cipher code. For example, any type of 'for' loop that uses a compare operation would not have the problem." ®
Aye, and most companys asked by AMD if they wish to have a replacement have said no, it wasnt worth the hassles. AMD found he problem in its lab, no other customer ever made mention of it, theyre just trying to prevent such things from happening, you know, warning the customers ahead of time.