Intel has disclosed that its sixth-generation Core products (known as Skylake) suffer from a CPU bug
that can cause a system to hang. The company has only publicly identified one application family that
causes the prime, Prime95.
The Prime95 thread on Skylake instability dates back to early December, when testers noted that
running the 768K test on the latest Intel processors would cause the application to fail — sometimes
within minutes, sometimes only after hours. The forum users collectively worked through the usual suspects
and double-checked RAM, motherboard vendors, voltage levels, clock speeds, Prime95 software versions,
and whether the CPU was overclocked or not.
Disabling Hyper-Threading apparently fixes the problem (based on user reports), but none of the other
variables had a measurable impact on the issue. If you run Prime95 on a Skylake CPU with the maximum number
of threads available on the processor with the “CpuSupportsFMA3=0″ (which forces the use of AVX) at the
768 FFT size, the system will eventually crash.
Unfortunately, Intel’s current disclosure is vague at best. The complete statement reads:
Intel has identified an issue that potentially affects the 6th Gen Intel® Core™ family of products.
This issue only occurs under certain complex workload conditions, like those that may be encountered when
running applications like Prime95. In those cases, the processor may hang or cause unpredictable
system behavior. Intel has identified and released a fix and is working with external business partners
to get the fix deployed through BIOS.
It’s not clear yet what the fix will be, or if it will require end users to avoid certain code paths
or features when testing processors. Niche cases like this can have enormous impacts on companies — in
the early 1990s, Intel’s Pentium processors suffered what became known as the FDIV bug.
The chip’s worked perfectly in the vast majority of cases, but would return an incorrect value in
specific floating-point cases. Specifically, the returned values were incorrect by roughly 0.000061.
Nonetheless, the bug caused serious headaches for Intel. The company took a hammering in the press and a
charge of $475 million against earnings to resolve the problem. Since then, we’ve seen a number of
high-profile errors — AMD has its TLB bug with the original Phenom, Intel’s first iteration of TSX
(Transactional Synchronization Extensions) were disabled via microcode update.
There’s a bug in Intel’s VM implementation that can allow a guest VM to fault in a way that traps the
CPU in an infinite loop.
Intel turned some of the flawed Pentium chips into keychains.
We think of processors as essentially flawless devices that “just work,” but reality tells a different story.
Check out Intel’s list of errata in Haswell — there’s a five-page list of flaws and issues,
virtually all of which are labeled as “No fix.” The solution, in the majority of cases,
is “Don’t do it like that.” AMD chips aren’t immune from these kinds of issues by any means,
but there’s been less hammering on AMD chips since they don’t have the enterprise market share they
used to command.
Sometimes bugs are disclosed, sometimes they aren’t — Piledriver has a significant problem with 256-bit AVX
instructions, for example, that injects an 18-20 cycle delay into executing multiple consecutive instructions.
Every original Intel Atom (before Bay Trail) had a floating point flaw that could insert a NOP
(no operation) into every other cycle, effectively doubling FPU compute time. No one bought an Atom for
its FPU performance, so the bug didn’t get talked about.
We’ll have to wait and see what Intel’s solution for this problem is. The simplest way to fix it might be
to tell the CPU to avoid using AVX in specific instances, but the FDIV bug demonstrated that users often
demand 100% compatible CPUs — even if they aren’t using the functions that actually trigger a bug.
The problem is, as CPUs add more features and capabilities, it takes longer and longer to adequately
test those functions.