Or: Nvidia has a brain fart…
I’d not be so worried about this were it not for Climate Models being ported to CUDA code that uses NVIDIA GPUs for processing and NVIDIA being widely used in the vision and control systems of self driving cars.
Seems the new ones have sporadically wrong math… It has “reproducibility” problems with it’s math.
(Bolding done by me)
2 + 2 = 4, er, 4.1, no, 4.3… Nvidia’s Titan V GPUs spit out ‘wrong answers’ in scientific simulations
Fine for gaming, not so much for modeling, it is claimed
By Katyanna Quach 21 Mar 2018 at 17:03
Nvidia’s flagship Titan V graphics cards may have hardware gremlins causing them to spit out different answers to repeated complex calculations under certain conditions, according to computer scientists.
The Titan V is the Silicon Valley giant’s most powerful GPU board available to date, and is built on Nv’s Volta technology. Gamers and casual users will not notice any errors or issues, however folks running intensive scientific software may encounter occasional glitches.
One engineer told The Register that when he tried to run identical simulations of an interaction between a protein and enzyme on Nvidia’s Titan V cards, the results varied. After repeated tests on four of the top-of-the-line GPUs, he found two gave numerical errors about 10 per cent of the time. These tests should produce the same output values each time again and again. On previous generations of Nvidia hardware, that generally was the case. On the Titan V, not so, we’re told.
All in all, it is bad news for boffins as reproducibility is essential to scientific research. When running a physics simulation, any changes from one run to another should be down to interactions within the virtual world, not rare glitches in the underlying hardware.
An industry veteran, who alerted us to the issue, reckoned this is due to a memory issue. Chip companies normally push their high-end silicon to the limit to maximize performance. Nvidia may be overclocking or red-lining its Titan V in some way, causing read errors from memory. These mistakes are carried forward in calculations, resulting in numerical errors. Another cause could be a design blunder.
It is not down to random defects in the chipsets nor a bad batch of products, since Nvidia has encountered this type of cockup in the past, we are told. The moneybags biz released patches for some of its older GeForce and Titan models that exhibited similar problems to address these errors. There was no issue with its Titan X card based on its Pascal architecture, we’re told.
Unlike previous GeForce and Titan GPUs, the Titan V is geared not so much for gamers but for handling intensive parallel computing workloads for data science, modeling, and machine learning.
Well, it’s not like the models are right now though, anyway… /sarc;
It’s the self driving cars that have me worried…
Still, having your math sporadically wander when doing intensive iterative calculations in a model is likely to cause all manner of unexpected divergences..