Perils of GPU Math For Scientific Computing

Or: Nvidia has a brain fart…

I’d not be so worried about this were it not for Climate Models being ported to CUDA code that uses NVIDIA GPUs for processing and NVIDIA being widely used in the vision and control systems of self driving cars.

Seems the new ones have sporadically wrong math… It has “reproducibility” problems with it’s math.

(Bolding done by me)

https://www.theregister.co.uk/2018/03/21/nvidia_titan_v_reproducibility/

2 + 2 = 4, er, 4.1, no, 4.3… Nvidia’s Titan V GPUs spit out ‘wrong answers’ in scientific simulations

Fine for gaming, not so much for modeling, it is claimed

By Katyanna Quach 21 Mar 2018 at 17:03

Nvidia’s flagship Titan V graphics cards may have hardware gremlins causing them to spit out different answers to repeated complex calculations under certain conditions, according to computer scientists.

The Titan V is the Silicon Valley giant’s most powerful GPU board available to date, and is built on Nv’s Volta technology. Gamers and casual users will not notice any errors or issues, however folks running intensive scientific software may encounter occasional glitches.

One engineer told The Register that when he tried to run identical simulations of an interaction between a protein and enzyme on Nvidia’s Titan V cards, the results varied. After repeated tests on four of the top-of-the-line GPUs, he found two gave numerical errors about 10 per cent of the time. These tests should produce the same output values each time again and again. On previous generations of Nvidia hardware, that generally was the case. On the Titan V, not so, we’re told.
[…]
All in all, it is bad news for boffins as reproducibility is essential to scientific research. When running a physics simulation, any changes from one run to another should be down to interactions within the virtual world, not rare glitches in the underlying hardware.

An industry veteran, who alerted us to the issue, reckoned this is due to a memory issue. Chip companies normally push their high-end silicon to the limit to maximize performance. Nvidia may be overclocking or red-lining its Titan V in some way, causing read errors from memory. These mistakes are carried forward in calculations, resulting in numerical errors. Another cause could be a design blunder.

It is not down to random defects in the chipsets nor a bad batch of products, since Nvidia has encountered this type of cockup in the past, we are told. The moneybags biz released patches for some of its older GeForce and Titan models that exhibited similar problems to address these errors. There was no issue with its Titan X card based on its Pascal architecture, we’re told.

Unlike previous GeForce and Titan GPUs, the Titan V is geared not so much for gamers but for handling intensive parallel computing workloads for data science, modeling, and machine learning.

Well, it’s not like the models are right now though, anyway… /sarc;

It’s the self driving cars that have me worried…

Still, having your math sporadically wander when doing intensive iterative calculations in a model is likely to cause all manner of unexpected divergences..

Subscribe to feed

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in AGW Science and Background, GCM, News Related, Tech Bits and tagged , , , , . Bookmark the permalink.

25 Responses to Perils of GPU Math For Scientific Computing

  1. spetzer86 says:

    Wonder what effect that has with bitcoin miners? Don’t they use high-end, overclocked video cards to handle most of the calculations?

  2. jim2 says:

    This problem is just a reflection of the world at large going to hell.

  3. E.M.Smith says:

    There are many bitcoin mining codes. The current hot crop are dedicated ASIC miners on a USB stick.

    In the beginning it was regular CPUs, mostly. Then some folks started using GPU’s too.

    Now pretty much CPUs take too much power and mine too few coins to be worth it in electricity cost (unless you are a government weenee with a supercomputer supposed to be doing something else and OPM (Other People’s Money) is paying the power bill.

    So now it’s moved on to a lot of GPUs and ASICs.

    The particular CPU with this issue is a new one, and it came out after the move to ASICs was well established. It is (was?) intended specifically for scientific computing (as opposed to graphics), so this is a big Aw-Shit for NVIDIA. Cost is over $2,000 each… But being new, not a lot of them in the field yet and mostly sold to institutions (or insanely rich gamers).

    In short, not much quantity of impact on coin miners – yet…

    Now, what would that impact be?

    The mining process consists of computing a whole bunch of hashes and then finding out if you got a coin, and if so, you get added to the block-chain and that gets communicated to others as the block chain syncs. So what’s likely to be computed on a GPU and what would a failure do? Well, the hashes are the heavy lifting, and a hash that was wrong would fail to validate. IMHO most of the bogus computes would do noting as they were gong to fail to validate anyway. Under very rare conditions you would be computing a valid coin hash, and get it wrong. Then continue to hash away looking for more without collecting the coin. (Someone else would find it in some other run elsewhere).

    So my best guess is that there would be something like a 10% slower mining process as some valid coins were missed. It is also possible that there would be a 100% failure to find coins IFF the hash process had internal dependencies such that the product “rolled forward” into future hashing i a way such that one error corrupts all future computes on that path. In that case I think you would have a miner that would work fine for a while, then just not find anything until a new hash sync happened and it started over.

    There’s some amount of guessing in this as I’ve not read the mining code. I’m working from what I think it would do and how I’d tend to write it.

  4. John F. Hultquist says:

    Luboš Motl @ the reference frame has several posts on “alternative math” that is being promoted by Mr Milan Hejný along SJW lines: LINK

    Right answers. Who cares? Feel good.

  5. Larry Ledwick says:

    New generation math and science – or Let’s just make up numbers and feel good.

  6. Stewart Pid says:

    4 … 4.1 …. 4.3 … close enough for the Muskster ;-)

  7. Chris in Calgary says:

    Despite all the new computational uses for these cards, they are Graphics Processing Units. Most of the time, the requirement of numeric exactitude in graphics is much less than scientific computation. I wonder if that has any bearing on the designs here.

    Possibly this exposes small numeric errors in older designs reused here? Not unlike the Ariane V rocket reused Ariane IV software and crashed because of unanticipated overflow errors, that only occurred in Ariane V due to higher rocket velocity (etc.).

  8. ossqss says:

    Oh no! That could mean that ECS could be calculated as high as 4.5 – 8.5 C!

    Ohhhh, wait a minute……..

  9. philjourdan says:

    Computers have always had problems dealing with decimals. I learned that early. It is not really a bug per se, but just the rounding error between binary and decimal. I long ago left the programming field, but before I did, learned to convert all decimal to whole numbers when storing them. They were only converted back to decimal for printout.

  10. ossqss says:

    Nvidia suspended their self driving car program today.

  11. E.M.Smith says:

    @PhilJourdan:

    I remember my Engineering FORTRAN class emphasizing to convert any input values to integer if possible and only once in any case. We spent a lot of time talking about mantissa, characteristic, and exponents of things…

    We were also admonished to try to always remember we were doing math in binary not decimal, and that underflow and overflow were our constant companions…

    In these days of “computer science” taking over the programming classes, they spend much more time on the syntax of high end language structures and not nearly enough time on what the hardware is doing to your numbers… IMHO.

    @ossqss:

    Oh really? Now that’s interesting…

    As all they do is make and sell compute stuff, one wonders what all they had in the back room under devo… or if they just want to assure that any ‘death liability’ attaches to their customers and no lawyer can bring them in as part of the “self driving development process”… “No, esteemed council, we have no such self driving involvement. All we do is sell compute engines. What someone does with them is not in our area of work.”

  12. E.M.Smith says:

    https://www.nvidia.com/en-us/self-driving-cars/drive-px/ Just so it can’t evaporate:

    NVIDIA DRIVE

    Scalable AI platform for Autonomous Driving
    World’s First Functionally Safe Ai Self-Driving Platform

    NVIDIA DRIVE™ is the AI platform that enables automakers, truck makers, tier 1 suppliers, and startups to accelerate production of automated and autonomous vehicles. The platform architecture allows our partners to build and deploy self-driving cars, trucks and shuttles that are functionally safe and can be certified to international safety standards.

    The architecture is available in a variety of configurations. These range from one passively cooled mobile processor operating at 10 watts, to a multi-chip configuration with four high performance AI processors — delivering 320 trillion deep learning operations per second (TOPS) — that enable Level 5 autonomous driving.

    The NVIDIA DRIVE platform combines deep learning, sensor fusion, and surround vision to change the driving experience. It is capable of understanding in real-time what’s happening around the vehicle, precisely locating itself on an HD map, and planning a safe path forward. Designed around a diverse and redundant system architecture, the platform is built to support ASIL-D, the highest level of automotive functional safety.
    SENSOR FUSION

    NVIDIA DRIVE systems can fuse data from multiple cameras, as well as lidar, radar, and ultrasonic sensors. This allows algorithms to accurately understand the full 360-degree environment around the car to produce a robust representation, including static and dynamic objects. Use of deep neural networks for the detection and classification of objects dramatically increases the accuracy of the fused sensor data.

    NVIDIA DGX Systems Artificial Intelligence Deep Learning Computer Vision
    ARTIFICIAL INTELLIGENCE AND DEEP LEARNING

    NVIDIA AI platforms are built around deep learning. With a unified architecture, deep neural networks can be trained on a system in the datacenter, and then deployed in the car. NVIDIA DGX Systems can reduce neural network training in the data center from months to just days. The resulting neural net model runs in real-time on DRIVE hardware inside the vehicle.
    NVIDIA DRIVE SOFTWARE
    NVIDIA DRIVE SOFTWARE

    The DRIVE platform software enables our partners to develop applications to accelerate production of automated and autonomous vehicles. It contains software libraries, frameworks and source packages that developers and researchers can use to optimize, validate and deploy their work.
    LEARN MORE
    DRIVE IX Software
    NVIDIA DRIVE IX SOFTWARE

    The NVIDIA DRIVE IX software development kit (SDK) enables AI assistants for both drivers and passengers, using sensors inside and outside the car. DRIVE IX leverages data from the microphone and cameras to track the environment around the driver. Even when the car isn’t driving itself, it is looking out for you.
    LEARN MORE
    END-TO-END HD MAPPING
    END-TO-END HD MAPPING

    NVIDIA offers an end-to-end mapping technology for self-driving cars, designed to help automakers, map companies, and startups rapidly create HD maps and keep them updated. This state-of-the-art technology uses an NVIDIA AI supercomputer in the car, coupled with NVIDIA Tesla GPUs in the data center, to create highly detailed maps.
    LEARN MORE
    THE NVIDIA DRIVE FAMILY
    DRIVE PX Pegasus
    DRIVE Pegasus

    With an unprecedented 320 TOPS of deep learning calculations and the ability to run numerous deep neural networks at the same time, this high-performance AI computer will provide everything needed for safe autonomous driving. No steering wheel or pedals required. Pegasus will be available to NVIDIA automotive partners mid 2018.
    LEARN MORE
    DRIVE PX Xavier AI Processor for Autonomous Car Development
    DRIVE Xavier

    DRIVE Xavier, the world’s highest performance system-on-a-chip, delivers 30 TOPS of performance, while consuming only 30 watts of power. It’s 15 times more energy efficient than our previous generation architecture, and with our unified architecture, all previous NVIDIA DRIVE software development carries over and runs. Now available to DRIVE partners.
    LEARN MORE
    DRIVE PX for Autochauffeur Autonomous Car Development
    DRIVE PX Parker AutoChauffeur

    DRIVE PX configuration with two SoCs and two discrete GPUs is available today for point-to-point travel. Available today.
    DRIVE PX for Autocruise Autonomous Highway Driving and HD Mapping
    DRIVE PX Parker AutoCruise

    Small form factor DRIVE PX for AutoCruise is designed to handle functions including highway automated driving, as well as HD mapping. Available today.
    NVIDIA SHARED ITS VISION FOR REINVENTING TRANSPORTATION AT THE CES 2018
    WATCH THE REPLAY
    CONTACT US

    NVIDIA automotive solutions are available to automakers, tier 1 suppliers, startups, and research institutions working on the future of transportation.
    Contact us
    Automotive Products and Solutions

    NVIDIA DRIVE
    NVIDIA DGX SYSTEMS
    DRIVE for Developers
    DRIVE Constellation
    NVIDIA DRIVE IX
    HD Mapping
    Advanced Driver Assistance Systems (ADAS)

    Automotive Partners

    Audi
    Mercedes-Benz
    Tesla
    Toyota
    Uber
    Volvo
    VW

  13. LOL@Klimate Katastrophe Kooks says:

    In a world where 1 + 1 != 2… scientific reproducibility becomes an impossibility, turning science into mythomaniacal guesswork rife with bias and preconception.

    Perils of GPU Math For Scientific Computing


    “One engineer told The Register that when he tried to run identical simulations of an interaction between a protein and enzyme on Nvidia’s Titan V cards, the results varied. After repeated tests on four of the top-of-the-line GPUs, he found two gave numerical errors about 10 per cent of the time. These tests should produce the same output values each time again and again.”

    The really scary thing is that the Nvidia GPUs in question are specifically designed for ‘scientific computing’, and are used for the vision and control systems of self-driving cars… and they run the climate models of the climate alarmists… there’s a reason not to blindly trust in technology, nor in the so-called ‘experts’. Because in both cases, they can easily run you (or all of humanity) straight into a brick wall.

    Compound the guileful perfidy of the climate ‘scientists’ with GPU-based data corruption… and what’ve you got? A big steaming pile of CAGW, based wholly upon bits and pieces of reality-based data taken out of context or so ‘adjusted’ as to be outright falsified, strung together with miles of politicized fantasy.

    CAGW is the ‘pink unicorn farting pixie dust’ of our age.

  14. R.de Haan says:

    One economic rule never to ignore; Never make yourself dependent on a single supplier, especially not in different markets.

    I know it is against the trend, see Google, see Amazon see…..

    Time to pull the rip cord.

  15. R.de Haan says:

    Did any bode here remember something about the design objectives and aspirations behind the X-15 project, until today the fastest aircraft ever to have flown? The program was designed with men in control because as was said: “Men would never be satisfied sitting in the nose cone of a rocket as a biological specimen”. People love to be in control of Moon lander”s, planes, cars, motor cycles, you name it. Call me a skeptic by saying that I don’t buy the self driving car hype. It is simply no fun and a lot of people will loose their income.
    Besides that how vulnerable are systems like that?
    Try America’s roads this winter…
    Besides that, the entire scheme collapses the moment we loose the grid or our GPS system.
    Park and break Assist as an option, that’s it.

  16. R.de Haan says:

    The opening dialogue of the movie above….
    Absolutely priceless.

  17. R.de Haan says:

    “Men will never be satisfied in the undignified position of sitting in a nose cone acting as a biological specimen”.

  18. R.de Haan says:

    X-15 propelled by NH3/LOX as designated rocket fuel, really remarkable for those days. Mach 6 with zero Co2 emissions. We could run our cars and heat our homes on NH3 if we need to, Current price approx 425 usd per ton.
    Much better than freaking batteries which simply lack the needed power density.

  19. E.M.Smith says:

    I’d love to have a self driving car…

    To get me home from the pub when I’m over the limit. Otherwise, hand me the G.D. steering wheel!

    Unfortunately, current case law is finding that if your self driving car “has issues” you are up for DUI if over the limit. (Recently in San Francisco a guy was busted for exactly that when his Tesla stopped mid-span of the bridge. Something about hands not on the wheel… so nobody hurt. No bad thing happened. But he’s got a DUI.)

  20. R. de Haan says:

    I wouldn’t seat myself on the back seat of a self driving car if I am sober, let alone is I’m drunk.
    The reality is that Uber and Co are lobbying to get exclusive rights to operate and exploit self driving cars. The consequence could be that politics somewhere in the future could make a choice and abolish private car ownership. I wouldn’t take that risk. In Europe we already know they will.
    That’s why I see no future on the Old Continent, at least not for me.

  21. Steve C says:

    Meanwhile, as sober technical types mull over the implications of faulty chip design, others continue to take a more direct, “hands-on” approach to life: “Eighteen bitcoin mining machines stolen from a property in Derbyshire”.

    (It happened in Chesterfield, which is the home town of the church {St. Mary and All Saints} with the famous twisted spire. Many pics online!)

  22. E.M.Smith says:

    @Steve C:

    What I find interesting is that the theft made the news. We have the theft of 18 things about the size of a shoe box, with a cost beetwen about $1000 and $1400 each, so a total of about $18,000 to $25,200 or way less than most new cars stolen.

    Oh, and they become obsolete faster than cars too…

    Do we even know if the thieves knew what the boxes were?…

  23. R. de Haan says:

    Holy Boxes?

  24. philjourdan says:

    @R. de Haan

    I wouldn’t seat myself on the back seat of a self driving car if I am sober, let alone is I’m drunk.

    A recurring nightmare as a child was being in the back seat of a car, and then realizing there is no driver. And trying to climb over the seat to gain control of the vehicle.

    I would never sit in the back seat either.

Comments are closed.