It’s all over the news and search pages. Once again a massive outage of “the internet” traced back to an AWS Amazon Web Services fault.
Lots of newspapers / video has the generic “it is out, it is in an AWS center, they are working on it”. Oddly, it looks like Zerohedge was on it early and with some technical details. Likely due to the impact on the SEC. Yup, the Securities and Exchange Commission depends on Amazon. (Surely there’s no worries about conflict of interests there… since Trump isn’t involved and it is only $Billions of publicly traded securities involved…)
They seem to have run their article in reverse chronological (i.e. put updates incrementally at the top) so I’m going to quote some bits “out of order” to put them back more in time order… S3 is their storage bucket system. When your data storage stops service you don’t get much else done.
Amazon S3 (Simple Storage Service) is a web service offered by Amazon Web Services. Amazon S3 provides storage through web services interfaces (REST, SOAP, and BitTorrent).
On November 1, 2008, pricing moved to tiers where end users storing more than 50 terabytes receive discounted pricing. Amazon says that S3 uses the same scalable storage infrastructure that Amazon.com uses to run its own global e-commerce network.
Amazon S3 is reported to store more than 2 trillion objects as of April 2013. This is up from 102 billion objects as of March 2010,
S3 uses include web hosting, image hosting, and storage for backup systems. S3 guarantees 99.9% monthly uptime service-level agreement (SLA), that is, not more than 43 minutes of downtime per month.
So it is the big bucket in which lots of folks put their always to be available “Cloud” stuff. Well, guess what, disks do not magically become more reliable by putting them on the other side of the nation in a data center run by someone else. “Stuff” happens. And it did. So 2 trillion “objects” were unobtainable for a while… which will have impacted a lot of folks. Back at 0Hedge:
According to the register, one chief technology officer, reported that “we are experiencing a complete S3 outage and have confirmed with several other companies as well that their buckets are also unavailable. At last check S3 status pages were showing green on AWS, but it isn’t even available through the AWS console.”
Indicatively, Amazon had a similar “Increased Error Rate” event several years ago, which led to hard reboot and an outage which lasted for several hours. It is unclear if Vladimir Putin was blamed for that particular incident.
I note with admiration their snark about Putin. But clearly they missed that Trump was responsible as Putin’s Man and just wanted the Internet shut down for his speech… /sarc; and /snark;
They then quote a rather long description of the 2008 S3 outage, as though that matters…
Amazon S3 Availability Event: July 20, 2008
We wanted to provide some additional detail about the problem we experienced on Sunday, July 20th.
At 8:40am PDT, error rates in all Amazon S3 datacenters began to quickly climb and our alarms went off. By 8:50am PDT, error rates were significantly elevated and very few requests were completing successfully. By 8:55am PDT, we had multiple engineers engaged and investigating the issue. Our alarms pointed at problems processing customer requests in multiple places within the system and across multiple data centers. While we began investigating several possible causes, we tried to restore system health by taking several actions to reduce system load. We reduced system load in several stages, but it had no impact on restoring system health.
And on and on…
Further up thread, they have more nearly current information.
A disturbance among several prominent websites, including Imgur and Medium to go offline, miss images or run slow, has been tracked to storage buckets hosted by Amazon’s AWS, which while not reporting any explicit failures, has posted a notice on its service health dashboard, that it has identified “Increased Error Rates” and adds that “We’ve identified the issue as high error rates with S3 in US-EAST-1, which is also impacting applications and services dependent on S3. We are actively working on remediating the issue.”
The abnormal state reportedly kicked off around 0944 Pacific Time (1744 UTC) today.
Update at 10:33 AM PST: We’re continuing to work to remediate the availability issues for Amazon S3 in US-EAST-1. AWS services and customer applications depending on S3 will continue to experience high error rates as we are actively working to remediate the errors in Amazon S3.
Update: According to BGR, if it seems like your internet browsing is hitting more walls than possible today, you’ll be happy to know that it’s not your computer. A massive Amazon Web Services (AWS) outage is striking down lots and lots of web pages, leading to huge hiccups on a number of domains. Amazon is reporting the issue on its AWS dashboard, citing “Increased Error Rates,” which is a fancy way of saying that something is seriously broken.
Amazon Web Services is the cloud services arm of Amazon, and its Amazon Simple Storage Service (S3) is used by everyone from Netflix to Reddit. When it goes down — or experiences any type of increased latency or errors — it causes major issues downstream, preventing content from loading on web pages and causing requests to fail.
These instances are always a great reminder of how much of the internet relies on just a handful of huge companies to keep it up and running. An issue with Amazon’s S3 service creates a problem for countless websites that rely on their storage product to be up and running every second of the day. Unfortunately, there’s always ghosts in the machine, and downtime is inevitable. Let’s all pray that Amazon gets everything sorted out in short order.
Update 2: according to the latest, at 11:35 AM PST: “We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.”
Gee, I’ve had zero data outages from my disk farm in over… over… well, I don’t know how long it’s been. At least a decade. More likely 2 or 3 decades. I have lost disks, but never data or a data access outage. Why? Simple redundancy. I have 2 copies of everything. 99% of the time my disks are shut down too. Makes for decent lifetimes on them ;-) Every few years I copy all the data to a new disk. (The old disk still gets used, I’m just refreshing the magnetic strength of the bits on the platters and doing reorganizing.)
The only data loss I’ve had was when AOL decided to toss out my saved email (and accumulated incoming) due to not logging in as often as they would like (i.e. changed terms of service poorly communicated). I now don’t depend on them for email storage…
Simply put: You can’t escape systems failures by third-party blame. You can only escape them by good practices and good design. It IS up to you. Nobody else can do it for you.
Some more history. So not just 2008, but here is an article from 2015 that also references a 2011 outage. Looking like about every 3 years they take a dive…
News 9/22/2015 09:31 AM
Amazon Disruption Produces Cloud Outage Spiral
Amazon DynamoDB failure early Sunday set off cascading slowdowns and service disruptions that illustrate the highly connected nature of cloud computing.
Amazon Web Services had a rare instance of a cascading set of disruptions that rippled through a core set of services early Sunday, September 20. The failure appears to have begun in DynamoDB and spread from there.
The incident fell well short of the disruptions and outright outages that afflicted Amazon over the Easter weekend of 2011. Many fewer customers seem to have been affected, and it appeared slowdowns or stalls were mainly limited to a data-oriented set of services. Nevertheless, this episode illustrates how in the automated cloud one service is linked to another in ways that the average customer may have a hard time predicting.
It is FINE to use cloud services, provided you know what you are getting and plan accordingly. Have redundant facilities. Have automated fail over for routing and services. Have some kind of in-house capability to respond. (You don’t really think that external service providers will be putting YOU at the top of their service list when “the internet is broken” and all their customers are calling… do you?)
Heck, I’ve pondered making a version of GISTemp and the Climate Models that runs on AWS. That would let folks run on a very large cluster supercomputer facility if desired (and wallet willing), as they like it. But if that service is down for a day, nobody will really notice. Your hospital medical records are a different thing.
My company uses their services to store copies of the BIM models we use. Four or five engineers sitting around not working on a major drawing release due THAT DAY did not charm my boss.
On the subject of backing up your data…
I had 6+ years of credit card transactions on the bank’s on-line system which they had categorised into 20+ groups like Groceries, Phone etc. I had spent many hours over that time correcting these guessed groups – their system did a good job but did get some guesses wrong.
These transactions gave the bank the ability to interactively show my spending history and seasonal patterns – useful for me to check when budgeting.
Recently I got a notification that my card number had been used in an AT&T scam and it was being closed down and a new card was being sent to me.
When the new card went active, all the old transactions were ‘wiped’ :(
All my data had been lost in an instant – despite lots of pleading to the bank.
If I had known I could have copied these transactions and made my own ‘interactive’ graphing system, but there was no warning from the bank.
This note is a warning to anyone who uses such a bank generated service. As EMS says “you must keep your data safe yourself”.
BTW I am doing the same with medical records, X-rays etc. – I keep these myself.
A lot of venom in this one. But it is accurate. AWS is a big player and as such they need to get their crap together and get to those 9s. For a large organization, it is the “security” of knowing your data is always safe. That is the selling point. These outages not only cost them in terms of SLAs, but in reputation.
I am a fan of services like that. Only because it means I do not have to worry about AC and power and extraneous stuff, I just have to worry about my areas of expertise. But I am no fan if those 9s do not materialize.
Reposting this in the proper thread
Recent Amazon cloud outage explained — human error typo took too much capacity off line at one time.
I work for a company that supplies 3g/4g M2M modem/routers. One of the services we offer is a shared private APN. While there are not that many people with both the equipment and skills required to even get a glimpse at data traversing such networks, for mission-critical applications where secure comms is a must-have, I always recommend using IPSec as well – perhaps a little ott, but you did say “mission-critical” and “secure”, didn’t you? But I digress…
We moved our Cisco ASA and RADIUS server to the cloud – mainly because we got better bandwidth, but also to keep costs under control (SLAs meant we needed to carry spares, have someone on call etc) Now looking at moving back – costs didn’t drop as much as we hoped, uptime got worse not better, and this is costing us in lost business and reputation. Bean counters are suitably hang-dog looking and management is paying more attention to those of us techies that warned about free lunches. This sort of thing (Amazon) will be noticed too – I’ll make sure of it, in case some bright spark decides we just need a “better” (read bigger) cloud provider.