GISS Watch – Wonder What Is Happening

I wonder what is happening at GISS

(Goddard Institute for Space Studies)

UPDATE2:

With a h/t to Peter O’Neill in comments under the GIStemp tab up top; I think we have our answer. Looks like they decided to put back the USA thermometers. From:

http://data.giss.nasa.gov/gistemp/updates/


November 13, 2009: NOAA is no longer updating the original version of the USHCN data; it ended in May 2007. The new version 2 currently extends to July 2009. Starting today, these newer data will be used in our analysis. Documentation and programs will be updated correspondingly.

So it looks like it was “maintenance” to do what ought to have been done a year or two ago… OK, better late than never. Oh, and now that you’ve put back the USA thermometers you can go into GHCN and start putting back the rest of the world…

UPDATE: Well, they are back up. No notice as to why.

Their web services have been down substantially all weekend. And I wonder why?…

This is often done when some kind of major “staffing change” is in the works; but can also just be a ‘server outage’.

Yet I’ve built enough “non-stop computing environments” to know that there are many ways to prevent a weekend long outage (or even a 5 minute outage).

And I’ve built enough “fail over co-location redundant” facilities to know that there are many ways to prevent a weekend long outage.

And I’ve been the I.T. guy asked to “lock down the site” while H.R. did the “keepers and tossers” sort…

And I’ve been the I.T. guy on a pager who got the call Saturday Night to come fix the meltdown and had it back up in under 4 hours (despite no new hardware… repurposing what was on site…)

So I’m left wondering: Why has Nasa GISS been down for the weekend?

http://www.giss.nasa.gov

is now not responding.

http://data.giss.nasa.gov/gistemp/station_data/

is now not repsonding.

I even went into the net maps and found the direct names of their web servers (at least 3 in “round robin” with 2 significantly different IP number groups which ought to mean 2 different Internet Service Providers or 2 different POPs serving their acccount).

So, IMHO, it is time to ponder start a “GISS Watch”, and wait to see what tomorrow brought brings.

All we can do is “watchful waiting”… and now, some wondering too…

As of Monday Morning it looks more like they just have poor weekend coverage for their site support.

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in AGW GIStemp Specific and tagged , . Bookmark the permalink.

10 Responses to GISS Watch – Wonder What Is Happening

  1. j ferguson says:

    It’s ok, E.M. They’re back as of 1200Z. How about site maintenance?

  2. juanslayton says:

    I see the GHCN station lists are back, too. Beats me, what’s going on.
    jws

  3. E.M.Smith says:

    If it is “site maintenance” it was particularly “ham handed”.

    They have at least 3 web servers behind a front end (you get re-directed to 3 different servers IP numbers from the same DNS lookup)

    $ host data.giss.nasa.gov
    data.giss.nasa.gov is an alias for web2.giss.nasa.gov.
    web2.giss.nasa.gov has address 169.154.204.81
    $

    Last time I did this it was “web3” that got returned.

    And behind the scenes:

    $ host web3.giss.nasa.gov
    web3.giss.nasa.gov has address 169.154.204.35
    $ host web1.giss.nasa.gov
    web1.giss.nasa.gov has address 169.154.204.33
    $

    Though interestingly enough, right now, doing a direct entry of the ‘web3’ name gives :

    “Dummy index file.”

    Which does look like web3 “had issues” and is being fixed as I type…

    The way you usually handle things like site maintenance with a set-up like that it to ‘offline’ one of the three (and the other two servers then do the load share). You update that one, and put it back into the rotation. Then offline the second one and update it. Then the third. If one is sick (like web3) you offline it and point it’s DNS entry at one of the other two while you work on it.

    The only impact is that during the middle of the process, you might randomly get either the one old or the one new server. If for some reason this is not acceptable (such as announcing a product recall – don’t want “it’s fine” and “it’s broke” alternating…) you can put the one “new” server back up at the lowest load hour and swap out the two old ones. You then have a little bit of a ‘race condition’ to get them all updated before the load rises, but that’s very ‘doable’.

    FWIW, I once managed the “rotation” of a major entertainment company web presence. We had 6 machines running and at peak the minimum that would work with acceptable response time was 4. Off peak, we could run on 3 … barely. Since it was a ‘tens of thousands of dollars an hour’ booking operation “downtime” was not acceptable… (IIRC it was somewhere around $1 Million a day loss for any downtime.) And the system / operating system / application updates we were doing was a “mutually exclusive set”. (That is, you could not have 3 old and 1 new running together…) We took 3 down “off peak” and updated them, then swapped 3 for 3 as a hot cutover, then had the race condition to get 1 more updated and added to the cluster before load ramped up.

    Whole swap done in about 3 hours (in the dead of night…) and not a ripple seen on the outside. (We did have a couple of months prep time going into it, though. I’d have been happier with a spare machine added to the project plan, but folks didn’t want to buy one… they were kind of large and expensive.)

    The other possible is that the DNS rotation on the front end got messed up or the front end is only a single router / ISP connection and that “had issues”. But I’ve done a “live cutover” for a major “coffee company” that booked orders for all their regional stores though such a front end (i.e. being down was not an option… even ‘flicker outages’ would be a bad thing…) We put in a load balancer and added a second ISP connection with redundant routers (and, incidentally, renumbered all their internal IP space… live…) While it did feel a bit like working the “high wire” without a net it was an interesting ‘bit of work’. So they ought to have been able to ‘stay up’ even if they were working on the network side of things.

    Heck, I was once on site at a small hardware company when their main router died (about 2 am …) and they had something like a 1 week response time service contract on it. I got to cobble together a new “main router” and get them back live from ‘what was laying around’… A minor department ended up on a switch/hub, and their router got ‘promoted’ to the ISP connection. It was a different brand with different i/o cards, but with a bit of imagination I found a way to make it work. We were ‘live’ again before start of business in the morning. Then I had a discussion with their management about the value of spare hardware and vendor service contracts …

    The bottom line, for me, is that if a 50 person start-up scale shoe string hardware company can ‘be back up’ in less than 8 hours; and if real world multi-server operations can be ‘never down’: For someone swimming in Federal Bucks (like NASA does) to be down for the weekend is just smarmy.

    Then again, I suppose I ought not to expect so much from these folks…

    At any rate, given the “web3” response as I type, it looks like “web3” was sick and yet as far as the network gear was concerned “looked fine” so was being sent the load. You connected, but to a ‘dead box’. The network gear would report “All is well” and you would see “connection made” being reported, just nothing coming from the server. I would have expected a better ‘rotation’ between the servers, but then again, I’m assuming they are a ‘rotation’ and not just ‘spares’.

    For now, I’m going to assume “web3” was sick but also was “the server” or that the rotation stuck on it due to low load level reported … And it does look like they are working on it now. Maybe weekend coverage is just not a priority for them… Or Hansen is directing budget to his favorite new projects and not to the I.T. Guys (hey, I’ve been there… it’s not fun being beaten up for ‘a poor product’ that you didn’t make and the guy in charge will not fund; but you’ve been handed this bucket of spit at 2 am and told “Fix it, it is a crisis!!! Oh, money? Look at the time… gotta go, I’ll get back to you on that money thing…”)

  4. Level_Head says:

    And now it seems apparent what was going on. Damage control.

    When you get the FOIA zip file, note carefully the file called HARRY_READ_ME.txt.

    That file contains an attempt at code reconstruction that you will find quite amusingly familiar, I think. And a few revelations. Look for “unsettling” for example.

    ===|==============/ Level Head

  5. e.m.smith says:

    Um, what “FOIA zip” file? And where is “HARRY_READ_ME.txt”?

    When I hit the sources link, it’s a busted link (no sources right now?) and I’m not sure where else to look. The “documentation” is just the usual more or less pointless minimal orientation…

  6. Ellie says:

    Hi E.M.,

    I was sure you’d have something on this – where have you been? It’s all over the web (hackers at Hadley CRU). Lots of juicy stuff.

  7. E.M.Smith says:

    @Ellie:

    I’ve been ill. (I think it was something I ate… but who knows). So I’ve been a bit “off line” for a couple of days.

    I’m still coming back up to speed. I’m just a bit fuzzy and my energy is still low. So I’d spent a couple of days doing very little (other than the one WSW posting that was mostly ready to go anyway and was waiting for Friday to release.)

    Basically, I’ve been away from the keyboard and near the bed and “little room”; so I’m completely out of sync with recent issues… (And not feeling energetic enough yet to try to catch up with internet searches… )

    Ah, a quick google turned up the WUWT page… see what I get for not reading Anthony’s stuff for a few days ;-)

  8. Ellie says:

    Oh dear – was it Hansen’s Revenge?

    All sorts of stuff has been coming out. You won’t even need to search – CA is completely blocked. I’d be surprised if your inbox hasn’t a few interesting files – I’m sure someone would have sent you the Harry_read_me file.

  9. Ripper says:

    I have a copy , email me if you need it and i will email it. EM.

  10. E.M.Smith says:

    I think I’ll wander through the online comments for a while before asking for a personal copy in the email… More than enough to keep me busy for a long time…

    BTW, what I’ve seen already does sound painfully familiar. A lot of the “They did what? And nothing documented? ” is very familiar. And the general broken status of the code base.

    At least now we know why GIStemp and CRUT agree ;-)

Comments are closed.