This posting is just for the purpose of making a top level entry point to the various postings about the comparison of GHCN version 3.3 to version 4. It is, in essence, a set of links to prior articles.
I’ve taken “My own sweet time” to make this “aggregator” posting, but in looking just at the number of links, I can see why I was feeling a bit “burnt out” on the topic / efforts. That’s one heck of a lot of work and learning embodied in all those links!
Well, the good news is I’m getting over it now. Enough that I was ready to dig back through it all and collect all the “stuff” in one easy to find posting. With this posting, I’m almost ready to take on adding version 2 to the mix ;-)
That will require adding another set of v2 tables to all this below, and would likely also be a good time to gather all the “how to build” stuff from scattered through these postings (and in some cases in comments in the postings) into one neat “How To Build” posting with scripts. If anyone else is thinking of building one of these systems and wants a single posting “how to”, post a comment to that effect and I’ll bring it all together (and likely add scripts for chunks of it).
All of these anaysis postings, along with a lot more related to GHCN, can be found in the category:
I started the comparison of v3.3 to v4 investigation by looking at some regions in total (the “continents” just below) and then at some selected countries just to get a feel for what was coming. To assure I was looking at things the right way and with tools that worked. If you don’t want to see those 2 first steps, just skip on down to “Around The World” for every country in the world represented in GHCN. Here are those two postings:
Around The World
These are the collections of graphs comparing v3.3 to v4 in all the various countries in the world represented in the GHCN Global Historical Climate Network data sets. They are not presented in the order created (shortest to longest list of countries) but rather in the order of the continents as numbered by GHCN.
Africa – 1
Asia – 2
South America – 3 with Antarctica – 7
North America – 4
Australia & Pacific Islands – 5
Europe – 6
Antarctica – 7
There are no countires in Antarctica, just one continent graph, so I put that at the end of South America (see just above).
QA and Technical Housekeeping
In this posting, I test the sensitivity to change of the final date used for computing “anomalies”, using a “baseline” period that ends in 2015, the same year ghcn v3.3 ends, and not using all the data to the end of v4. This assures only the same time periods are used for computing differences. An important sanitation measure that shows it is NOT the inclusion of 3 more years of data in v4 causing the changes in the comparisons.
Some general complaints about the kinds of change made from data set version to version, and how they make effective cross version comparisons deliberately and unnecessarily difficult.
At one point I attempted to port the system to an Odroid N2 (which was just arrived on the market as a new system with barely ported operating system). This worked well right up to the “plot the data” step when the Python Matplotlib had a bug in it that screwed up headings. Here’s the story of the process / experiment:
That’s why a lot of these graphs and this process were done on much slower boards with more stable operating systems. Having things work right often means using systems of hardware and software that are more fully debugged and NOT the “latest and greatest”.
Along the way, this also showed that while the faster computers were significantly faster, even a Raspberry Pi Model 3 was sufficient to do this work. (Just have a box of candy bars and extra coffee available for the waits ;-)
Close Ups & Odd Bits
Here I’m putting links to minor investigations I did, or things I looked at “up close” along the way. Things that might need similar investigation for far more parts of the world.
How does the inventory of thermometers change over time? In nice globe graphs:
What is the global distribution of thermometers in the GHCN v4 set NOW as compared to the time window used by GIStemp for computing a “Baseline”?
How did the “high altitude” stations represented change over time? This matters rather a lot since, as we are experiencing now, when the Sun has a major quite time, the UV drops, total atmospheric height shrinks, and all the “high cold places” end up at, effectively, higher and colder density altitudes (i.e. thinner air). Change when they are in vs out of the data set, you are indirectly adding / removing those solar changes.
Peculiar things about Djibouti. Just why would the SAME historical data from the ONE instrument, change? Eh?
I’d often rejected the idea that “The Anomaly Fixes Instrument Change”. Using this database I was finally able to assess that belief. I found that the use of anomaly processing is NOT sufficient to correct for instrument changes. As the entire GHCN is one giant mass of instrument changes, this means that there’s no validity to the data even when processed as anomalies, for saying anything about 1/10 C scale (or even whole C IMHO) “trends”. It could just as easily be an artifact of instrument changes and data series length changes.
When I first had this running, I did Australia Pacific Islands region first as a kind of test case. Here’s that posting:
The Code & Computer Stuff
There is a general introduction to this code, the database, and all here:
Various technical changes were made over the course of the investigation. Here’s the last iteration of the scripts to load the database:
The original database and related code install scripts and process:
Here is the original build of the statistics and anomaly tables:
Where I got the data, how to unpack it, and the first look at it:
The GHCN v3.3 Steps
This comparison of v3.3 to v4 started with getting v3.3 loaded and kicking it around a little. Could not compare v3.3 to v4 until they were both loaded, so had to start somewhere. I started on v3.3, and here are those 3.3 only postings:
I’d compared the seasons in Australia, then had a request to do a comparison of the seasons using the months as defined in Oz for each season:
Here’s the various continents with standard seasons used on all of them:
Do thermometer anomalies by season by continent tell us anything interesting? I think so… Graphs of same and conclusions that altitude in winter and asphalt in summer sun matter to volatility
Anomalies as computed and graphed, by Continent in v3.3:
GHCN v3.3 thermometer change over time, plotted on a global graph:
Comparing the altitude of stations overall, with the altitudes used in the GIStemp baseline time period. Let’s just say the are not really comparable…
How long do thermometers “live” in the data set? (It would be very interesting to compare this with how long they exist in the real world…)
Where are the thermometers NOW in v3.3 vs where where they during the Hadley baseline period? Global graph:
What might be the impact of particular months on how the anomaly looks? What do anomalies look like, month by month, and does this have implications for volatility of data and what months might have more data dropped for being “out of range” if you use a “one size fits all” volatility screen?
How does Station at Altitude change over time?
A close up look at where thermometers are, by continent, in the baseline period vs now:
Where are all the thermometers? Why, where all the people are (and all their industrial and household and transport and asphalt heat are located):
I take a look at ALL the data for Australia as a “scatter plot”:
Scatter Plots for all the continents:
A close up look at San Francisco that asks if maybe, just maybe a 5 C change is not due to CO2 but “something else”? Like maybe that giant airport expansion for the Jet Age?
The first GHCN sin based globe I plottted with labels:
The earlier attempt with a rectangular globe projection:
Then there are the Pre-Plotting Postings
Version 3.3 Related Technical Bits
A crude look at how “country code” changes between versions as I tried to figure out how to work around that:
My first step building of a MySQL database. (Later changed to MariaDB way up above).
Pingback: GHCN v3.3 vs v4 – Top Level Entry Point – Climate Collections
Hi E M
Yes it’s good that you have done this..It also shows the amount of work you have done over the past year..
Meanwhile I accidentally discovered a CSIRI research paper from 2003 which might give some insight into how the data diddling happened here in Oz.
[ Without the “chrome Extension” stuff: http://www.cmar.csiro.au/e-print/internal/braganza_x2004a.pdf -E.M.Smith]
Lots of mathematic equations summoned out of thin air ….Just so the non mathematically minded among us cannot challenge the logic..
Iam also wondering why Murray Bridge ( a near by town ) gets completely omitted in the discussion..along with quite a few other BOM stations.
@EMSmith; For the last few years I have had the pleasure of “watching” you forge the tools necessary for this undertaking. Applying knowledge and talent in place of massive funding and man power to create elegant solutions to the problems posed in evaluating this massive data trove.
From my point of view the tools created for this project are more important then the outcome because it places massive but inexpensive computing power in the hands of independent thinkers that are not part of accepted Elite…pg .
One of my goals has been to make everything available for others to replicate and make it run on equipment cheap enough anyone can afford it. I think I’ve done that.
It has been made a lot easier by the incredibly fast improvement in the SBC market in total computes / $ spent. Less than the cost of one tank of gasoline buys one of the top end hot boards that is way more than what is needed.
On the software front, I really ought to go back through it from front to end and polish things up. Make it less “what works” and more “what is well designed while working”. As it stands now, it is just the first working model. One step up from prototype, but with each bit worked out as it came due. Usually once you have done it all, you finally know enough to go back and re-do the whole thing in a better way. (You have the ability to “look ahead” and do some bits earlier in the rewrite, or structure a more integrated database instead of a bunch of “glue ons” as each step was worked out).
I don’t know if I’ll do that or not. It may be best left as “An Exercise For The Students” ;-) so they will have some value add to motivate them. (IF there are any…)
I’ve hoped more folks would pick up the stuff I’ve written and published and use it, improve it; but I’ve not seen much evidence for that. I probably need to do the whole hog “define a project” and put source on git-hub or git-lab and all that.
Well, it is what it is. It works. It has illuminated some things that I think are rather important. If someone REALLY wants it improved they can DIY it (or give me a grant… maybe I need to make a 501c3 “charity” and apply for “Climate Change” grants ;-0
If nothing else, it has done a good job of keeping my Linux / Database / Programming skills in good working order. Nothing like learning another database product / syntax, a new computer programming language, and some graphics libraries to make you feel good about still “having it” ;-)
EM, you asked: ” If anyone else is thinking of building one of these systems and wants a single posting “how to”, post a comment to that effect and I’ll bring it all together (and likely add scripts for chunks of it).”
I would like to replicate your work.
A hearty thank you for the amount of effort you have already done.
EM, congratulations this is the War & Peace and definitive evaluation of GHCN data.
It is a pity that President Trump has backed off from challenging the CAGW science because this would have been a good place to start.
Being an Access & VBA programmer I can easily imagine how much time and effort has gone in to it.
OK! Then instead of just gluing on a v2 chunk, I’ll walk through the design first to see if I can improve things.
Then I’ll make a “soup to nuts” re-install / test and post it up.
Thanks for that! I’d gotten close to burn out on the global graphs. Part of why I’ve done a lot of unrelated stuff since. Just getting recovery time from the big push…
Now I’m ready to move the comparison back another decade of diddle to v2.
As I recall it, v1 to v2 was a minor diddle, it was with v3 & 4 that it went way off. Maybe I’ll work in both v1 and v2 in one go… V2 uses the same basic structure and countries / station codes as v3 so ought yo be trivial to add. V1 is very different though. We’ll see when I get into it.
I’d definitely like a “how to” posting. I have done some work on this myself; yet having a “been there, done that” overview helps. The thing I am interested in changing is plotting box-whisker graphs. I do have some time using R, so I am likely going that way, rather than playing with Python.
I only used Python to get to the matplotlib plotting package.
The database is all MySQL / MariaDB (same thing, really, differences are small) and most of the heavy lifting is done with SQL scripts.
There is a (nearly trivial) FORTRAN loading step that could easily be redone in a dozen different languages. Primarily I used FORTRAN just because the original data were written by FORTRAN so are in a “FORTRAN Friendly” fixed format. Very UN_Fortunately, many “modern” languages have become “Fixed Format Hostile” or at least unfriendly… (I’ve never liked the hoops one must work to get C to do plain fixed format. It just smells of Kludge compared to FORTRAN).
Essentially that step is taking fixed format and making it TAB separated (as MySQL / MariaDB are in that fixed format hostile category and want TAB or CSV…). Then I added to it the computing of COS(latitude) as it was easy to do it there and makes globe shaped (SIN) plots easier. That’s pretty much it. (I do break up some composite fields into their parts so you get the whole and the parts as discrete items). Any of Perl, Python, C, and a dozen other languages CAN do this, as long as you are willing to tackle how they treat fixed format data… But really, FORTRAN is a dead simple and clear language to use and is ideal for this kind of thing.
Then it gets chucked into a Database. The Database is NOT a normalized Relational design. I treat it more like a series of “built for purpose” storage structures, that can also be used in JOIN and related ways to get other structures. As storage is so cheap and this data not that large, I trade a fair amount of storage and redundancy for fewer computes (which also makes it easier to use a dinky board like a Raspberry Pi Model 3). Were I designing this for a general purpose R&D system on Big Iron, the database would be quite different.
That’s one of the aspects I’m going to review in the next few days. Is that REALLY the way I want this done? Now?
Basically, at times I just move the data along a step and do a set of computes so that they will not need to be done over again each time you run a report. (Like that computing of COS(LAT) done once and stored for each station, instead of done de novo each time you make a graph). Similarly, I compute all the anomalies in one big step and move that into another table with more redundant data. That makes working with all the anomalies easier and faster (as most of it is in one table) at the expense of more data storage space and redundant copies of data items.
Since the data always flow through one way, there isn’t much (any?) update going on, so that aspect of Normalized Structure isn’t really valuable. (Redundant data complicates data updates / changes) As the cost of a 32 GB “chip” to hold all of this (or a 32 GB patch on your 1 TB USB disk) is nearly nil, the storage cost doesn’t matter.
But that’s another of those things I want to double check. Is it REALLY the best way to do this? I had ONE goal in mind. Those final graphs of v3.3 vs v4 showing changes. This whole structure was designed to reliably get there in incremental steps with break points (redundant data structures one step further along the path). Now that I’ve done that: Would a different approach be better for longer term more generalized goals?
Since some of the design choices can have a 10 hours vs 10 minutes time to compute, it may require trying some of the alternative ideas to see if they are losers, or not. I found that I could cut a 10 hour “update” dramatically with some small tweaks to the database and update syntax. Avoiding that kind of thing is important to keeping this running on minimal hardware…
Well, I’m starting to ramble (quelle surprise ;-) and it would be better for me to be doing instead of musing about doing…
With that: I had originally thought I’d have this designed with the v2, v3, and v4 data for a given site all in one table. Along the way I shifted to unique tables by version. An expedient way to get past some annoying differences in the versions. I think I’ll start by looking at that choice again. Now that I’ve gotten past those differences, is there some way to return to one Grand Unified Data Table… More “updates in place” and fewer “Join, Extract, Load new Table”… And if it is doable, is it valuable?….
As you have some R experience, and I think it can do matplotlib also, you might want to just look over one of the Python report / graphing programs and make the R equivalent. As R is more terse than Python, it ought to be about a dozen lines long. I’d be happy to run it and report results. (And maybe learn some R in the process ;-)
I think I’ve annotated what the Python is doing on various graphing steps such that it ought to be easy to rewrite, but if you want a “what’s this do” on any given program, give a holler. As I recall it, the first 1/4 is “set up Python” with libraries and to use the database, then it calls the database with an SQL statement. Then there’s a 1/4 that stuffs the return from the DB into something python likes and gets the axis oriented right. Then 1/4 that does the plots (basically calls to matplotlib that ought to be similar for R), followed by a clean up step. Fairly trivial code, really. I don’t do anything even approaching Apprentice Level in Python. Strictly NOOB level. A rewrite in anything else would be simpler, IMHO.
Oh, and another thing:
I did some walking into walls (OOOF! My Nose! ;-) along the way. One was finding out which Python to use and another was the MySQL vs MariaDB issue. A re-write would all be in the final choices. MySQL has been, effectively, ditched by the Linux community for MariaDB. It is ALMOST the same. Just a slightly newer version with a name change. BUT some of the version change includes changes in libraries called and how the work… so some of the older examples are from MySQL days. So basically I need to bring it all up to one level. The most recent Python and MariaDB.
For example, in this posting:
I plot the GHCN thermometers NOW vs those recorded only in the BASELINE period of Hadley. ( 1960-1990 ish). That code used MySQL. Why does that matter? The “how to open the database” is a bit different with MariaDB.
From that posting:
Python has this notion of a Database “connector” you get to import. Some kind of OO library indirection. I’d be happier with a simple set of procedural calls / library, but it is what it is. Clearly you don’t open MariaDB with a mysql.connector in the same way as they are different version levels.
That first set of “import” lines is just loading into Python all the parts (libraries and Object Oriented modules it will need to actually bu useful…) it will use.
Then all the “plt.(whatever)” bits are just loading values into variables that matplotlib wants to use for things like axis labels and range limits.
That’s the first 1/2 of the program. For R you would need to do something similar to get a connection to a given database and set up details for the graph in toto. Then there’s the 1/4 that actually makes a plot:
The “db=” line just opens the database and provides credentials. For MariaDB this has a different number of parameters or different order… I had to change it somewhat.
Then we stuff an SQL statement and send that off to the database. More SQL than Python…
The SQL says to pick the LAT and COS(LONG) for every station that existed in the baseline period between 1960 and 1990 and where it is NOT in the present 2015. You get a set of X / Y coordinates for each such station.
MySQL returns a result into “cursor”. Then you get some Object Oriented junk… I hate how OO does things, but it is what it is. “cursor.fetchall” stuffs the data you got into “stn” but a bit differently… then “np.array” rearranges it some more; all so you can hand it to matplotlib in a way it likes. Most likely R can bypass all this cruft and just plot the returned data. I hope…
Finally we load the x and y values for the plot using “data.transpose” to pick the bits we want plotted. First ( the zero fields) or second (the 1 fields) from the array.
ALL that just to say “Put a DOT at the x/y coordinate pairs you got from the database”.
I really really think R will do that a whole lot more “directly”…
Personally, I find it excessively wordy, highly obscure for no reason, and with a LOT of “hidden knowledge” needed to choose and use a “method” to get something trivial done.
It then does all that again for the 2015 stations that will be compared, but they are “red” in the color setting line.
Then the last bit of the code just cleans up:
You just issue the “plt.show” command to get the plot to actually be drawn. All that other stuff was just building it in memory somewhere…
Then there is an exception branch for error cases and a clean up and close the database branch for cleaning up at the end. Again, R will have some ways of doing this that are familiar to you.
So the point of this?
Just to point out that the bulk of the Python I use is irrelevant to writing the same thing in R. Just open the database as you would in R and similarly handle errors and close it. Also the set-up and plot will be however they are done in R.
The only “special” bit is that an SQL statement is issued to get a set of Lat / Long values to plot (and where the LONGITUDE chosen is the pre-computed COS(LONG) version to make a simple x/y plot give a SIN projection global map).
Hopefully that makes it pretty clear that to re-write this in R mostly consists of ignoring all the Python and just doing a database call to get an array of X, Y pairs and plot them.
For an R programmer, that ought to be darned simple (unless my understanding of how R gets into a MySQL / MariaDB database is broken and how it does a plot is worse than for Python…)
EM, do you intend to do another scrape of V4 in the future to see what has changed, now that you have the code it should be quite quick to do (I hope)?
There has been previous evidence that the values change virtually every month.
Yes, I do intend a new scrape / download. Not particularly soon though. Likely at the end of the year.
As to things changing “monthly”: Yes, they do. I ran into this back on v2 and v3. The claim is that the various national BOMs don’t always send timely data and that’s a big part of why. So say you get August data. It can have “catching up” values for July for any given country. Sometimes they come in a few months later…
In fact, the data change every day. Say you have data arriving for Paris. Today you have 3 out of 6 thermometers reported. Tomorrow you might get another reported and some missing data for the first one filled in. The next day the second thermometer (that was running a week late) fills in a few more days and also 1/2 of the 6th thermometer data shows up… Yes, it is THAT chaotic. Often data is missing for weeks to months.
Now think about that. How hard would it be to hold back some cold data for a month to “do more QA checking” and suddenly make THIS month “the hottest ever”? Then in a month or two release that data as “acceptable” and now that month is no longer an embarrassment by being hotter than the current “Hottest month evah!”…
Similarly, the various national Bureaus of Met. can change their QA process, run QA over previously passed data and change it, do “homogenizing” and infill real missing data with hypothetical constructed values ( one of the the changes for the USA was doing just that…) and more.
I did some postings on this back about 2010 I think…
Another example is Australia / New Zealand. I have forgotten if it was one or both… but they instituted new procedures for data manicuring and basically changed all of history. That, then, gets handed to GHCN and changes all of history there, too.
Part of why I’m so interested in doing “by country” A/B compares of v1, v2, v3, v4 as it ought to highlight just which countries do the data diddle and even roughly when they started.
THE most pernicious thing about all this is that as you get more “fellow travelers” involved in the diddling, the location and timing of the corruption gets more diffuse, so hiding it somewhat more. This ‘by version by country” comparison ought to uncover some of that covering.
What really needs to be done is to collect ALL the various origin data sets from all the various countries in all their versions and then do a comparison over time. But that takes more than one old retired guy and a Raspberry Pi computer doing volunteer work for free…
So I guess my point is that since I (we?) already have an answer to “Do the data change monthly?” it isn’t a high priority for me to demonstrate it once again…
Not GHCN but temperatures and models
“China is warming fastest where the cities are, not where the models predicted – classic UHI”
Yeah, IMHO between UHI and Airport Tarmac (seasoned with a good load of dropping “high cold stations” during low volatility times and leaving them in during the high volatility baseline) you can account for ALL of “Global Warming” and then some.
I suspect we are actually cooling, globally, and it is all down to the adjustments to create “warming” as the UHI is just hiding the cooling real trend. But that’s just my opinion.
I’ve added a pointer to the “documentation” posting under the “Code” & such heading.
Also, a modest review of the present Database Structure has me thinking there really isn’t all that much I can change. The problem is time scope.
Each of the different tables (other than the descriptive tables like inventory or continents) tends to be keyed based on a given scope of time.
temp3 and temp4 are ALL data with monthly granularity. So key is both station and month across all years.
mstats3 and mstats4 are Monthly Statistics. So key is Station BY month. No years.
anom3 and anom4 are keyed by Station, Year, month so might also be integrated with temps3 and temps4 IF I get rid of the idea of having multiple versions of the data stored in temps3 and temps4. (At present they are set up to hold different types – min, max, average – as well as different versions (ascension or when I made the copy). I’m clearly not doing that at present, so it COULD be depricated out… so some thought required here).
yrcastats – this is the Yearly by Country, Anomaly Statistics. Key is by year and country.
So basically the various key values limit what can be mixed in a table (or ought to be).
The biggest issues are just:
To keep Ascension or not? and
To keep type as a key, or not?
At present, the notion of doing this all over again for a dozen different copies of “the same data set” seems a bit beyond the realistic… It is likely better to just do a “diff” on the relevant parts of the input files to find if things changed over time inside one type of one version… (avg of v3.3 vs v3.2 for example).
So Ascension is on the chopping block.
Then, for min vs max vs avg:
Am I really going to load all three into one table? Would it be better to keep them as three different tables? (IFF I ever get to doing them…)
The only other interesting “Mods” I was thinking about were the date format and input formatting.
Dates are in ddMONyyyy form where I use an abbreviation for the month. I rather like that form of date as it can NOT be ambiguous as to what is a day vs a month. However, it is a bit of a pain on some selections / screens as you can’t use > or < or numeric comparisons. So I’m thinking of reverting to a simple number as date format.
This was a bit of a thorn in the side writing the reports as “JAN” is different from ” JAN” that’s different from “JAN ” and I had to fiddle to get that right. Yet it is right now, so changing it all just makes it easier for “next time”… maybe.
On the input formatting:
While I love FORTRAN, I know many folks no longer know it or use it. For a huge number of number problems it is ideal (that was the design goal after all) and it also excels at parsing fixed format data (where it became trendy to do various “delimited” formats for a few decades despite them being a pain to pick a delimiter that will never be in the data, never be a ‘reserved char’ to the OS, and taking up more storage space) so the languages of that era followed suit and dropped built in fixed FORMAT statements.
A quick look showed that the newer languages have realized this is sucky and a royal pain in the patoot in the “modern” languages and have added some parsing methods. Python can do it, so I could rewrite that part in Python, but I really didn’t like HOW it did it. Similarly, R has a fixed format reading routine glued on. It is much more direct.
So I’m tempted to rewrite the FORTRAN parts in R (with an eye to the report writers / graphing also moving to R “someday”). But I’m not all that fond of learning Yet Another Language…
My first impulse is to do it in C, despite C parsing of fixed format being a pain and a bit of a kludge, mostly just because C is everywhere and I’ve used it before. That seems to be insufficient reason…
So, that said, anyone have a preference for what language to use to do the “change fixed format to comma delimited format and compute COS(LAT)” step? It really is trivial code and could be done in just about any language (with various levels of annoyance at how to do it).
As much as I like FORTRAN and it works FINE, I know it will be a barrier for many other folks and can easily be side-stepped.