GHCN in MySQL – Preliminary Inventory Load

Just a quick “Hey, I’m half done!” on loading the detailed “inventory” file into the MySQL database. This is preliminary as there are a few numeric fields I’ve just loaded as type CHARACTER for now. I’ll need to cast them into numbers later / someday. (We saw that done with the temperature data earlier inside the FORTRAN data TAB insertion program for the Station Temperature Data). I’m thinking of making Latitude and Longitude as Spacial Data Type fields, but that’s a new concept for me. They are “Float” in the original description (though look like text in the file…) But for now I just wanted to get it in and kick it around.

So here’s the place where I picked up the original description for the V3 inventory file, the README from the data dump:

chiefio@odroidxu4:~/SQL/v3/ghcnm.v3.3.0.20150907$ vi ../README

 2.1 METADATA

       The metadata has been carried over from GHCN-Monthly v2.  This would
       include basic geographical station information such as latitude,
       longitude, elevation, station name, etc., and also extended metadata
       information, such as surrounding vegetation, etc.

    2.1.1 METADATA FORMAT

       Variable          Columns      Type
       --------          -------      ----

       ID                 1-11        Integer
       LATITUDE          13-20        Real
       LONGITUDE         22-30        Real
       STNELEV           32-37        Real
       NAME              39-68        Character
       GRELEV            70-73        Integer
       POPCLS            74-74        Character
       POPSIZ            75-79        Integer
       TOPO              80-81        Character
       STVEG             82-83        Character
       STLOC             84-85        Character
       OCNDIS            86-87        Integer
       AIRSTN            88-88        Character
       TOWNDIS           89-90        Integer
       GRVEG             91-106       Character
       POPCSS            107-107      Character

       Variable Definitions:

       ID: 11 digit identifier, digits 1-3=Country Code, digits 4-8 represent
           the WMO id if the station is a WMO station.  It is a WMO station if
           digits 9-11="000".

       LATITUDE: latitude of station in decimal degrees

       LONGITUDE: longitude of station in decimal degrees

       STELEV: is the station elevation in meters. -999.0 = missing.

       NAME: station name

       GRELEV: station elevation in meters estimated from gridded digital
               terrain data

       POPCLS: population class
               (U=Urban (>50,000 persons);
               (S=Suburban (>=10,000 and <= 50,000 persons);
               (R=Rural ( 100,000 persons, population data were provided by
               the United Nations Demographic Yearbook. For smaller cities and
               towns several atlases were uses to determine population.

       POPSIZ: the population of the city or town the station is location in
               (expressed in thousands of persons).

       TOPO: type of topography in the environment surrounding the station,
             (Flat-FL,Hilly-HI,Mountain Top-MT,Mountainous Valley-MV).

       STVEG: type of vegetation in environment of station if station is Rural
              and when it is indicated on the Operational Navigation Chart
              (Desert-DE,Forested-FO,Ice-IC,Marsh-MA).

       STLOC: indicates whether station is near lake or ocean (<= 30 km of
              ocean-CO, adjacent to a lake at least 25 square km-LA).

       OCNDIS: distance to nearest ocean/lake from station (km).

       AIRSTN: airport station indicator (A=station at an airport).

       TOWNDIS: distance from airport to center of associated city or town (km).

       GRVEG: vegetation type at nearest 0.5 deg x 0.5 deg gridded data point of
              vegetation dataset (44 total classifications).

[ 44 types skipped -EMS]

      POPCSS: population class as determined by Satellite night lights
               (C=Urban, B=Suburban, A=Rural)

I tossed together a schema file with roughly the same nature, but added some fields for things like ascension and type:

root@odroidxu4:/SG2/ext/chiefio/SQL/Table.Schemas#  cat inventory.sql 
CREATE TABLE inventory (
    version    CHAR(5) NOT NULL,
    ascension  CHAR(10),
    type       CHAR(4),
    region     CHAR(1) NOT NULL,
    country    CHAR(3) NOT NULL,
    wmo        CHAR(5) NOT NULL,
    wno_near   CHAR(3) NOT NULL,
    latitude   CHAR(8),
    longitude  CHAR(9),
    stn_elev   CHAR(6),
    name       CHAR(30) NOT NULL,
    grid_elev  CHAR(4),
    pop_class  CHAR(1),
    pop_size   CHAR(5),
    topo_type  CHAR(2),
    veg        CHAR(2),
    prox_water CHAR(2),
    dist_water CHAR(2),
    airport    CHAR(1),
    dist_A_to_U CHAR(2),
    veg_grid    CHAR(16),
    pop_class_nitelites CHAR(1)
    ) ;

I tried to make a few of the names more understandable, and also used slightly different names in the FORTRAN that inserts TABs into the data so it will load. Yeah, now I’ve got 3 names for the same thing for some fields, bad Mikey… Cleanup on Isle Later… I’ve not put keys / index fields in place yet. I have a good idea what I want to do, but for this new table that’s still for later.

Here’s the FORTRAN that reads a line of their text file, adds tabs, and spits it back out. Note that I’ve left in a testing WRITE statement that just echos the line without tabs. Useful when doing the first debugging and you have off by one format problems… but it is commented out with a leading “C” in the first column:

root@odroidxu4:/SG2/ext/chiefio/SQL/v3# cat inven.f
C FORTRAN to read the inventory files v3 GHCN file and insert Tabs
C in the output.  Also divides "country" into Continent and Country
C
C Variable declarations...
C
      CHARACTER * 1  TAB
      CHARACTER * 5  VERSION
      CHARACTER * 10 ASCEN
      CHARACTER * 4  TYPE, GRIDELEV
      CHARACTER * 1  CONT,POPC,AIR,POPNL
      CHARACTER * 2  COUNTRY,TOPOT,VEG,PROXW,DISTW,DISTAU
      CHARACTER * 5  WMO,PSIZE
      CHARACTER * 3  NEAR
      CHARACTER * 6  STNELEV
      CHARACTER * 8  LATITUDE
      CHARACTER * 9  LONGITUDE
      CHARACTER * 16 VEGGRID
      CHARACTER * 30 NAME

C
C Set the TAB character
      TAB=CHAR(09)
C
C Set some constants
      VERSION="Gv3"
      ASCEN="7Sept2015"
      TYPE="QCU"
C
C Read in one line of data...
C
    9 OPEN(1, FILE='inventory.in', STATUS='OLD', ACTION='READ')
   10 READ (1, 11, END=99) CONT, COUNTRY, WMO, NEAR,LATITUDE,           &
     &LONGITUDE,STNELEV,NAME,GRIDELEV,POPC,PSIZE,TOPOT,VEG,PROXW,       &
     &DISTW,AIR,DISTAU,VEGGRID,POPNL
C
   11 FORMAT (A1,A2,A5,A3,X,A8,X,A9,X,A6,X,A30,X,A4,A1,A5,A2,A2,A2,     &
     &A2,A1,A2,A16,A1)
C
C Convert CHAR  to Float 

C         READ (T(I),*,END=20) F(I)
C
C Write out one line of data with TAB between fields
C
C      WRITE (6, 6) VERSION,ASCEN,TYPE,CONT,COUNTRY,WMO,NEAR,LATITUDE,   &
C     &LONGITUDE,STNELEV,NAME,GRIDELEV,POPC,PSIZE,TOPOT,VEG,PROXW,DISTW, &
C     &AIR,DISTAU,VEGGRID,POPNL

C    6 FORMAT (A5,A10,A4,A1,A2,A5,A3,A8,A9,A6,A30,A4,A1,A5,A2,A2,A2,     &
C     &A2,A1,A2,A16,A1)

      WRITE (6, 7) VERSION,TAB,ASCEN,TAB,TYPE,TAB,CONT,TAB,CONT,COUNTRY,&
     &TAB,WMO,TAB,NEAR,TAB,LATITUDE,TAB,LONGITUDE,TAB,STNELEV,TAB,NAME, &
     &TAB,GRIDELEV,TAB,POPC,TAB,PSIZE,TAB,TOPOT,TAB,VEG,TAB,PROXW,TAB,  &
     &DISTW,TAB,AIR,TAB,DISTAU,TAB,VEGGRID,TAB,POPNL

    7 FORMAT (A5,A1,A10,A1,A4,A1,A1,A1,A1,A2,A1,A5,A1,A3,A1,A8,A1,A9,A1,&
     &A6,A1,A30,A1,A4,A1,A1,A1,A5,A1,A2,A1,A2,A1,A2,A1,                 &
     &A2,A1,A1,A1,A2,A1,A16,A1,A1)

C Retrieve another line of data...
C
      GO TO 10
C
C If end of file, then stop.
C
   99 STOP
      END

Yeah, almost exactly like the others just with different READ / WRITE lists and FORMAT statements to match… and were made by taking a former one and changing the READ / WRITE lists and FORMAT stations to match this data file…

Here’s what it looks like after a “SOURCE inventory.sql” command:

mysql> describe inventory;
+---------------------+----------+------+-----+---------+-------+
| Field               | Type     | Null | Key | Default | Extra |
+---------------------+----------+------+-----+---------+-------+
| version             | char(5)  | NO   |     | NULL    |       |
| ascension           | char(10) | YES  |     | NULL    |       |
| type                | char(4)  | YES  |     | NULL    |       |
| region              | char(1)  | NO   |     | NULL    |       |
| country             | char(3)  | NO   |     | NULL    |       |
| wmo                 | char(5)  | NO   |     | NULL    |       |
| wno_near            | char(3)  | NO   |     | NULL    |       |
| latitude            | char(8)  | YES  |     | NULL    |       |
| longitude           | char(9)  | YES  |     | NULL    |       |
| stn_elev            | char(6)  | YES  |     | NULL    |       |
| name                | char(30) | NO   |     | NULL    |       |
| grid_elev           | char(4)  | YES  |     | NULL    |       |
| pop_class           | char(1)  | YES  |     | NULL    |       |
| pop_size            | char(5)  | YES  |     | NULL    |       |
| topo_type           | char(2)  | YES  |     | NULL    |       |
| veg                 | char(2)  | YES  |     | NULL    |       |
| prox_water          | char(2)  | YES  |     | NULL    |       |
| dist_water          | char(2)  | YES  |     | NULL    |       |
| airport             | char(1)  | YES  |     | NULL    |       |
| dist_A_to_U         | char(2)  | YES  |     | NULL    |       |
| veg_grid            | char(16) | YES  |     | NULL    |       |
| pop_class_nitelites | char(1)  | YES  |     | NULL    |       |
+---------------------+----------+------+-----+---------+-------+
22 rows in set (0.01 sec)

mysql> 

Has 7280 lines in it:

mysql> SELECT COUNT(wmo) FROM inventory;
+------------+
| COUNT(wmo) |
+------------+
|       7280 |
+------------+
1 row in set (0.05 sec)

Of which 2379 currently have AIRPORT designation:

mysql> SELECT COUNT(wmo) FROM inventory WHERE airport='A';
+------------+
| COUNT(wmo) |
+------------+
|       2379 |
+------------+
1 row in set (0.02 sec)

As you can see, even with this preliminary format there’s things you can report from the data ;-)

I do want to get the numeric data types cast into their proper forms, then put on a key / index set. Also I just did a “diff” in Linux Land on the QCA and QCU file versions and they were the same, so I’m going to do that same test on the rest of the V3 inventory copies. IFF all of them match and the V2 match (as they claim) then the utility of loading them is zero. I know the v1 inventory file is different, so it will need some different fields. It might make sense to just toss it into a dedicated table (and lose the ascension and type fields from this table) and then do the compare and contrast between the two tables. Like I said, this is preliminary…

Well, I think that covers it for tonight. A basic “load table” done.

Oh, FWIW, to compile the FORTRAN on Linux, of the Debian / Devuan sort, you just install the FORTRAN compiler with “apt-get install gfortran” and then do a “gfortran inven.f” to make the a.out executable file. Pretty darned quick and easy really. (Of course you can rename a.out to anything, so like “inventory_tab_monster” or whatever…)

In Conclusion

At this point I’ve got a workable if crude layout for all the basic data tables and a demonstration load from one instance of the V3 data. Next steps are just a LOT of polish to make the database layout more efficient / faster / cleaner; and then load more data instances from V2 and V3, and then take on that integration load of the significantly different v1. (Then think about USHCN & Hadley, maybe…)

Along the way working up some interesting “compare and contrast reports” and such. Also I need to lean how to make MySQL calls from inside programs (like C, Python, whatever…) so I can use those tools on this data if needed.

The point being that there’s a long tune, test, extend, and use path stretching to the horizon now. The really nice thing is that now all sorts of “ad hoc queries” become very easy to do! So now I need to think about what I want to kick around ;-) Things like “how many thermometers per country?” and “What is the highest elevation thermometer in each country” become relatively easy to answer. Any ideas welcome ;-)

Subscribe to feed

Advertisements

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in NCDC - GHCN Issues, Tech Bits and tagged , , , . Bookmark the permalink.

36 Responses to GHCN in MySQL – Preliminary Inventory Load

  1. rms says:

    Following up on my previous suggestion, the query on Stackoverflow https://stackoverflow.com/questions/37730243/importing-data-from-a-mysql-database-into-a-pandas-data-frame-including-column-n/37730334 gives good guidance (based on what I know) for reading *all* the data from the database into a Pandas dataframe. My hunch is your computer is good enough to do this as Pandas is designed for that.

    reading the data into a home-made programme using SQL gets complicated really fast. Been there done that. A little bit of learning and Pandas and then growing skills from there by using it pays been big dividends. Again, been there done that.

    R of course probably has the same capability of grabbing everything into its environment, but I couldn’t get my head around the “language” and trying to do so conflicted with so much of what I already had going in my head with Python. Willis likes R and so gotta respect that.

    I like “simple enough” with power.

  2. A C Osborn says:

    It would be interesting to compare local pairs, triplets etc of Site data to look at differences, with regard to spatial kriging etc over 1200km.
    Should show it up as totally non scientific.

  3. Gary says:

    One thing to check is data integrity. Anthony Watt’s Surface Stations project found anomalies in the raw data from instrumental error.

  4. E.M.Smith says:

    @RMS:

    I’m going to look at Pandas… but things on my “look into it” list don’t end up in “I’m likely to do this” comments. When I don’t know about the details of something, I don’t make assertions about it. Yet that absence of comments does not take them off my “look into it” list…

    I could likely have avoided the issue by phrasing it as “need to pull the data into some procedural or object oriented or functional or {list of programming styles / languages} language system for detailed processing” but that’s bit obscure…and jargon rich.

    As I mentioned before, my only real complaint about Pandas is that it is tied to Python and I don’t like the lack of {} or BEGIN / END marking. Position dependent syntax is fundamentally screwed as soon as you “cut / paste” either into another bit of code or into a comment / posting (been there, done that…) so I find the language fundamentally broken by design. I’ll use it if I must, but it is never going to be a first, second, or third choice. (I went through that experience about 2009 with the porting of GIStemp. STEP2 has a Python section and it was painful to post comments about bits of code.)

    Per R: I’ve read the “How to use R” tutorials. It looks reasonable. It is a bit, erm, “different”; in that it is intended for use by non-programmers. So inside the problem domain it addresses and inside the thinking style it expects, it is quite good. Step out of that and it’s an issue. As my programming thinking is usually “stepped out of that”, well, R is a bit of an issue for me, too. I do want to get good at it some day; just don’t know if this is the day and the dataset…. I posted a video in a comment here:

    https://chiefio.wordpress.com/2019/01/23/procedural-languages-mysql-massively-parallel-and-climate-modeling/#comment-106760

    That looks at 4 programming languages and styles in 40 minutes. (Well worth the time, BTW) that gives a great POV on what it is like to think in a different style (the good and the bad). I wish she had used something more approachable for the “Procedural Language” segment (she uses a simplified assembler – not really representative but interesting). Still, as an exercise, it demonstrates what folks go through when they shift languages or programming paradigms… I especially found “son of LISP” Racket an interesting experience. Didn’t know LISP lived on in a decedent… Kind of liked it ;-) as long as you like long lists of close parenthesis ;-) ))))))))

    For me, I’ve decided to not ever care about what language is “trendy” ever again. I’ll just use whatever tool works best in the time I’ve got. That does bias a little toward using languages and tools I already know / have installed / am currently active in; but not a whole lot. Clearly: I’m “loading up MySQL” at the moment as a new tool for me. So IF I run into some problem case where it is faster / easier / better to absorb some new tool and language I’ll do so. BUT IFF it is faster / easier / better to just use what I’ve already got I’ll do that instead. (This “rule” also biases FOR Python over things like Ruby & Rust as I’ve already used Python… just not fond of it.. but “fondness” is not in the rule…)

    Thus the FORTRAN for doing the TAB buffering. Not only is it a language I’ve been using lately, so is fully in my brain ATM, but the FORMAT statement is just gold for formatting things. I’ve done the same thing in C and it is just a kludge. First load some libraries, then read the input into a string, then parse the string into bits and pack your variables. Really? I have to do the string parsing myself? Yes, it is in functions that are already written in the libraries, but really, why not just use a FORMAT statement… FORTRAN is much much better at fixed format data entry, writing, and handling; so I use it. It was designed for that. These files were written in that style by FORTRAN, so just “use it and move on”.)

    Why? Because what happens when you are driven by “trendy” is you get nutty things like:
    https://www.quora.com/What-is-Object-Oriented-COBOL

    Yeah, “Object Oriented COBOL”… that’s the ticket! /sarc;

    Just So OMG.

    I’ve written COBOL (once…) in a very short class in it. Way too wordy., but great for taking in large amounts of sequential data, diddling some fields, and spitting it back out again. What a lot of business programming requires. Good for “word not math” oriented people to program with. Now you want to layer on all the stuff of OO into that? Just walk away… no, RUN!

    I’ve unfortunately had the experience (a few times) of learning some hot new language / paradigm, and then a decade later folks are saying “What was that? Oh, nobody does that anymore.”. (I’ve written reports in RPG II, for example. Done ADA and PL1 programming. Used Perl. etc.) Now folks have the hots of other languages. OK, I’ll just wait a few years and see if they last… FWIW, it looks like OO is reaching End Of Trendy Life. Several videos by Name Names pointing out it doesn’t add anything over Procedural or Functional for many / most classes of problem… a thing I figured out when it was first being shoved at me in the early ’80s. I’ve managed product creation in OO languages (to final ship and use) so it isn’t like I’m a bigot against it… just don’t believe it really gains you anything much. Now that OO_COBOL his here, I’m pretty sure that’s the end of the line for “trendy”. ;-)

    So short form: Yes, I’m going to look at using Pandas. No, I have no knowledge base yet from which to construct comment about it, so I don’t comment on it. YMMV so feel free to comment on it.

    Sidebar on old languages:

    From that video I learned that LISP is the oldest higher level language and FORTRAN only the 2nd oldest. Now I’m a bit bummed that I’m not writing in the oldest language… I like old languages of all sorts ;-) Maybe I need to refresh my understanding of LISP and use it a bit ;-) It looks like there are several choices:

    https://wiki.debian.org/CommonLisp

    CommonLisp

    The language

    Common Lisp (CL) is a general purpose, multi-paradigm programming language, member of the Lisp family. It supports a combination of procedural, functional and object-oriented programming paradigms. It has a dynamic typing system, but also supports optional type annotations for performance. It is extensible through the use of Lisp macros and reader macros. It has been standardized via an ANSI document.

    Various resources about the language can be found in the external links below.

    Implementations

    Since Common Lisp is a standard, it can be implemented in different ways.

    The following Common Lisp implementations are currently available in Debian:

    sbcl – Steel Bank Common Lisp

    ecl – Embeddable Common Lisp

    clisp – a GNU Common Lisp implementation

    cmucl – Carnegie Mellon University Common Lisp

    gcl – another GNU Common Lisp implementation (not fully ANSI-compliant)

    Some LISPs have automatic close parenthesis generation (so you can avoid tediously counting all that 2 dozen close parenthesis at the end…)

    I really liked the power of Racket… but why write a new modern instantiation when you can dust off Granpa In The Corner and write in the Old Greek Of Computing? ;-)

  5. E.M.Smith says:

    @AC Osborn:
    I can likely find that via a compare of v1 to v2 to v3 to v4 for the same station, and for sets of ‘nearby’ stations with a set comparing the “unadjusted” to the “adjusted”. Just a small matter of programming ;-)

    @Gary:

    A compare of unadjusted to adjusted for a given station / version ought to show some of that. Then there is just “sanity checking”.

    I’ve pondered doing a statistical compare:

    For each instrument, for each month/year: Compute statistics (MAX, MIN, AVE, MEAN, MODEs etc) then compare those over time. That would require using the daily data (a future project… as the data are huge) but would be interesting. IF, for example, the MAX is not rising but the MIN and AVERAGE are, and it ALL comes out of removing ‘outliers’ from the unadjusted to make the adjusted, is that VALID? Hmmm? I know the “volatility” in the data is being suppressed (I’ve read their QA code and it is fundamentally biased that way) and I know that the “adjusted” data shows more rise of the lows and basically flat highs in the recent data. (From prior explorations with other tools). But is there a statistical way to show that it’s all from the adjustments? OR that some is from locating so many thermometers at airports that grow over time, add acres of tarmac, and hundreds of jets spewing hot exhaust along with snow removal equipment?

    I suspect comparing Airstation data to Rural and “Airstation over time” statistics (as the airport comes into existence in the early 1900s and grows into the jet age) might be enlightening.

    Not sure the same approach would work with Monthly Average data.

  6. Steven Fraser says:

    @EM: I believe that playing with different ways to survey the data, slice-and-dice, ad-hoc-report.. “(Yul Brynner moment) etc., etc., etc.” will help to create some understanding of the character of the set, and the particulars.

    For example, for any given station you can calculate your own anomalies, over whatever baseline periods you want, from the raw (unadjusted) values, and compare with those with the NOAA/GISS/BEST/Whomever anomalies you care to include.

    Or, you could apply your own geographic /temp filters, and model the progression of stormfronts as they move across the country.

    One of my interests would be to ignore all airport and ‘City’ sites entirely, and see if any trends are detectable, and then select for only Cities and Airports to do the same. I bet they would be very interested plotted against one another.

    And, from a data quality perspective, the show the number of bad/missing values, and their distribution through time and geography.

    Of course, you’d be able to profile high, low and average (normal?) temps from the readings yourself, if you cared to. However, I do not see an element for ‘observation time-of-day’. Is there metadata that indicates, for each station, for various calendar years (or particular days) when the obs took place?

    Oh, and you could use the actual temps for analysis and reporting.

    Sounds like a fun thing you are up to.

  7. Larry Ledwick says:

    You could also filter and plot for the stations that WUWT found actually met proper site standards in their site survey.

  8. jim2 says:

    I’ve played around with R a bit. Lots of libraries, but as noted, R-cane.

  9. E.M.Smith says:

    @Steven Fraser:

    As far as know, TOBS data are not available to the public (and maybe only estimated to the insiders). As I recall the TOBS adjustment discussions, an ASSUMED TOD is applied up to one point in time, then a different ASSUMED value is used. So TOBS Adjustment may well be applied to observations that didn’t need it. The earliest data had no standard TOD, the middle had a recomendation in some geographies, and only the latest has any decent TOD data. I have never seen a TOD in the data, though, so it isn’t commonly available (though the recommend TOD is published).

    BTW , part of why I’m publishing the code is so anyone can join in the fun ;-)

  10. cdquarles says:

    @EM, well, from a statistical point of view, averages alone, without ranges and/or other measures of dispersion, simply can’t provide you with information stripped out by averaging.

    Hmm, maybe a point about Time of observation bias … in that while weather does move in a rather regular fashion, locality effects *do* matter, especially for low resolution data. And dang it, these should *not* be thrown out when that is what happened! Say a stronger than ‘average’ system moves in, yet it is also moving relatively slowly. Depending on your location, you might see a different temperature or humidity curve that depends on the wind vector and the amount of time that wind vector spends in a particular quadrant (or even smaller section of the compass rose). I have seen strong southerly fetches make temperatures rise more or less steadily over as much as 48 to 60 hours. I have seen northerly fetches do the opposite. Diurnal variation depends on wind and clouds, too. Diurnal variation can range from none to more than 40 degrees F where I live; even though most of the time they’ll be between 10 and 30 degrees and most commonly between 15 and 25 degrees F.

  11. jim2 says:

    TOBS, GHCN:

    GHCN Daily

    TOBS
    TMAX
    TMIN
    Source: National Climatic Data Center

    http://berkeleyearth.org/source-files/

    This directory: ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/superghcnd/

    “The following information serves as a definition of each field in one line of data covering one station-day.
    Each field described below is separated by a comma ( , ) and follows the order below:

    ID = 11 character station identification code
    YEAR/MONTH/DAY = 8 character date in YYYYMMDD format (e.g. 19860529 = May 29, 1986)
    ELEMENT = 4 character indicator of element type
    DATA VALUE = 5 character data value for ELEMENT
    M-FLAG = 1 character Measurement Flag
    Q-FLAG = 1 character Quality Flag
    S-FLAG = 1 character Source Flag
    OBS-TIME = 4-character time of observation in hour-minute format (i.e. 0700 =7:00 am)

    See section III of the GHCN-Daily readme.txt file (ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt)
    for an explanation of ELEMENT codes and their units as well as the M-FLAG, Q-FLAGS and S-FLAGS.

    The OBS-TIME field is populated with the observation times contained in NOAA/NCEI’s HOMR station history database. ”

    Or were you discussing a different TOBS?

  12. jim2 says:

    Nick Stokes has a new article on WUWT. It is relevant to your efforts here.

    [https://wattsupwiththat.com/2019/01/25/nyquist-sampling-anomalies-and-all-that/]

  13. E.M.Smith says:

    @Jim2:

    Nice to TOBS is available somewhere. Looks like it is a GO FISH in the daily data. I would want confirmation it is real TOBS and not an implied fabricated TOBS based on standards, but that could easily be seen in the data. Thanks for the pointer!

  14. jim2 says:

    EMS: There is are yearly files containing TOBS also. Over a gig each. I almost crashed my desktop trying to open one of them with a regular text editor. VI opened it OK and can use head or tail to sample the data from terminal. Also, there is a reference to the HOMR station database. That web site is shut down at the moment.

  15. Graeme No.3 says:

    Most of this goes right over my head but at least….

    Millennials try to work out an old phone
    https://www.bitchute.com/video/2GNUJY0h3elT/

    Problem solving under pressure.
    Pointman

  16. Graeme No.3 says:

    Millennials try to work out an old phone

    https://www.bitchute.com/video/2GNUJY0h3elT/

    Problem solving under pressure.
    Pointman

  17. richard verney says:

    @EM

    Have you reviewed Tony Heller’s site (Steven Godard) and his source code?

    He analyses the USGHCN network in great detail, and has published his source code that he uses for this purpose. See for example:
    https://realclimatescience.com/unhiding-the-decline-for-windows/

    You will note that there are a lot of useful links.

    It would be interesting to see whether your different coding gets the same sort of results as Tony Heller publishes.

  18. richard verney says:

    Regarding TOBS

    Tony Heller has also looked into TOBS. I link to a summary page:
    https://realclimatescience.com/?s=TOBS

    For example, one sees very similar trends no matter what time the measurements are recorded. Viz:

    and

    Certainly it would be very interesting to see another analysis of the impact of TOBS and whether there really needs to be an adjustment made because of changes to TOBS. I hope that at some future date you analyse this.

  19. A C Osborn says:

    The problem with TOBs, as I pointed out yesterday at WUWT is that they apply it whether it was “actually required” or not.
    So one station may have a TOB problem for one day and they apply it to the monthly data, another station may not have had a TOB problem but still get’s it applied.

  20. E.M.Smith says:

    Seems to me that TOBS issues can be dodged by NOT making daily averages that are then averaged, but instead make an average of highs then an average of lows, then a montly average from those two. Since nobody has shown why one average is better than another, any are allowed.

  21. jim2 says:

    This site has a list of data sources, but no links, and papers discussing various issues of taking a pigs ear to make a silk purse.

    http://www.surfacetemperatures.org/home

    Here’s a good example of a zit on the pig’s ear:

    https://docs.google.com/viewer?a=v&pid=sites&srcid=c3VyZmFjZXRlbXBlcmF0dXJlcy5vcmd8aG9tZXxneDo0ZjA1NWRkNGIxMTQ0Y2M5

  22. jim2 says:

    There is a link contained in this page with raw data:

    http://www.surfacetemperatures.org/databank

    Some of this data are actual images of temperature records – can’t get much rawer than that! Also contains other raw data in non-image form. Pretty interesting collection from around the world.

  23. E.M.Smith says:

    Interesting point in that link to the “Databank” created in 2014-15. Looks like the complaints about lack of provenance and history had some effect!

    I’ll need to snag some copies from there to compare to my very old sources (sets of v1 v2 etc. copied as far back as about 2008? or so…) just to make sure any changes are noted…. I’m not a fan of changeable history….

    Having a bit of an allergy day today, so not soon, though. Need to button up the house and turn on the air filters and…and…and… then catch up tge must do’s and the honey do’s and…

  24. Steven Fraser says:

    @EM: You said, above: ‘ Since nobody has shown why one average is better than another, any are allowed.’

    ‘Betterness’ could be tested, I think, using both methods mentioned: For a year of months, calculate each month’s 1) average of daily hi/lo, then average the 28-31 values, and 2) average of daily highs, average of daily lows, and then average those 2 numbers to get the monthly number.

    Using those 2 monthly numbers, for each day of the month, calculate the hightemp and lowtemp anomalies.

    Repeat the process, but using meteorological quarters, rather than months, and then an entire year.

    I think the results would be revealing of a bias inherent in the grouping of average into the monthly, quarterly and yearly perspectives. Perhaps the most interesting would be how the year’s high average and the year’s low average could be related to the seasonal and monthly values.

    My expectation would be that somewhere in there would be seasonal patterns, perhaps of anomalies, that would indicate the variation in the gap between the shorter-term low and high temps when compared with the longer terms. IMO, the longer-term averaging (i.e., year’s worth of highs) does automatic smoothing out of the ‘weather noise’, and reveals the longer-term (phased or unphased) changes.

    I used to have a copy of the data. Maybe I should revisit this myself, and try some of what I have described.

  25. Larry Ledwick says:

    I still think averages are simply useless for the intended purpose. If you average high and low, you get a mythical number but it has no meaning. If the high is actually a 12:01 pm high as the temperature falls into the basement due to a cold front then the whole day could be considerably colder than implied by a simple average of high and low.

    A much better method in my estimation is something like heating and cooling degree days. Multiply degrees above or below the local ground water temperature and hours that that temperature was maintained. Then you get an actual area under the curve measurement of how far “and how long” the temperature deviated from the long term local average temperature (which is happily integrated for us in the local ground water temperature).

    Here in the west we can have 40-50 degree temperature excursions in less than an hour due to Chinook winds. Subzero all day except two hours at 28 degrees F does not give an average that has any value for measuring the earths warming or cooling – it literally has no meaning.

  26. jim2 says:

    Here is the HOMR station database. It’s still not open due to the shutdown.

    I’m interested to see what’s in it.

    https://www.ncdc.noaa.gov/homr/

  27. A C Osborn says:

    Larry Ledwick says: 26 January 2019 at 7:31 pm

    But we do not have that historical data, so there is no point in starting it now.
    We have perfectly adequate Max & Min, so that is what we should use and plot them on the same graph.
    There is much more data there than by graphing a false average.

  28. A C Osborn says:

    jim2 says: 26 January 2019 at 2:43 pm

    That “zit on a pig’s ear” was very enlightening, everything before 1989 was rubbish according to that.
    I suspect all the Thermometer manufacturers and all the people who dedicated their time to taking all those CRAP readings are really impressed by that analysis.

  29. E.M.Smith says:

    The fallacy is that temperature data products have any meaning at less than full degree precision.

    Original data in the early years were recorded in full degrees (of various sorts even Reaumur, not just F C and K) and explicit direction was given in the USA at least that you could just guess a value if needed… From this, the self styled masters of the universe think they can devine 0.1 to 0.001C precision comparisons. It is crazy talk.

    My interest in the averages is not to endorse the concept, but to illustrate the variance just from changes of method… when it is like 0.4 C then what meaning is in 0.5 C other than the method chosen?

    I very much want to show MIN MAX directly, not least to show the known lack of higher highs and all the “warming” is from fewer very low lows (IMHO enhanced by their “QA” process that tosses “outliers”, like every sudded cold plunge…).

    But as the AVE is shoved in all folks face, it must be addressed too.

  30. H.R. says:

    @E.M. – If you plot Min/Max as separate lines on the same graph, and you had the time stamp for the reading to color code the point as Day or Night (dream on, H.R. ğŸ˜ž) one interesting thing to look at would be what Larry L. brought up; that the Min and the Max would often be inverted from day to night. I doubt that any particular pattern would be revealed so perhaps the randomness of it would be the interesting feature.

    If a pattern emerged, then you would get to use the current catchphrase, “E.M. was surprised to find…”and with any luck, “It’s worse than E.M. thought.” 😜 (Why are they always “surprised?” It is always worse than they thought. That’s just climate research boilerplate.)

  31. E.M.Smith says:

    @H.R.:

    As in: “It is much much worse than I thought; I had to code it in C++!”

    8-)

    Been looking at Go as a language for parallel programming. It claims to fix many C++ ills. Unfortunately, it has its own issues, like enforced formatting that forbids a curly brace {} on a line by itself. A practice I use often… Then in C tradition, IO is off in a library package somewhere…

    A pet peeve of mine. Most computer programs take in data, do something, then write it out. What silly idea says to leave 2/3 of the process out of the language? I know Algol started it and Algol is one of my favorite languages, but at U C it was implimented with FORTRAN like I/O features including a Write statement with a Format statement… which helped form my like…

    It is like designing a car, but shipping it without doors or wheels. Nice engine, but how do you get stuff in, have it not fall out, and then get you somewhere?

  32. Larry Ledwick says:

    Just for what it is worth:

  33. E.M.Smith says:

    As the S.H. is mostly water and very little land, and that land has very poor historical data, the whole S.H. is largely made up… I’ve sporadically wondered how much of “global warming” is really the known Polar Oscillation but where the S.H. data are effectively missing in the first half. So cold N.H. with hot S.H. data not in the average, then N.H. warms while S.H. suddenly shows up as cold, but defined as the “normal”…

  34. E.M.Smith says:

    I’ve spent the day looking at Python and Pandas. It looks like Pandas does what I want to do, but also includes a lot of the SQL functionality. Essentially it can suck in a whole table and then deal with it… The only question is does the data size blow some limit (and I’m pretty sure it won’t).

    This is a good news bad news situation for me. Good because I can see the path to completion for what I want to do. Bad because I still don’t like Python very much… then Pandas looks Object Oriented to me and I’m also not keen on the OO Way…

    So OK, I can get over it….

    The interpreted Python is not as annoying as editing pages of it where screwing up alignment can change your program in an unexpected way… Yes, I’ve seen the editors that auto indent, but I have my own prefence for formatting…

    Guess it is just a question of which is more annoying: Using some other harder product (and searching for it), or getting over my biases….

  35. Pingback: GHCN v3.3 vs v4 – Top Level Entry Point | Musings from the Chiefio

Comments are closed.