Procedural Languages, MySQL, Massively Parallel, and Climate Modeling

This posting will be a bit eclectic. Several threads of the day pulled to one spot.

First off, most of the day was spent reading an SQL book and some of the MySQL Manual and more. Then an hour went into “relaxing” with a video about Procedural Langages, that turned out to be more interesting that I’d expected.

In the first bit of reading I found out that MySQL has a “Spacial” or Geographic data type, or set of them. It also has a parallel server model. Ponder what that implies for a moment…

Then, in the second bit, not only was there a fun stroll down memory lane (at about 9 minutes where Algol and all it begat was discussed), but also at about the 50 something minute mark there’s an interesting graph about mutable data vs shared data and how only data that is both mutable and shared needs a control structure to coordinate access (thus ‘multi-threaded’ and parallel processing is hard and needs some kind of coordination mechanism but straight procedural code on immutable data or non-shared data does not…)

Now glue those thoughts together…

What does a database system do? It provides concurrent data access to a large collection of data to many independent machines and processes, along with a locking mechanism

Now say you wanted to build a geographic model of the world, wouldn’t it be nice to have a data structure that was designed to represent the surface of a globe with a coordinate data type? One with parallel data access and locking built in? Wouldn’t it be nice to separate the procedures of finding and presenting and storing and locking data items from the code that does the computation? Then, with a database that can be distributed over many servers for much faster operation and much higher bandwidth (presuming a switched network as in the backend of a Beowulf cluster) and a cluster of small computers also on a switch fabric:

One could design a data layout of cells with properties and each compute engine could just be handed a cell, check the “work needed flag field” and if it shows new data has shown up (input of new solar heating or new mass of air from a neighbor or…) it could do whatever computations were needed, put the output data in the appropriate cell “inbox”, update the present cell, and move on.

So each cell becomes a database data object, with independent access. Each operation done on it becomes an independent operation done by any worker board that finds the work flag set, all running in parallel. Coordination and locking provided by the combination of the database update locking facility and the “inbox” of data computed from some other cell.

For example, the cell at 35 N x 56 E has a computation that shows 10% of the mass moves SW, so puts 5% of the mass in the inbox of the cell directly SW, and 2.5% in the inbox of the cells directly S and W. Then it sets it’s inbox to zero (having done the processing on it) and moves on. Then some worker board can “discover” that the cell directly SW has 10% more mass entry and needs computing; so it then sets the locked flag, computes things like pressure, mass flow out to other cells, wind speed, changes of enthalpy from altitude changes, whatever. Writes the appropriate stuff into the “inbox” of the appropriate cells, unset the lock flag and move on.

Essentially, put the data storage, structure, locking and management into a database engine, and then very simplified “per cell” computation into a gaggle of independent computation engines that wander around looking for work to do. ( I’d likely make a table in the DB where LRU -Least Recently Used- cells were at the top and when a cell was locked for processing it was moved to the end of the list, then each worker could just pop a cell off the top of the list, lock it, and go check it out / work on it).

With the flow of control and data management all off-loaded from the “climate programming”, those bits of code ought to become much simpler. Furthermore, the only limits to how many independent processors you could have working on the system would be the number of cells (after one processor is dedicated to each cell, you don’t gain anything from another one being locked out…) and the switch fabric bandwidth in aggregate.

No, I don’t know if it would actually be faster. That depends on too much that is not known. But network speeds are now in the GBit range, and with a switch fabric that becomes multiplied by each switch channel (so a 24 port switch can have 12 x Gbit simultaneous transfers happening). It ought to scale fairly linearly. At some point, the limits of the parallel distributed database engine will start to bite, but I don’t know when that is. There are folks running truly gigantic database engines, though, so it is very very large.

If nothing else, it would be a fun experiment in a different way to do parallel programming solutions.

So that’s the Bright Ideal Of The Day. BIOTD? ;-)

The Video

Here’s the 1 hour video on Procedural Programming. Does a nice job of covering the history, lets you see how central ALGOL was to the “modern” languages. Unfortunately tosses some barbs at FORTRAN, despite it being a very nice very workable language that does many things better than any other language; thus it continuing to be the Go-To-Language (pun intended) for Scientific Programming and much of Engineering (and climate models). He also shows a lot of other fun bits of code to demonstrate how “modern” concepts show up in other languages, from Simula to Java Script. He even manages to toss in some (weak) humor on things like Europeans preferring to start arrays at 0 (so the “1st Floor” is up one flight of stairs from the ground) while Americans like to start arrays with 1, one being the value of the first positive integer…

It isn’t too technical for anyone with some computer background to follow (and those with zero background can marvel at the crap we programmers think about and wonder why we create a whole new crop of languages every decade… {think “Ph.D. in Comp. Sci. requires doing something ‘new’…}) Yet it also has enough tech chops in it to be interesting to more experienced folks.

While much of it was a walk down ‘memory refresher lane’ for me, I learned a couple of new bits. I also found some justification for my feelings about O.O Programming as he points out most of the properties of it are available in procedural languages if done ‘right’ and the concepts date from then. Just that nobody bothers to do them much ;-) It was also fun to see a couple of languages I haven’t used (Simula, Java Script) and see how familiar they looked… Seems many of the “new” languages are mostly old languages with a facelift of sorts or a new concept grafted on.

Some MySQL References

So here’s a couple of links to bits about MySQL that started me pondering. I’ve pretty much soaked up most of the language now and I have a good grasp on how they use keys and indexes; and how they are mushy on the two terms and mix their meanings…

Along with that, I have a nicer appreciation for how they do JOIN; though my old 2007 or so book claims MySQL doesn’t do “OUTER JOIN” only a “LEFT JOIN” and “RIGHT JOIN” – and I’m hoping it has been added since then. This manual page talks about LEFT and RIGHT as “outer join”, but that is different from the “give me all records even those with blanks in the join” that the “OUTER JOIN” in Oracle and Microsoft has in addition to LEFT and RIGHT.

https://dev.mysql.com/doc/refman/8.0/en/outer-join-simplification.html

Since the ability to create a VIEW does a very nice job of hiding the JOIN complexity and via using INDEX fields can be nicely efficient (as it pulls indexes into a hidden table for faster processing – and has a Optimizer that figures out the best access strategy): I’m now more willing to factor the GHCN and other temperature data into more fragmented tables (“normalize” the data more).

THE big pain of fully normalized DB design is the need to pull data from 10+ tiny tables to do anything Every Single Time you want to knock out a report. The ability to create and store a VIEW means that can be done once and then is hidden from you when you create your report. Nice. So I’m going to revisit my “toy” Temperature DB layout with the idea in mind of both more smaller tables and more “VIEW”s of them as one big flat table. Plus add more keys (things that make each row a unique entry and prevent duplicates) and indexes (tables of entries that let you go right to a specific datum from a known value) – even if MySQL treats those as synonyms in some contexts…

Give that even using the existing load of data that is largely treated as one large non-key non-index flat file is “plenty fast”, everything else will be faster. So, OK, I’ll tune it up some more and use more of the SQL language features.

Parallel Servers

MySQL has a bunch of different “Storage Engines” you may choose. These give you methods of storing data opimized for different uses. Mostly they recommend the InnoDB engine (that assures rollbacks and crash safety and such) but also include others.

https://dev.mysql.com/doc/refman/8.0/en/storage-engines.html

Chapter 16 Alternative Storage Engines

Table of Contents

16.1 Setting the Storage Engine
16.2 The MyISAM Storage Engine
16.3 The MEMORY Storage Engine
16.4 The CSV Storage Engine
16.5 The ARCHIVE Storage Engine
16.6 The BLACKHOLE Storage Engine
16.7 The MERGE Storage Engine
16.8 The FEDERATED Storage Engine
16.9 The EXAMPLE Storage Engine
16.10 Other Storage Engines
16.11 Overview of MySQL Storage Engine Architecture

Storage engines are MySQL components that handle the SQL operations for different table types. InnoDB is the default and most general-purpose storage engine, and Oracle recommends using it for tables except for specialized use cases. (The CREATE TABLE statement in MySQL 8.0 creates InnoDB tables by default.)

MySQL Server uses a pluggable storage engine architecture that enables storage engines to be loaded into and unloaded from a running MySQL server.

To determine which storage engines your server supports, use the SHOW ENGINES statement. The value in the Support column indicates whether an engine can be used. A value of YES, NO, or DEFAULT indicates that an engine is available, not available, or available and currently set as the default storage engine.

Further down in the manual it has another 2 chapters. There are two cluster oriented bits. The NDB cluster that holds everything in memory, so with huge machines with oodles of memory you can have lots of GoFast, and the (more interesting to me) Cluster InnoDB option.

https://dev.mysql.com/doc/refman/8.0/en/mysql-innodb-cluster-userguide.html

Chapter 21 InnoDB Cluster

Table of Contents

21.1 Introducing InnoDB Cluster
21.2 Creating an InnoDB Cluster
21.3 Using MySQL Router with InnoDB Cluster
21.4 Working with InnoDB Cluster
21.5 Known Limitations

This chapter covers MySQL InnoDB cluster, which combines MySQL technologies to enable you to create highly available clusters of MySQL server instances.

21.1 Introducing InnoDB Cluster

MySQL InnoDB cluster provides a complete high availability solution for MySQL. MySQL Shell includes AdminAPI which enables you to easily configure and administer a group of at least three MySQL server instances to function as an InnoDB cluster. Each MySQL server instance runs MySQL Group Replication, which provides the mechanism to replicate data within InnoDB clusters, with built-in failover. AdminAPI removes the need to work directly with Group Replication in InnoDB clusters, but for more information see Chapter 18, Group Replication which explains the details. MySQL Router can automatically configure itself based on the cluster you deploy, connecting client applications transparently to the server instances. In the event of an unexpected failure of a server instance the cluster reconfigures automatically. In the default single-primary mode, an InnoDB cluster has a single read-write server instance – the primary. Multiple secondary server instances are replicas of the primary. If the primary fails, a secondary is automatically promoted to the role of primary. MySQL Router detects this and forwards client applications to the new primary. Advanced users can also configure a cluster to have multiple-primaries.

Important

InnoDB cluster does not provide support for MySQL NDB Cluster. NDB Cluster depends on the NDB storage engine as well as a number of programs specific to NDB Cluster which are not furnished with MySQL Server 8.0; NDB is available only as part of the MySQL NDB Cluster distribution. In addition, the MySQL server binary (mysqld) that is supplied with MySQL Server 8.0 cannot be used with NDB Cluster. For more information about MySQL NDB Cluster, see Chapter 22, MySQL NDB Cluster 8.0. Section 22.1.6, “MySQL Server Using InnoDB Compared with NDB Cluster”, provides information about the differences between the InnoDB and NDB storage engines.

The following diagram shows an overview of how these technologies work together:

Now if I’ve got that right, with multiple primaries you can have multiple read/write engines at one time and then they replicate between themselves. At least, I hope that is what it is saying ;-)

Note that this is in MySQL 8.0 and I know it is not in MySQL 5.7, I also don’t know what version is presently available in Debian / Devuan, so it might be something that is “buy it only”…

Geographic Data Types

https://dev.mysql.com/doc/refman/8.0/en/data-types.html

Chapter 11 Data Types

Table of Contents

11.1 Data Type Overview
11.2 Numeric Types
11.3 Date and Time Types
11.4 String Types
11.5 Spatial Data Types
11.6 The JSON Data Type
11.7 Data Type Default Values
11.8 Data Type Storage Requirements
11.9 Choosing the Right Type for a Column
11.10 Using Data Types from Other Database Engines

MySQL supports a number of SQL data types in several categories: numeric types, date and time types, string (character and byte) types, spatial types, and the JSON data type. This chapter provides an overview of these data types, a more detailed description of the properties of the types in each category, and a summary of the data type storage requirements. The initial overview is intentionally brief. The more detailed descriptions later in the chapter should be consulted for additional information about particular data types, such as the permissible formats in which you can specify values.

https://dev.mysql.com/doc/refman/8.0/en/spatial-types.html

11.5 Spatial Data Types

11.5.1 Spatial Data Types
11.5.2 The OpenGIS Geometry Model
11.5.3 Supported Spatial Data Formats
11.5.4 Geometry Well-Formedness and Validity
11.5.5 Spatial Reference System Support
11.5.6 Creating Spatial Columns
11.5.7 Populating Spatial Columns
11.5.8 Fetching Spatial Data
11.5.9 Optimizing Spatial Analysis
11.5.10 Creating Spatial Indexes
11.5.11 Using Spatial Indexes

The Open Geospatial Consortium (OGC) is an international consortium of more than 250 companies, agencies, and universities participating in the development of publicly available conceptual solutions that can be useful with all kinds of applications that manage spatial data.

The Open Geospatial Consortium publishes the OpenGIS® Implementation Standard for Geographic information – Simple feature access – Part 2: SQL option, a document that proposes several conceptual ways for extending an SQL RDBMS to support spatial data. This specification is available from the OGC website at [http://www.opengeospatial.org/standards/sfs].

Following the OGC specification, MySQL implements spatial extensions as a subset of the SQL with Geometry Types environment. This term refers to an SQL environment that has been extended with a set of geometry types. A geometry-valued SQL column is implemented as a column that has a geometry type. The specification describes a set of SQL geometry types, as well as functions on those types to create and analyze geometry values.

MySQL spatial extensions enable the generation, storage, and analysis of geographic features:

Data types for representing spatial values

Functions for manipulating spatial values

Spatial indexing for improved access times to spatial columns

The spatial data types and functions are available for MyISAM, InnoDB, NDB, and ARCHIVE tables. For indexing spatial columns, MyISAM and InnoDB support both SPATIAL and non-SPATIAL indexes. The other storage engines support non-SPATIAL indexes, as described in Section 13.1.15, “CREATE INDEX Syntax”.

A geographic feature is anything in the world that has a location. A feature can be:

An entity. For example, a mountain, a pond, a city.

A space. For example, town district, the tropics.

A definable location. For example, a crossroad, as a particular place where two streets intersect.

Some documents use the term geospatial feature to refer to geographic features.

Geometry is another word that denotes a geographic feature. Originally the word geometry meant measurement of the earth. Another meaning comes from cartography, referring to the geometric features that cartographers use to map the world.

It also has a nice set of functions that work on Spacial data types:

https://dev.mysql.com/doc/refman/8.0/en/spatial-analysis-functions.html

12.16 Spatial Analysis Functions

12.16.1 Spatial Function Reference
12.16.2 Argument Handling by Spatial Functions
12.16.3 Functions That Create Geometry Values from WKT Values
12.16.4 Functions That Create Geometry Values from WKB Values
12.16.5 MySQL-Specific Functions That Create Geometry Values
12.16.6 Geometry Format Conversion Functions
12.16.7 Geometry Property Functions
12.16.8 Spatial Operator Functions
12.16.9 Functions That Test Spatial Relations Between Geometry Objects
12.16.10 Spatial Geohash Functions
12.16.11 Spatial GeoJSON Functions
12.16.12 Spatial Convenience Functions

MySQL provides functions to perform various operations on spatial data. These functions can be grouped into several major categories according to the type of operation they perform:

Functions that create geometries in various formats (WKT, WKB, internal)

Functions that convert geometries between formats

Functions that access qualitative or quantitative properties of a geometry

Functions that describe relations between two geometries

Functions that create new geometries from existing ones

For general background about MySQL support for using spatial data, see Section 11.5, “Spatial Data Types”.

I find that a rather interesting data type!

I didn’t find any kind of “altitude” in the Spacial data types, so it looks like they expect a smooth one altitude sphere. So, OK, you need to do the altitude layers as a set of repeated fields inside the cell and do altitude processing on that cell in layers. Got it. Temp_ground, Temp_5km,Temp_10km, Temp_15km, Temp_20km, etc.

It still moves that whole “lay out a global grid and process through it” out of the flow of control logic of the program and into the data structure.

Also, since you can write modules to extend MySQL, the really hard core folks could extend the Spacial data type to include an altitude axis…

In Conclusion:

So there you have it. Some “chewing gum for the mind” to work on for a while… I know it’s made my mind a bit tired today, chewing on all of it.

OTOH, I’m much more prepared to make a decent go of it with getting a proper data structure into the GHCN / USHCN etc. DB I’m building. I’m still not keen on their DATE data type (it is fairly inflexible about the order of YYYY-MM-DD (though the Oracle version supports my preferred MMM as letters format) but I may try loading the STATION LAT LONG data into a Spacial data type, just for grins.

I’m also ready to take a stab at some keys, indexes and views. But not, I think, tonight… It’s already been a long day slaving over a hot computer ;-)

Subscribe to feed

Advertisements

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in GCM, NCDC - GHCN Issues, Tech Bits and tagged , , , , , , . Bookmark the permalink.

7 Responses to Procedural Languages, MySQL, Massively Parallel, and Climate Modeling

  1. A C Osborn says:

    Tiring picking up all that new or revisited knowledge, my problem is I no longer remember it all to make the prgramming (VBA) as natural as it used to be and have to keep looking it up again.
    One thing I would recommend is to have a Word (or equivelent) document to paste your code and SQL into with a simple description to make it quickly searchable, so that next time you want that “bit of code” with a slight mod you don’t have to go hunting for it all over the place.

  2. H.R. says:

    @E.M. – Nice write-up of what’s going on. Pretty easy to follow by me, and I lost interest in programming and making computers do things in the ’80s. Most of the programming I have done has been machine specific – grinders, mills, bending machines, plcs and whatnot.

    I do read your geeky posts but I can’t recall any specific comment, if any, I’ve made on any of them over the years. You post details, but it is helpful to me as the details are always about broader issues that are good to be aware of.

    For whatever reason (fishing has been spotty… yeah, that’s the ticket 😜), this project has piqued my interest a bit and your time-outs for a summary-to-date have been much appreciated and quite accessible to this reader.

  3. Soronel Haetir says:

    I suspect that (1) Mmysql’s (or any other SQL) evaluation would not be efficient enough to replace the typical compiled code used by GCM’s if that is what you are thinking and (2) that SQL’s model of producing output from a number of like inputs would be painful if instead applied to a system with 1 input row producing changes in multiple output rows. It is possible to write compiled code and have mysql run that instead but I suspect point (2) would still be a killer if you were to try and do this for large scale scientific computing.

  4. Gary P Smith says:

    I used MySQL quite a bit in the past and found it did everything the big boys do (DB2 / MS SQL Server / Oracle) including table partitioning. Inner join is definitely supported although I never used those key words. I used table aliases and a.keycolumn1 = b.keycolumn1. I always stopped at 3rd normal form. At the end of the day, probably not much extra programming work between 3rd and 4th normal form, but I’ve generally found that you can’t UPDATE through a view, so a “logical transaction with COMMIT” gets busted up on the INSERT or UPDATE into a bunch of tiny transactions. 3rd normal form just usually seems to fit my “interpretive mind” better.

    4th normal form however, fits Codd’s mandate better: use the key, the whole key, and nothing but the key, so help me Codd.

    I wish I had time to collaborate with you on this project. It would be fun.

  5. E.M.Smith says:

    @A.C. Osborn:

    Nice tip. Unfortunately for me, most of what I’ve done has never had cause for “reuse”. It was specific to some company or some project, then gone. Or it was in some old crusty language that I’d moved on from. Or…

    For some odd reason everyone gets all “Panties in a bunch” about the latest idea for “Code Re-usability” as though it were a Brand New Idea, but it isn’t. FORTRAN hangs around partly due to the enormous library of Math Routines that are crazy fast written in it. The OO folks think Inheritance is the Cat’s Meow (snide side ref to their tendency to demonstrate inheritance with Animal – Mammal – Cat paradigm) as it lets you reuse code… (then forget it means you must memorize the entire library of Classes and Objects to do that…) Sigh. Everything old is new again… Kind of the point of the video.

    I was surprised to see that Lisp lives on as Son-Of-Lips the Racket language… (in another video…) Gotta love those parenthesis…

    My code re-usability is largely just grabbing some module in the current effort that uses a lot of the same data (so that part is mostly done as is the main loop) and then hack it up. ;-) Once something is a decade in the past, I rarely have any use for it (but maybe that will change with this climate stuff ;-)

    @H.R.:

    Glad you are finding interest in it! I’m having some fun waking up those old DBA skills. (It’s been about 35 years…) as for a few decades I’ve wished for some quick and easy report writer that could query data files (like the old mainframe FOCUS and RAMIS II products I used) and SQL comes close. Not quite as general purpose a tool, and not as English Like a report writer, but good enough.

    @Gary P. Smith:

    The book I’ve got has comparisons for MS SQL, Oracle SQL and MySQL. Pretty much they all have all the same bits. The only thing that was a “maybe matters” (or maybe not…) was the lack of an “OUTER JOIN” command in MySQL. This is called a “Full Outer Join” and only really differs from LEFT and RIGHT in that it doesn’t care if there are missing fields, you still get the record. I can see that as being useful for gluing tables back together to make the exact original loaded flat data file, all blanks included; but can’t think of why I’d need to do that ;-)

    Otherwise, it looks pretty complete.

    The only other Wha? is the occasional “That feature isn’t implemented in…” for the free MySQL in Debian / Devuan. So far none have bit me, but when looking at more exotic stuff like Spacial Data Types I’m always just kind of wondering if they were implemented or not…

    Good point on the updates. I’ll have to make sure I don’t “over normalize” and make the data loads a PITA ;-) I usually don’t…

    I’m still a bit miffed that they call the key the Primary Key and the Indexes “Foreign Keys”… (and maybe in Views it comes up…) and then talk about key and index as synonyms. They are not. But OK, I can keep a Tidy Mind while doing an inner translate…

    Per collaboration: My intent is to just toss stuff to the public. Then if someone wants to do something trick with it, they can. 100% voluntary, asynchronous, and mahalo driven… Liberty, it’s a thing ;-)

  6. jim2 says:

    Just to break some stereotypes, functional programming with Java.

    [https://www.javaworld.com/article/3314640/learn-java/java-101-functional-programming-for-java-developers-part-1.html]

  7. Roberto says:

    Concerning normalized database layouts. These are super for flexibility when you have to adapt to unexpected variations in the input data. But after a bunch of code has baked in your design, this concrete hardens. Good luck with that balance.

Anything to say?

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.