Smoothing Data and not Distorting Data

I’ve been downloading a bunch of the daily (‘original’) temperature records and pondering a better way to ‘smooth’ the series than those used by the “climate scientists”. In particular, avoiding “averages” as much as possible.

Along the way, I ran into these nice resources. I’m still working through them. The problems of discontinuous data and time series data is all over the place in Economics, so much work has been done on that kind of problem in my field. (Part of why it rankles me a little when folks disparage someone as ‘just an Economist’ in the context of ‘climate science’… as Econometrics and Economic Modeling has likely done more to advance the methods of analysis of such crappy data than just about anyone…)

I’m just going to document these two resources here with a couple of comments (so they are off of my ‘preserve’ list on the notepad and easy to link from other browsers du jour).

Handling Noisy Non-Smooth Data

http://www.hindawi.com/journals/isrn/2011/164564/

ISRN Applied Mathematics

Volume 2011 (2011), Article ID 164564, 11 pages

http://dx.doi.org/10.5402/2011/164564

Research Article

Numerical Differentiation of Noisy, Nonsmooth Data

Rick Chartrand
Theoretical Division, MS B284, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
Received 8 March 2011; Accepted 4 April 2011

Academic Editors: L. Marin and D. Xiao

Copyright © 2011 Rick Chartrand. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

We consider the problem of differentiating a function specified
by noisy data. Regularizing the differentiation process
avoids the noise amplification of finite-difference methods.
We use total-variation regularization, which allows for discontinuous
solutions. The resulting simple algorithm accurately
differentiates noisy functions, including those which
have a discontinuous derivative.

If ever there was a “Noisy, Nonsmooth Data” series, it is the temperature series, and ‘differentiating’ it to get the delta-temp/delta-time is exactly the goal.

I hope at some point to try their method on temperature series. When time permits. Probably a couple of years out at this point. They have a discussion of using MATLAB to do the calcs… oh boy yet another language to learn… Sigh.

As the body of the paper is just chock full of Greek letters and math symbols, I’m not going to try putting a sample here. Just takes too much time to get the characters right in unicode.

5. Conclusions

We presented a method for regularizing the numerical derivative process, using total-variation regularization. Unlike previously developed methods, the TV method allows for discontinuities in the derivatives, as desired when differentiating data corresponding to nonsmooth functions. We used the lagged diffusivity algorithm, which enjoys proven convergence properties, with one implementation that works rapidly for small problems, and a second more suitable for large problems. The TV regularization allows the derivative to capture more features of the data, while adjusting the regularization parameter controls the scale of fluctuations in the data that are ignored.

Online Text Book for Forecasting using R

From the “Far Simpler but Complicated In It’s Own Way” department, this is a nice online introduction text book to “Forecasting Principles and Practices”

https://www.otexts.org/fpp

Forecasting: principles and practice

Rob J Hyndman
George Athana­sopou­los

Welcome to our online textbook on forecasting. This textbook is intended to provide a comprehensive introduction to forecasting methods and to present enough information about each method for readers to be able to use them sensibly. We don’t attempt to give a thorough discussion of the theoretical details behind each method, although the references at the end of each chapter will fill in many of those details. The book is written for three audiences: (1) people finding themselves doing forecasting in business when they may not have had any formal training in the area; (2) undergraduate students studying business; (3) MBA students doing a forecasting elective. We use it ourselves for a second-year subject for students undertaking a Bachelor of Commerce degree at Monash University, Australia.

For most sections, we only assume that readers are familiar with algebra, and high school mathematics should be sufficient background. Readers who have completed an introductory course in statistics will probably want to skip some of Chapters 2 and 4. There are a couple of sections which require knowledge of matrices, but these are flagged.

At the end of each chapter we provide a list of “further reading”. In general, these lists comprise suggested textbooks that provide a more advanced or detailed treatment of the subject. Where there is no suitable textbook, we suggest journal articles that provide more information.

We use R throughout the book and we intend students to learn how to forecast with R. R is free and available on almost every operating system. It is a wonderful tool for all statistical analysis, not just for forecasting. See Using R for instructions on installing and using R. The book is different from other forecasting textbooks in several ways.

It is free and online, making it accessible to a wide audience.
It uses R, which is free, open-source, and extremely powerful software.
It is continuously updated. You don’t have to wait until the next edition for errors to be removed or new methods to be discussed. We will update the book frequently.
There are dozens of real data examples taken from our own consulting practice. We have worked with hundreds of businesses and organizations helping them with forecasting issues, and this experience has contributed directly to many of the examples given here, as well as guiding our general philosophy of forecasting.
We emphasise graphical methods more than most forecasters. We use graphs to explore the data, analyse the validity of the models fitted and present the forecasting results.

Use the table of contents on the right to browse the book. If you have any comments or suggestions on what is here so far, feel free to add them on the book page.

Happy forecasting!
Rob J Hyndman
George Athanasopoulos
May 2012.

I find it a fairly readable way to learn R, specifically directed at Forecasting problems. Since market prediction is a very specific kind of Forecasting, as is weather, as is economic performance trend, as is Operations Planning in production, as is… {long list of Economics topics elided…} and, oh yeah, as is ‘climate science’; it ought to be a very useful thing to read. If nothing else, the primitive nature of the Forecasting done by ‘climate science’ stands in stark relief against the topics here…

There are little ‘triangle topic markers’ on the Table Of Contents to the right of the page. Here’s a look into just ONE of the chapters:

Time series decomposition

Time series components
Moving averages
Classical decomposition
X-12-ARIMA decomposition
STL decomposition
Forecasting with decomposition
Exercises
Further reading

Here is a sample from the “Exponential Smoothing” chapter:

7.1 Simple exponential smoothing

The simplest of the exponentially smoothing methods is naturally called “simple exponential smoothing” (SES). (In some books, it is called “single exponential smoothing”.) This method is suitable for forecasting data with no trend or seasonal pattern. For example, the data in Figure 7.1 do not display any clear trending behaviour or any seasonality, although the mean of the data may be changing slowly over time. We have already considered the naïve and the average as possible methods for forecasting such data (Section 2/3).

Saudi Oil Production 1966 to 2007

[Graph of Saudi Oil Production: Figure 7.1: Oil production in Saudi Arabia from 1996 to 2007.]

R output
oildata <- window(oil,start=1996,end=2007)
plot(oildata, ylab="Oil (millions of tonnes)",xlab="Year")

Using the naïve method, all forecasts for the future are equal to the last observed value of the series,

Further down they have the exponential smoothed results:

Simple Exponential Smoothing of Saudi Oil Data 1996 to 2007

Simple Exponential Smoothing of Saudi Oil Data 1996 to 2007

R output
fit1 <- ses(oildata, alpha=0.2, initial="simple", h=3)
fit2 <- ses(oildata, alpha=0.6, initial="simple", h=3)
fit3 <- ses(oildata, h=3)
plot(fit1, plot.conf=FALSE, ylab="Oil (millions of tonnes)",
  xlab="Year", main="", fcol="white", type="o")
lines(fitted(fit1), col="blue", type="o")
lines(fitted(fit2), col="red", type="o")
lines(fitted(fit3), col="green", type="o")
lines(fit1$mean, col="blue", type="o")
lines(fit2$mean, col="red", type="o")
lines(fit3$mean, col="green", type="o")
legend("topleft",lty=1, col=c(1,"blue","red","green"),
  c("data", expression(alpha == 0.2), expression(alpha == 0.6),
  expression(alpha == 0.89)),pch=1)

In this example, simple exponential smoothing is applied to forecast oil production in Saudi Arabia. The black line in Figure 7.2 is a plot of the data over the period 1996–2007, which shows a changing level over time but no obvious trending behaviour.

In Table 7.2 we demonstrate the application of simple exponential smoothing. The last three columns show the estimated level for times t=0 to t=12, then the forecasts for h=1,2,3, for three different values of α.

It gets increasingly more complicated as you go through the book. Even has a bit on neural networks…

Advanced forecasting methods

    Dynamic regression models
    Vector autoregressions
    Neural network models
    Forecasting hierarchical or grouped time series
    Further reading

Anyone want to try it on Sunspots? Would be interesting to set this up and run it in parallel with the “professional” predictions and see who wins ;-)

Example 9.6: Sunspots

The surface of the sun contains magnetic regions that appear as dark spots. These affect the propagation of radio waves and so telecommunication companies like to predict sunspot activity in order to plan for any future difficulties. Sunspots follow a cycle of length between 9 and 14 years. In Figure 9.11, forecasts from an NNAR(9,5) are shown for the next 20 years.

Figure 9.11: Forecasts from a neural network with nine lagged inputs and one hidden layer containing five neurons.

Figure 9.11: Forecasts from a neural network with nine lagged inputs and one hidden layer containing five neurons.

R code
fit <- nnetar(sunspotarea)
plot(forecast(fit,h=20))

The forecasts actually go slightly negative, which is of course impossible. If we wanted to restrict the forecasts to remain positive, we could use a log transformation (specified by the Box-Cox parameter λ=0):
R code
fit <- nnetar(sunspotarea,lambda=0)
plot(forecast(fit,h=20))

In Conclusion

So now you know what I read when I’m not posting, doing stock stuff, or doing computer stuff ;-)

It is my belief that using the methods either in that simple text, or in that paper, or both, ought to yield a far better result that the junk that passes for “forecasting” (pardon “projection”) from the ‘climate science’ crowd.

On the ‘long lead time very slow project list’ is to do just that with the daily temperature data and see where it leads when looked at ‘unadorned’ by adjustments. For example, if you just trend “daily max” and separately “daily min”, then compare the two curves, I see zero reason to need TOBS. Since there is no “daily average” made, you can not bias it via TOBS.

So that’s basically where I’m heading.

(Anyone else wants to run out ahead of me, please do 8-)

Subscribe to feed

About E.M.Smith

A technical managerial sort interested in things from Stonehenge to computer science. My present "hot buttons' are the mythology of Climate Change and ancient metrology; but things change...
This entry was posted in AGW Science and Background, Economics - Trading - and Money, NCDC - GHCN Issues, Science Bits and tagged , , , , . Bookmark the permalink.

14 Responses to Smoothing Data and not Distorting Data

  1. Pete Russell says:

    Take a look at Shumway & Stoffer (http://www.stat.pitt.edu/stoffer/tsa3/) for time series analysis with examples in R if you haven’t seen this resource.

  2. Richard Ilfeld says:

    Not Math, But.
    Since you are an avid gardener and bunny fan, I’d hope there is profit in relating temperature series to agricultural events:

    First and last frost, first and last killing frost, growing season in days.

    Agricultural relationships might be extended well back in time, perhaps even to before instrument records. Can’t be worse than tree rings, and nobody has cared more about climate than farmers or gardeners.

    I’ve know window-box ladies i’d trust more than climate scientists an when to plant and harvest their herbs.

  3. Chuck Bradley says:

    The virtue of exponential smoothing is it avoids hauling around a lot of data. With a moving average you have n periods of historic data, all shifted to make room for the new period, and the computed forecast. With exponential smoothing, the old forecast is replaced by the new forecast.
    I did a lot of exponential smoothing for computerized inventory control systems back in the mid 1960s. We needed the cycles, so exponential smoothing was a win for us, for stable or seasonal products. But it is bad when the underlying distribution changes. An instant or ramp change to a new stable level leaves the system playing catch up for a long time. In a inventory control system, that means large safety stocks driven by a beta similar to the stock volatility measure..
    I suspect there is another reason for the popularity of exponential smoothing. It is easy to teach and looks so awesome and sciencey to the average business student.
    I have not tried to keep up with the field, but I have heard there are methods that can detect a change in the distribution, at least some of the time for some kinds of changes.
    The most fun of the exponential smoothing projects was implementing x to the 0.7 power on an IBM 1401 (no floating point).

  4. Paul Hanlon says:

    I too don’t like this business of averaging and then averaging those averages. That’s why in the code I sent you, I first of all leave the temp measurements as integers, so there’s minimal chance of rounding errors and it’s faster.

    I consider a valid day to be one that has both a min and a max, so I add those together and increment an observation count by two. So if you want monthly resolution you will have a possible count of 62 and a total temp of say, 6200 (tenths of a degree) where a station has an average temp of 10 and you have the maximum number of observations. Same with yearly, except it would be 730 observations giving a total temp of 73000 in the same circumstances. Only when I decide the resolution I need, do I do the averaging *once*, be it for a given year, month, week or day.

    It actually simplifies things a lot, because it means the first step is purely parsing the ghcnd file into a form for further processing. It is at that stage that you can filter things for quality and the resolution you want and from that derive charts etc. That’s where R scores big. But on the initial parsing it struggles.

    I haven’t used R much. I found it very similar to Javascript, which isn’t surprising because both of them take elements out of Scheme (which took its influence from Lisp). With all the moving over to “Big Data”, R programmers are very much in demand, and with all the open source modules for R (I believe there is even one for clustered computing), there’s pretty much nothing you cannot do statistically with R. It’s a great choice for a language to learn.

  5. Svend Ferdinandsen says:

    TOBS is real if you try to construct a daily mean temperature from max/min measurements.
    I downloaded CNRH data taken each hour and found that the mean based on (max+min)/2 taken two times a day at specific times would depend very much on the timing of the readings.
    The error depends on the variations from day to day and could be 0.5C.

  6. omanuel says:

    E.M. Smith et al.

    Thanks to your keen analytical mind and brave tenacity in confronting experts, it now appears that politicians and technocrats handled by communists are all going down now!

    In the interest of concluding the AGW debate before the UN’s International IPCC Convention in Paris, two errors inserted in foundations of nuclear and solar physics after WWII have been concisely identified here:

    https://dl.dropboxusercontent.com/u/10640850/STALINS_SCIENCE.pdf

    This document was also posted on ResearchGate to allow everyone an
    opportunity to publicly comment on errors that appear to undermine the basis of the AGW position:

    https://www.researchgate.net/publication/281017812_STALIN'S_SCIENCE

  7. E.M.Smith says:

    @Svend: And that is why I am looking at ways to eliminate the daily min max average step…

    Now all we have is a set of mins and maxes, so my idea is make a trend of mins and a separate trend of maxes and then composite the trends. Details still murkey :-)

  8. p.g.sharrow says:

    As a matter of energy net loading, I would suggest that daily minimums would be the better indication of the 24hour trends. Highs could be spiked by air movements as well as direct and indirect heating. Thermal loss afterwards would tend to flatten to the curve of the 24 hour, net gain or loss. Local heat island causes would create maximum spikes that raise the daily average. In the old days, the surrounding areas would have much the same as the recording site. Today the recording sites are often near heat generators or storage masses…pg

  9. Svend Ferdinandsen says:

    E.M.
    That seems simple, but the bias is in the min and max readings themselves because they are taken at certain time each day covering the former 12 hours. One explanation is that you can measure the same twice, if the temperature drops just after the first measurement, the second mesurement could have the same max, or close to.
    An other explanation is that it is a sampled system where you dont know where in the 12 hour period you actually measured the max or min. And the sampling depends on the variation in temperature.
    I had to check it myself, to se what really went on with temperature measured every hour. Then i could make my own result of max/min measurement at different observation times and compare them.
    Sorry i did not made that clear from the beginning.
    I fear that this built in “error” in a max/min system is hard to avoid.
    An other point is, if any correction is needed. But some stations at some time changed the observation times, and that is the basis for the TOBS, because it might give some changes.
    I was amazed of the changes TOBS could mean for especially Boulder, CO.
    (Mountains, big difference day/night and big difference over weeks and months.)

  10. E.M.Smith says:

    @Svend: I got your point first time and have looked ar TOBS before.

    Yes, using a min max thermometer you can get the max duplicated as one example.

    My assertion is just that take a thermometer with 100 years date. Take Sept 15 and you have a 100 data point series. put a trend to it. That’s the Sept 15th max trend. shifting those points one to the right doesn’t change much. Have a few of them off by some tenths doesn’t change much. The min max thermometer is supposedly getting the max of some day, and if you avoid averaging it with anything, that ought to have nearly no error impact. While still speculative, I would assert it is about 1/180th error. (One day off out of a 1/2 seasonal cycle).

    Do the same with min. Now find the trend line between those two lines.

  11. Neil Jordan says:

    In water resources (hydrology, streamflow, etc.) it is a common reality check to cut the data set in half, for example, and work one’s mystical statistical magic on one half. Then do the extrapolation over the part of the data withheld. Then lift the curtain to see how well the extrapolation compares with reality.

    In a related matter, see
    http://acwi.gov/hydrology/Frequency/B17bFAQ.html#data
    Question: What is the relationship of the Federal Data Quality Act to flood data and flood-frequency analysis?
    Answer:The “Federal Data Quality Act” (officially known as Section 515 of Public Law 106-554, the Treasury and General Government Appropriations Act for Fiscal Year 2001) requires the Office of Management and Budget (OMB) and, through it, all Federal agencies to issue guidelines to ensure the “quality, objectivity, utility, and integrity” of information issued by the government. . .

  12. Paul Hanlon says:

    I thought ghcn-daily comes with the TOBs correction already done. I’ve looked here, and although it states that these are min-max values, they are not clear as to how they were derived, but I’m sure I saw a reference somewhere that these are after TOBs adjustments as part of their quality assurance.

    Unless you are talking about summer temps measured hourly creating a higher average than just using min / max. This is only a concern if you are trying to get an average temp over a specific area.

    If you are trying to get a global temperature, then the bias would be balanced out by measurements done on the other hemisphere, because winter temps measured hourly would have a lower average than just using max-min. Over a year the biases would balance out for any given station.

  13. Svend Ferdinandsen says:

    Interresting subject. It is in a way much telling that all those small adjustings is needed to find an expected temperature change. The adjustments are nearly the same size as the change, so nobody would care for these small adjustments if the change was clearly visible.
    As some have said, if it was not for climate science nobody would know the temperature has changed.
    By the way, i think that automatic electric thermometers with their fast responce could give more bias than TOBS. Remember an old meterologist from Germany that estimated 0.5 to 1K by comparing old to new, but it was to the warm side, so nothing to see.

  14. beng135 says:

    I took min/max temps for nearly 20 yrs. Perhaps a few times a yr as a severe cold front was moving thru @ 10 pm, the low was the next day’s high. But I doubt that had any significant effect.

Comments are closed.