Stats miscellany

Posted on 27 Feb 2015
Tags: statistics, modeling

Here are some lukewarm takes on recent events in statistics and associated domains. At risk of jinxing the endeavor, I’ll try to make this a semi-regular feature.

Null hypothesis testing

In statistical methodology, Basic and Applied Social Psychology has banned the use of inferential statistics in its publications. The announcement is short and interesting enough that I’ve excerpted it in its entirety here:

The Basic and Applied Social Psychology (BASP) 2014 Editorial emphasized that the null hypothesis significance testing procedure (NHSTP) is invalid […] From now on, BASP is banning the NHSTP.

[…] authors will have to remove all vestiges of the NHSTP (p-values, t-values, F-values, statements about “significant” differences or lack thereof, and so on).

[…] Confidence intervals suffer from an inverse inference problem that is not very different from that suffered by the NHSTP. […] Therefore, confidence intervals also are banned from BASP.

Bayesian procedures are more interesting. The usual problem with Bayesian procedures is that they depend on some sort of Laplacian assumption to generate numbers where none exist. […] However, there have been Bayesian proposals that at least somewhat circumvent the Laplacian assumption, and there might even be cases where there are strong grounds for assuming that the numbers really are there (see Fisher, 1973, for an example). […] Bayesian procedures are neither required nor banned from BASP.

[…] BASP will require strong descriptive statistics, including effect sizes. We also encourage the presentation of frequency or distributional data when this is feasible. Finally, we encourage the use of larger sample sizes than is typical in much psychology research […]

A subtlety I didn’t include in the above that the ban is only strict at publication, but not submission/review. Redditor wil_dogg speculates that this would allow the NHSTP to keep being used editorially as long as it doesn’t make it into print. I’m unconvinced.

Gelman’s response is disappointingly all object-level, but the emails he was responding to don’t do any better. The comments are equally unhelpful.

So what am I to make of a not super-prestigious but not bogus scientific journal simultaneously:

  • calling “invalid” and banning a procedure (significance testing) that I consider valid if flawed when used in practice,
  • banning a procedure (confidence intervals) that does about as well as anything else in reliably producing knowledge about the world,
  • discouraging, or at least failing to endorse, a procedure (Bayesian inference) for entirely non-pragmatic reasons,
  • encouraging larger sample sizes in an era when all the cool data is too large to jam into a researcher’s mind?

I could try to make sense of the parts I cut out, but it’s out-of-focus stuff about “increasing the quality of submitted manuscripts” and “creative thinking”. If nothing else, Trafimow and Marks treat the NHSTP with clarity and precision. I also found it interesting structurally/rhetorically, though, how justifications for the significance testing ban are pushed into later and unrelated parts of the editorial. If they’re placed back where they make more sense, the case against confidence intervals nearly disappears. So maybe this is about defense in depth?

Boston weather

I jumped in on a Twitter melee over just how extremely bad the snow in Boston has been over the last few weeks. ClarkHat had expressed his skepticism about the results.

His concerns fall in three directions: appropriateness of the model, adequacy of the data, and politics. Addressing the first of these frames the second, and screens off the last from mattering at all, so let’s start there. Here’s the Telephone game description:

That set him on a statistical coding spree aimed at generating a million-year synthetic dataset of plausible winters based on actual historical data in Boston back to 1938. To do this, he parsed through every three-day period (to maintain meteorological plausibility and prevent the possibility of back-to-back-to-back 20-inch snowstorms) and then randomly generated a set of hypothetical winters consistent with the city’s climate history. His analysis shows that given a static climate, Boston can expect a winter with a 30-day stretch like this one only once approximately every 26,315 years — 38 out of a million.

To draw a really unsupportable analogy with trigraph models of English, this is like approximating the real thing with:


Without getting too technical or precise, you don’t need to hit this model with too much data for it to say that “QWYZ BSARW XAXPI TRUHG” is very unlikely to be English. The model will even give you a probability (equivalently, waiting time) of seeing that nonsense. You don’t need to think the point estimate is very precise to believe it qualitatively. And if the model is trained on a corpus of, say, 1,000 words, you would be out of your mind to clamp the probability of “QWYZ” from something like 1e-10 back up to something like 1e-3.

If anything, the problem is what you conclude from the analysis. If you actually see a word with an absurdly low probability (is there a list of high surprisal words out there?), the right inference is that your language model is bad. Back in the weather case, you’d revisit the assumptions of three-day dependence or of the climate being stationary over the last seven decades. This is where politics may come back into the discussion.

The predictive check perspective might also be clarifying here.

Adaptation and independence

Talking with Sarah Perry about music and constraints led to this page from Christopher Alexander’s Notes on the Synthesis of Form:

There’s no reason not to interpret the figures as graphical models. You can do useful things with subsystems:

  • dynamic programming allows global computation by grouping over states that look the same at inter-subsystem interfaces,
  • the model can be coarsened to the subsystem level and analyzed there to provide initial guesses at the fine scale,
  • when the variables are hidden/unobserved but have associated observed variables, the subsystems pick out pools of observed variables with different populations of underlying hidden variables.

The first two ideas correspond to different mechanisms of evolution/design, bottom-up and top-down, respectively. (And when evolvability is in play, the former more aggressively selects for simpler interfaces?) The third idea is one I’m currently writing up, but it’s not yet clear if it has any meaningful design interpretation.