Let’s go back to some good old science subjects and take some notes about sediments, something I am supposed to be an expert in.

One of my favorite pastimes lately is collecting examples from the geological literature in which the statistical analysis went incredibly wrong. Take for example the papers dealing with grain-size distributions that advertise cumulative probability plots as the best technique to identify subpopulations in a mixed distribution. Here is what G.S. Visher says in his 1969 paper on “Grain size distributions and depositional processes” (Journal of Sedimentary Petrology, v. 39, p. 1074-1106):

“The most important aspect in analysis of textural patterns is the recognition of straight line curve segments. In figure 3 four such segments occur on the log-probability curve, each defined by at least four control points. The interpretation of this distribution is that it represents four separate log-normal populations. Each population is truncated and joined with the next population to form a single distribution. This means that grain size distributions do not follow a single log-normal law, but are composed of several log-normal populations each with a different mean and a standard deviation. These separate populations are readily identifiable on the log-probability plot, but they are difficult to precisely define on the other two curves.” (p. 1079)

I am wondering if this tendency to see straight line segments in cumulative probability plots and to give them some special significance is a syndrome restricted only to geologists – whose abilities for pattern recognition are excellent in general – or one could find such examples from other fields as well. The fact that a certain distribution looks like a straight line on a cumulative plot does not mean that mixtures of the same type of distribution will plot as straight line segments. The excellent sedimentologist Robert Folk has pointed this out in a 1977 discussion of a paper coauthored by Visher (in which they try to prove that the Navajo Sandstone is not an eolian deposit – yeah, right):

“A general defect of the Visher method is exemplified by Kane Creek #2, which is shown as consisting of four straight line segments, implying that it is a mixture of four populations. It can be proved by anyone using probability paper and ordinary arithmetic that such kinky curves can be made by a simple mixing of two (not four) populations that are widely separated; the ‘flat’ portions represent the gaps in the distribution. Furthermore, mixing of populations on probability paper results in smoothly curving inflexions, not angularly joined straight-line segments.”

Despite this, multiple straight-line-fitting to cumulative probability plots is fashionable again, although this time it is done on log-log plots of exceedence probability of either bed thickness or fault size data. But this is going to be part of a paper that I am working on right now (in the evenings and weekends…) — so more about this later.