When is a test "biased"?

Data Colada recently posted a comment (written by Uri Simonsohn) on a supposed ‘bias’ lurking a common default Bayesian alternative to the t-test.  Point #1 was that the Bayesian T-test is ‘biased’ against low power results. Several smart people have made sophisticated and pretty sensible critiques of Simonsohn’s arguments, which I won’t rehash here. Instead, I want to point out the obvious problem: the claim of ‘bias’ is based on choosing the sample size in a way that is, well, biased.

Here’s the central graph under contention.

Now, there are several things to notice. First, not only does the probability that the results support the null increase as the effect size goes down, the probability of the results supporting the alternative go down as well. This is barely noticeable in this graph, but that’s just a function of the particular power level the author chose.  With a power of 0.85 (which is a better estimate of what a person ought to be doing in their experiment anyway), the size of these changes is reversed; the probability of supporting the alternative decreases 5 points from .73 to .68, while the increase in the probability of supporting the null increases just 4 points, from .005 to .042.  I got these numbers using Simonsohn’s publicly available R code).  So it seems that only really low power tests are ‘biased’ against small effect sizes?

But this brings up the deeper question: why make this comparison at all? In particular, why use the notion of ‘power’ to set your sample size?   ‘Power’ is supposed to control for type II errors in a t test.  Why should it have any particular privileged status in setting the size of an experiment?  The answer is that it shouldn’t.  These two concepts have been juxtaposed here without any clear justification.  One could just  as easily choose sample sizes to keep a constant probability that the default test favors the null.  Then one would find that the t-test is ‘biased’ against large effect sizes, in that the power would be higher for small effects.

I might be missing something, but I don’t see what. You have different ways of using models to estimate the parameters for an experiment to control the probabilities of certain kinds of errors.  Parameters that control for one type of error won’t necessarily control for another, but it’s rather extreme to call this a ‘bias’.