Most likely you will have heard that you should never go data fishing, meaning that you should not repeatedly test data. In the case of statistical significance tests, perhaps you will have heard that because of the nature of these tests you will find an effect at the 5% significance level in 5% of cases when there actually is no effect, and an effect at the 2% significance level in 2% of cases when there actually is no effect, and so on. It is less likely you will have heard not to continue looking for an effect after your current test concluded there was none. Here is why.

Let’s say a researcher has an initial data set of ten samples. The effect they are looking for is not present in this data. Of course it is possible that they were just unlucky, and that simply more data is all that is needed to find the effect. The researcher can now start to draw more samples, and will continuously tests whether there is a significant effect. Once they find a significant effect, they stop looking. They do this until the data set hits 100 data points, after which the researcher is convinced there is no effect. When the statistical test used is looking at the 5% significance level, one might think that doing this still gives a probability of just 5% of making a mistake. That is wrong.

Instead, by doing this the researcher finds themselves with an approximate 28% chance of finding an effect while there is none. This can be confirmed by writing a bit of code that performs this procedure for a large number of initial samples from a distribution without the hypothesized effect, which is precisely what I have done to find this number. I’ve included the code I wrote below. By running this code in MATLAB, you will find that about 2,800 out of 10,000 experiments will (eventually) yield a significant result. This means that in about 24% of data sets with initially insignificant effects, you will find a significant effect just by continuing to look for that effect. The issue is easily remedied; choose a hypothesis and sample size and stick with them.

A related issue can pop up when doing other types of analyses. In data mining, a researcher is usually interested in finding novel patterns in data. This means the researcher cannot choose one hypothesis before testing; they will be forming and testing a great number of hypotheses as they explore the data. One way to do this without fishing is to split the data randomly into two sets: *exploring*, and *testing*. The researcher uses the *exploring* set to formulate hypotheses they want to test, and they test all the chosen hypotheses on the *testing* set. Compare this with the training, testing and development sets I posted about earlier. Note that one still has to account for the number of tests they are performing.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
pd = makedist('Normal'); % Make a normal distribution (mu=0, stdev=1) numExperiments = 10000; % Number of experiments we want to run numAdditions = 90; % Number of samples to add iteratively to find a % significant result significances = 0; for i = 1:numExperiments data = random(pd, 1, 10); % Perform a one-sample t-test (null hyp. mu == 0) h = ttest(data); if h == 1 % Sample mean differs significantly from 0 significances = significances + 1; else % Sample mean does not differ significantly from 0 for j = 1:numAdditions r = random(pd); data = [data r]; % Draw and add new sample % Perform a one-sample t-test (null hyp. mu == 0) h = ttest(data); if h == 1 % Sample mean significantly differs from 0 significances = significances + 1; break; end end end display(['Significances found ' num2str(significances) '/'... num2str(i)]); end |