The Neurophysiological Biomarker Toolbox (NBT)

# Variability and statistics exercise

Version: 2016

Approximate time needed to complete the tutorial: 20 min.

Aim

This tutorial will give you a quick and non-mathematical introduction to statistical tests. We hope to give you a more “empirical” understanding of statistics than what you may have gained from following courses in statistics. The tutorial also serves to refresh your understanding of statistics and its limitations, because next week you will perform a lot of analyses.
As a secondary aim, you will get a flavor of programming in Matlab.
Several questions are asked. Discuss the answers with the student sitting next to you and ask the tutors if you run into problems.

## First Install NBT

Specific for the Human Neurophysiology course

1. Go to Course Documents / NBT Material and download NBT.zip by clicking on NBT.
2. A window will pop up which asks you whether you want to save or open the file. Click on open with and select the program 7-zip, which should be already set as the default program.
3. Once 7-zip opens click on the button extract and save it in the Documents folder on the local C: drive.

How to install NBT in Matlab

Installing NBT is very easy!

1. Start Matlab (skip this step if you have Matlab already opened)
2. Set your Current folder to NBT folder
3. Run installNBT.m (write installNBT into command line and press Enter)

## What is a statistical test?

A statistical test is a method which you use to evaluate the risk that the results you find just happened by random coincidence. For example, let us say you measure the amplitude in the alpha frequency band (8-13 Hz) using EEG in two groups A and B, and you find that the average in group A is 1 μV lower than in group B. Is this difference significant, i.e., can you be sure that if you measured more people from the two groups you would also find a difference? A statistical test can be used to tell if a difference is statistically significant result meaning that the result will only happen by chance with a certain probability (also called the significance level, or the p-value). It is common to set the significance level at 5%.

## Simple statistics

From this moment Matlab will embed all NBT functions, so you do not need to repeat this operation every time you open a new Matlab window (just the first time!).

We will now simulate some data in Matlab to see how a commonly used statistical test works, namely the student's t-test. [You can read more about the Student's t-test on Wikipedia].

The p-value

1. Start Matlab, if you do not know Matlab read the getting started with matlab tutorial first.
2. Make two sets of data; A and B. Type in the matlab command window A = randn(20,1); B = randn(20,1);. This will make two sets of data A with 20 values, and an average of 0; and B also with 20 values, and an average of 0. Imagine these two datasets are the measurements (e.g., EEG amplitudes) on two groups (group A and group B) of students (e.g., males and females). In the next step we will test if there is a difference in amplitude between the groups.
3. A and B are by definition equal in their average value (which is 0). Type [h, p] = ttest2(A,B) this is the student's t-test. p is the p-value. The probability that the average difference between A and B happened by chance. If you run the steps 1-3 several times you will not get the same p-value.
4. Run the code 1000 times (type the code below in the matlab command window).

for i=1:1000
A = randn(20,1); B = randn(20,1);
[h, p(i)] = ttest2(A,B);
end

What is the percentage of p-values less than 0.05? Write length(find(p<0.05))/10 in the matlab command window to compute this percentage. You should get around 5%. Do you understand why?

Power: Is a difference always significant?

We will now simulate two data sets with a difference to see if the t-test will find a significant difference. Do it 1000 times (use the code below)

for i=1:1000
A = randn(20,1); B = 1 + randn(20,1);
[h, p(i)] = ttest2(A,B);
end

A and B by definition have a difference in average B-A = 1. We should find that A and B are significantly different. What is the percentage of p-values less than 0.05? Write length(find(p<0.05))/10 in the matlab command window. You will find a percentage around 86%. It means that 86% of all cases the t-test found a significant difference between A and B; this is called the statistical Power of your test.

We see that a statistical test will not always find a significant difference even if two data sets are different. The reason is that the overall distribution of A (red in Figure below) and B (blue in Figure below) are over-lapping. Since we are only sampling 20 values from these two distributions it is very easy to get two data sets that are so close that the t-test is not significant. Can you think of a way to get a higher power?

One way of increasing the power in a statistical test is to increase the number of samples (or, i.e., subjects). A major concern in the design of an experimental study is, therefore, to ensure a sufficient power. But how many subjects are enough? We do not want too many subjects, because each subject we include in a study costs both money and time. To get an estimate you can do a power analysis.

Power analysis: How many subjects do I need?

In a power analysis we want to map the relationship between power and number of subjects. We, therefore, generate data sets with different numbers of subjects, and calculate the power. Type the following code in the Matlab command window.

for n=2:50
disp(n)
for i=1:1000
A = randn(n,1); B = 1 + randn(n,1);
[h, p(i)] = ttest2(A,B);
end
ttestpower(n) = length(find(p<0.05))/10;
end

When the code is finished, type plot(ttestpower). You will get a figure with number of subjects on the x-axis, and statistical power on the y-axis. You will see that you need a certain number of subjects to get a good power, but also that it is not necessary to use a huge number of subjects. The power depends on the number of subjects, but also on the difference in mean between the two datasets, in our case the difference between the mean of A and the mean of B. The power depends also on the standard deviations of A and B. You can use the function nbt_PowerAnalysis to try different means and standard deviations of the data sets. If you want to analyse the power of a test between two datasets, one with mean 0 and standard deviation 1.5, and the other with mean 2 and standard deviation 1, you should write:

nbt_PowerAnalysis(0,1.5,2,1)

## Is it that simple? Or, i.e., can I always trust a statistical test?

No. You can not always trust a statistical test. Statistical tests have certain assumptions, e.g., a t-test assumes that the data are normally distributed, and if these assumptions are not met then the test is inaccurate or not valid at all.

Usually, however, a t-test is a good first choice for testing simple differences between two groups.

## What is a correlation?

In your analysis you will test if different biomarkers are correlated. A and B are said to be correlated if there is a linear relation between them. For example, you might want to know if there is a relation between the amplitude of the EEG data and the rating of the ARSQ item: I felt sleepy (on a scale from 1 to 5). In this case you have paired observations: for every subject there is the EEG amplitude and the rating of the ARSQ statement. The i-th place in the vector A contains the amplitude of subject i, and the i-th place in B the ARSQ score of subject i. If you plot A on the x-axis, and B on the y-axis, you have a first impression of the correlation between the two variables. In other words, you visualize how brain activity relates to cognition. The more the data points are concentrated on a line, the stronger the variables are correlated. The following code gives an example of two uncorrelated variables. The correlation coefficient is computed with the function corrcoef, which also generates a P-value to test whether there is correlation: a low P-value corresponds to a significant correlation. The output of corrcoef is a matrix; when running the following code, only the relevant information is displayed.

A = randn(100,1);
B = randn(100,1);
plot(A,B,'.')
[c,p]=corrcoef(A,B);
disp(c(1,2))
disp(p(1,2))

Note, also uncorrelated variables might result in a high correlation, by chance. Run the code several times to see that the variables are uncorrelated. Now we create two correlated variables by setting A equal to B plus some noise:

A=B+randn(100,1)/10;
plot(A,B,'.')
[c,p]=corrcoef(A,B);
disp(c(1,2))
disp(p(1,2))

The correlation coefficient is a measure which tells you how strong the correlation is. In the figure below you see how the correlation coefficient changes for different distribution of data points. In the lowest row you see that the correlation coefficient might not reveal more complex relations between A and B.

Go to the next tutorial Getting started with EEG analysis in NBT