1 | // $Id: Statistics.doxygen 2119 2009-12-12 23:11:43Z peter $ |
---|
2 | // |
---|
3 | // Copyright (C) 2005 Peter Johansson |
---|
4 | // Copyright (C) 2006 Jari Häkkinen, Peter Johansson, Markus Ringnér |
---|
5 | // Copyright (C) 2007, 2008 Jari Häkkinen, Peter Johansson |
---|
6 | // |
---|
7 | // This file is part of the yat library, http://dev.thep.lu.se/yat |
---|
8 | // |
---|
9 | // The yat library is free software; you can redistribute it and/or |
---|
10 | // modify it under the terms of the GNU General Public License as |
---|
11 | // published by the Free Software Foundation; either version 3 of the |
---|
12 | // License, or (at your option) any later version. |
---|
13 | // |
---|
14 | // The yat library is distributed in the hope that it will be useful, |
---|
15 | // but WITHOUT ANY WARRANTY; without even the implied warranty of |
---|
16 | // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU |
---|
17 | // General Public License for more details. |
---|
18 | // |
---|
19 | // You should have received a copy of the GNU General Public License |
---|
20 | // along with yat. If not, see <http://www.gnu.org/licenses/>. |
---|
21 | |
---|
22 | |
---|
23 | /** |
---|
24 | \page weighted_statistics Weighted Statistics |
---|
25 | |
---|
26 | \section Introduction |
---|
27 | There are several different reasons why a statistical analysis needs |
---|
28 | to adjust for weighting. In literature reasons are mainly diveded in |
---|
29 | to groups. |
---|
30 | |
---|
31 | The first group is when some of the measurements are known to be more |
---|
32 | precise than others. The more precise a measurement is, the larger |
---|
33 | weight it is given. The simplest case is when the weight are given |
---|
34 | before the measurements and they can be treated as deterministic. It |
---|
35 | becomes more complicated when the weight can be determined not until |
---|
36 | afterwards, and even more complicated if the weight depends on the |
---|
37 | value of the observable. |
---|
38 | |
---|
39 | The second group of situations is when calculating averages over one |
---|
40 | distribution and sampling from another distribution. Compensating for |
---|
41 | this discrepency weights are introduced to the analysis. A simple |
---|
42 | example may be that we are interviewing people but for economical |
---|
43 | reasons we choose to interview more people from the city than from the |
---|
44 | countryside. When summarizing the statistics the answers from the city |
---|
45 | are given a smaller weight. In this example we are choosing the |
---|
46 | proportions of people from countryside and people from city being |
---|
47 | intervied. Hence, we can determine the weights before and consider |
---|
48 | them to be deterministic. In other situations the proportions are not |
---|
49 | deterministic, but rather a result from the sampling and the weights |
---|
50 | must be treated as stochastic and only in rare situations the weights |
---|
51 | can be treated as independent of the observable. |
---|
52 | |
---|
53 | Since there are various origins for a weight occuring in a statistical |
---|
54 | analysis, there are various ways to treat the weights and in general |
---|
55 | the analysis should be tailored to treat the weights correctly. We |
---|
56 | have not chosen one situation for our implementations, so see specific |
---|
57 | function documentation for what assumtions are made. Though, common |
---|
58 | for implementations are the following: |
---|
59 | |
---|
60 | - Setting all weights to unity yields the same result as the |
---|
61 | non-weighted version. |
---|
62 | - Rescaling the weights does not change any function. |
---|
63 | - Setting a weight to zero is equivalent to removing the data point. |
---|
64 | |
---|
65 | The last point implies that a data point with zero weight is ignored |
---|
66 | also when the value is NaN. |
---|
67 | An important case is when weights are binary (either 1 or 0). Then we |
---|
68 | get the same result using the weighted version as using the data with |
---|
69 | weight not equal to zero and the non-weighted version. Hence, using |
---|
70 | binary weights and the weighted version missing values can be treated |
---|
71 | in a proper way. |
---|
72 | |
---|
73 | \section AveragerWeighted |
---|
74 | |
---|
75 | |
---|
76 | |
---|
77 | \subsection Mean |
---|
78 | |
---|
79 | For any situation the weight is always designed so the weighted mean |
---|
80 | is calculated as \f$ m=\frac{\sum w_ix_i}{\sum w_i} \f$, which obviously |
---|
81 | fulfills the conditions above. |
---|
82 | |
---|
83 | |
---|
84 | |
---|
85 | In the case of varying measurement error, it could be motivated that |
---|
86 | the weight shall be \f$ w_i = 1/\sigma_i^2 \f$. We assume measurement error |
---|
87 | to be Gaussian and the likelihood to get our measurements is |
---|
88 | \f$ L(m)=\prod |
---|
89 | (2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}} \f$. We |
---|
90 | maximize the likelihood by taking the derivity with respect to \f$ m \f$ on |
---|
91 | the logarithm of the likelihood \f$ \frac{d\ln L(m)}{dm}=\sum |
---|
92 | \frac{x_i-m}{\sigma_i^2} \f$. Hence, the Maximum Likelihood method yields |
---|
93 | the estimator \f$ m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2} \f$. |
---|
94 | |
---|
95 | |
---|
96 | \subsection Variance |
---|
97 | In case of varying variance, there is no point estimating a variance |
---|
98 | since it is different for each data point. |
---|
99 | |
---|
100 | Instead we look at the case when we want to estimate the variance over |
---|
101 | \f$f\f$ but are sampling from \f$ f' \f$. For the mean of an observable \f$ O \f$ we |
---|
102 | have \f$ \widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum |
---|
103 | w_i} \f$. Hence, an estimator of the variance of \f$ X \f$ is |
---|
104 | |
---|
105 | \f$ |
---|
106 | s^2 = <X^2>-<X>^2= |
---|
107 | \f$ |
---|
108 | |
---|
109 | \f$ |
---|
110 | = \frac{\sum w_ix_i^2}{\sum w_i}-\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}= |
---|
111 | \f$ |
---|
112 | |
---|
113 | \f$ |
---|
114 | = \frac{\sum w_i(x_i^2-m^2)}{\sum w_i}= |
---|
115 | \f$ |
---|
116 | |
---|
117 | \f$ |
---|
118 | = \frac{\sum w_i(x_i^2-2mx_i+m^2)}{\sum w_i}= |
---|
119 | \f$ |
---|
120 | |
---|
121 | \f$ |
---|
122 | = \frac{\sum w_i(x_i-m)^2}{\sum w_i} |
---|
123 | \f$ |
---|
124 | |
---|
125 | This estimator fulfills that it is invariant under a rescaling and |
---|
126 | having a weight equal to zero is equivalent to removing the data |
---|
127 | point. Having all weights equal to unity we get \f$ \sigma=\frac{\sum |
---|
128 | (x_i-m)^2}{N} \f$, which is the same as returned from Averager. Hence, |
---|
129 | this estimator is slightly biased, but still very efficient. |
---|
130 | |
---|
131 | \subsection standard_error Standard Error |
---|
132 | The standard error squared is equal to the expexted squared error of |
---|
133 | the estimation of \f$m\f$. The squared error consists of two parts, the |
---|
134 | variance of the estimator and the squared bias: |
---|
135 | |
---|
136 | \f$ |
---|
137 | <m-\mu>^2=<m-<m>+<m>-\mu>^2= |
---|
138 | \f$ |
---|
139 | \f$ |
---|
140 | <m-<m>>^2+(<m>-\mu)^2 |
---|
141 | \f$. |
---|
142 | |
---|
143 | In the case when weights are included in analysis due to varying |
---|
144 | measurement errors and the weights can be treated as deterministic, we |
---|
145 | have |
---|
146 | |
---|
147 | \f$ |
---|
148 | Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}= |
---|
149 | \f$ |
---|
150 | \f$ |
---|
151 | \frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}= |
---|
152 | \f$ |
---|
153 | \f$ |
---|
154 | \frac{\sigma_0^2}{\sum w_i}, |
---|
155 | \f$ |
---|
156 | |
---|
157 | where we need to estimate \f$ \sigma_0^2 \f$. Again we have the likelihood |
---|
158 | |
---|
159 | \f$ |
---|
160 | L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{(-\frac{w_i(x-m)^2}{2\sigma_0^2})} |
---|
161 | \f$ |
---|
162 | and taking the derivity with respect to |
---|
163 | \f$\sigma_o^2\f$, |
---|
164 | |
---|
165 | \f$ |
---|
166 | \frac{d\ln L}{d\sigma_i^2}= |
---|
167 | \f$ |
---|
168 | \f$ |
---|
169 | \sum -\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2} |
---|
170 | \f$ |
---|
171 | |
---|
172 | which |
---|
173 | yields an estimator \f$ \sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2 \f$. This |
---|
174 | estimator is not ignoring weights equal to zero, because deviation is |
---|
175 | most often smaller than the expected infinity. Therefore, we modify |
---|
176 | the expression as follows \f$\sigma_0^2=\frac{\sum w_i^2}{\left(\sum |
---|
177 | w_i\right)^2}\sum w_i(x-m)^2\f$ and we get the following estimator of |
---|
178 | the variance of the mean \f$\sigma_0^2=\frac{\sum w_i^2}{\left(\sum |
---|
179 | w_i\right)^3}\sum w_i(x-m)^2\f$. This estimator fulfills the conditions |
---|
180 | above: adding a weight zero does not change it: rescaling the weights |
---|
181 | does not change it, and setting all weights to unity yields the same |
---|
182 | expression as in the non-weighted case. |
---|
183 | |
---|
184 | In a case when it is not a good approximation to treat the weights as |
---|
185 | deterministic, there are two ways to get a better estimation. The |
---|
186 | first one is to linearize the expression \f$\left<\frac{\sum |
---|
187 | w_ix_i}{\sum w_i}\right>\f$. The second method when the situation is |
---|
188 | more complicated is to estimate the standard error using a |
---|
189 | bootstrapping method. |
---|
190 | |
---|
191 | \section AveragerPairWeighted |
---|
192 | Here data points come in pairs (x,y). We are sampling from \f$f'_{XY}\f$ |
---|
193 | but want to measure from \f$f_{XY}\f$. To compensate for this decrepency, |
---|
194 | averages of \f$g(x,y)\f$ are taken as \f$\sum \frac{f}{f'}g(x,y)\f$. Even |
---|
195 | though, \f$X\f$ and \f$Y\f$ are not independent \f$(f_{XY}\neq f_Xf_Y)\f$ we |
---|
196 | assume that we can factorize the ratio and get \f$\frac{\sum |
---|
197 | w_xw_yg(x,y)}{\sum w_xw_y}\f$ |
---|
198 | \subsection Covariance |
---|
199 | Following the variance calculations for AveragerWeighted we have |
---|
200 | \f$Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}\f$ where |
---|
201 | \f$m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}\f$ |
---|
202 | |
---|
203 | \subsection Correlation |
---|
204 | |
---|
205 | As the mean is estimated as |
---|
206 | \f$ |
---|
207 | m_x=\frac{\sum w_xw_yx}{\sum w_xw_y} |
---|
208 | \f$, |
---|
209 | the variance is estimated as |
---|
210 | \f$ |
---|
211 | \sigma_x^2=\frac{\sum w_xw_y(x-m_x)^2}{\sum w_xw_y} |
---|
212 | \f$. |
---|
213 | As in the non-weighted case we define the correlation to be the ratio |
---|
214 | between the covariance and geometrical average of the variances |
---|
215 | |
---|
216 | \f$ |
---|
217 | \frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum |
---|
218 | w_xw_y(y-m_y)^2}} |
---|
219 | \f$. |
---|
220 | |
---|
221 | |
---|
222 | This expression fulfills the following |
---|
223 | - Having N equal weights the expression reduces to the non-weighted expression. |
---|
224 | - Adding a pair of data, in which one weight is zero is equivalent |
---|
225 | to ignoring the data pair. |
---|
226 | - Correlation is equal to unity if and only if \f$x\f$ is equal to |
---|
227 | \f$y\f$. Otherwise the correlation is between -1 and 1. |
---|
228 | |
---|
229 | \section Score |
---|
230 | |
---|
231 | \subsection Pearson |
---|
232 | |
---|
233 | \f$\frac{\sum w(x-m_x)(y-m_y)}{\sqrt{\sum w(x-m_x)^2\sum w(y-m_y)^2}}\f$. |
---|
234 | |
---|
235 | See AveragerPairWeighted correlation. |
---|
236 | |
---|
237 | \subsection ROC |
---|
238 | |
---|
239 | An interpretation of the ROC curve area is the probability that if we |
---|
240 | take one sample from class \f$+\f$ and one sample from class \f$-\f$, what is |
---|
241 | the probability that the sample from class \f$+\f$ has greater value. The |
---|
242 | ROC curve area calculates the ratio of pairs fulfilling this |
---|
243 | |
---|
244 | \f$ |
---|
245 | \frac{\sum_{\{i,j\}:x^-_i<x^+_j}1}{\sum_{i,j}1}. |
---|
246 | \f$ |
---|
247 | |
---|
248 | An geometrical interpretation is to have a number of squares where |
---|
249 | each square correspond to a pair of samples. The ROC curve follows the |
---|
250 | border between pairs in which the samples from class \f$+\f$ has a greater |
---|
251 | value and pairs in which this is not fulfilled. The ROC curve area is |
---|
252 | the area of those latter squares and a natural extension is to weight |
---|
253 | each pair with its two weights and consequently the weighted ROC curve |
---|
254 | area becomes |
---|
255 | |
---|
256 | \f$ |
---|
257 | \frac{\sum_{\{i,j\}:x^-_i<x^+_j}w^-_iw^+_j}{\sum_{i,j}w^-_iw^+_j} |
---|
258 | \f$ |
---|
259 | |
---|
260 | This expression is invariant under a rescaling of weight. Adding a |
---|
261 | data value with weight zero adds nothing to the exprssion, and having |
---|
262 | all weight equal to unity yields the non-weighted ROC curve area. |
---|
263 | |
---|
264 | \subsection tScore |
---|
265 | |
---|
266 | Assume that \f$x\f$ and \f$y\f$ originate from the same distribution |
---|
267 | \f$N(\mu,\sigma_i^2)\f$ where \f$\sigma_i^2=\frac{\sigma_0^2}{w_i}\f$. We then |
---|
268 | estimate \f$\sigma_0^2\f$ as |
---|
269 | \f$ |
---|
270 | \frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2} |
---|
271 | {\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+ |
---|
272 | \frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2} |
---|
273 | \f$ |
---|
274 | The variance of difference of the means becomes |
---|
275 | \f$ |
---|
276 | Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum |
---|
277 | w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}= |
---|
278 | \frac{\sigma_0^2}{\sum w_i}+\frac{\sigma_0^2}{\sum w_i}, |
---|
279 | \f$ |
---|
280 | and consequently the t-score becomes |
---|
281 | \f$ |
---|
282 | \frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2} |
---|
283 | {\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+ |
---|
284 | \frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2} |
---|
285 | \left(\frac{1}{\sum w_i}+\frac{1}{\sum w_i}\right), |
---|
286 | \f$ |
---|
287 | |
---|
288 | For a \f$w_i=w\f$ we this expression get condensed down to |
---|
289 | \f$ |
---|
290 | \frac{w\sum (x-m_x)^2+w\sum (y-m_y)^2} |
---|
291 | {n_x+n_y-2} |
---|
292 | \left(\frac{1}{wn_x}+\frac{1}{wn_y}\right), |
---|
293 | \f$ |
---|
294 | in other words the good old expression as for non-weighted. |
---|
295 | |
---|
296 | \subsection FoldChange |
---|
297 | Fold-Change is simply the difference between the weighted mean of the |
---|
298 | two groups \f$\frac{\sum w_xx}{\sum w_x}-\frac{\sum w_yy}{\sum w_y}\f$ |
---|
299 | |
---|
300 | \subsection WilcoxonFoldChange |
---|
301 | Taking all pair samples (one from class \f$+\f$ and one from class \f$-\f$) |
---|
302 | and calculating the weighted median of the distances. |
---|
303 | |
---|
304 | \section Distance |
---|
305 | |
---|
306 | A \ref concept_distance measures how far apart two ranges are. A Distance should |
---|
307 | preferably meet some criteria: |
---|
308 | |
---|
309 | - It is symmetric, \f$ d(x,y) = d(y,x) \f$, that is distance from \f$ |
---|
310 | x \f$ to \f$ y \f$ equals the distance from \f$ y \f$ to \f$ x \f$. |
---|
311 | - Zero self-distance: \f$ d(x,x) = 0 \f$ |
---|
312 | - Triangle inequality: \f$ d(x,z) \le d(x,y) + d(y,z) \f$ |
---|
313 | |
---|
314 | \subsection weighted_distance Weighted Distance |
---|
315 | |
---|
316 | Weighted Distance is an extension of usual unweighted distances, in |
---|
317 | which each data point is accompanied with a weight. A weighted |
---|
318 | distance should meet some criteria: |
---|
319 | |
---|
320 | - Having all unity weights should yield the unweighted case. |
---|
321 | - Rescaling the weights, \f$ w_i = Cw_i \f$, does not change the distance. |
---|
322 | - Having a \f$ w_x = 0 \f$ the distance should ignore corresponding |
---|
323 | \f$ x \f$, \f$ y \f$, and \f$ w_y \f$. |
---|
324 | - A zero weight should not result in a very different distance than a |
---|
325 | small weight, in other words, modifying a weight should change the |
---|
326 | distance in a continuous manner. |
---|
327 | - The duplicate property. If data is coming in duplicate such that |
---|
328 | \f$ x_{2i}=x_{2i+1} \f$, then the case when \f$ w_{2i}=w_{2i+1} \f$ |
---|
329 | should equal to if you set \f$ w_{2i}=0 \f$. |
---|
330 | |
---|
331 | The last condition, duplicate property, implies that setting a weight |
---|
332 | to zero is not equivalent to removing the data point. This behavior is |
---|
333 | sensible because otherwise we would have a bias towards having ranges |
---|
334 | with small weights being close to other ranges. For a weighted |
---|
335 | distance, meeting these criteria, it might be difficult to show that |
---|
336 | the triangle inequality is fulfilled. For most algorithms the triangle |
---|
337 | inequality is not essential for the distance to work properly, so if |
---|
338 | you need to choose between fulfilling triangle inequality and these |
---|
339 | latter criteria it is preferable to meet the latter criteria. |
---|
340 | |
---|
341 | In test/distance_test.cc there are tests for testing these properties. |
---|
342 | |
---|
343 | \section Kernel |
---|
344 | \subsection polynomial_kernel Polynomial Kernel |
---|
345 | The polynomial kernel of degree \f$N\f$ is defined as \f$(1+<x,y>)^N\f$, where |
---|
346 | \f$<x,y>\f$ is the linear kernel (usual scalar product). For the weighted |
---|
347 | case we define the linear kernel to be |
---|
348 | \f$<x,y>=\frac{\sum {w_xw_yxy}}{\sum{w_xw_y}}\f$ and the |
---|
349 | polynomial kernel can be calculated as before |
---|
350 | \f$(1+<x,y>)^N\f$. |
---|
351 | |
---|
352 | \subsection gaussian_kernel Gaussian Kernel |
---|
353 | We define the weighted Gaussian kernel as \f$\exp\left(-N\frac{\sum |
---|
354 | w_xw_y(x-y)^2}{\sum w_xw_y}\right)\f$. |
---|
355 | |
---|
356 | \section Regression |
---|
357 | \subsection Naive |
---|
358 | \subsection Linear |
---|
359 | We have the model |
---|
360 | |
---|
361 | \f$ |
---|
362 | y_i=\alpha+\beta (x-m_x)+\epsilon_i, |
---|
363 | \f$ |
---|
364 | |
---|
365 | where \f$\epsilon_i\f$ is the noise. The variance of the noise is |
---|
366 | inversely proportional to the weight, |
---|
367 | \f$Var(\epsilon_i)=\frac{\sigma^2}{w_i}\f$. In order to determine the |
---|
368 | model parameters, we minimimize the sum of quadratic errors. |
---|
369 | |
---|
370 | \f$ |
---|
371 | Q_0 = \sum \epsilon_i^2 |
---|
372 | \f$ |
---|
373 | |
---|
374 | Taking the derivity with respect to \f$\alpha\f$ and \f$\beta\f$ yields two conditions |
---|
375 | |
---|
376 | \f$ |
---|
377 | \frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha - |
---|
378 | \beta (x_i-m_x)=0 |
---|
379 | \f$ |
---|
380 | |
---|
381 | and |
---|
382 | |
---|
383 | \f$ \frac{\partial Q_0}{\partial \beta} = -2 \sum |
---|
384 | w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0 |
---|
385 | \f$ |
---|
386 | |
---|
387 | or equivalently |
---|
388 | |
---|
389 | \f$ |
---|
390 | \alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y |
---|
391 | \f$ |
---|
392 | |
---|
393 | and |
---|
394 | |
---|
395 | \f$ \beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum |
---|
396 | w_i(x_i-m_x)^2}=\frac{Cov(x,y)}{Var(x)} |
---|
397 | \f$ |
---|
398 | |
---|
399 | Note, by having all weights equal we get back the unweighted |
---|
400 | case. Furthermore, we calculate the variance of the estimators of |
---|
401 | \f$\alpha\f$ and \f$\beta\f$. |
---|
402 | |
---|
403 | \f$ |
---|
404 | \textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}= |
---|
405 | \frac{\sigma^2}{\sum w_i} |
---|
406 | \f$ |
---|
407 | |
---|
408 | and |
---|
409 | \f$ |
---|
410 | \textrm{Var}(\beta )= \frac{w_i^2(x_i-m_x)^2\frac{\sigma^2}{w_i}} |
---|
411 | {(\sum w_i(x_i-m_x)^2)^2}= |
---|
412 | \frac{\sigma^2}{\sum w_i(x_i-m_x)^2} |
---|
413 | \f$ |
---|
414 | |
---|
415 | Finally, we estimate the level of noise, \f$\sigma^2\f$. Inspired by the |
---|
416 | unweighted estimation |
---|
417 | |
---|
418 | \f$ |
---|
419 | s^2=\frac{\sum (y_i-\alpha-\beta (x_i-m_x))^2}{n-2} |
---|
420 | \f$ |
---|
421 | |
---|
422 | we suggest the following estimator |
---|
423 | |
---|
424 | \f$ s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum |
---|
425 | w_i-2\frac{\sum w_i^2}{\sum w_i}} \f$ |
---|
426 | |
---|
427 | */ |
---|
428 | |
---|
429 | |
---|
430 | |
---|