1 | // $Id: Statistics.doxygen 1639 2008-12-11 17:08:11Z peter $ |
---|
2 | // |
---|
3 | // Copyright (C) 2005 Peter Johansson |
---|
4 | // Copyright (C) 2006 Jari Häkkinen, Peter Johansson, Markus Ringnér |
---|
5 | // Copyright (C) 2007 Jari Häkkinen, Peter Johansson |
---|
6 | // Copyright (C) 2008 Peter Johansson |
---|
7 | // |
---|
8 | // This file is part of the yat library, http://dev.thep.lu.se/yat |
---|
9 | // |
---|
10 | // The yat library is free software; you can redistribute it and/or |
---|
11 | // modify it under the terms of the GNU General Public License as |
---|
12 | // published by the Free Software Foundation; either version 3 of the |
---|
13 | // License, or (at your option) any later version. |
---|
14 | // |
---|
15 | // The yat library is distributed in the hope that it will be useful, |
---|
16 | // but WITHOUT ANY WARRANTY; without even the implied warranty of |
---|
17 | // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU |
---|
18 | // General Public License for more details. |
---|
19 | // |
---|
20 | // You should have received a copy of the GNU General Public License |
---|
21 | // along with yat. If not, see <http://www.gnu.org/licenses/>. |
---|
22 | |
---|
23 | |
---|
24 | /** |
---|
25 | \page weighted_statistics Weighted Statistics |
---|
26 | |
---|
27 | \section Introduction |
---|
28 | There are several different reasons why a statistical analysis needs |
---|
29 | to adjust for weighting. In literature reasons are mainly diveded in |
---|
30 | to groups. |
---|
31 | |
---|
32 | The first group is when some of the measurements are known to be more |
---|
33 | precise than others. The more precise a measurement is, the larger |
---|
34 | weight it is given. The simplest case is when the weight are given |
---|
35 | before the measurements and they can be treated as deterministic. It |
---|
36 | becomes more complicated when the weight can be determined not until |
---|
37 | afterwards, and even more complicated if the weight depends on the |
---|
38 | value of the observable. |
---|
39 | |
---|
40 | The second group of situations is when calculating averages over one |
---|
41 | distribution and sampling from another distribution. Compensating for |
---|
42 | this discrepency weights are introduced to the analysis. A simple |
---|
43 | example may be that we are interviewing people but for economical |
---|
44 | reasons we choose to interview more people from the city than from the |
---|
45 | countryside. When summarizing the statistics the answers from the city |
---|
46 | are given a smaller weight. In this example we are choosing the |
---|
47 | proportions of people from countryside and people from city being |
---|
48 | intervied. Hence, we can determine the weights before and consider |
---|
49 | them to be deterministic. In other situations the proportions are not |
---|
50 | deterministic, but rather a result from the sampling and the weights |
---|
51 | must be treated as stochastic and only in rare situations the weights |
---|
52 | can be treated as independent of the observable. |
---|
53 | |
---|
54 | Since there are various origins for a weight occuring in a statistical |
---|
55 | analysis, there are various ways to treat the weights and in general |
---|
56 | the analysis should be tailored to treat the weights correctly. We |
---|
57 | have not chosen one situation for our implementations, so see specific |
---|
58 | function documentation for what assumtions are made. Though, common |
---|
59 | for implementations are the following: |
---|
60 | |
---|
61 | - Setting all weights to unity yields the same result as the |
---|
62 | non-weighted version. |
---|
63 | - Rescaling the weights does not change any function. |
---|
64 | - Setting a weight to zero is equivalent to removing the data point. |
---|
65 | |
---|
66 | The last point implies that a data point with zero weight is ignored |
---|
67 | also when the value is NaN. |
---|
68 | An important case is when weights are binary (either 1 or 0). Then we |
---|
69 | get the same result using the weighted version as using the data with |
---|
70 | weight not equal to zero and the non-weighted version. Hence, using |
---|
71 | binary weights and the weighted version missing values can be treated |
---|
72 | in a proper way. |
---|
73 | |
---|
74 | \section AveragerWeighted |
---|
75 | |
---|
76 | |
---|
77 | |
---|
78 | \subsection Mean |
---|
79 | |
---|
80 | For any situation the weight is always designed so the weighted mean |
---|
81 | is calculated as \f$ m=\frac{\sum w_ix_i}{\sum w_i} \f$, which obviously |
---|
82 | fulfills the conditions above. |
---|
83 | |
---|
84 | |
---|
85 | |
---|
86 | In the case of varying measurement error, it could be motivated that |
---|
87 | the weight shall be \f$ w_i = 1/\sigma_i^2 \f$. We assume measurement error |
---|
88 | to be Gaussian and the likelihood to get our measurements is |
---|
89 | \f$ L(m)=\prod |
---|
90 | (2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}} \f$. We |
---|
91 | maximize the likelihood by taking the derivity with respect to \f$ m \f$ on |
---|
92 | the logarithm of the likelihood \f$ \frac{d\ln L(m)}{dm}=\sum |
---|
93 | \frac{x_i-m}{\sigma_i^2} \f$. Hence, the Maximum Likelihood method yields |
---|
94 | the estimator \f$ m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2} \f$. |
---|
95 | |
---|
96 | |
---|
97 | \subsection Variance |
---|
98 | In case of varying variance, there is no point estimating a variance |
---|
99 | since it is different for each data point. |
---|
100 | |
---|
101 | Instead we look at the case when we want to estimate the variance over |
---|
102 | \f$f\f$ but are sampling from \f$ f' \f$. For the mean of an observable \f$ O \f$ we |
---|
103 | have \f$ \widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum |
---|
104 | w_i} \f$. Hence, an estimator of the variance of \f$ X \f$ is |
---|
105 | |
---|
106 | \f$ |
---|
107 | s^2 = <X^2>-<X>^2= |
---|
108 | \f$ |
---|
109 | |
---|
110 | \f$ |
---|
111 | = \frac{\sum w_ix_i^2}{\sum w_i}-\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}= |
---|
112 | \f$ |
---|
113 | |
---|
114 | \f$ |
---|
115 | = \frac{\sum w_i(x_i^2-m^2)}{\sum w_i}= |
---|
116 | \f$ |
---|
117 | |
---|
118 | \f$ |
---|
119 | = \frac{\sum w_i(x_i^2-2mx_i+m^2)}{\sum w_i}= |
---|
120 | \f$ |
---|
121 | |
---|
122 | \f$ |
---|
123 | = \frac{\sum w_i(x_i-m)^2}{\sum w_i} |
---|
124 | \f$ |
---|
125 | |
---|
126 | This estimator fulfills that it is invariant under a rescaling and |
---|
127 | having a weight equal to zero is equivalent to removing the data |
---|
128 | point. Having all weights equal to unity we get \f$ \sigma=\frac{\sum |
---|
129 | (x_i-m)^2}{N} \f$, which is the same as returned from Averager. Hence, |
---|
130 | this estimator is slightly biased, but still very efficient. |
---|
131 | |
---|
132 | \subsection standard_error Standard Error |
---|
133 | The standard error squared is equal to the expexted squared error of |
---|
134 | the estimation of \f$m\f$. The squared error consists of two parts, the |
---|
135 | variance of the estimator and the squared bias: |
---|
136 | |
---|
137 | \f$ |
---|
138 | <m-\mu>^2=<m-<m>+<m>-\mu>^2= |
---|
139 | \f$ |
---|
140 | \f$ |
---|
141 | <m-<m>>^2+(<m>-\mu)^2 |
---|
142 | \f$. |
---|
143 | |
---|
144 | In the case when weights are included in analysis due to varying |
---|
145 | measurement errors and the weights can be treated as deterministic, we |
---|
146 | have |
---|
147 | |
---|
148 | \f$ |
---|
149 | Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}= |
---|
150 | \f$ |
---|
151 | \f$ |
---|
152 | \frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}= |
---|
153 | \f$ |
---|
154 | \f$ |
---|
155 | \frac{\sigma_0^2}{\sum w_i}, |
---|
156 | \f$ |
---|
157 | |
---|
158 | where we need to estimate \f$ \sigma_0^2 \f$. Again we have the likelihood |
---|
159 | |
---|
160 | \f$ |
---|
161 | L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{(-\frac{w_i(x-m)^2}{2\sigma_0^2})} |
---|
162 | \f$ |
---|
163 | and taking the derivity with respect to |
---|
164 | \f$\sigma_o^2\f$, |
---|
165 | |
---|
166 | \f$ |
---|
167 | \frac{d\ln L}{d\sigma_i^2}= |
---|
168 | \f$ |
---|
169 | \f$ |
---|
170 | \sum -\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2} |
---|
171 | \f$ |
---|
172 | |
---|
173 | which |
---|
174 | yields an estimator \f$ \sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2 \f$. This |
---|
175 | estimator is not ignoring weights equal to zero, because deviation is |
---|
176 | most often smaller than the expected infinity. Therefore, we modify |
---|
177 | the expression as follows \f$\sigma_0^2=\frac{\sum w_i^2}{\left(\sum |
---|
178 | w_i\right)^2}\sum w_i(x-m)^2\f$ and we get the following estimator of |
---|
179 | the variance of the mean \f$\sigma_0^2=\frac{\sum w_i^2}{\left(\sum |
---|
180 | w_i\right)^3}\sum w_i(x-m)^2\f$. This estimator fulfills the conditions |
---|
181 | above: adding a weight zero does not change it: rescaling the weights |
---|
182 | does not change it, and setting all weights to unity yields the same |
---|
183 | expression as in the non-weighted case. |
---|
184 | |
---|
185 | In a case when it is not a good approximation to treat the weights as |
---|
186 | deterministic, there are two ways to get a better estimation. The |
---|
187 | first one is to linearize the expression \f$\left<\frac{\sum |
---|
188 | w_ix_i}{\sum w_i}\right>\f$. The second method when the situation is |
---|
189 | more complicated is to estimate the standard error using a |
---|
190 | bootstrapping method. |
---|
191 | |
---|
192 | \section AveragerPairWeighted |
---|
193 | Here data points come in pairs (x,y). We are sampling from \f$f'_{XY}\f$ |
---|
194 | but want to measure from \f$f_{XY}\f$. To compensate for this decrepency, |
---|
195 | averages of \f$g(x,y)\f$ are taken as \f$\sum \frac{f}{f'}g(x,y)\f$. Even |
---|
196 | though, \f$X\f$ and \f$Y\f$ are not independent \f$(f_{XY}\neq f_Xf_Y)\f$ we |
---|
197 | assume that we can factorize the ratio and get \f$\frac{\sum |
---|
198 | w_xw_yg(x,y)}{\sum w_xw_y}\f$ |
---|
199 | \subsection Covariance |
---|
200 | Following the variance calculations for AveragerWeighted we have |
---|
201 | \f$Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}\f$ where |
---|
202 | \f$m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}\f$ |
---|
203 | |
---|
204 | \subsection Correlation |
---|
205 | |
---|
206 | As the mean is estimated as |
---|
207 | \f$ |
---|
208 | m_x=\frac{\sum w_xw_yx}{\sum w_xw_y} |
---|
209 | \f$, |
---|
210 | the variance is estimated as |
---|
211 | \f$ |
---|
212 | \sigma_x^2=\frac{\sum w_xw_y(x-m_x)^2}{\sum w_xw_y} |
---|
213 | \f$. |
---|
214 | As in the non-weighted case we define the correlation to be the ratio |
---|
215 | between the covariance and geometrical average of the variances |
---|
216 | |
---|
217 | \f$ |
---|
218 | \frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum |
---|
219 | w_xw_y(y-m_y)^2}} |
---|
220 | \f$. |
---|
221 | |
---|
222 | |
---|
223 | This expression fulfills the following |
---|
224 | - Having N equal weights the expression reduces to the non-weighted expression. |
---|
225 | - Adding a pair of data, in which one weight is zero is equivalent |
---|
226 | to ignoring the data pair. |
---|
227 | - Correlation is equal to unity if and only if \f$x\f$ is equal to |
---|
228 | \f$y\f$. Otherwise the correlation is between -1 and 1. |
---|
229 | |
---|
230 | \section Score |
---|
231 | |
---|
232 | \subsection Pearson |
---|
233 | |
---|
234 | \f$\frac{\sum w(x-m_x)(y-m_y)}{\sqrt{\sum w(x-m_x)^2\sum w(y-m_y)^2}}\f$. |
---|
235 | |
---|
236 | See AveragerPairWeighted correlation. |
---|
237 | |
---|
238 | \subsection ROC |
---|
239 | |
---|
240 | An interpretation of the ROC curve area is the probability that if we |
---|
241 | take one sample from class \f$+\f$ and one sample from class \f$-\f$, what is |
---|
242 | the probability that the sample from class \f$+\f$ has greater value. The |
---|
243 | ROC curve area calculates the ratio of pairs fulfilling this |
---|
244 | |
---|
245 | \f$ |
---|
246 | \frac{\sum_{\{i,j\}:x^-_i<x^+_j}1}{\sum_{i,j}1}. |
---|
247 | \f$ |
---|
248 | |
---|
249 | An geometrical interpretation is to have a number of squares where |
---|
250 | each square correspond to a pair of samples. The ROC curve follows the |
---|
251 | border between pairs in which the samples from class \f$+\f$ has a greater |
---|
252 | value and pairs in which this is not fulfilled. The ROC curve area is |
---|
253 | the area of those latter squares and a natural extension is to weight |
---|
254 | each pair with its two weights and consequently the weighted ROC curve |
---|
255 | area becomes |
---|
256 | |
---|
257 | \f$ |
---|
258 | \frac{\sum_{\{i,j\}:x^-_i<x^+_j}w^-_iw^+_j}{\sum_{i,j}w^-_iw^+_j} |
---|
259 | \f$ |
---|
260 | |
---|
261 | This expression is invariant under a rescaling of weight. Adding a |
---|
262 | data value with weight zero adds nothing to the exprssion, and having |
---|
263 | all weight equal to unity yields the non-weighted ROC curve area. |
---|
264 | |
---|
265 | \subsection tScore |
---|
266 | |
---|
267 | Assume that \f$x\f$ and \f$y\f$ originate from the same distribution |
---|
268 | \f$N(\mu,\sigma_i^2)\f$ where \f$\sigma_i^2=\frac{\sigma_0^2}{w_i}\f$. We then |
---|
269 | estimate \f$\sigma_0^2\f$ as |
---|
270 | \f$ |
---|
271 | \frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2} |
---|
272 | {\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+ |
---|
273 | \frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2} |
---|
274 | \f$ |
---|
275 | The variance of difference of the means becomes |
---|
276 | \f$ |
---|
277 | Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum |
---|
278 | w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}= |
---|
279 | \frac{\sigma_0^2}{\sum w_i}+\frac{\sigma_0^2}{\sum w_i}, |
---|
280 | \f$ |
---|
281 | and consequently the t-score becomes |
---|
282 | \f$ |
---|
283 | \frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2} |
---|
284 | {\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+ |
---|
285 | \frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2} |
---|
286 | \left(\frac{1}{\sum w_i}+\frac{1}{\sum w_i}\right), |
---|
287 | \f$ |
---|
288 | |
---|
289 | For a \f$w_i=w\f$ we this expression get condensed down to |
---|
290 | \f$ |
---|
291 | \frac{w\sum (x-m_x)^2+w\sum (y-m_y)^2} |
---|
292 | {n_x+n_y-2} |
---|
293 | \left(\frac{1}{wn_x}+\frac{1}{wn_y}\right), |
---|
294 | \f$ |
---|
295 | in other words the good old expression as for non-weighted. |
---|
296 | |
---|
297 | \subsection FoldChange |
---|
298 | Fold-Change is simply the difference between the weighted mean of the |
---|
299 | two groups \f$\frac{\sum w_xx}{\sum w_x}-\frac{\sum w_yy}{\sum w_y}\f$ |
---|
300 | |
---|
301 | \subsection WilcoxonFoldChange |
---|
302 | Taking all pair samples (one from class \f$+\f$ and one from class \f$-\f$) |
---|
303 | and calculating the weighted median of the distances. |
---|
304 | |
---|
305 | \section Distance |
---|
306 | |
---|
307 | A \ref concept_distance measures how far apart two ranges are. A Distance should |
---|
308 | preferably meet some criteria: |
---|
309 | |
---|
310 | - It is symmetric, \f$ d(x,y) = d(y,x) \f$, that is distance from \f$ |
---|
311 | x \f$ to \f$ y \f$ equals the distance from \f$ y \f$ to \f$ x \f$. |
---|
312 | - Zero self-distance: \f$ d(x,x) = 0 \f$ |
---|
313 | - Triangle inequality: \f$ d(x,z) \le d(x,y) + d(y,z) \f$ |
---|
314 | |
---|
315 | \subsection weighted_distance Weighted Distance |
---|
316 | |
---|
317 | Weighted Distance is an extension of usual unweighted distances, in |
---|
318 | which each data point is accompanied with a weight. A weighted |
---|
319 | distance should meet some criteria: |
---|
320 | |
---|
321 | - Having all unity weights should yield the unweighted case. |
---|
322 | - Rescaling the weights, \f$ w_i = Cw_i \f$, does not change the distance. |
---|
323 | - Having a \f$ w_x = 0 \f$ the distance should ignore corresponding |
---|
324 | \f$ x \f$, \f$ y \f$, and \f$ w_y \f$. |
---|
325 | - A zero weight should not result in a very different distance than a |
---|
326 | small weight, in other words, modifying a weight should change the |
---|
327 | distance in a continuous manner. |
---|
328 | - The duplicate property. If data is coming in duplicate such that |
---|
329 | \f$ x_{2i}=x_{2i+1} \f$, then the case when \f$ w_{2i}=w_{2i+1} \f$ |
---|
330 | should equal to if you set \f$ w_{2i}=0 \f$. |
---|
331 | |
---|
332 | The last condition, duplicate property, implies that setting a weight |
---|
333 | to zero is not equivalent to removing the data point. This behavior is |
---|
334 | sensible because otherwise we would have a bias towards having ranges |
---|
335 | with small weights being close to other ranges. For a weighted |
---|
336 | distance, meeting these criteria, it might be difficult to show that |
---|
337 | the triangle inequality is fulfilled. For most algorithms the triangle |
---|
338 | inequality is not essential for the distance to work properly, so if |
---|
339 | you need to choose between fulfilling triangle inequality and these |
---|
340 | latter criteria it is preferable to meet the latter criteria. |
---|
341 | |
---|
342 | In test/distance_test.cc there are tests for testing these properties. |
---|
343 | |
---|
344 | \section Kernel |
---|
345 | \subsection polynomial_kernel Polynomial Kernel |
---|
346 | The polynomial kernel of degree \f$N\f$ is defined as \f$(1+<x,y>)^N\f$, where |
---|
347 | \f$<x,y>\f$ is the linear kernel (usual scalar product). For the weighted |
---|
348 | case we define the linear kernel to be |
---|
349 | \f$<x,y>=\frac{\sum {w_xw_yxy}}{\sum{w_xw_y}}\f$ and the |
---|
350 | polynomial kernel can be calculated as before |
---|
351 | \f$(1+<x,y>)^N\f$. |
---|
352 | |
---|
353 | \subsection gaussian_kernel Gaussian Kernel |
---|
354 | We define the weighted Gaussian kernel as \f$\exp\left(-N\frac{\sum |
---|
355 | w_xw_y(x-y)^2}{\sum w_xw_y}\right)\f$. |
---|
356 | |
---|
357 | \section Regression |
---|
358 | \subsection Naive |
---|
359 | \subsection Linear |
---|
360 | We have the model |
---|
361 | |
---|
362 | \f$ |
---|
363 | y_i=\alpha+\beta (x-m_x)+\epsilon_i, |
---|
364 | \f$ |
---|
365 | |
---|
366 | where \f$\epsilon_i\f$ is the noise. The variance of the noise is |
---|
367 | inversely proportional to the weight, |
---|
368 | \f$Var(\epsilon_i)=\frac{\sigma^2}{w_i}\f$. In order to determine the |
---|
369 | model parameters, we minimimize the sum of quadratic errors. |
---|
370 | |
---|
371 | \f$ |
---|
372 | Q_0 = \sum \epsilon_i^2 |
---|
373 | \f$ |
---|
374 | |
---|
375 | Taking the derivity with respect to \f$\alpha\f$ and \f$\beta\f$ yields two conditions |
---|
376 | |
---|
377 | \f$ |
---|
378 | \frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha - |
---|
379 | \beta (x_i-m_x)=0 |
---|
380 | \f$ |
---|
381 | |
---|
382 | and |
---|
383 | |
---|
384 | \f$ \frac{\partial Q_0}{\partial \beta} = -2 \sum |
---|
385 | w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0 |
---|
386 | \f$ |
---|
387 | |
---|
388 | or equivalently |
---|
389 | |
---|
390 | \f$ |
---|
391 | \alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y |
---|
392 | \f$ |
---|
393 | |
---|
394 | and |
---|
395 | |
---|
396 | \f$ \beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum |
---|
397 | w_i(x_i-m_x)^2}=\frac{Cov(x,y)}{Var(x)} |
---|
398 | \f$ |
---|
399 | |
---|
400 | Note, by having all weights equal we get back the unweighted |
---|
401 | case. Furthermore, we calculate the variance of the estimators of |
---|
402 | \f$\alpha\f$ and \f$\beta\f$. |
---|
403 | |
---|
404 | \f$ |
---|
405 | \textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}= |
---|
406 | \frac{\sigma^2}{\sum w_i} |
---|
407 | \f$ |
---|
408 | |
---|
409 | and |
---|
410 | \f$ |
---|
411 | \textrm{Var}(\beta )= \frac{w_i^2(x_i-m_x)^2\frac{\sigma^2}{w_i}} |
---|
412 | {(\sum w_i(x_i-m_x)^2)^2}= |
---|
413 | \frac{\sigma^2}{\sum w_i(x_i-m_x)^2} |
---|
414 | \f$ |
---|
415 | |
---|
416 | Finally, we estimate the level of noise, \f$\sigma^2\f$. Inspired by the |
---|
417 | unweighted estimation |
---|
418 | |
---|
419 | \f$ |
---|
420 | s^2=\frac{\sum (y_i-\alpha-\beta (x_i-m_x))^2}{n-2} |
---|
421 | \f$ |
---|
422 | |
---|
423 | we suggest the following estimator |
---|
424 | |
---|
425 | \f$ s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum |
---|
426 | w_i-2\frac{\sum w_i^2}{\sum w_i}} \f$ |
---|
427 | |
---|
428 | */ |
---|
429 | |
---|
430 | |
---|
431 | |
---|