# source:trunk/doc/Statistics.doxygen@2353

Last change on this file since 2353 was 2119, checked in by Peter, 12 years ago

converted files to utf-8. fixes #577

• Property svn:eol-style set to native
• Property svn:keywords set to Author Date Id Revision
File size: 14.8 KB
Line
1// $Id: Statistics.doxygen 2119 2009-12-12 23:11:43Z peter$
2//
3// Copyright (C) 2005 Peter Johansson
4// Copyright (C) 2006 Jari Häkkinen, Peter Johansson, Markus Ringnér
5// Copyright (C) 2007, 2008 Jari Häkkinen, Peter Johansson
6//
7// This file is part of the yat library, http://dev.thep.lu.se/yat
8//
9// The yat library is free software; you can redistribute it and/or
10// modify it under the terms of the GNU General Public License as
11// published by the Free Software Foundation; either version 3 of the
13//
14// The yat library is distributed in the hope that it will be useful,
15// but WITHOUT ANY WARRANTY; without even the implied warranty of
16// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
17// General Public License for more details.
18//
19// You should have received a copy of the GNU General Public License
20// along with yat. If not, see <http://www.gnu.org/licenses/>.
21
22
23/**
24\page weighted_statistics Weighted Statistics
25
26\section Introduction
27There are several different reasons why a statistical analysis needs
28to adjust for weighting. In literature reasons are mainly diveded in
29to groups.
30
31The first group is when some of the measurements are known to be more
32precise than others. The more precise a measurement is, the larger
33weight it is given. The simplest case is when the weight are given
34before the measurements and they can be treated as deterministic. It
35becomes more complicated when the weight can be determined not until
36afterwards, and even more complicated if the weight depends on the
37value of the observable.
38
39The second group of situations is when calculating averages over one
40distribution and sampling from another distribution. Compensating for
41this discrepency weights are introduced to the analysis. A simple
42example may be that we are interviewing people but for economical
43reasons we choose to interview more people from the city than from the
44countryside. When summarizing the statistics the answers from the city
45are given a smaller weight. In this example we are choosing the
46proportions of people from countryside and people from city being
47intervied. Hence, we can determine the weights before and consider
48them to be deterministic. In other situations the proportions are not
49deterministic, but rather a result from the sampling and the weights
50must be treated as stochastic and only in rare situations the weights
51can be treated as independent of the observable.
52
53Since there are various origins for a weight occuring in a statistical
54analysis, there are various ways to treat the weights and in general
55the analysis should be tailored to treat the weights correctly. We
56have not chosen one situation for our implementations, so see specific
57function documentation for what assumtions are made. Though, common
58for implementations are the following:
59
60 - Setting all weights to unity yields the same result as the
61non-weighted version.
62 - Rescaling the weights does not change any function.
63 - Setting a weight to zero is equivalent to removing the data point.
64
65The last point implies that a data point with zero weight is ignored
66also when the value is NaN.
67An important case is when weights are binary (either 1 or 0). Then we
68get the same result using the weighted version as using the data with
69weight not equal to zero and the non-weighted version. Hence, using
70binary weights and the weighted version missing values can be treated
71in a proper way.
72
73\section AveragerWeighted
74
75
76
77\subsection Mean
78
79For any situation the weight is always designed so the weighted mean
80is calculated as \f$m=\frac{\sum w_ix_i}{\sum w_i} \f$, which obviously
81fulfills the conditions above.
82
83
84
85In the case of varying measurement error, it could be motivated that
86the weight shall be \f$w_i = 1/\sigma_i^2 \f$. We assume measurement error
87to be Gaussian and the likelihood to get our measurements is
88\f$L(m)=\prod 89(2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}} \f$.  We
90maximize the likelihood by taking the derivity with respect to \f$m \f$ on
91the logarithm of the likelihood \f$\frac{d\ln L(m)}{dm}=\sum 92\frac{x_i-m}{\sigma_i^2} \f$. Hence, the Maximum Likelihood method yields
93the estimator \f$m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2} \f$.
94
95
96\subsection Variance
97In case of varying variance, there is no point estimating a variance
98since it is different for each data point.
99
100Instead we look at the case when we want to estimate the variance over
101\f$f\f$ but are sampling from \f$f' \f$. For the mean of an observable \f$O \f$ we
102have \f$\widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum 103w_i} \f$. Hence, an estimator of the variance of \f$X \f$ is
104
105\f$106s^2 = <X^2>-<X>^2= 107\f$
108
109\f$110 = \frac{\sum w_ix_i^2}{\sum w_i}-\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}= 111\f$
112
113\f$114 = \frac{\sum w_i(x_i^2-m^2)}{\sum w_i}= 115\f$
116
117\f$118 = \frac{\sum w_i(x_i^2-2mx_i+m^2)}{\sum w_i}= 119\f$
120
121\f$122 = \frac{\sum w_i(x_i-m)^2}{\sum w_i} 123\f$
124
125This estimator fulfills that it is invariant under a rescaling and
126having a weight equal to zero is equivalent to removing the data
127point. Having all weights equal to unity we get \f$\sigma=\frac{\sum 128(x_i-m)^2}{N} \f$, which is the same as returned from Averager. Hence,
129this estimator is slightly biased, but still very efficient.
130
131\subsection standard_error Standard Error
132The standard error squared is equal to the expexted squared error of
133the estimation of \f$m\f$. The squared error consists of two parts, the
134variance of the estimator and the squared bias:
135
136\f$137<m-\mu>^2=<m-<m>+<m>-\mu>^2= 138\f$
139\f$140<m-<m>>^2+(<m>-\mu)^2 141\f$.
142
143In the case when weights are included in analysis due to varying
144measurement errors and the weights can be treated as deterministic, we
145have
146
147\f$148Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}= 149\f$
150\f$151\frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}= 152\f$
153\f$154\frac{\sigma_0^2}{\sum w_i}, 155\f$
156
157where we need to estimate \f$\sigma_0^2 \f$. Again we have the likelihood
158
159\f$160L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{(-\frac{w_i(x-m)^2}{2\sigma_0^2})} 161\f$
162and taking the derivity with respect to
163\f$\sigma_o^2\f$,
164
165\f$166\frac{d\ln L}{d\sigma_i^2}= 167\f$
168\f$169\sum -\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2} 170\f$
171
172which
173yields an estimator \f$\sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2 \f$. This
174estimator is not ignoring weights equal to zero, because deviation is
175most often smaller than the expected infinity. Therefore, we modify
176the expression as follows \f$\sigma_0^2=\frac{\sum w_i^2}{\left(\sum 177w_i\right)^2}\sum w_i(x-m)^2\f$ and we get the following estimator of
178the variance of the mean \f$\sigma_0^2=\frac{\sum w_i^2}{\left(\sum 179w_i\right)^3}\sum w_i(x-m)^2\f$. This estimator fulfills the conditions
180above: adding a weight zero does not change it: rescaling the weights
181does not change it, and setting all weights to unity yields the same
182expression as in the non-weighted case.
183
184In a case when it is not a good approximation to treat the weights as
185deterministic, there are two ways to get a better estimation. The
186first one is to linearize the expression \f$\left<\frac{\sum 187w_ix_i}{\sum w_i}\right>\f$. The second method when the situation is
188more complicated is to estimate the standard error using a
189bootstrapping method.
190
191\section AveragerPairWeighted
192Here data points come in pairs (x,y). We are sampling from \f$f'_{XY}\f$
193but want to measure from \f$f_{XY}\f$. To compensate for this decrepency,
194averages of \f$g(x,y)\f$ are taken as \f$\sum \frac{f}{f'}g(x,y)\f$. Even
195though, \f$X\f$ and \f$Y\f$ are not independent \f$(f_{XY}\neq f_Xf_Y)\f$ we
196assume that we can factorize the ratio and get \f$\frac{\sum 197w_xw_yg(x,y)}{\sum w_xw_y}\f$
198\subsection Covariance
199Following the variance calculations for AveragerWeighted we have
200\f$Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}\f$ where
201\f$m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}\f$
202
203\subsection Correlation
204
205As the mean is estimated as
206\f$207m_x=\frac{\sum w_xw_yx}{\sum w_xw_y} 208\f$,
209the variance is estimated as
210\f$211\sigma_x^2=\frac{\sum w_xw_y(x-m_x)^2}{\sum w_xw_y} 212\f$.
213As in the non-weighted case we define the correlation to be the ratio
214between the covariance and geometrical average of the variances
215
216\f$217\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum 218w_xw_y(y-m_y)^2}} 219\f$.
220
221
222This expression fulfills the following
223 - Having N equal weights the expression reduces to the non-weighted expression.
224 - Adding a pair of data, in which one weight is zero is equivalent
225to ignoring the data pair.
226 - Correlation is equal to unity if and only if \f$x\f$ is equal to
227\f$y\f$. Otherwise the correlation is between -1 and 1.
228
229\section Score
230
231\subsection Pearson
232
233\f$\frac{\sum w(x-m_x)(y-m_y)}{\sqrt{\sum w(x-m_x)^2\sum w(y-m_y)^2}}\f$.
234
235See AveragerPairWeighted correlation.
236
237\subsection ROC
238
239An interpretation of the ROC curve area is the probability that if we
240take one sample from class \f$+\f$ and one sample from class \f$-\f$, what is
241the probability that the sample from class \f$+\f$ has greater value. The
242ROC curve area calculates the ratio of pairs fulfilling this
243
244\f$245\frac{\sum_{\{i,j\}:x^-_i<x^+_j}1}{\sum_{i,j}1}. 246\f$
247
248An geometrical interpretation is to have a number of squares where
249each square correspond to a pair of samples. The ROC curve follows the
250border between pairs in which the samples from class \f$+\f$ has a greater
251value and pairs in which this is not fulfilled. The ROC curve area is
252the area of those latter squares and a natural extension is to weight
253each pair with its two weights and consequently the weighted ROC curve
254area becomes
255
256\f$257\frac{\sum_{\{i,j\}:x^-_i<x^+_j}w^-_iw^+_j}{\sum_{i,j}w^-_iw^+_j} 258\f$
259
260This expression is invariant under a rescaling of weight. Adding a
261data value with weight zero adds nothing to the exprssion, and having
262all weight equal to unity yields the non-weighted ROC curve area.
263
264\subsection tScore
265
266Assume that \f$x\f$ and \f$y\f$ originate from the same distribution
267\f$N(\mu,\sigma_i^2)\f$ where \f$\sigma_i^2=\frac{\sigma_0^2}{w_i}\f$. We then
268estimate \f$\sigma_0^2\f$ as
269\f$270\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2} 271{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+ 272\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2} 273\f$
274The variance of difference of the means becomes
275\f$276Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum 277w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}= 278\frac{\sigma_0^2}{\sum w_i}+\frac{\sigma_0^2}{\sum w_i}, 279\f$
280and consequently the t-score becomes
281\f$282\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2} 283{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+ 284\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2} 285\left(\frac{1}{\sum w_i}+\frac{1}{\sum w_i}\right), 286\f$
287
288For a \f$w_i=w\f$ we this expression get condensed down to
289\f$290\frac{w\sum (x-m_x)^2+w\sum (y-m_y)^2} 291{n_x+n_y-2} 292\left(\frac{1}{wn_x}+\frac{1}{wn_y}\right), 293\f$
294in other words the good old expression as for non-weighted.
295
296\subsection FoldChange
297Fold-Change is simply the difference between the weighted mean of the
298two groups \f$\frac{\sum w_xx}{\sum w_x}-\frac{\sum w_yy}{\sum w_y}\f$
299
300\subsection WilcoxonFoldChange
301Taking all pair samples (one from class \f$+\f$ and one from class \f$-\f$)
302and calculating the weighted median of the distances.
303
304\section Distance
305
306A \ref concept_distance measures how far apart two ranges are. A Distance should
307preferably meet some criteria:
308
309 - It is symmetric, \f$d(x,y) = d(y,x) \f$, that is distance from \f$310 x \f$ to \f$y \f$ equals the distance from \f$y \f$ to \f$x \f$.
311 - Zero self-distance: \f$d(x,x) = 0 \f$
312 - Triangle inequality: \f$d(x,z) \le d(x,y) + d(y,z) \f$
313
314\subsection weighted_distance Weighted Distance
315
316Weighted Distance is an extension of usual unweighted distances, in
317which each data point is accompanied with a weight. A weighted
318distance should meet some criteria:
319
320 - Having all unity weights should yield the unweighted case.
321 - Rescaling the weights, \f$w_i = Cw_i \f$, does not change the distance.
322 - Having a \f$w_x = 0 \f$ the distance should ignore corresponding
323    \f$x \f$, \f$y \f$, and \f$w_y \f$.
324 - A zero weight should not result in a very different distance than a
325   small weight, in other words, modifying a weight should change the
326   distance in a continuous manner.
327 - The duplicate property. If data is coming in duplicate such that
328   \f$x_{2i}=x_{2i+1} \f$, then the case when \f$w_{2i}=w_{2i+1} \f$
329   should equal to if you set \f$w_{2i}=0 \f$.
330
331The last condition, duplicate property, implies that setting a weight
332to zero is not equivalent to removing the data point. This behavior is
333sensible because otherwise we would have a bias towards having ranges
334with small weights being close to other ranges. For a weighted
335distance, meeting these criteria, it might be difficult to show that
336the triangle inequality is fulfilled. For most algorithms the triangle
337inequality is not essential for the distance to work properly, so if
338you need to choose between fulfilling triangle inequality and these
339latter criteria it is preferable to meet the latter criteria.
340
341In test/distance_test.cc there are tests for testing these properties.
342
343\section Kernel
344\subsection polynomial_kernel Polynomial Kernel
345The polynomial kernel of degree \f$N\f$ is defined as \f$(1+<x,y>)^N\f$, where
346\f$<x,y>\f$ is the linear kernel (usual scalar product). For the weighted
347case we define the linear kernel to be
348\f$<x,y>=\frac{\sum {w_xw_yxy}}{\sum{w_xw_y}}\f$ and the
349polynomial kernel can be calculated as before
350\f$(1+<x,y>)^N\f$.
351
352\subsection gaussian_kernel Gaussian Kernel
353We define the weighted Gaussian kernel as \f$\exp\left(-N\frac{\sum 354w_xw_y(x-y)^2}{\sum w_xw_y}\right)\f$.
355
356\section Regression
357\subsection Naive
358\subsection Linear
359We have the model
360
361\f$362y_i=\alpha+\beta (x-m_x)+\epsilon_i, 363\f$
364
365where \f$\epsilon_i\f$ is the noise. The variance of the noise is
366inversely proportional to the weight,
367\f$Var(\epsilon_i)=\frac{\sigma^2}{w_i}\f$. In order to determine the
368model parameters, we minimimize the sum of quadratic errors.
369
370\f$371Q_0 = \sum \epsilon_i^2 372\f$
373
374Taking the derivity with respect to \f$\alpha\f$ and \f$\beta\f$ yields two conditions
375
376\f$377\frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha - 378\beta (x_i-m_x)=0 379\f$
380
381and
382
383\f$\frac{\partial Q_0}{\partial \beta} = -2 \sum 384w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0 385\f$
386
387or equivalently
388
389\f$390\alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y 391\f$
392
393and
394
395\f$\beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum 396w_i(x_i-m_x)^2}=\frac{Cov(x,y)}{Var(x)} 397\f$
398
399Note, by having all weights equal we get back the unweighted
400case. Furthermore, we calculate the variance of the estimators of
401\f$\alpha\f$ and \f$\beta\f$.
402
403\f$404\textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}= 405\frac{\sigma^2}{\sum w_i} 406\f$
407
408and
409\f$410\textrm{Var}(\beta )= \frac{w_i^2(x_i-m_x)^2\frac{\sigma^2}{w_i}} 411{(\sum w_i(x_i-m_x)^2)^2}= 412\frac{\sigma^2}{\sum w_i(x_i-m_x)^2} 413\f$
414
415Finally, we estimate the level of noise, \f$\sigma^2\f$. Inspired by the
416unweighted estimation
417
418\f$419s^2=\frac{\sum (y_i-\alpha-\beta (x_i-m_x))^2}{n-2} 420\f$
421
422we suggest the following estimator
423
424\f$s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum 425w_i-2\frac{\sum w_i^2}{\sum w_i}} \f$
426
427*/
428
429
430
Note: See TracBrowser for help on using the repository browser.