source: trunk/doc/Statistics.doxygen @ 1511

Last change on this file since 1511 was 1487, checked in by Jari Häkkinen, 15 years ago

Addresses #436. GPL license copy reference should also be updated.

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 14.7 KB
Line 
1// $Id: Statistics.doxygen 1487 2008-09-10 08:41:36Z jari $
2//
3// Copyright (C) 2005 Peter Johansson
4// Copyright (C) 2006 Jari Häkkinen, Peter Johansson, Markus Ringnér
5// Copyright (C) 2007 Jari Häkkinen, Peter Johansson
6// Copyright (C) 2008 Peter Johansson
7//
8// This file is part of the yat library, http://dev.thep.lu.se/yat
9//
10// The yat library is free software; you can redistribute it and/or
11// modify it under the terms of the GNU General Public License as
12// published by the Free Software Foundation; either version 3 of the
13// License, or (at your option) any later version.
14//
15// The yat library is distributed in the hope that it will be useful,
16// but WITHOUT ANY WARRANTY; without even the implied warranty of
17// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
18// General Public License for more details.
19//
20// You should have received a copy of the GNU General Public License
21// along with yat. If not, see <http://www.gnu.org/licenses/>.
22
23
24/**
25\page weighted_statistics Weighted Statistics
26
27\section Introduction
28There are several different reasons why a statistical analysis needs
29to adjust for weighting. In literature reasons are mainly diveded in
30to groups.
31
32The first group is when some of the measurements are known to be more
33precise than others. The more precise a measurement is, the larger
34weight it is given. The simplest case is when the weight are given
35before the measurements and they can be treated as deterministic. It
36becomes more complicated when the weight can be determined not until
37afterwards, and even more complicated if the weight depends on the
38value of the observable.
39
40The second group of situations is when calculating averages over one
41distribution and sampling from another distribution. Compensating for
42this discrepency weights are introduced to the analysis. A simple
43example may be that we are interviewing people but for economical
44reasons we choose to interview more people from the city than from the
45countryside. When summarizing the statistics the answers from the city
46are given a smaller weight. In this example we are choosing the
47proportions of people from countryside and people from city being
48intervied. Hence, we can determine the weights before and consider
49them to be deterministic. In other situations the proportions are not
50deterministic, but rather a result from the sampling and the weights
51must be treated as stochastic and only in rare situations the weights
52can be treated as independent of the observable.
53
54Since there are various origins for a weight occuring in a statistical
55analysis, there are various ways to treat the weights and in general
56the analysis should be tailored to treat the weights correctly. We
57have not chosen one situation for our implementations, so see specific
58function documentation for what assumtions are made. Though, common
59for implementations are the following:
60
61 - Setting all weights to unity yields the same result as the
62non-weighted version.
63 - Rescaling the weights does not change any function.
64 - Setting a weight to zero is equivalent to removing the data point.
65
66An important case is when weights are binary (either 1 or 0). Then we
67get the same result using the weighted version as using the data with
68weight not equal to zero and the non-weighted version. Hence, using
69binary weights and the weighted version missing values can be treated
70in a proper way.
71
72\section AveragerWeighted
73
74
75
76\subsection Mean
77
78For any situation the weight is always designed so the weighted mean
79is calculated as \f$ m=\frac{\sum w_ix_i}{\sum w_i} \f$, which obviously
80fulfills the conditions above.
81
82
83
84In the case of varying measurement error, it could be motivated that
85the weight shall be \f$ w_i = 1/\sigma_i^2 \f$. We assume measurement error
86to be Gaussian and the likelihood to get our measurements is
87\f$ L(m)=\prod
88(2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}} \f$.  We
89maximize the likelihood by taking the derivity with respect to \f$ m \f$ on
90the logarithm of the likelihood \f$ \frac{d\ln L(m)}{dm}=\sum
91\frac{x_i-m}{\sigma_i^2} \f$. Hence, the Maximum Likelihood method yields
92the estimator \f$ m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2} \f$.
93
94
95\subsection Variance
96In case of varying variance, there is no point estimating a variance
97since it is different for each data point.
98
99Instead we look at the case when we want to estimate the variance over
100\f$f\f$ but are sampling from \f$ f' \f$. For the mean of an observable \f$ O \f$ we
101have \f$ \widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum
102w_i} \f$. Hence, an estimator of the variance of \f$ X \f$ is
103
104\f$
105s^2 = <X^2>-<X>^2=
106\f$
107
108\f$
109 = \frac{\sum w_ix_i^2}{\sum w_i}-\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}=
110\f$
111
112\f$
113 = \frac{\sum w_i(x_i^2-m^2)}{\sum w_i}=
114\f$
115
116\f$
117 = \frac{\sum w_i(x_i^2-2mx_i+m^2)}{\sum w_i}=
118\f$
119
120\f$
121 = \frac{\sum w_i(x_i-m)^2}{\sum w_i}
122\f$
123
124This estimator fulfills that it is invariant under a rescaling and
125having a weight equal to zero is equivalent to removing the data
126point. Having all weights equal to unity we get \f$ \sigma=\frac{\sum
127(x_i-m)^2}{N} \f$, which is the same as returned from Averager. Hence,
128this estimator is slightly biased, but still very efficient.
129
130\subsection standard_error Standard Error
131The standard error squared is equal to the expexted squared error of
132the estimation of \f$m\f$. The squared error consists of two parts, the
133variance of the estimator and the squared bias:
134
135\f$
136<m-\mu>^2=<m-<m>+<m>-\mu>^2=
137\f$
138\f$
139<m-<m>>^2+(<m>-\mu)^2
140\f$. 
141
142In the case when weights are included in analysis due to varying
143measurement errors and the weights can be treated as deterministic, we
144have
145
146\f$
147Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}=
148\f$
149\f$
150\frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}=
151\f$
152\f$
153\frac{\sigma_0^2}{\sum w_i},
154\f$
155
156where we need to estimate \f$ \sigma_0^2 \f$. Again we have the likelihood
157
158\f$
159L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{(-\frac{w_i(x-m)^2}{2\sigma_0^2})}
160\f$
161and taking the derivity with respect to
162\f$\sigma_o^2\f$,
163
164\f$
165\frac{d\ln L}{d\sigma_i^2}=
166\f$
167\f$
168\sum -\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2}
169\f$
170
171which
172yields an estimator \f$ \sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2 \f$. This
173estimator is not ignoring weights equal to zero, because deviation is
174most often smaller than the expected infinity. Therefore, we modify
175the expression as follows \f$\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
176w_i\right)^2}\sum w_i(x-m)^2\f$ and we get the following estimator of
177the variance of the mean \f$\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
178w_i\right)^3}\sum w_i(x-m)^2\f$. This estimator fulfills the conditions
179above: adding a weight zero does not change it: rescaling the weights
180does not change it, and setting all weights to unity yields the same
181expression as in the non-weighted case.
182
183In a case when it is not a good approximation to treat the weights as
184deterministic, there are two ways to get a better estimation. The
185first one is to linearize the expression \f$\left<\frac{\sum
186w_ix_i}{\sum w_i}\right>\f$. The second method when the situation is
187more complicated is to estimate the standard error using a
188bootstrapping method.
189
190\section AveragerPairWeighted
191Here data points come in pairs (x,y). We are sampling from \f$f'_{XY}\f$
192but want to measure from \f$f_{XY}\f$. To compensate for this decrepency,
193averages of \f$g(x,y)\f$ are taken as \f$\sum \frac{f}{f'}g(x,y)\f$. Even
194though, \f$X\f$ and \f$Y\f$ are not independent \f$(f_{XY}\neq f_Xf_Y)\f$ we
195assume that we can factorize the ratio and get \f$\frac{\sum
196w_xw_yg(x,y)}{\sum w_xw_y}\f$
197\subsection Covariance
198Following the variance calculations for AveragerWeighted we have
199\f$Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}\f$ where
200\f$m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}\f$
201
202\subsection Correlation
203
204As the mean is estimated as
205\f$
206m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}
207\f$,
208the variance is estimated as
209\f$
210\sigma_x^2=\frac{\sum w_xw_y(x-m_x)^2}{\sum w_xw_y}
211\f$.
212As in the non-weighted case we define the correlation to be the ratio
213between the covariance and geometrical average of the variances
214
215\f$
216\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum
217w_xw_y(y-m_y)^2}}
218\f$.
219
220
221This expression fulfills the following
222 - Having N equal weights the expression reduces to the non-weighted expression.
223 - Adding a pair of data, in which one weight is zero is equivalent
224to ignoring the data pair.
225 - Correlation is equal to unity if and only if \f$x\f$ is equal to
226\f$y\f$. Otherwise the correlation is between -1 and 1.
227
228\section Score
229
230\subsection Pearson
231
232\f$\frac{\sum w(x-m_x)(y-m_y)}{\sqrt{\sum w(x-m_x)^2\sum w(y-m_y)^2}}\f$.
233
234See AveragerPairWeighted correlation.
235
236\subsection ROC
237
238An interpretation of the ROC curve area is the probability that if we
239take one sample from class \f$+\f$ and one sample from class \f$-\f$, what is
240the probability that the sample from class \f$+\f$ has greater value. The
241ROC curve area calculates the ratio of pairs fulfilling this
242
243\f$
244\frac{\sum_{\{i,j\}:x^-_i<x^+_j}1}{\sum_{i,j}1}.
245\f$
246
247An geometrical interpretation is to have a number of squares where
248each square correspond to a pair of samples. The ROC curve follows the
249border between pairs in which the samples from class \f$+\f$ has a greater
250value and pairs in which this is not fulfilled. The ROC curve area is
251the area of those latter squares and a natural extension is to weight
252each pair with its two weights and consequently the weighted ROC curve
253area becomes
254
255\f$
256\frac{\sum_{\{i,j\}:x^-_i<x^+_j}w^-_iw^+_j}{\sum_{i,j}w^-_iw^+_j}
257\f$
258
259This expression is invariant under a rescaling of weight. Adding a
260data value with weight zero adds nothing to the exprssion, and having
261all weight equal to unity yields the non-weighted ROC curve area.
262
263\subsection tScore
264
265Assume that \f$x\f$ and \f$y\f$ originate from the same distribution
266\f$N(\mu,\sigma_i^2)\f$ where \f$\sigma_i^2=\frac{\sigma_0^2}{w_i}\f$. We then
267estimate \f$\sigma_0^2\f$ as
268\f$
269\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
270{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
271\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
272\f$
273The variance of difference of the means becomes
274\f$
275Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum
276w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}=
277\frac{\sigma_0^2}{\sum w_i}+\frac{\sigma_0^2}{\sum w_i},
278\f$
279and consequently the t-score becomes
280\f$
281\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
282{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
283\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
284\left(\frac{1}{\sum w_i}+\frac{1}{\sum w_i}\right),
285\f$
286
287For a \f$w_i=w\f$ we this expression get condensed down to
288\f$
289\frac{w\sum (x-m_x)^2+w\sum (y-m_y)^2}
290{n_x+n_y-2}
291\left(\frac{1}{wn_x}+\frac{1}{wn_y}\right),
292\f$
293in other words the good old expression as for non-weighted.
294
295\subsection FoldChange
296Fold-Change is simply the difference between the weighted mean of the
297two groups \f$\frac{\sum w_xx}{\sum w_x}-\frac{\sum w_yy}{\sum w_y}\f$
298
299\subsection WilcoxonFoldChange
300Taking all pair samples (one from class \f$+\f$ and one from class \f$-\f$)
301and calculating the weighted median of the distances.
302
303\section Distance
304
305A \ref concept_distance measures how far apart two ranges are. A Distance should
306preferably meet some criteria:
307
308 - It is symmetric, \f$ d(x,y) = d(y,x) \f$, that is distance from \f$
309   x \f$ to \f$ y \f$ equals the distance from \f$ y \f$ to \f$ x \f$.
310 - Zero self-distance: \f$ d(x,x) = 0 \f$
311 - Triangle inequality: \f$ d(x,z) \le d(x,y) + d(y,z) \f$
312
313\subsection weighted_distance Weighted Distance
314
315Weighted Distance is an extension of usual unweighted distances, in
316which each data point is accompanied with a weight. A weighted
317distance should meet some criteria:
318
319 - Having all unity weights should yield the unweighted case.
320 - Rescaling the weights, \f$ w_i = Cw_i \f$, does not change the distance.
321 - Having a \f$ w_x = 0 \f$ the distance should ignore corresponding
322    \f$ x \f$, \f$ y \f$, and \f$ w_y \f$.
323 - A zero weight should not result in a very different distance than a
324   small weight, in other words, modifying a weight should change the
325   distance in a continuous manner.
326 - The duplicate property. If data is coming in duplicate such that
327   \f$ x_{2i}=x_{2i+1} \f$, then the case when \f$ w_{2i}=w_{2i+1} \f$
328   should equal to if you set \f$ w_{2i}=0 \f$.
329
330The last condition, duplicate property, implies that setting a weight
331to zero is not equivalent to removing the data point. This behavior is
332sensible because otherwise we would have a bias towards having ranges
333with small weights being close to other ranges. For a weighted
334distance, meeting these criteria, it might be difficult to show that
335the triangle inequality is fulfilled. For most algorithms the triangle
336inequality is not essential for the distance to work properly, so if
337you need to choose between fulfilling triangle inequality and these
338latter criteria it is preferable to meet the latter criteria.
339
340In test/distance_test.cc there are tests for testing these properties.
341
342\section Kernel
343\subsection polynomial_kernel Polynomial Kernel
344The polynomial kernel of degree \f$N\f$ is defined as \f$(1+<x,y>)^N\f$, where
345\f$<x,y>\f$ is the linear kernel (usual scalar product). For the weighted
346case we define the linear kernel to be
347\f$<x,y>=\frac{\sum {w_xw_yxy}}{\sum{w_xw_y}}\f$ and the
348polynomial kernel can be calculated as before
349\f$(1+<x,y>)^N\f$.
350
351\subsection gaussian_kernel Gaussian Kernel
352We define the weighted Gaussian kernel as \f$\exp\left(-N\frac{\sum
353w_xw_y(x-y)^2}{\sum w_xw_y}\right)\f$.
354
355\section Regression
356\subsection Naive
357\subsection Linear
358We have the model
359
360\f$
361y_i=\alpha+\beta (x-m_x)+\epsilon_i,
362\f$
363
364where \f$\epsilon_i\f$ is the noise. The variance of the noise is
365inversely proportional to the weight,
366\f$Var(\epsilon_i)=\frac{\sigma^2}{w_i}\f$. In order to determine the
367model parameters, we minimimize the sum of quadratic errors.
368
369\f$
370Q_0 = \sum \epsilon_i^2
371\f$
372
373Taking the derivity with respect to \f$\alpha\f$ and \f$\beta\f$ yields two conditions
374
375\f$
376\frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha -
377\beta (x_i-m_x)=0
378\f$
379
380and
381
382\f$ \frac{\partial Q_0}{\partial \beta} = -2 \sum
383w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0
384\f$
385
386or equivalently
387
388\f$
389\alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y
390\f$
391
392and
393
394\f$ \beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum
395w_i(x_i-m_x)^2}=\frac{Cov(x,y)}{Var(x)}
396\f$
397
398Note, by having all weights equal we get back the unweighted
399case. Furthermore, we calculate the variance of the estimators of
400\f$\alpha\f$ and \f$\beta\f$.
401
402\f$
403\textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}=
404\frac{\sigma^2}{\sum w_i}
405\f$
406
407and
408\f$
409\textrm{Var}(\beta )= \frac{w_i^2(x_i-m_x)^2\frac{\sigma^2}{w_i}}
410{(\sum w_i(x_i-m_x)^2)^2}=
411\frac{\sigma^2}{\sum w_i(x_i-m_x)^2}
412\f$
413
414Finally, we estimate the level of noise, \f$\sigma^2\f$. Inspired by the
415unweighted estimation
416
417\f$
418s^2=\frac{\sum (y_i-\alpha-\beta (x_i-m_x))^2}{n-2}
419\f$
420
421we suggest the following estimator
422
423\f$ s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum
424w_i-2\frac{\sum w_i^2}{\sum w_i}} \f$
425
426*/
427
428
429
Note: See TracBrowser for help on using the repository browser.