source: tags/0.4.1/doc/Statistics.doxygen @ 1629

Last change on this file since 1629 was 1275, checked in by Jari Häkkinen, 14 years ago

Updating copyright statements.

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 14.8 KB
Line 
1// $Id: Statistics.doxygen 1275 2008-04-11 06:10:12Z jari $
2//
3// Copyright (C) 2005 Peter Johansson
4// Copyright (C) 2006 Jari Häkkinen, Peter Johansson, Markus Ringnér
5// Copyright (C) 2007 Jari Häkkinen, Peter Johansson
6// Copyright (C) 2008 Peter Johansson
7//
8// This file is part of the yat library, http://trac.thep.lu.se/yat
9//
10// The yat library is free software; you can redistribute it and/or
11// modify it under the terms of the GNU General Public License as
12// published by the Free Software Foundation; either version 2 of the
13// License, or (at your option) any later version.
14//
15// The yat library is distributed in the hope that it will be useful,
16// but WITHOUT ANY WARRANTY; without even the implied warranty of
17// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
18// General Public License for more details.
19//
20// You should have received a copy of the GNU General Public License
21// along with this program; if not, write to the Free Software
22// Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
23// 02111-1307, USA.
24
25
26/**
27\page weighted_statistics Weighted Statistics
28
29\section Introduction
30There are several different reasons why a statistical analysis needs
31to adjust for weighting. In literature reasons are mainly diveded in
32to groups.
33
34The first group is when some of the measurements are known to be more
35precise than others. The more precise a measurement is, the larger
36weight it is given. The simplest case is when the weight are given
37before the measurements and they can be treated as deterministic. It
38becomes more complicated when the weight can be determined not until
39afterwards, and even more complicated if the weight depends on the
40value of the observable.
41
42The second group of situations is when calculating averages over one
43distribution and sampling from another distribution. Compensating for
44this discrepency weights are introduced to the analysis. A simple
45example may be that we are interviewing people but for economical
46reasons we choose to interview more people from the city than from the
47countryside. When summarizing the statistics the answers from the city
48are given a smaller weight. In this example we are choosing the
49proportions of people from countryside and people from city being
50intervied. Hence, we can determine the weights before and consider
51them to be deterministic. In other situations the proportions are not
52deterministic, but rather a result from the sampling and the weights
53must be treated as stochastic and only in rare situations the weights
54can be treated as independent of the observable.
55
56Since there are various origins for a weight occuring in a statistical
57analysis, there are various ways to treat the weights and in general
58the analysis should be tailored to treat the weights correctly. We
59have not chosen one situation for our implementations, so see specific
60function documentation for what assumtions are made. Though, common
61for implementations are the following:
62
63 - Setting all weights to unity yields the same result as the
64non-weighted version.
65 - Rescaling the weights does not change any function.
66 - Setting a weight to zero is equivalent to removing the data point.
67
68An important case is when weights are binary (either 1 or 0). Then we
69get the same result using the weighted version as using the data with
70weight not equal to zero and the non-weighted version. Hence, using
71binary weights and the weighted version missing values can be treated
72in a proper way.
73
74\section AveragerWeighted
75
76
77
78\subsection Mean
79
80For any situation the weight is always designed so the weighted mean
81is calculated as \f$ m=\frac{\sum w_ix_i}{\sum w_i} \f$, which obviously
82fulfills the conditions above.
83
84
85
86In the case of varying measurement error, it could be motivated that
87the weight shall be \f$ w_i = 1/\sigma_i^2 \f$. We assume measurement error
88to be Gaussian and the likelihood to get our measurements is
89\f$ L(m)=\prod
90(2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}} \f$.  We
91maximize the likelihood by taking the derivity with respect to \f$ m \f$ on
92the logarithm of the likelihood \f$ \frac{d\ln L(m)}{dm}=\sum
93\frac{x_i-m}{\sigma_i^2} \f$. Hence, the Maximum Likelihood method yields
94the estimator \f$ m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2} \f$.
95
96
97\subsection Variance
98In case of varying variance, there is no point estimating a variance
99since it is different for each data point.
100
101Instead we look at the case when we want to estimate the variance over
102\f$f\f$ but are sampling from \f$ f' \f$. For the mean of an observable \f$ O \f$ we
103have \f$ \widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum
104w_i} \f$. Hence, an estimator of the variance of \f$ X \f$ is
105
106\f$
107s^2 = <X^2>-<X>^2=
108\f$
109
110\f$
111 = \frac{\sum w_ix_i^2}{\sum w_i}-\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}=
112\f$
113
114\f$
115 = \frac{\sum w_i(x_i^2-m^2)}{\sum w_i}=
116\f$
117
118\f$
119 = \frac{\sum w_i(x_i^2-2mx_i+m^2)}{\sum w_i}=
120\f$
121
122\f$
123 = \frac{\sum w_i(x_i-m)^2}{\sum w_i}
124\f$
125
126This estimator fulfills that it is invariant under a rescaling and
127having a weight equal to zero is equivalent to removing the data
128point. Having all weights equal to unity we get \f$ \sigma=\frac{\sum
129(x_i-m)^2}{N} \f$, which is the same as returned from Averager. Hence,
130this estimator is slightly biased, but still very efficient.
131
132\subsection standard_error Standard Error
133The standard error squared is equal to the expexted squared error of
134the estimation of \f$m\f$. The squared error consists of two parts, the
135variance of the estimator and the squared bias:
136
137\f$
138<m-\mu>^2=<m-<m>+<m>-\mu>^2=
139\f$
140\f$
141<m-<m>>^2+(<m>-\mu)^2
142\f$. 
143
144In the case when weights are included in analysis due to varying
145measurement errors and the weights can be treated as deterministic, we
146have
147
148\f$
149Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}=
150\f$
151\f$
152\frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}=
153\f$
154\f$
155\frac{\sigma_0^2}{\sum w_i},
156\f$
157
158where we need to estimate \f$ \sigma_0^2 \f$. Again we have the likelihood
159
160\f$
161L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{(-\frac{w_i(x-m)^2}{2\sigma_0^2})}
162\f$
163and taking the derivity with respect to
164\f$\sigma_o^2\f$,
165
166\f$
167\frac{d\ln L}{d\sigma_i^2}=
168\f$
169\f$
170\sum -\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2}
171\f$
172
173which
174yields an estimator \f$ \sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2 \f$. This
175estimator is not ignoring weights equal to zero, because deviation is
176most often smaller than the expected infinity. Therefore, we modify
177the expression as follows \f$\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
178w_i\right)^2}\sum w_i(x-m)^2\f$ and we get the following estimator of
179the variance of the mean \f$\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
180w_i\right)^3}\sum w_i(x-m)^2\f$. This estimator fulfills the conditions
181above: adding a weight zero does not change it: rescaling the weights
182does not change it, and setting all weights to unity yields the same
183expression as in the non-weighted case.
184
185In a case when it is not a good approximation to treat the weights as
186deterministic, there are two ways to get a better estimation. The
187first one is to linearize the expression \f$\left<\frac{\sum
188w_ix_i}{\sum w_i}\right>\f$. The second method when the situation is
189more complicated is to estimate the standard error using a
190bootstrapping method.
191
192\section AveragerPairWeighted
193Here data points come in pairs (x,y). We are sampling from \f$f'_{XY}\f$
194but want to measure from \f$f_{XY}\f$. To compensate for this decrepency,
195averages of \f$g(x,y)\f$ are taken as \f$\sum \frac{f}{f'}g(x,y)\f$. Even
196though, \f$X\f$ and \f$Y\f$ are not independent \f$(f_{XY}\neq f_Xf_Y)\f$ we
197assume that we can factorize the ratio and get \f$\frac{\sum
198w_xw_yg(x,y)}{\sum w_xw_y}\f$
199\subsection Covariance
200Following the variance calculations for AveragerWeighted we have
201\f$Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}\f$ where
202\f$m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}\f$
203
204\subsection Correlation
205
206As the mean is estimated as
207\f$
208m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}
209\f$,
210the variance is estimated as
211\f$
212\sigma_x^2=\frac{\sum w_xw_y(x-m_x)^2}{\sum w_xw_y}
213\f$.
214As in the non-weighted case we define the correlation to be the ratio
215between the covariance and geometrical average of the variances
216
217\f$
218\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum
219w_xw_y(y-m_y)^2}}
220\f$.
221
222
223This expression fulfills the following
224 - Having N equal weights the expression reduces to the non-weighted expression.
225 - Adding a pair of data, in which one weight is zero is equivalent
226to ignoring the data pair.
227 - Correlation is equal to unity if and only if \f$x\f$ is equal to
228\f$y\f$. Otherwise the correlation is between -1 and 1.
229
230\section Score
231
232\subsection Pearson
233
234\f$\frac{\sum w(x-m_x)(y-m_y)}{\sqrt{\sum w(x-m_x)^2\sum w(y-m_y)^2}}\f$.
235
236See AveragerPairWeighted correlation.
237
238\subsection ROC
239
240An interpretation of the ROC curve area is the probability that if we
241take one sample from class \f$+\f$ and one sample from class \f$-\f$, what is
242the probability that the sample from class \f$+\f$ has greater value. The
243ROC curve area calculates the ratio of pairs fulfilling this
244
245\f$
246\frac{\sum_{\{i,j\}:x^-_i<x^+_j}1}{\sum_{i,j}1}.
247\f$
248
249An geometrical interpretation is to have a number of squares where
250each square correspond to a pair of samples. The ROC curve follows the
251border between pairs in which the samples from class \f$+\f$ has a greater
252value and pairs in which this is not fulfilled. The ROC curve area is
253the area of those latter squares and a natural extension is to weight
254each pair with its two weights and consequently the weighted ROC curve
255area becomes
256
257\f$
258\frac{\sum_{\{i,j\}:x^-_i<x^+_j}w^-_iw^+_j}{\sum_{i,j}w^-_iw^+_j}
259\f$
260
261This expression is invariant under a rescaling of weight. Adding a
262data value with weight zero adds nothing to the exprssion, and having
263all weight equal to unity yields the non-weighted ROC curve area.
264
265\subsection tScore
266
267Assume that \f$x\f$ and \f$y\f$ originate from the same distribution
268\f$N(\mu,\sigma_i^2)\f$ where \f$\sigma_i^2=\frac{\sigma_0^2}{w_i}\f$. We then
269estimate \f$\sigma_0^2\f$ as
270\f$
271\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
272{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
273\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
274\f$
275The variance of difference of the means becomes
276\f$
277Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum
278w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}=
279\frac{\sigma_0^2}{\sum w_i}+\frac{\sigma_0^2}{\sum w_i},
280\f$
281and consequently the t-score becomes
282\f$
283\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
284{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
285\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
286\left(\frac{1}{\sum w_i}+\frac{1}{\sum w_i}\right),
287\f$
288
289For a \f$w_i=w\f$ we this expression get condensed down to
290\f$
291\frac{w\sum (x-m_x)^2+w\sum (y-m_y)^2}
292{n_x+n_y-2}
293\left(\frac{1}{wn_x}+\frac{1}{wn_y}\right),
294\f$
295in other words the good old expression as for non-weighted.
296
297\subsection FoldChange
298Fold-Change is simply the difference between the weighted mean of the
299two groups \f$\frac{\sum w_xx}{\sum w_x}-\frac{\sum w_yy}{\sum w_y}\f$
300
301\subsection WilcoxonFoldChange
302Taking all pair samples (one from class \f$+\f$ and one from class \f$-\f$)
303and calculating the weighted median of the distances.
304
305\section Distance
306
307A \ref concept_distance measures how far apart two ranges are. A Distance should
308preferably meet some criteria:
309
310 - It is symmetric, \f$ d(x,y) = d(y,x) \f$, that is distance from \f$
311   x \f$ to \f$ y \f$ equals the distance from \f$ y \f$ to \f$ x \f$.
312 - Zero self-distance: \f$ d(x,x) = 0 \f$
313 - Triangle inequality: \f$ d(x,z) \le d(x,y) + d(y,z) \f$
314
315\subsection weighted_distance Weighted Distance
316
317Weighted Distance is an extension of usual unweighted distances, in
318which each data point is accompanied with a weight. A weighted
319distance should meet some criteria:
320
321 - Having all unity weights should yield the unweighted case.
322 - Rescaling the weights, \f$ w_i = Cw_i \f$, does not change the distance.
323 - Having a \f$ w_x = 0 \f$ the distance should ignore corresponding
324    \f$ x \f$, \f$ y \f$, and \f$ w_y \f$.
325 - A zero weight should not result in a very different distance than a
326   small weight, in other words, modifying a weight should change the
327   distance in a continuous manner.
328 - The duplicate property. If data is coming in duplicate such that
329   \f$ x_{2i}=x_{2i+1} \f$, then the case when \f$ w_{2i}=w_{2i+1} \f$
330   should equal to if you set \f$ w_{2i}=0 \f$.
331
332The last condition, duplicate property, implies that setting a weight
333to zero is not equivalent to removing the data point. This behavior is
334sensible because otherwise we would have a bias towards having ranges
335with small weights being close to other ranges. For a weighted
336distance, meeting these criteria, it might be difficult to show that
337the triangle inequality is fulfilled. For most algorithms the triangle
338inequality is not essential for the distance to work properly, so if
339you need to choose between fulfilling triangle inequality and these
340latter criteria it is preferable to meet the latter criteria.
341
342In test/distance_test.cc there are tests for testing these properties.
343
344\section Kernel
345\subsection polynomial_kernel Polynomial Kernel
346The polynomial kernel of degree \f$N\f$ is defined as \f$(1+<x,y>)^N\f$, where
347\f$<x,y>\f$ is the linear kernel (usual scalar product). For the weighted
348case we define the linear kernel to be
349\f$<x,y>=\frac{\sum {w_xw_yxy}}{\sum{w_xw_y}}\f$ and the
350polynomial kernel can be calculated as before
351\f$(1+<x,y>)^N\f$.
352
353\subsection gaussian_kernel Gaussian Kernel
354We define the weighted Gaussian kernel as \f$\exp\left(-N\frac{\sum
355w_xw_y(x-y)^2}{\sum w_xw_y}\right)\f$.
356
357\section Regression
358\subsection Naive
359\subsection Linear
360We have the model
361
362\f$
363y_i=\alpha+\beta (x-m_x)+\epsilon_i,
364\f$
365
366where \f$\epsilon_i\f$ is the noise. The variance of the noise is
367inversely proportional to the weight,
368\f$Var(\epsilon_i)=\frac{\sigma^2}{w_i}\f$. In order to determine the
369model parameters, we minimimize the sum of quadratic errors.
370
371\f$
372Q_0 = \sum \epsilon_i^2
373\f$
374
375Taking the derivity with respect to \f$\alpha\f$ and \f$\beta\f$ yields two conditions
376
377\f$
378\frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha -
379\beta (x_i-m_x)=0
380\f$
381
382and
383
384\f$ \frac{\partial Q_0}{\partial \beta} = -2 \sum
385w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0
386\f$
387
388or equivalently
389
390\f$
391\alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y
392\f$
393
394and
395
396\f$ \beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum
397w_i(x_i-m_x)^2}=\frac{Cov(x,y)}{Var(x)}
398\f$
399
400Note, by having all weights equal we get back the unweighted
401case. Furthermore, we calculate the variance of the estimators of
402\f$\alpha\f$ and \f$\beta\f$.
403
404\f$
405\textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}=
406\frac{\sigma^2}{\sum w_i}
407\f$
408
409and
410\f$
411\textrm{Var}(\beta )= \frac{w_i^2(x_i-m_x)^2\frac{\sigma^2}{w_i}}
412{(\sum w_i(x_i-m_x)^2)^2}=
413\frac{\sigma^2}{\sum w_i(x_i-m_x)^2}
414\f$
415
416Finally, we estimate the level of noise, \f$\sigma^2\f$. Inspired by the
417unweighted estimation
418
419\f$
420s^2=\frac{\sum (y_i-\alpha-\beta (x_i-m_x))^2}{n-2}
421\f$
422
423we suggest the following estimator
424
425\f$ s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum
426w_i-2\frac{\sum w_i^2}{\sum w_i}} \f$
427
428*/
429
430
431
Note: See TracBrowser for help on using the repository browser.