source: trunk/doc/Statistics.tex @ 730

Last change on this file since 730 was 675, checked in by Jari Häkkinen, 17 years ago

References #83. Changing project name to yat. Compilation will fail in this revision.

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 15.7 KB
Line 
1\documentclass[12pt]{article}
2
3% $Id: Statistics.tex 675 2006-10-10 12:08:45Z jari $
4%
5% Copyright (C) The authors contributing to this file.
6%
7% This file is part of the yat library, http://lev.thep.lu.se/trac/yat
8%
9% The yat library is free software; you can redistribute it and/or
10% modify it under the terms of the GNU General Public License as
11% published by the Free Software Foundation; either version 2 of the
12% License, or (at your option) any later version.
13%
14% The yat library is distributed in the hope that it will be useful,
15% but WITHOUT ANY WARRANTY; without even the implied warranty of
16% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
17% General Public License for more details.
18%
19% You should have received a copy of the GNU General Public License
20% along with this program; if not, write to the Free Software
21% Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
22% 02111-1307, USA.
23
24\usepackage{html}
25
26
27\flushbottom
28\footskip 54pt
29\headheight 0pt
30\headsep 0pt
31\oddsidemargin 0pt
32\parindent 0pt
33\parskip 2ex
34\textheight 230mm
35\textwidth 165mm
36\topmargin 0pt
37
38\renewcommand{\baselinestretch} {1.0}
39\renewcommand{\textfraction} {0.1}
40\renewcommand{\topfraction} {1.0}
41\renewcommand{\bottomfraction} {1.0}
42\renewcommand{\floatpagefraction} {1.0}
43
44\renewcommand{\d}{{\mathrm{d}}}
45\newcommand{\nd}{$^{\mathrm{nd}}$}
46\newcommand{\eg}{{\it {e.g.}}}
47\newcommand{\ie}{{\it {i.e., }}}
48\newcommand{\etal}{{\it {et al.}}}
49\newcommand{\eref}[1]{Eq.~(\ref{e:#1})}
50\newcommand{\fref}[1]{Fig.~\ref{f:#1}}
51\newcommand{\ovr}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)}
52
53\begin{document}
54
55\large
56{\bf Weighted Statistics}
57\normalsize
58\begin{htmlonly}
59This document is also available in
60\htmladdnormallink{PDF}{Statistics.pdf}.
61\end{htmlonly}
62
63\tableofcontents
64\clearpage
65
66\section{Introduction}
67There are several different reasons why a statistical analysis needs
68to adjust for weighting. In literature reasons are mainly diveded in
69to groups.
70
71The first group is when some of the measurements are known to be more
72precise than others. The more precise a measurement is, the larger
73weight it is given. The simplest case is when the weight are given
74before the measurements and they can be treated as deterministic. It
75becomes more complicated when the weight can be determined not until
76afterwards, and even more complicated if the weight depends on the
77value of the observable.
78
79The second group of situations is when calculating averages over one
80distribution and sampling from another distribution. Compensating for
81this discrepency weights are introduced to the analysis. A simple
82example may be that we are interviewing people but for economical
83reasons we choose to interview more people from the city than from the
84countryside. When summarizing the statistics the answers from the city
85are given a smaller weight. In this example we are choosing the
86proportions of people from countryside and people from city being
87intervied. Hence, we can determine the weights before and consider
88them to be deterministic. In other situations the proportions are not
89deterministic, but rather a result from the sampling and the weights
90must be treated as stochastic and only in rare situations the weights
91can be treated as independent of the observable.
92
93Since there are various origins for a weight occuring in a statistical
94analysis, there are various ways to treat the weights and in general
95the analysis should be tailored to treat the weights correctly. We
96have not chosen one situation for our implementations, so see specific
97function documentation for what assumtions are made. Though, common
98for implementationare the following:
99\begin{itemize}
100\item Setting all weights to unity yields the same result as the
101non-weighted version.
102\item Rescaling the weights does not change any function.
103\item Setting a weight to zero is equivalent to removing the data point.
104\end{itemize}
105An important case is when weights are binary (either 1 or 0). Then we
106get the same result using the weighted version as using the data with
107weight not equal to zero and the non-weighted version. Hence, using
108binary weights and the weighted version missing values can be treated
109in a proper way.
110
111\section{AveragerWeighted}
112
113
114
115\subsection{Mean}
116
117For any situation the weight is always designed so the weighted mean
118is calculated as $m=\frac{\sum w_ix_i}{\sum w_i}$, which obviously
119fulfills the conditions above.
120
121In the case of varying measurement error, it could be motivated that
122the weight shall be $w_i = 1/\sigma_i^2$. We assume measurement error
123to be Gaussian and the likelihood to get our measurements is
124$L(m)=\prod
125(2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}}$.  We
126maximize the likelihood by taking the derivity with respect to $m$ on
127the logarithm of the likelihood $\frac{d\ln L(m)}{dm}=\sum
128\frac{x_i-m}{\sigma_i^2}$. Hence, the Maximum Likelihood method yields
129the estimator $m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2}$.
130
131
132\subsection{Variance}
133In case of varying variance, there is no point estimating a variance
134since it is different for each data point.
135
136Instead we look at the case when we want to estimate the variance over
137$f$ but are sampling from $f'$. For the mean of an observable $O$ we
138have $\widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum
139w_i}$. Hence, an estimator of the variance of $X$ is
140\begin{eqnarray}
141\sigma^2=<X^2>-<X>^2=
142\\\frac{\sum w_ix_i^2}{\sum w_i}-\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}=
143\\\frac{\sum w_i(x_i^2-m^2)}{\sum w_i}
144\\\frac{\sum w_i(x_i^2-2mx_i+m^2)}{\sum w_i}
145\\\frac{\sum w_i(x_i-m)^2}{\sum w_i}
146\end{eqnarray}
147This estimator fulfills that it is invariant under a rescaling and
148having a weight equal to zero is equivalent to removing the data
149point. Having all weight equal to unity we get $\sigma=\frac{\sum
150(x_i-m)^2}{N}$, which is the same as returned from Averager. Hence,
151this estimator is slightly biased, but still very efficient.
152
153\subsection{Standard Error}
154The standard error squared is equal to the expexted squared error of
155the estimation of $m$. The squared error consists of two parts, the
156variance of the estimator and the squared
157bias. $<m-\mu>^2=<m-<m>+<m>-\mu>^2=<m-<m>>^2+(<m>-\mu)^2$.  In the
158case when weights are included in analysis due to varying measurement
159errors and the weights can be treated as deterministic ,we have
160\begin{equation}
161Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}=
162\frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}=
163\frac{\sigma_0^2}{\sum w_i},
164\end{equation}
165where we need to estimate $\sigma_0^2$. Again we have the likelihood
166$L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{-\frac{w_i(x-m)^2}{2\sigma_0^2}}$
167and taking the derivity with respect to $\sigma_o^2$, $\frac{d\ln
168L}{d\sigma_i^2}=\sum
169-\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2}$ which
170yields an estimator $\sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2$. This
171estimator is not ignoring weights equal to zero, because deviation is
172most often smaller than the expected infinity. Therefore, we modify
173the expression as follows $\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
174w_i\right)^2}\sum w_i(x-m)^2$ and we get the following estimator of
175the variance of the mean $\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
176w_i\right)^3}\sum w_i(x-m)^2$. This estimator fulfills the conditions
177above: adding a weight zero does not change it: rescaling the weights
178does not change it, and setting all weights to unity yields the same
179expression as in the non-weighted case.
180
181In a case when it is not a good approximation to treat the weights as
182deterministic, there are two ways to get a better estimation. The
183first one is to linearize the expression $\left<\frac{\sum
184w_ix_i}{\sum w_i}\right>$. The second method when the situation is
185more complicated is to estimate the standard error using a
186bootstrapping method.
187
188\section{AveragerPairWeighted}
189Here data points come in pairs (x,y). We are sampling from $f'_{XY}$
190but want to measure from $f_{XY}$. To compensate for this decrepency,
191averages of $g(x,y)$ are taken as $\sum \frac{f}{f'}g(x,y)$. Even
192though, $X$ and $Y$ are not independent $(f_{XY}\neq f_Xf_Y)$ we
193assume that we can factorize the ratio and get $\frac{\sum
194w_xw_yg(x,y)}{\sum w_xw_y}$
195\subsection{Covariance}
196Following the variance calculations for AveragerWeighted we have
197$Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}$ where
198$m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$
199
200\subsection{correlation}
201
202As the mean is estimated as $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$,
203the variance is estimated as $\sigma_x^2=\frac{\sum
204w_xw_y(x-m_x)^2}{\sum w_xw_y}$. As in the non-weighted case we define
205the correlation to be the ratio between the covariance and geometrical
206average of the variances
207
208$\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum
209w_xw_y(y-m_y)^2}}$.
210
211This expression fulfills the following
212\begin{itemize}
213\item Having N weights the expression reduces to the non-weighted expression.
214\item Adding a pair of data, in which one weight is zero is equivalent
215to ignoring the data pair.
216\item Correlation is equal to unity if and only if $x$ is equal to
217$y$. Otherwise the correlation is between -1 and 1.
218\end{itemize}
219\section{Score}
220
221
222\subsection{Pearson}
223
224$\frac{\sum w(x-m_x)(y-m_y)}{\sqrt{\sum w(x-m_x)^2\sum w(y-m_y)^2}}$.
225
226See AveragerPairWeighted correlation.
227
228\subsection{ROC}
229
230An interpretation of the ROC curve area is the probability that if we
231take one sample from class $+$ and one sample from class $-$, what is
232the probability that the sample from class $+$ has greater value. The
233ROC curve area calculates the ratio of pairs fulfilling this
234
235\begin{equation}
236\frac{\sum_{\{i,j\}:x^-_i<x^+_j}1}{\sum_{i,j}1}.
237\end{equation}
238
239An geometrical interpretation is to have a number of squares where
240each square correspond to a pair of samples. The ROC curve follows the
241border between pairs in which the samples from class $+$ has a greater
242value and pairs in which this is not fulfilled. The ROC curve area is
243the area of those latter squares and a natural extension is to weight
244each pair with its two weights and consequently the weighted ROC curve
245area becomes
246
247\begin{equation}
248\frac{\sum_{\{i,j\}:x^-_i<x^+_j}w^-_iw^+_j}{\sum_{i,j}w^-_iw^+_j}
249\end{equation}
250
251This expression is invariant under a rescaling of weight. Adding a
252data value with weight zero adds nothing to the exprssion, and having
253all weight equal to unity yields the non-weighted ROC curve area.
254
255\subsection{tScore}
256
257Assume that $x$ and $y$ originate from the same distribution
258$N(\mu,\sigma_i^2)$ where $\sigma_i^2=\frac{\sigma_0^2}{w_i}$. We then
259estimate $\sigma_0^2$ as
260\begin{equation}
261\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
262{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
263\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
264\end{equation}
265The variance of difference of the means becomes
266\begin{eqnarray}
267Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum
268w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}=
269\frac{\sigma_0^2}{\sum w_i}+\frac{\sigma_0^2}{\sum w_i},
270\end{eqnarray}
271and consequently the t-score becomes
272\begin{equation}
273\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
274{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
275\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
276\left(\frac{1}{\sum w_i}+\frac{1}{\sum w_i}\right),
277\end{equation}
278
279For a $w_i=w$ we this expression get condensed down to
280\begin{equation}
281\frac{w\sum (x-m_x)^2+w\sum (y-m_y)^2}
282{n_x+n_y-2} 
283\left(\frac{1}{wn_x}+\frac{1}{wn_y}\right),
284\end{equation}
285in other words the good old expression as for non-weighted.
286
287\subsection{FoldChange}
288Fold-Change is simply the difference between the weighted mean of the
289two groups //$\frac{\sum w_xx}{\sum w_x}-\frac{\sum w_yy}{\sum w_y}$
290
291\subsection{WilcoxonFoldChange}
292Taking all pair samples (one from class $+$ and one from class $-$)
293and calculating the weighted median of the distances.
294
295\section{Kernel}
296\subsection{Polynomial Kernel}
297The polynomial kernel of degree $N$ is defined as $(1+<x,y>)^N$, where
298$<x,y>$ is the linear kernel (usual scalar product). For the weighted
299case we define the linear kernel to be $<x,y>=\sum {w_xw_yxy}$ and the
300polynomial kernel can be calculated as before
301$(1+<x,y>)^N$. Is this kernel a proper kernel (always being semi
302positive definite). Yes, because $<x,y>$ is obviously a proper kernel
303as it is a scalar product. Adding a positive constant to a kernel
304yields another kernel so $1+<x,y>$ is still a proper kernel. Then also
305$(1+<x,y>)^N$ is a proper kernel because taking a proper kernel to the
306$Nth$ power yields a new proper kernel (see any good book on SVM).
307\subsection{Gaussian Kernel}
308We define the weighted Gaussian kernel as $\exp\left(-\frac{\sum
309w_xw_y(x-y)^2}{\sum w_xw_y}\right)$, which fulfills the conditions
310listed in the introduction.
311
312Is this kernel a proper kernel? Yes, following the proof of the
313non-weighted kernel we see that $K=\exp\left(-\frac{\sum
314w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum
315w_xw_y}\right)\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$,
316which is a product of two proper kernels. $\exp\left(-\frac{\sum
317w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum
318w_xw_y}\right)$ is a proper kernel, because it is a scalar product and
319$\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$ is a proper
320kernel, because it a polynomial of the linear kernel with positive
321coefficients. As product of two kernel also is a kernel, the Gaussian
322kernel is a proper kernel.
323
324\section{Distance}
325
326\section{Regression}
327\subsection{Naive}
328\subsection{Linear}
329We have the model
330
331\begin{equation} 
332y_i=\alpha+\beta (x-m_x)+\epsilon_i,
333\end{equation} 
334
335where $\epsilon_i$ is the noise. The variance of the noise is
336inversely proportional to the weight,
337$Var(\epsilon_i)=\frac{\sigma^2}{w_i}$. In order to determine the
338model parameters, we minimimize the sum of quadratic errors.
339
340\begin{equation}
341Q_0 = \sum \epsilon_i^2
342\end{equation}
343
344Taking the derivity with respect to $\alpha$ and $\beta$ yields two conditions
345
346\begin{equation} 
347\frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha -
348\beta (x_i-m_x)=0
349\end{equation}
350
351and
352
353\begin{equation} \frac{\partial Q_0}{\partial \beta} = -2 \sum
354w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0
355\end{equation}
356
357or equivalently
358
359\begin{equation}
360\alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y
361\end{equation}
362
363and
364
365\begin{equation} \beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum
366w_i(x_i-m_x)^2}=\frac{Cov(x,y)}{Var(x)} 
367\end{equation}
368
369Note, by having all weights equal we get back the unweighted
370case. Furthermore, we calculate the variance of the estimators of
371$\alpha$ and $\beta$.
372
373\begin{equation} 
374\textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}=
375\frac{\sigma^2}{\sum w_i}
376\end{equation}
377
378and
379\begin{equation}
380\textrm{Var}(\beta )= \frac{w_i^2(x_i-m_x)^2\frac{\sigma^2}{w_i}}
381{(\sum w_i(x_i-m_x)^2)^2}=
382\frac{\sigma^2}{\sum w_i(x_i-m_x)^2}
383\end{equation}
384
385Finally, we estimate the level of noise, $\sigma^2$. Inspired by the
386unweighted estimation
387
388\begin{equation}
389s^2=\frac{\sum (y_i-\alpha-\beta (x_i-m_x))^2}{n-2}
390\end{equation}
391
392we suggest the following estimator
393
394\begin{equation} s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum
395w_i-2\frac{\sum w_i^2}{\sum w_i}} \end{equation}
396
397\section{Outlook}
398\subsection{Hierarchical clustering}
399A hierarchical clustering consists of two things: finding the two
400closest data points, merge these two data points two a new data point
401and calculate the new distances from this point to all other points.
402
403In the first item, we need a distance matrix, and if we use Euclidean
404distanses the natural modification of the expression would be
405
406\begin{equation}
407d(x,y)=\frac{\sum w_i^xw_j^y(x_i-y_i)^2}{\sum w_i^xw_j^y} 
408\end{equation} 
409
410For the second item, inspired by average linkage, we suggest
411
412\begin{equation} 
413d(xy,z)=\frac{\sum w_i^xw_j^z(x_i-z_i)^2+\sum
414w_i^yw_j^z(y_i-z_i)^2}{\sum w_i^xw_j^z+\sum w_i^yw_j^z} 
415\end{equation}
416
417to be the distance between the new merged point $xy$ and $z$, and we
418also calculate new weights for this point: $w^{xy}_i=w^x_i+w^y_i$
419
420\end{document}
421
422
423
Note: See TracBrowser for help on using the repository browser.