source: branches/0.3.1/doc/Statistics.tex @ 838

Last change on this file since 838 was 831, checked in by Peter, 17 years ago

Refs #185.

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 15.6 KB
Line 
1\documentclass[12pt]{article}
2
3% $Id: Statistics.tex 831 2007-03-27 13:22:09Z peter $
4%
5% Copyright (C) 2005 Peter Johansson
6% Copyright (C) 2006 Jari Häkkinen, Markus Ringnér, Peter Johansson
7% Copyright (C) 2007 Peter Johansson
8%
9% This file is part of the yat library, http://lev.thep.lu.se/trac/yat
10%
11% The yat library is free software; you can redistribute it and/or
12% modify it under the terms of the GNU General Public License as
13% published by the Free Software Foundation; either version 2 of the
14% License, or (at your option) any later version.
15%
16% The yat library is distributed in the hope that it will be useful,
17% but WITHOUT ANY WARRANTY; without even the implied warranty of
18% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
19% General Public License for more details.
20%
21% You should have received a copy of the GNU General Public License
22% along with this program; if not, write to the Free Software
23% Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
24% 02111-1307, USA.
25
26
27
28\flushbottom
29\footskip 54pt
30\headheight 0pt
31\headsep 0pt
32\oddsidemargin 0pt
33\parindent 0pt
34\parskip 2ex
35\textheight 230mm
36\textwidth 165mm
37\topmargin 0pt
38
39\renewcommand{\baselinestretch} {1.0}
40\renewcommand{\textfraction} {0.1}
41\renewcommand{\topfraction} {1.0}
42\renewcommand{\bottomfraction} {1.0}
43\renewcommand{\floatpagefraction} {1.0}
44
45\renewcommand{\d}{{\mathrm{d}}}
46\newcommand{\nd}{$^{\mathrm{nd}}$}
47\newcommand{\eg}{{\it {e.g.}}}
48\newcommand{\ie}{{\it {i.e., }}}
49\newcommand{\etal}{{\it {et al.}}}
50\newcommand{\eref}[1]{Eq.~(\ref{e:#1})}
51\newcommand{\fref}[1]{Fig.~\ref{f:#1}}
52\newcommand{\ovr}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)}
53
54\begin{document}
55
56\large
57{\bf Weighted Statistics}
58\normalsize
59
60\tableofcontents
61\clearpage
62
63\section{Introduction}
64There are several different reasons why a statistical analysis needs
65to adjust for weighting. In literature reasons are mainly diveded in
66to groups.
67
68The first group is when some of the measurements are known to be more
69precise than others. The more precise a measurement is, the larger
70weight it is given. The simplest case is when the weight are given
71before the measurements and they can be treated as deterministic. It
72becomes more complicated when the weight can be determined not until
73afterwards, and even more complicated if the weight depends on the
74value of the observable.
75
76The second group of situations is when calculating averages over one
77distribution and sampling from another distribution. Compensating for
78this discrepency weights are introduced to the analysis. A simple
79example may be that we are interviewing people but for economical
80reasons we choose to interview more people from the city than from the
81countryside. When summarizing the statistics the answers from the city
82are given a smaller weight. In this example we are choosing the
83proportions of people from countryside and people from city being
84intervied. Hence, we can determine the weights before and consider
85them to be deterministic. In other situations the proportions are not
86deterministic, but rather a result from the sampling and the weights
87must be treated as stochastic and only in rare situations the weights
88can be treated as independent of the observable.
89
90Since there are various origins for a weight occuring in a statistical
91analysis, there are various ways to treat the weights and in general
92the analysis should be tailored to treat the weights correctly. We
93have not chosen one situation for our implementations, so see specific
94function documentation for what assumtions are made. Though, common
95for implementations are the following:
96\begin{itemize}
97\item Setting all weights to unity yields the same result as the
98non-weighted version.
99\item Rescaling the weights does not change any function.
100\item Setting a weight to zero is equivalent to removing the data point.
101\end{itemize}
102An important case is when weights are binary (either 1 or 0). Then we
103get the same result using the weighted version as using the data with
104weight not equal to zero and the non-weighted version. Hence, using
105binary weights and the weighted version missing values can be treated
106in a proper way.
107
108\section{AveragerWeighted}
109
110
111
112\subsection{Mean}
113
114For any situation the weight is always designed so the weighted mean
115is calculated as $m=\frac{\sum w_ix_i}{\sum w_i}$, which obviously
116fulfills the conditions above.
117
118In the case of varying measurement error, it could be motivated that
119the weight shall be $w_i = 1/\sigma_i^2$. We assume measurement error
120to be Gaussian and the likelihood to get our measurements is
121$L(m)=\prod
122(2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}}$.  We
123maximize the likelihood by taking the derivity with respect to $m$ on
124the logarithm of the likelihood $\frac{d\ln L(m)}{dm}=\sum
125\frac{x_i-m}{\sigma_i^2}$. Hence, the Maximum Likelihood method yields
126the estimator $m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2}$.
127
128
129\subsection{Variance}
130In case of varying variance, there is no point estimating a variance
131since it is different for each data point.
132
133Instead we look at the case when we want to estimate the variance over
134$f$ but are sampling from $f'$. For the mean of an observable $O$ we
135have $\widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum
136w_i}$. Hence, an estimator of the variance of $X$ is
137\begin{eqnarray}
138\sigma^2=<X^2>-<X>^2=
139\\\frac{\sum w_ix_i^2}{\sum w_i}-\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}=
140\\\frac{\sum w_i(x_i^2-m^2)}{\sum w_i}=
141\\\frac{\sum w_i(x_i^2-2mx_i+m^2)}{\sum w_i}=
142\\\frac{\sum w_i(x_i-m)^2}{\sum w_i}
143\end{eqnarray}
144This estimator fulfills that it is invariant under a rescaling and
145having a weight equal to zero is equivalent to removing the data
146point. Having all weights equal to unity we get $\sigma=\frac{\sum
147(x_i-m)^2}{N}$, which is the same as returned from Averager. Hence,
148this estimator is slightly biased, but still very efficient.
149
150\subsection{Standard Error}
151The standard error squared is equal to the expexted squared error of
152the estimation of $m$. The squared error consists of two parts, the
153variance of the estimator and the squared
154bias. $<m-\mu>^2=<m-<m>+<m>-\mu>^2=<m-<m>>^2+(<m>-\mu)^2$.  In the
155case when weights are included in analysis due to varying measurement
156errors and the weights can be treated as deterministic ,we have
157\begin{equation}
158Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}=
159\frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}=
160\frac{\sigma_0^2}{\sum w_i},
161\end{equation}
162where we need to estimate $\sigma_0^2$. Again we have the likelihood
163$L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{-\frac{w_i(x-m)^2}{2\sigma_0^2}}$
164and taking the derivity with respect to $\sigma_o^2$, $\frac{d\ln
165L}{d\sigma_i^2}=\sum
166-\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2}$ which
167yields an estimator $\sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2$. This
168estimator is not ignoring weights equal to zero, because deviation is
169most often smaller than the expected infinity. Therefore, we modify
170the expression as follows $\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
171w_i\right)^2}\sum w_i(x-m)^2$ and we get the following estimator of
172the variance of the mean $\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
173w_i\right)^3}\sum w_i(x-m)^2$. This estimator fulfills the conditions
174above: adding a weight zero does not change it: rescaling the weights
175does not change it, and setting all weights to unity yields the same
176expression as in the non-weighted case.
177
178In a case when it is not a good approximation to treat the weights as
179deterministic, there are two ways to get a better estimation. The
180first one is to linearize the expression $\left<\frac{\sum
181w_ix_i}{\sum w_i}\right>$. The second method when the situation is
182more complicated is to estimate the standard error using a
183bootstrapping method.
184
185\section{AveragerPairWeighted}
186Here data points come in pairs (x,y). We are sampling from $f'_{XY}$
187but want to measure from $f_{XY}$. To compensate for this decrepency,
188averages of $g(x,y)$ are taken as $\sum \frac{f}{f'}g(x,y)$. Even
189though, $X$ and $Y$ are not independent $(f_{XY}\neq f_Xf_Y)$ we
190assume that we can factorize the ratio and get $\frac{\sum
191w_xw_yg(x,y)}{\sum w_xw_y}$
192\subsection{Covariance}
193Following the variance calculations for AveragerWeighted we have
194$Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}$ where
195$m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$
196
197\subsection{correlation}
198
199As the mean is estimated as $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$,
200the variance is estimated as $\sigma_x^2=\frac{\sum
201w_xw_y(x-m_x)^2}{\sum w_xw_y}$. As in the non-weighted case we define
202the correlation to be the ratio between the covariance and geometrical
203average of the variances
204
205$\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum
206w_xw_y(y-m_y)^2}}$.
207
208This expression fulfills the following
209\begin{itemize}
210\item Having N weights the expression reduces to the non-weighted expression.
211\item Adding a pair of data, in which one weight is zero is equivalent
212to ignoring the data pair.
213\item Correlation is equal to unity if and only if $x$ is equal to
214$y$. Otherwise the correlation is between -1 and 1.
215\end{itemize}
216\section{Score}
217
218
219\subsection{Pearson}
220
221$\frac{\sum w(x-m_x)(y-m_y)}{\sqrt{\sum w(x-m_x)^2\sum w(y-m_y)^2}}$.
222
223See AveragerPairWeighted correlation.
224
225\subsection{ROC}
226
227An interpretation of the ROC curve area is the probability that if we
228take one sample from class $+$ and one sample from class $-$, what is
229the probability that the sample from class $+$ has greater value. The
230ROC curve area calculates the ratio of pairs fulfilling this
231
232\begin{equation}
233\frac{\sum_{\{i,j\}:x^-_i<x^+_j}1}{\sum_{i,j}1}.
234\end{equation}
235
236An geometrical interpretation is to have a number of squares where
237each square correspond to a pair of samples. The ROC curve follows the
238border between pairs in which the samples from class $+$ has a greater
239value and pairs in which this is not fulfilled. The ROC curve area is
240the area of those latter squares and a natural extension is to weight
241each pair with its two weights and consequently the weighted ROC curve
242area becomes
243
244\begin{equation}
245\frac{\sum_{\{i,j\}:x^-_i<x^+_j}w^-_iw^+_j}{\sum_{i,j}w^-_iw^+_j}
246\end{equation}
247
248This expression is invariant under a rescaling of weight. Adding a
249data value with weight zero adds nothing to the exprssion, and having
250all weight equal to unity yields the non-weighted ROC curve area.
251
252\subsection{tScore}
253
254Assume that $x$ and $y$ originate from the same distribution
255$N(\mu,\sigma_i^2)$ where $\sigma_i^2=\frac{\sigma_0^2}{w_i}$. We then
256estimate $\sigma_0^2$ as
257\begin{equation}
258\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
259{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
260\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
261\end{equation}
262The variance of difference of the means becomes
263\begin{eqnarray}
264Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum
265w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}=
266\frac{\sigma_0^2}{\sum w_i}+\frac{\sigma_0^2}{\sum w_i},
267\end{eqnarray}
268and consequently the t-score becomes
269\begin{equation}
270\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
271{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
272\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
273\left(\frac{1}{\sum w_i}+\frac{1}{\sum w_i}\right),
274\end{equation}
275
276For a $w_i=w$ we this expression get condensed down to
277\begin{equation}
278\frac{w\sum (x-m_x)^2+w\sum (y-m_y)^2}
279{n_x+n_y-2} 
280\left(\frac{1}{wn_x}+\frac{1}{wn_y}\right),
281\end{equation}
282in other words the good old expression as for non-weighted.
283
284\subsection{FoldChange}
285Fold-Change is simply the difference between the weighted mean of the
286two groups //$\frac{\sum w_xx}{\sum w_x}-\frac{\sum w_yy}{\sum w_y}$
287
288\subsection{WilcoxonFoldChange}
289Taking all pair samples (one from class $+$ and one from class $-$)
290and calculating the weighted median of the distances.
291
292\section{Kernel}
293\subsection{Polynomial Kernel}
294The polynomial kernel of degree $N$ is defined as $(1+<x,y>)^N$, where
295$<x,y>$ is the linear kernel (usual scalar product). For the weighted
296case we define the linear kernel to be $<x,y>=\sum {w_xw_yxy}$ and the
297polynomial kernel can be calculated as before
298$(1+<x,y>)^N$. Is this kernel a proper kernel (always being semi
299positive definite). Yes, because $<x,y>$ is obviously a proper kernel
300as it is a scalar product. Adding a positive constant to a kernel
301yields another kernel so $1+<x,y>$ is still a proper kernel. Then also
302$(1+<x,y>)^N$ is a proper kernel because taking a proper kernel to the
303$Nth$ power yields a new proper kernel (see any good book on SVM).
304\subsection{Gaussian Kernel}
305We define the weighted Gaussian kernel as $\exp\left(-\frac{\sum
306w_xw_y(x-y)^2}{\sum w_xw_y}\right)$, which fulfills the conditions
307listed in the introduction.
308
309Is this kernel a proper kernel? Yes, following the proof of the
310non-weighted kernel we see that $K=\exp\left(-\frac{\sum
311w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum
312w_xw_y}\right)\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$,
313which is a product of two proper kernels. $\exp\left(-\frac{\sum
314w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum
315w_xw_y}\right)$ is a proper kernel, because it is a scalar product and
316$\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$ is a proper
317kernel, because it a polynomial of the linear kernel with positive
318coefficients. As product of two kernel also is a kernel, the Gaussian
319kernel is a proper kernel.
320
321\section{Distance}
322
323\section{Regression}
324\subsection{Naive}
325\subsection{Linear}
326We have the model
327
328\begin{equation} 
329y_i=\alpha+\beta (x-m_x)+\epsilon_i,
330\end{equation} 
331
332where $\epsilon_i$ is the noise. The variance of the noise is
333inversely proportional to the weight,
334$Var(\epsilon_i)=\frac{\sigma^2}{w_i}$. In order to determine the
335model parameters, we minimimize the sum of quadratic errors.
336
337\begin{equation}
338Q_0 = \sum \epsilon_i^2
339\end{equation}
340
341Taking the derivity with respect to $\alpha$ and $\beta$ yields two conditions
342
343\begin{equation} 
344\frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha -
345\beta (x_i-m_x)=0
346\end{equation}
347
348and
349
350\begin{equation} \frac{\partial Q_0}{\partial \beta} = -2 \sum
351w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0
352\end{equation}
353
354or equivalently
355
356\begin{equation}
357\alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y
358\end{equation}
359
360and
361
362\begin{equation} \beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum
363w_i(x_i-m_x)^2}=\frac{Cov(x,y)}{Var(x)} 
364\end{equation}
365
366Note, by having all weights equal we get back the unweighted
367case. Furthermore, we calculate the variance of the estimators of
368$\alpha$ and $\beta$.
369
370\begin{equation} 
371\textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}=
372\frac{\sigma^2}{\sum w_i}
373\end{equation}
374
375and
376\begin{equation}
377\textrm{Var}(\beta )= \frac{w_i^2(x_i-m_x)^2\frac{\sigma^2}{w_i}}
378{(\sum w_i(x_i-m_x)^2)^2}=
379\frac{\sigma^2}{\sum w_i(x_i-m_x)^2}
380\end{equation}
381
382Finally, we estimate the level of noise, $\sigma^2$. Inspired by the
383unweighted estimation
384
385\begin{equation}
386s^2=\frac{\sum (y_i-\alpha-\beta (x_i-m_x))^2}{n-2}
387\end{equation}
388
389we suggest the following estimator
390
391\begin{equation} s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum
392w_i-2\frac{\sum w_i^2}{\sum w_i}} \end{equation}
393
394\section{Outlook}
395\subsection{Hierarchical clustering}
396A hierarchical clustering consists of two things: finding the two
397closest data points, merge these two data points two a new data point
398and calculate the new distances from this point to all other points.
399
400In the first item, we need a distance matrix, and if we use Euclidean
401distanses the natural modification of the expression would be
402
403\begin{equation}
404d(x,y)=\frac{\sum w_i^xw_j^y(x_i-y_i)^2}{\sum w_i^xw_j^y} 
405\end{equation} 
406
407For the second item, inspired by average linkage, we suggest
408
409\begin{equation} 
410d(xy,z)=\frac{\sum w_i^xw_j^z(x_i-z_i)^2+\sum
411w_i^yw_j^z(y_i-z_i)^2}{\sum w_i^xw_j^z+\sum w_i^yw_j^z} 
412\end{equation}
413
414to be the distance between the new merged point $xy$ and $z$, and we
415also calculate new weights for this point: $w^{xy}_i=w^x_i+w^y_i$
416
417\end{document}
418
419
420
Note: See TracBrowser for help on using the repository browser.