source: trunk/doc/Statistics.tex @ 763

Last change on this file since 763 was 744, checked in by Peter, 15 years ago

corrected some typos

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 15.5 KB
Line 
1\documentclass[12pt]{article}
2
3% $Id: Statistics.tex 744 2007-02-10 20:16:11Z peter $
4%
5% Copyright (C) The authors contributing to this file.
6%
7% This file is part of the yat library, http://lev.thep.lu.se/trac/yat
8%
9% The yat library is free software; you can redistribute it and/or
10% modify it under the terms of the GNU General Public License as
11% published by the Free Software Foundation; either version 2 of the
12% License, or (at your option) any later version.
13%
14% The yat library is distributed in the hope that it will be useful,
15% but WITHOUT ANY WARRANTY; without even the implied warranty of
16% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
17% General Public License for more details.
18%
19% You should have received a copy of the GNU General Public License
20% along with this program; if not, write to the Free Software
21% Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
22% 02111-1307, USA.
23
24
25
26\flushbottom
27\footskip 54pt
28\headheight 0pt
29\headsep 0pt
30\oddsidemargin 0pt
31\parindent 0pt
32\parskip 2ex
33\textheight 230mm
34\textwidth 165mm
35\topmargin 0pt
36
37\renewcommand{\baselinestretch} {1.0}
38\renewcommand{\textfraction} {0.1}
39\renewcommand{\topfraction} {1.0}
40\renewcommand{\bottomfraction} {1.0}
41\renewcommand{\floatpagefraction} {1.0}
42
43\renewcommand{\d}{{\mathrm{d}}}
44\newcommand{\nd}{$^{\mathrm{nd}}$}
45\newcommand{\eg}{{\it {e.g.}}}
46\newcommand{\ie}{{\it {i.e., }}}
47\newcommand{\etal}{{\it {et al.}}}
48\newcommand{\eref}[1]{Eq.~(\ref{e:#1})}
49\newcommand{\fref}[1]{Fig.~\ref{f:#1}}
50\newcommand{\ovr}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)}
51
52\begin{document}
53
54\large
55{\bf Weighted Statistics}
56\normalsize
57
58\tableofcontents
59\clearpage
60
61\section{Introduction}
62There are several different reasons why a statistical analysis needs
63to adjust for weighting. In literature reasons are mainly diveded in
64to groups.
65
66The first group is when some of the measurements are known to be more
67precise than others. The more precise a measurement is, the larger
68weight it is given. The simplest case is when the weight are given
69before the measurements and they can be treated as deterministic. It
70becomes more complicated when the weight can be determined not until
71afterwards, and even more complicated if the weight depends on the
72value of the observable.
73
74The second group of situations is when calculating averages over one
75distribution and sampling from another distribution. Compensating for
76this discrepency weights are introduced to the analysis. A simple
77example may be that we are interviewing people but for economical
78reasons we choose to interview more people from the city than from the
79countryside. When summarizing the statistics the answers from the city
80are given a smaller weight. In this example we are choosing the
81proportions of people from countryside and people from city being
82intervied. Hence, we can determine the weights before and consider
83them to be deterministic. In other situations the proportions are not
84deterministic, but rather a result from the sampling and the weights
85must be treated as stochastic and only in rare situations the weights
86can be treated as independent of the observable.
87
88Since there are various origins for a weight occuring in a statistical
89analysis, there are various ways to treat the weights and in general
90the analysis should be tailored to treat the weights correctly. We
91have not chosen one situation for our implementations, so see specific
92function documentation for what assumtions are made. Though, common
93for implementations are the following:
94\begin{itemize}
95\item Setting all weights to unity yields the same result as the
96non-weighted version.
97\item Rescaling the weights does not change any function.
98\item Setting a weight to zero is equivalent to removing the data point.
99\end{itemize}
100An important case is when weights are binary (either 1 or 0). Then we
101get the same result using the weighted version as using the data with
102weight not equal to zero and the non-weighted version. Hence, using
103binary weights and the weighted version missing values can be treated
104in a proper way.
105
106\section{AveragerWeighted}
107
108
109
110\subsection{Mean}
111
112For any situation the weight is always designed so the weighted mean
113is calculated as $m=\frac{\sum w_ix_i}{\sum w_i}$, which obviously
114fulfills the conditions above.
115
116In the case of varying measurement error, it could be motivated that
117the weight shall be $w_i = 1/\sigma_i^2$. We assume measurement error
118to be Gaussian and the likelihood to get our measurements is
119$L(m)=\prod
120(2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}}$.  We
121maximize the likelihood by taking the derivity with respect to $m$ on
122the logarithm of the likelihood $\frac{d\ln L(m)}{dm}=\sum
123\frac{x_i-m}{\sigma_i^2}$. Hence, the Maximum Likelihood method yields
124the estimator $m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2}$.
125
126
127\subsection{Variance}
128In case of varying variance, there is no point estimating a variance
129since it is different for each data point.
130
131Instead we look at the case when we want to estimate the variance over
132$f$ but are sampling from $f'$. For the mean of an observable $O$ we
133have $\widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum
134w_i}$. Hence, an estimator of the variance of $X$ is
135\begin{eqnarray}
136\sigma^2=<X^2>-<X>^2=
137\\\frac{\sum w_ix_i^2}{\sum w_i}-\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}=
138\\\frac{\sum w_i(x_i^2-m^2)}{\sum w_i}=
139\\\frac{\sum w_i(x_i^2-2mx_i+m^2)}{\sum w_i}=
140\\\frac{\sum w_i(x_i-m)^2}{\sum w_i}
141\end{eqnarray}
142This estimator fulfills that it is invariant under a rescaling and
143having a weight equal to zero is equivalent to removing the data
144point. Having all weights equal to unity we get $\sigma=\frac{\sum
145(x_i-m)^2}{N}$, which is the same as returned from Averager. Hence,
146this estimator is slightly biased, but still very efficient.
147
148\subsection{Standard Error}
149The standard error squared is equal to the expexted squared error of
150the estimation of $m$. The squared error consists of two parts, the
151variance of the estimator and the squared
152bias. $<m-\mu>^2=<m-<m>+<m>-\mu>^2=<m-<m>>^2+(<m>-\mu)^2$.  In the
153case when weights are included in analysis due to varying measurement
154errors and the weights can be treated as deterministic ,we have
155\begin{equation}
156Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}=
157\frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}=
158\frac{\sigma_0^2}{\sum w_i},
159\end{equation}
160where we need to estimate $\sigma_0^2$. Again we have the likelihood
161$L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{-\frac{w_i(x-m)^2}{2\sigma_0^2}}$
162and taking the derivity with respect to $\sigma_o^2$, $\frac{d\ln
163L}{d\sigma_i^2}=\sum
164-\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2}$ which
165yields an estimator $\sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2$. This
166estimator is not ignoring weights equal to zero, because deviation is
167most often smaller than the expected infinity. Therefore, we modify
168the expression as follows $\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
169w_i\right)^2}\sum w_i(x-m)^2$ and we get the following estimator of
170the variance of the mean $\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
171w_i\right)^3}\sum w_i(x-m)^2$. This estimator fulfills the conditions
172above: adding a weight zero does not change it: rescaling the weights
173does not change it, and setting all weights to unity yields the same
174expression as in the non-weighted case.
175
176In a case when it is not a good approximation to treat the weights as
177deterministic, there are two ways to get a better estimation. The
178first one is to linearize the expression $\left<\frac{\sum
179w_ix_i}{\sum w_i}\right>$. The second method when the situation is
180more complicated is to estimate the standard error using a
181bootstrapping method.
182
183\section{AveragerPairWeighted}
184Here data points come in pairs (x,y). We are sampling from $f'_{XY}$
185but want to measure from $f_{XY}$. To compensate for this decrepency,
186averages of $g(x,y)$ are taken as $\sum \frac{f}{f'}g(x,y)$. Even
187though, $X$ and $Y$ are not independent $(f_{XY}\neq f_Xf_Y)$ we
188assume that we can factorize the ratio and get $\frac{\sum
189w_xw_yg(x,y)}{\sum w_xw_y}$
190\subsection{Covariance}
191Following the variance calculations for AveragerWeighted we have
192$Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}$ where
193$m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$
194
195\subsection{correlation}
196
197As the mean is estimated as $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$,
198the variance is estimated as $\sigma_x^2=\frac{\sum
199w_xw_y(x-m_x)^2}{\sum w_xw_y}$. As in the non-weighted case we define
200the correlation to be the ratio between the covariance and geometrical
201average of the variances
202
203$\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum
204w_xw_y(y-m_y)^2}}$.
205
206This expression fulfills the following
207\begin{itemize}
208\item Having N weights the expression reduces to the non-weighted expression.
209\item Adding a pair of data, in which one weight is zero is equivalent
210to ignoring the data pair.
211\item Correlation is equal to unity if and only if $x$ is equal to
212$y$. Otherwise the correlation is between -1 and 1.
213\end{itemize}
214\section{Score}
215
216
217\subsection{Pearson}
218
219$\frac{\sum w(x-m_x)(y-m_y)}{\sqrt{\sum w(x-m_x)^2\sum w(y-m_y)^2}}$.
220
221See AveragerPairWeighted correlation.
222
223\subsection{ROC}
224
225An interpretation of the ROC curve area is the probability that if we
226take one sample from class $+$ and one sample from class $-$, what is
227the probability that the sample from class $+$ has greater value. The
228ROC curve area calculates the ratio of pairs fulfilling this
229
230\begin{equation}
231\frac{\sum_{\{i,j\}:x^-_i<x^+_j}1}{\sum_{i,j}1}.
232\end{equation}
233
234An geometrical interpretation is to have a number of squares where
235each square correspond to a pair of samples. The ROC curve follows the
236border between pairs in which the samples from class $+$ has a greater
237value and pairs in which this is not fulfilled. The ROC curve area is
238the area of those latter squares and a natural extension is to weight
239each pair with its two weights and consequently the weighted ROC curve
240area becomes
241
242\begin{equation}
243\frac{\sum_{\{i,j\}:x^-_i<x^+_j}w^-_iw^+_j}{\sum_{i,j}w^-_iw^+_j}
244\end{equation}
245
246This expression is invariant under a rescaling of weight. Adding a
247data value with weight zero adds nothing to the exprssion, and having
248all weight equal to unity yields the non-weighted ROC curve area.
249
250\subsection{tScore}
251
252Assume that $x$ and $y$ originate from the same distribution
253$N(\mu,\sigma_i^2)$ where $\sigma_i^2=\frac{\sigma_0^2}{w_i}$. We then
254estimate $\sigma_0^2$ as
255\begin{equation}
256\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
257{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
258\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
259\end{equation}
260The variance of difference of the means becomes
261\begin{eqnarray}
262Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum
263w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}=
264\frac{\sigma_0^2}{\sum w_i}+\frac{\sigma_0^2}{\sum w_i},
265\end{eqnarray}
266and consequently the t-score becomes
267\begin{equation}
268\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
269{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
270\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
271\left(\frac{1}{\sum w_i}+\frac{1}{\sum w_i}\right),
272\end{equation}
273
274For a $w_i=w$ we this expression get condensed down to
275\begin{equation}
276\frac{w\sum (x-m_x)^2+w\sum (y-m_y)^2}
277{n_x+n_y-2} 
278\left(\frac{1}{wn_x}+\frac{1}{wn_y}\right),
279\end{equation}
280in other words the good old expression as for non-weighted.
281
282\subsection{FoldChange}
283Fold-Change is simply the difference between the weighted mean of the
284two groups //$\frac{\sum w_xx}{\sum w_x}-\frac{\sum w_yy}{\sum w_y}$
285
286\subsection{WilcoxonFoldChange}
287Taking all pair samples (one from class $+$ and one from class $-$)
288and calculating the weighted median of the distances.
289
290\section{Kernel}
291\subsection{Polynomial Kernel}
292The polynomial kernel of degree $N$ is defined as $(1+<x,y>)^N$, where
293$<x,y>$ is the linear kernel (usual scalar product). For the weighted
294case we define the linear kernel to be $<x,y>=\sum {w_xw_yxy}$ and the
295polynomial kernel can be calculated as before
296$(1+<x,y>)^N$. Is this kernel a proper kernel (always being semi
297positive definite). Yes, because $<x,y>$ is obviously a proper kernel
298as it is a scalar product. Adding a positive constant to a kernel
299yields another kernel so $1+<x,y>$ is still a proper kernel. Then also
300$(1+<x,y>)^N$ is a proper kernel because taking a proper kernel to the
301$Nth$ power yields a new proper kernel (see any good book on SVM).
302\subsection{Gaussian Kernel}
303We define the weighted Gaussian kernel as $\exp\left(-\frac{\sum
304w_xw_y(x-y)^2}{\sum w_xw_y}\right)$, which fulfills the conditions
305listed in the introduction.
306
307Is this kernel a proper kernel? Yes, following the proof of the
308non-weighted kernel we see that $K=\exp\left(-\frac{\sum
309w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum
310w_xw_y}\right)\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$,
311which is a product of two proper kernels. $\exp\left(-\frac{\sum
312w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum
313w_xw_y}\right)$ is a proper kernel, because it is a scalar product and
314$\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$ is a proper
315kernel, because it a polynomial of the linear kernel with positive
316coefficients. As product of two kernel also is a kernel, the Gaussian
317kernel is a proper kernel.
318
319\section{Distance}
320
321\section{Regression}
322\subsection{Naive}
323\subsection{Linear}
324We have the model
325
326\begin{equation} 
327y_i=\alpha+\beta (x-m_x)+\epsilon_i,
328\end{equation} 
329
330where $\epsilon_i$ is the noise. The variance of the noise is
331inversely proportional to the weight,
332$Var(\epsilon_i)=\frac{\sigma^2}{w_i}$. In order to determine the
333model parameters, we minimimize the sum of quadratic errors.
334
335\begin{equation}
336Q_0 = \sum \epsilon_i^2
337\end{equation}
338
339Taking the derivity with respect to $\alpha$ and $\beta$ yields two conditions
340
341\begin{equation} 
342\frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha -
343\beta (x_i-m_x)=0
344\end{equation}
345
346and
347
348\begin{equation} \frac{\partial Q_0}{\partial \beta} = -2 \sum
349w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0
350\end{equation}
351
352or equivalently
353
354\begin{equation}
355\alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y
356\end{equation}
357
358and
359
360\begin{equation} \beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum
361w_i(x_i-m_x)^2}=\frac{Cov(x,y)}{Var(x)} 
362\end{equation}
363
364Note, by having all weights equal we get back the unweighted
365case. Furthermore, we calculate the variance of the estimators of
366$\alpha$ and $\beta$.
367
368\begin{equation} 
369\textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}=
370\frac{\sigma^2}{\sum w_i}
371\end{equation}
372
373and
374\begin{equation}
375\textrm{Var}(\beta )= \frac{w_i^2(x_i-m_x)^2\frac{\sigma^2}{w_i}}
376{(\sum w_i(x_i-m_x)^2)^2}=
377\frac{\sigma^2}{\sum w_i(x_i-m_x)^2}
378\end{equation}
379
380Finally, we estimate the level of noise, $\sigma^2$. Inspired by the
381unweighted estimation
382
383\begin{equation}
384s^2=\frac{\sum (y_i-\alpha-\beta (x_i-m_x))^2}{n-2}
385\end{equation}
386
387we suggest the following estimator
388
389\begin{equation} s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum
390w_i-2\frac{\sum w_i^2}{\sum w_i}} \end{equation}
391
392\section{Outlook}
393\subsection{Hierarchical clustering}
394A hierarchical clustering consists of two things: finding the two
395closest data points, merge these two data points two a new data point
396and calculate the new distances from this point to all other points.
397
398In the first item, we need a distance matrix, and if we use Euclidean
399distanses the natural modification of the expression would be
400
401\begin{equation}
402d(x,y)=\frac{\sum w_i^xw_j^y(x_i-y_i)^2}{\sum w_i^xw_j^y} 
403\end{equation} 
404
405For the second item, inspired by average linkage, we suggest
406
407\begin{equation} 
408d(xy,z)=\frac{\sum w_i^xw_j^z(x_i-z_i)^2+\sum
409w_i^yw_j^z(y_i-z_i)^2}{\sum w_i^xw_j^z+\sum w_i^yw_j^z} 
410\end{equation}
411
412to be the distance between the new merged point $xy$ and $z$, and we
413also calculate new weights for this point: $w^{xy}_i=w^x_i+w^y_i$
414
415\end{document}
416
417
418
Note: See TracBrowser for help on using the repository browser.