source: trunk/doc/Statistics.tex @ 661

Last change on this file since 661 was 639, checked in by Markus Ringnér, 15 years ago

Fixed documentation bugs

File size: 14.8 KB
Line 
1\documentclass[12pt]{article}
2
3\usepackage{html}
4
5
6\flushbottom
7\footskip 54pt
8\headheight 0pt
9\headsep 0pt
10\oddsidemargin 0pt
11\parindent 0pt
12\parskip 2ex
13\textheight 230mm
14\textwidth 165mm
15\topmargin 0pt
16
17\renewcommand{\baselinestretch} {1.0}
18\renewcommand{\textfraction} {0.1}
19\renewcommand{\topfraction} {1.0}
20\renewcommand{\bottomfraction} {1.0}
21\renewcommand{\floatpagefraction} {1.0}
22
23\renewcommand{\d}{{\mathrm{d}}}
24\newcommand{\nd}{$^{\mathrm{nd}}$}
25\newcommand{\eg}{{\it {e.g.}}}
26\newcommand{\ie}{{\it {i.e., }}}
27\newcommand{\etal}{{\it {et al.}}}
28\newcommand{\eref}[1]{Eq.~(\ref{e:#1})}
29\newcommand{\fref}[1]{Fig.~\ref{f:#1}}
30\newcommand{\ovr}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)}
31
32\begin{document}
33
34\large
35{\bf Weighted Statistics}
36\normalsize
37\begin{htmlonly}
38This document is also available in
39\htmladdnormallink{PDF}{Statistics.pdf}.
40\end{htmlonly}
41
42\tableofcontents
43\clearpage
44
45\section{Introduction}
46There are several different reasons why a statistical analysis needs
47to adjust for weighting. In literature reasons are mainly diveded in
48to groups.
49
50The first group is when some of the measurements are known to be more
51precise than others. The more precise a measurement is, the larger
52weight it is given. The simplest case is when the weight are given
53before the measurements and they can be treated as deterministic. It
54becomes more complicated when the weight can be determined not until
55afterwards, and even more complicated if the weight depends on the
56value of the observable.
57
58The second group of situations is when calculating averages over one
59distribution and sampling from another distribution. Compensating for
60this discrepency weights are introduced to the analysis. A simple
61example may be that we are interviewing people but for economical
62reasons we choose to interview more people from the city than from the
63countryside. When summarizing the statistics the answers from the city
64are given a smaller weight. In this example we are choosing the
65proportions of people from countryside and people from city being
66intervied. Hence, we can determine the weights before and consider
67them to be deterministic. In other situations the proportions are not
68deterministic, but rather a result from the sampling and the weights
69must be treated as stochastic and only in rare situations the weights
70can be treated as independent of the observable.
71
72Since there are various origins for a weight occuring in a statistical
73analysis, there are various ways to treat the weights and in general
74the analysis should be tailored to treat the weights correctly. We
75have not chosen one situation for our implementations, so see specific
76function documentation for what assumtions are made. Though, common
77for implementationare the following:
78\begin{itemize}
79\item Setting all weights to unity yields the same result as the
80non-weighted version.
81\item Rescaling the weights does not change any function.
82\item Setting a weight to zero is equivalent to removing the data point.
83\end{itemize}
84An important case is when weights are binary (either 1 or 0). Then we
85get the same result using the weighted version as using the data with
86weight not equal to zero and the non-weighted version. Hence, using
87binary weights and the weighted version missing values can be treated
88in a proper way.
89
90\section{AveragerWeighted}
91
92
93
94\subsection{Mean}
95
96For any situation the weight is always designed so the weighted mean
97is calculated as $m=\frac{\sum w_ix_i}{\sum w_i}$, which obviously
98fulfills the conditions above.
99
100In the case of varying measurement error, it could be motivated that
101the weight shall be $w_i = 1/\sigma_i^2$. We assume measurement error
102to be Gaussian and the likelihood to get our measurements is
103$L(m)=\prod
104(2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}}$.  We
105maximize the likelihood by taking the derivity with respect to $m$ on
106the logarithm of the likelihood $\frac{d\ln L(m)}{dm}=\sum
107\frac{x_i-m}{\sigma_i^2}$. Hence, the Maximum Likelihood method yields
108the estimator $m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2}$.
109
110
111\subsection{Variance}
112In case of varying variance, there is no point estimating a variance
113since it is different for each data point.
114
115Instead we look at the case when we want to estimate the variance over
116$f$ but are sampling from $f'$. For the mean of an observable $O$ we
117have $\widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum
118w_i}$. Hence, an estimator of the variance of $X$ is
119\begin{eqnarray}
120\sigma^2=<X^2>-<X>^2=
121\\\frac{\sum w_ix_i^2}{\sum w_i}-\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}=
122\\\frac{\sum w_i(x_i^2-m^2)}{\sum w_i}
123\\\frac{\sum w_i(x_i^2-2mx_i+m^2)}{\sum w_i}
124\\\frac{\sum w_i(x_i-m)^2}{\sum w_i}
125\end{eqnarray}
126This estimator fulfills that it is invariant under a rescaling and
127having a weight equal to zero is equivalent to removing the data
128point. Having all weight equal to unity we get $\sigma=\frac{\sum
129(x_i-m)^2}{N}$, which is the same as returned from Averager. Hence,
130this estimator is slightly biased, but still very efficient.
131
132\subsection{Standard Error}
133The standard error squared is equal to the expexted squared error of
134the estimation of $m$. The squared error consists of two parts, the
135variance of the estimator and the squared
136bias. $<m-\mu>^2=<m-<m>+<m>-\mu>^2=<m-<m>>^2+(<m>-\mu)^2$.  In the
137case when weights are included in analysis due to varying measurement
138errors and the weights can be treated as deterministic ,we have
139\begin{equation}
140Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}=
141\frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}=
142\frac{\sigma_0^2}{\sum w_i},
143\end{equation}
144where we need to estimate $\sigma_0^2$. Again we have the likelihood
145$L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{-\frac{w_i(x-m)^2}{2\sigma_0^2}}$
146and taking the derivity with respect to $\sigma_o^2$, $\frac{d\ln
147L}{d\sigma_i^2}=\sum
148-\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2}$ which
149yields an estimator $\sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2$. This
150estimator is not ignoring weights equal to zero, because deviation is
151most often smaller than the expected infinity. Therefore, we modify
152the expression as follows $\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
153w_i\right)^2}\sum w_i(x-m)^2$ and we get the following estimator of
154the variance of the mean $\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
155w_i\right)^3}\sum w_i(x-m)^2$. This estimator fulfills the conditions
156above: adding a weight zero does not change it: rescaling the weights
157does not change it, and setting all weights to unity yields the same
158expression as in the non-weighted case.
159
160In a case when it is not a good approximation to treat the weights as
161deterministic, there are two ways to get a better estimation. The
162first one is to linearize the expression $\left<\frac{\sum
163w_ix_i}{\sum w_i}\right>$. The second method when the situation is
164more complicated is to estimate the standard error using a
165bootstrapping method.
166
167\section{AveragerPairWeighted}
168Here data points come in pairs (x,y). We are sampling from $f'_{XY}$
169but want to measure from $f_{XY}$. To compensate for this decrepency,
170averages of $g(x,y)$ are taken as $\sum \frac{f}{f'}g(x,y)$. Even
171though, $X$ and $Y$ are not independent $(f_{XY}\neq f_Xf_Y)$ we
172assume that we can factorize the ratio and get $\frac{\sum
173w_xw_yg(x,y)}{\sum w_xw_y}$
174\subsection{Covariance}
175Following the variance calculations for AveragerWeighted we have
176$Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}$ where
177$m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$
178
179\subsection{correlation}
180
181As the mean is estimated as $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$,
182the variance is estimated as $\sigma_x^2=\frac{\sum
183w_xw_y(x-m_x)^2}{\sum w_xw_y}$. As in the non-weighted case we define
184the correlation to be the ratio between the covariance and geometrical
185average of the variances
186
187$\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum
188w_xw_y(y-m_y)^2}}$.
189
190This expression fulfills the following
191\begin{itemize}
192\item Having N weights the expression reduces to the non-weighted expression.
193\item Adding a pair of data, in which one weight is zero is equivalent
194to ignoring the data pair.
195\item Correlation is equal to unity if and only if $x$ is equal to
196$y$. Otherwise the correlation is between -1 and 1.
197\end{itemize}
198\section{Score}
199
200
201\subsection{Pearson}
202
203$\frac{\sum w(x-m_x)(y-m_y)}{\sqrt{\sum w(x-m_x)^2\sum w(y-m_y)^2}}$.
204
205See AveragerPairWeighted correlation.
206
207\subsection{ROC}
208
209An interpretation of the ROC curve area is the probability that if we
210take one sample from class $+$ and one sample from class $-$, what is
211the probability that the sample from class $+$ has greater value. The
212ROC curve area calculates the ratio of pairs fulfilling this
213
214\begin{equation}
215\frac{\sum_{\{i,j\}:x^-_i<x^+_j}1}{\sum_{i,j}1}.
216\end{equation}
217
218An geometrical interpretation is to have a number of squares where
219each square correspond to a pair of samples. The ROC curve follows the
220border between pairs in which the samples from class $+$ has a greater
221value and pairs in which this is not fulfilled. The ROC curve area is
222the area of those latter squares and a natural extension is to weight
223each pair with its two weights and consequently the weighted ROC curve
224area becomes
225
226\begin{equation}
227\frac{\sum_{\{i,j\}:x^-_i<x^+_j}w^-_iw^+_j}{\sum_{i,j}w^-_iw^+_j}
228\end{equation}
229
230This expression is invariant under a rescaling of weight. Adding a
231data value with weight zero adds nothing to the exprssion, and having
232all weight equal to unity yields the non-weighted ROC curve area.
233
234\subsection{tScore}
235
236Assume that $x$ and $y$ originate from the same distribution
237$N(\mu,\sigma_i^2)$ where $\sigma_i^2=\frac{\sigma_0^2}{w_i}$. We then
238estimate $\sigma_0^2$ as
239\begin{equation}
240\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
241{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
242\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
243\end{equation}
244The variance of difference of the means becomes
245\begin{eqnarray}
246Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum
247w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}=
248\frac{\sigma_0^2}{\sum w_i}+\frac{\sigma_0^2}{\sum w_i},
249\end{eqnarray}
250and consequently the t-score becomes
251\begin{equation}
252\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
253{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
254\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
255\left(\frac{1}{\sum w_i}+\frac{1}{\sum w_i}\right),
256\end{equation}
257
258For a $w_i=w$ we this expression get condensed down to
259\begin{equation}
260\frac{w\sum (x-m_x)^2+w\sum (y-m_y)^2}
261{n_x+n_y-2} 
262\left(\frac{1}{wn_x}+\frac{1}{wn_y}\right),
263\end{equation}
264in other words the good old expression as for non-weighted.
265
266\subsection{FoldChange}
267Fold-Change is simply the difference between the weighted mean of the
268two groups //$\frac{\sum w_xx}{\sum w_x}-\frac{\sum w_yy}{\sum w_y}$
269
270\subsection{WilcoxonFoldChange}
271Taking all pair samples (one from class $+$ and one from class $-$)
272and calculating the weighted median of the distances.
273
274\section{Kernel}
275\subsection{Polynomial Kernel}
276The polynomial kernel of degree $N$ is defined as $(1+<x,y>)^N$, where
277$<x,y>$ is the linear kernel (usual scalar product). For the weighted
278case we define the linear kernel to be $<x,y>=\sum {w_xw_yxy}$ and the
279polynomial kernel can be calculated as before
280$(1+<x,y>)^N$. Is this kernel a proper kernel (always being semi
281positive definite). Yes, because $<x,y>$ is obviously a proper kernel
282as it is a scalar product. Adding a positive constant to a kernel
283yields another kernel so $1+<x,y>$ is still a proper kernel. Then also
284$(1+<x,y>)^N$ is a proper kernel because taking a proper kernel to the
285$Nth$ power yields a new proper kernel (see any good book on SVM).
286\subsection{Gaussian Kernel}
287We define the weighted Gaussian kernel as $\exp\left(-\frac{\sum
288w_xw_y(x-y)^2}{\sum w_xw_y}\right)$, which fulfills the conditions
289listed in the introduction.
290
291Is this kernel a proper kernel? Yes, following the proof of the
292non-weighted kernel we see that $K=\exp\left(-\frac{\sum
293w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum
294w_xw_y}\right)\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$,
295which is a product of two proper kernels. $\exp\left(-\frac{\sum
296w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum
297w_xw_y}\right)$ is a proper kernel, because it is a scalar product and
298$\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$ is a proper
299kernel, because it a polynomial of the linear kernel with positive
300coefficients. As product of two kernel also is a kernel, the Gaussian
301kernel is a proper kernel.
302
303\section{Distance}
304
305\section{Regression}
306\subsection{Naive}
307\subsection{Linear}
308We have the model
309
310\begin{equation} 
311y_i=\alpha+\beta (x-m_x)+\epsilon_i,
312\end{equation} 
313
314where $\epsilon_i$ is the noise. The variance of the noise is
315inversely proportional to the weight,
316$Var(\epsilon_i)=\frac{\sigma^2}{w_i}$. In order to determine the
317model parameters, we minimimize the sum of quadratic errors.
318
319\begin{equation}
320Q_0 = \sum \epsilon_i^2
321\end{equation}
322
323Taking the derivity with respect to $\alpha$ and $\beta$ yields two conditions
324
325\begin{equation} 
326\frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha -
327\beta (x_i-m_x)=0
328\end{equation}
329
330and
331
332\begin{equation} \frac{\partial Q_0}{\partial \beta} = -2 \sum
333w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0
334\end{equation}
335
336or equivalently
337
338\begin{equation}
339\alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y
340\end{equation}
341
342and
343
344\begin{equation} \beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum
345w_i(x_i-m_x)^2}=\frac{Cov(x,y)}{Var(x)} 
346\end{equation}
347
348Note, by having all weights equal we get back the unweighted
349case. Furthermore, we calculate the variance of the estimators of
350$\alpha$ and $\beta$.
351
352\begin{equation} 
353\textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}=
354\frac{\sigma^2}{\sum w_i}
355\end{equation}
356
357and
358\begin{equation}
359\textrm{Var}(\beta )= \frac{w_i^2(x_i-m_x)^2\frac{\sigma^2}{w_i}}
360{(\sum w_i(x_i-m_x)^2)^2}=
361\frac{\sigma^2}{\sum w_i(x_i-m_x)^2}
362\end{equation}
363
364Finally, we estimate the level of noise, $\sigma^2$. Inspired by the
365unweighted estimation
366
367\begin{equation}
368s^2=\frac{\sum (y_i-\alpha-\beta (x_i-m_x))^2}{n-2}
369\end{equation}
370
371we suggest the following estimator
372
373\begin{equation} s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum
374w_i-2\frac{\sum w_i^2}{\sum w_i}} \end{equation}
375
376\section{Outlook}
377\subsection{Hierarchical clustering}
378A hierarchical clustering consists of two things: finding the two
379closest data points, merge these two data points two a new data point
380and calculate the new distances from this point to all other points.
381
382In the first item, we need a distance matrix, and if we use Euclidean
383distanses the natural modification of the expression would be
384
385\begin{equation}
386d(x,y)=\frac{\sum w_i^xw_j^y(x_i-y_i)^2}{\sum w_i^xw_j^y} 
387\end{equation} 
388
389For the second item, inspired by average linkage, we suggest
390
391\begin{equation} 
392d(xy,z)=\frac{\sum w_i^xw_j^z(x_i-z_i)^2+\sum
393w_i^yw_j^z(y_i-z_i)^2}{\sum w_i^xw_j^z+\sum w_i^yw_j^z} 
394\end{equation}
395
396to be the distance between the new merged point $xy$ and $z$, and we
397also calculate new weights for this point: $w^{xy}_i=w^x_i+w^y_i$
398
399\end{document}
400
401
402
Note: See TracBrowser for help on using the repository browser.