# source:trunk/doc/Statistics.tex@492

Last change on this file since 492 was 492, checked in by Peter, 16 years ago

updated Weighted Statistics document and reimplemented few stuff. NOTE, this revision may not compile. Cannot test compilation on Le Mac.

File size: 14.9 KB
Line
1\documentstyle[12pt]{article}
2
3\flushbottom
4\footskip 54pt
7\oddsidemargin 0pt
8\parindent 0pt
9\parskip 2ex
10\textheight 230mm
11\textwidth 165mm
12\topmargin 0pt
13
14\newcommand{\bea} {\begin{eqnarray}}
15\newcommand{\eea} {\end{eqnarray}}
16\newcommand{\beq} {\begin{equation}}
17\newcommand{\eeq} {\end{equation}}
18\newcommand{\bibl}[5]
19        {#1, {\it #2} {\bf #3} (#4) #5}
20\newcommand{\ol}{\overline}
21
22\renewcommand{\baselinestretch} {1.0}
23\renewcommand{\textfraction} {0.1}
24\renewcommand{\topfraction} {1.0}
25\renewcommand{\bottomfraction} {1.0}
26\renewcommand{\floatpagefraction} {1.0}
27
28\renewcommand{\d}{{\mathrm{d}}}
29\newcommand{\nd}{$^{\mathrm{nd}}$}
30\newcommand{\eg}{{\it {e.g.}}}
31\newcommand{\ie}{{\it {i.e., }}}
32\newcommand{\etal}{{\it {et al.}}}
33\newcommand{\eref}[1]{Eq.~(\ref{e:#1})}
34\newcommand{\fref}[1]{Fig.~\ref{f:#1}}
35\newcommand{\ovr}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)}
36
37% Use these to include comments and remarks into the text, these will
38% obviously appear as footnotes in the final output.
39\newcommand{\CR}[1]{\footnote{CR: #1}}
40\newcommand{\JH}[1]{\footnote{JH: #1}}
41\newcommand{\PE}[1]{\footnote{PE: #1}}
42\newcommand{\PJ}[1]{\footnote{PJ: #1}}
43
44\begin{document}
45
46\large
47{\bf Weighted Statistics}
48\normalsize
49
50\tableofcontents
51\clearpage
52
53\section{Introduction}
54There are several different reasons why a statistical analysis needs to adjust for weighting. In literature reasons are mainly diveded in to groups.
55
56The first group is when some of the measurements are known to be more precise than others. The more precise a measuremtns is the larger weight it is given. The simplest case is when the weight are given before the measurements and they can be treated as deterministic. It becomes more complicated when the weight can be determined not until afterwards, and even more complicated if the weight depends on the value of the observable.
57
58The second group of situations is when calculating averages over one distribution and sampling from another distribution. Compensating for this discrepency weights are introduced to the analysis. A simple example may be that we are interviewing people but for economical reasons we choose to interview more people from the city than from the countryside. When summarizing the statistics the answers from the city are given a smaller weight. In this example we are choosing the proportions of people from countryside and people from city being intervied. Hence, we can determine the weights before and consider them to be deterministic. In other situations the proportions are not deterministic, but rather a result from the sampling and the weights must be treated as stochastic and only in rare situations the weights can be treated as independent of the observable.
59
60Since there are various origin for a weight occuring in a statistical analysis, there are various way to treat the weights and in general the analysis should be tailored to treat the weights correctly. We have not chosen one situation for our implementations, so see specific function documentation for what assumtions are made. Though, common for implementationare the following:
61\begin{itemize}
62\item Setting all weights to unity yields the same result as the non-weighted version.
63\item Rescaling the weights does not change any function.
64\item Setting a weight to zero is equivalent to removing the data point.
65\end{itemize}
66An important case is when weights are binary (either 1 or 0). Then we get same result using the weighted version as using the data with weight not equal to zero and the non-weighted version. Hence, using binary weights and the weighted version missing values can be treated in a proper way.
67
68\section{AveragerWeighted}
69
70
71
72\subsection{Mean}
73
74For any situation the weight is always designed so the weighted mean is calculated as $m=\frac{\sum w_ix_i}{\sum w_i}$, which obviously fulfills the conditions above.
75
76In the case of varying measurement error, it could be motivated that the weight shall be $w_i = 1/\sigma_i^2$. We assume measurement error to be Gaussian and the likelihood to get our measurements is
77$L(m)=\prod (2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}}$.
78We maximize the likelihood by taking the derivity with respect to $m$ on the logarithm of the likelihood
79$\frac{d\ln L(m)}{dm}=\sum \frac{x_i-m}{\sigma_i^2}$. Hence, the Maximum Likelihood method yields the estimator
80$m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2}$.
81
82
83\subsection{Variance}
84In case of varying variance, there is no point estimating a variance since it is different for each data point.
85
86Instead we look at the case when we want to estimate the variance over $f$ but are sampling from $f'$. For the mean of an observable $O$ we have
87$\widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum w_i}$. Hence, an estimator of the variance of $X$ is
88\begin{eqnarray}
89\sigma^2=<X^2>-<X>^2=
90\\\frac{\sum w_ix_i^2}{\sum w_i}-\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}=
91\\\frac{\sum w_i(x_i^2-m^2)}{\sum w_i}
92\\\frac{\sum w_i(x_i^2-2mx_i+m^2)}{\sum w_i}
93\\\frac{\sum w_i(x_i-m)^2}{\sum w_i}
94\end{eqnarray}
95This estimator fulfills that it is invariant under a rescaling and having a weight equal to zero is equivalent to removing the data point. Having all weight equal to unity we get $\sigma=\frac{\sum (x_i-m)^2}{N}$, which is the same as returned from Averager. Hence, this estimator is slightly biased, but still very efficient.
96
97\subsection{Standard Error}
98The standard error squared is equal to the expexted squared error of the estimation of $m$. The squared error consists of two parts, the variance of the estimator and the squared bias. $<m-\mu>^2=<m-<m>+<m>-\mu>^2=<m-<m>>^2+(<m>-\mu)^2$.
99In the case when weights are included in analysis due to varying measurement errors and the weights can be treated as deterministic ,we have
100\begin{eqnarray}
101Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}=
102\\\frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}
103\frac{\sigma_0^2}{\sum w_i},
104\end{eqnarray}
105where we need to estimate $\sigma_0^2$. Again we have the likelihood
106$L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{-\frac{w_i(x-m)^2}{2\sigma_0^2}}$ and taking the derivity with respect to $\sigma_o^2$,
107$\frac{d\ln L}{d\sigma_i^2}=\sum -\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2}$
108which yields an estimator $\sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2$. This estimator is not ignoring weights equal to zero, because deviation is most often smaller than the expected infinity. Therefore, we modify the expression as follows $\sigma_i^2=\frac{\sum w_i^2}{\left(\sum w_i\right)^2}\sum w_i(x-m)^2$ and we get the following estimator of the variance of the mean
109 $\sigma_i^2=\frac{\sum w_i^2}{\left(\sum w_i\right)^3}\sum w_i(x-m)^2$. This estimator fulfills the conditions above: adding a weight zero does not change it: rescaling the weights does not change it, and setting all weights to unity yields the same expression as in the non-weighted case.
110
111In a case when it is not a good approximation to treat the weights as deterministic, there are two ways to get a better estimation. The first one is to linearize the expression $\left<\frac{\sum w_ix_i}{\sum w_i}\right>$. The second method when the situation is more complicated is to estimate the standard error using a bootstrapping method.
112
113\section{AveragerPairWeighted}
114Here data points come in pairs (x,y). We are sampling from $f'_{XY}$ but want to measure from $f_{XY}$. To compensate for this decrepency, averages of $g(x,y)$ are taken as $\sum \frac{f}{f'}g(x,y)$. Even though, $X$ and $Y$ are not independent $(f_{XY}\neq f_Xf_Y)$ we assume that we can factorize the ratio and get $\frac{\sum w_xw_yg(x,y)}{\sum w_xw_y}$
115\subsection{Covariance}
116Following the variance calculations for AveragerWeighted we have $Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}$ where $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$
117
118\subsection{correlation}
119
120As the mean is estimated as $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$, the variance is estimated as
121$\sigma_x^2=\frac{\sum w_xw_y(x-m_x)^2}{\sum w_xw_y}$. As in the non-weighted case we define the correlation to be the ratio between the covariance and geomtrical avergae of the variances
122
123$\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum w_xw_y(y-m_y)^2}}$.
124
125This expression fulfills the following
126\begin{itemize}
127\item Having N weights the expression reduces to the non-weighted expression.
128\item Adding a pair of data, in which one weight is zero is equivalent to ignoring the data pair.
129\item Correlation is equal to unity if and only if $x$ is equal to $y$. Otherwise the correlation is between -1 and 1.
130\end{itemize}
131\section{Score}
132
133
134\subsection{Pearson}
135
136$\frac{\sum w(x-m_x)(y-m_y)}{\sqrt{\sum w(x-m_x)^2\sum w(y-m_y)^2}}$.
137
138See AveragerPairWeighted correlation.
139
140\subsection{ROC}
141
142An interpretation of the ROC curve area is the probability that if we take one sample from class $+$ and one sample from class $-$, what is the probability that the sample from class $+$ has greater value. The ROC curve area calculates the ratio of pairs fulfilling this
143
144\beq
145\frac{\sum_{\{i,j\}:x^-_i<x^+_j}1}{\sum_{i,j}1}.
146\eeq
147
148An geometrical interpretation is to have a number of squares where each square correspond to a pair of samples. The ROC curve follows the border between pairs in which the samples from class $+$ has a greater value and pairs in which this is not fulfilled. The ROC curve area is the area of those latter squares and a natural extension is to weight each pair with its two weights and consequently the weighted ROC curve area becomes
149
150\beq
151\frac{\sum_{\{i,j\}:x^-_i<x^+_j}w^-_iw^+_j}{\sum_{i,j}w^-_iw^+_j}
152\eeq
153
154This expression is invariant under a rescaling of weight. Adding a data value with weight zero adds nothing to the exprssion, and having all weight equal to unity yields the non-weighted ROC curve area.
155
156\subsection{tScore}
157
158Assume that $x$ and $y$ originate from the same distribution $N(\mu,\sigma_i^2)$ where $\sigma_i^2=\frac{\sigma_0^2}{w_i}$. We then estimate $\sigma_0^2$ as
159\begin{equation}
160\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
161{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
162\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
163\end{equation}
164The variance of difference of the means becomes
165\begin{eqnarray}
166Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}=
167\frac{\sigma_0^2}{\sum w_i}+\frac{\sigma_0^2}{\sum w_i},
168\end{eqnarray}
169and consequently the t-score becomes
170\begin{equation}
171\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
172{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
173\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
174\left(\frac{1}{\sum w_i}+\frac{1}{\sum w_i}\right),
175\end{equation}
176
177For a $w_i=w$ we this expression get condensed down to
178\begin{equation}
179\frac{w\sum (x-m_x)^2+w\sum (y-m_y)^2}
180{n_x+n_y-2}
181\left(\frac{1}{wn_x}+\frac{1}{wn_y}\right),
182\end{equation}
183in other words the good old expression as for non-weighted.
184
185\subsection{FoldChange}
186Fold-Change is simply the difference between the weighted mean of the two groups //$\frac{\sum w_xx}{\sum w_x}-\frac{\sum w_yy}{\sum w_y}$
187
188\subsection{WilcoxonFoldChange}
189Taking all pair samples (one from class $+$ and one from class $-$) and calculating the weighted median of the distances.
190
191\section{Kernel}
192\subsection{Polynomial Kernel}
193The polynomial kernel of degree $N$ is defined as $(1+<x,y>)^N$, where $<x,y>$ is the linear kenrel (usual scalar product). For weights we define the linear kernel to be $<x,y>=\frac{\sum w_xw_yxy}{\sum w_xw_y}$ and the polynomial kernel can be calculated as before $(1+<x,y>)^N$. Is this kernel a proper kernel (always being semi positive definite). Yes, because $<x,y>$ is obviously a proper kernel as it is a scalar product. Adding a positive constant to a kernel yields another kernel so $1+<x,y>$ is still a proper kernel. Then also $(1+<x,y>)^N$ is a proper kernel because taking a proper kernel to the $Nth$ power yields a new proper kernel (see any good book on SVM).
194\subsection{Gaussian Kernel}
195We define the weighted Gaussian kernel as
196$\exp\left(-\frac{\sum w_xw_y(x-y)^2}{\sum w_xw_y}\right)$, which fulfills the conditions listed in the introduction.
197
198Is this kernel a proper kernel? Yes, following the proof of the non-weighted kernel we see that $K=\exp\left(-\frac{\sum w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum w_xw_y}\right)\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$, which is a product of two proper kernels. $\exp\left(-\frac{\sum w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum w_xw_y}\right)$ is a proper kernel, because it is a scalar product and $\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$ is a proper kernel, because it a polynomial of the linear kernel with positive coefficients. As product of two kernel also is a kernel, the Gaussian kernel is a proper kernel.
199
200\section{Distance}
201
202\section{Regression}
203\subsection{Naive}
204\subsection{Linear}
205We have the model
206
207\beq
208y_i=\alpha+\beta (x-m_x)+\epsilon_i,
209\eeq
210
211where $\epsilon_i$ is the noise. The variance of the noise is
212inversely proportional to the weight,
213$Var(\epsilon_i)=\frac{\sigma^2}{w_i}$. In order to determine the
214model parameters, we minimimize the sum of quadratic errors.
215
216\beq
217Q_0 = \sum \epsilon_i^2
218\eeq
219
220Taking the derivity with respect to $\alpha$ and $\beta$ yields two conditions
221
222\beq
223\frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha - \beta (x_i-m_x)=0
224\eeq
225
226and
227
228\beq
229\frac{\partial Q_0}{\partial \beta} = -2 \sum w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0
230\eeq
231
232or equivalently
233
234\beq
235\alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y
236\eeq
237
238and
239
240\beq
241\beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum w_i(x_i-m_x)^2}=\frac{Cov(x,y)}{Var(x)}
242\eeq
243
244Note, by having all weights equal we get back the unweighted
245case. Furthermore, we calculate the variance of the estimators of
246$\alpha$ and $\beta$.
247
248\beq
249\textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}=
250\frac{\sigma^2}{\sum w_i}
251\eeq
252
253and
254\beq
255\textrm{Var}(\beta )= \frac{w_i^2(x_i-m_x)^2\frac{\sigma^2}{w_i}}
256{(\sum w_i(x_i-m_x)^2)^2}=
257\frac{\sigma^2}{\sum w_i(x_i-m_x)^2}
258\eeq
259
260Finally, we estimate the level of noise, $\sigma^2$. Inspired by the
261unweighted estimation
262
263\beq
264s^2=\frac{\sum (y_i-\alpha-\beta (x_i-m_x))^2}{n-2}
265\eeq
266
267we suggest the following estimator
268
269\beq
270s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum w_i-2\frac{\sum w_i^2}{\sum w_i}}
271\eeq
272
273\section{Outlook}
274\subsection{Hierarchical clustering}
275\label{hc}
276A hierarchical clustering consists of two things: finding the two
277closest data points, merge these two data points two a new data point
278and calculate the new distances from this point to all other points.\\
279
280In the first item, we need a distance matrix, and if we use Euclidean
281distanses the natural modification of the expression would be
282
283\beq
284d(x,y)=\frac{\sum w_i^xw_j^y(x_i-y_i)^2}{\sum w_i^xw_j^y} \eeq \\
285
286For the second item, inspired by average linkage, we suggest
287
288\beq
289d(xy,z)=\frac{\sum w_i^xw_j^z(x_i-z_i)^2+\sum w_i^yw_j^z(y_i-z_i)^2}{\sum w_i^xw_j^z+\sum w_i^yw_j^z}
290\eeq
291
292to be the distance between the new merged point $xy$ and $z$, and we
293also calculate new weights for this point: $w^{xy}_i=w^x_i+w^y_i$
294
295\end{document}
296
297
298
Note: See TracBrowser for help on using the repository browser.