1 | \documentclass[12pt]{article} |
---|
2 | |
---|
3 | \usepackage{html} |
---|
4 | |
---|
5 | |
---|
6 | \flushbottom |
---|
7 | \footskip 54pt |
---|
8 | \headheight 0pt |
---|
9 | \headsep 0pt |
---|
10 | \oddsidemargin 0pt |
---|
11 | \parindent 0pt |
---|
12 | \parskip 2ex |
---|
13 | \textheight 230mm |
---|
14 | \textwidth 165mm |
---|
15 | \topmargin 0pt |
---|
16 | |
---|
17 | \renewcommand{\baselinestretch} {1.0} |
---|
18 | \renewcommand{\textfraction} {0.1} |
---|
19 | \renewcommand{\topfraction} {1.0} |
---|
20 | \renewcommand{\bottomfraction} {1.0} |
---|
21 | \renewcommand{\floatpagefraction} {1.0} |
---|
22 | |
---|
23 | \renewcommand{\d}{{\mathrm{d}}} |
---|
24 | \newcommand{\nd}{$^{\mathrm{nd}}$} |
---|
25 | \newcommand{\eg}{{\it {e.g.}}} |
---|
26 | \newcommand{\ie}{{\it {i.e., }}} |
---|
27 | \newcommand{\etal}{{\it {et al.}}} |
---|
28 | \newcommand{\eref}[1]{Eq.~(\ref{e:#1})} |
---|
29 | \newcommand{\fref}[1]{Fig.~\ref{f:#1}} |
---|
30 | \newcommand{\ovr}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)} |
---|
31 | |
---|
32 | \begin{document} |
---|
33 | |
---|
34 | \large |
---|
35 | {\bf Weighted Statistics} |
---|
36 | \normalsize |
---|
37 | \begin{htmlonly} |
---|
38 | This document is also available in |
---|
39 | \htmladdnormallink{PDF}{Statistics.pdf}. |
---|
40 | \end{htmlonly} |
---|
41 | |
---|
42 | \tableofcontents |
---|
43 | \clearpage |
---|
44 | |
---|
45 | \section{Introduction} |
---|
46 | There are several different reasons why a statistical analysis needs |
---|
47 | to adjust for weighting. In literature reasons are mainly diveded in |
---|
48 | to groups. |
---|
49 | |
---|
50 | The first group is when some of the measurements are known to be more |
---|
51 | precise than others. The more precise a measuremtns is the larger |
---|
52 | weight it is given. The simplest case is when the weight are given |
---|
53 | before the measurements and they can be treated as deterministic. It |
---|
54 | becomes more complicated when the weight can be determined not until |
---|
55 | afterwards, and even more complicated if the weight depends on the |
---|
56 | value of the observable. |
---|
57 | |
---|
58 | The second group of situations is when calculating averages over one |
---|
59 | distribution and sampling from another distribution. Compensating for |
---|
60 | this discrepency weights are introduced to the analysis. A simple |
---|
61 | example may be that we are interviewing people but for economical |
---|
62 | reasons we choose to interview more people from the city than from the |
---|
63 | countryside. When summarizing the statistics the answers from the city |
---|
64 | are given a smaller weight. In this example we are choosing the |
---|
65 | proportions of people from countryside and people from city being |
---|
66 | intervied. Hence, we can determine the weights before and consider |
---|
67 | them to be deterministic. In other situations the proportions are not |
---|
68 | deterministic, but rather a result from the sampling and the weights |
---|
69 | must be treated as stochastic and only in rare situations the weights |
---|
70 | can be treated as independent of the observable. |
---|
71 | |
---|
72 | Since there are various origin for a weight occuring in a statistical |
---|
73 | analysis, there are various way to treat the weights and in general |
---|
74 | the analysis should be tailored to treat the weights correctly. We |
---|
75 | have not chosen one situation for our implementations, so see specific |
---|
76 | function documentation for what assumtions are made. Though, common |
---|
77 | for implementationare the following: |
---|
78 | \begin{itemize} |
---|
79 | \item Setting all weights to unity yields the same result as the |
---|
80 | non-weighted version. |
---|
81 | \item Rescaling the weights does not change any function. |
---|
82 | \item Setting a weight to zero is equivalent to removing the data point. |
---|
83 | \end{itemize} |
---|
84 | An important case is when weights are binary (either 1 or 0). Then we |
---|
85 | get same result using the weighted version as using the data with |
---|
86 | weight not equal to zero and the non-weighted version. Hence, using |
---|
87 | binary weights and the weighted version missing values can be treated |
---|
88 | in a proper way. |
---|
89 | |
---|
90 | \section{AveragerWeighted} |
---|
91 | |
---|
92 | |
---|
93 | |
---|
94 | \subsection{Mean} |
---|
95 | |
---|
96 | For any situation the weight is always designed so the weighted mean |
---|
97 | is calculated as $m=\frac{\sum w_ix_i}{\sum w_i}$, which obviously |
---|
98 | fulfills the conditions above. |
---|
99 | |
---|
100 | In the case of varying measurement error, it could be motivated that |
---|
101 | the weight shall be $w_i = 1/\sigma_i^2$. We assume measurement error |
---|
102 | to be Gaussian and the likelihood to get our measurements is |
---|
103 | $L(m)=\prod |
---|
104 | (2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}}$. We |
---|
105 | maximize the likelihood by taking the derivity with respect to $m$ on |
---|
106 | the logarithm of the likelihood $\frac{d\ln L(m)}{dm}=\sum |
---|
107 | \frac{x_i-m}{\sigma_i^2}$. Hence, the Maximum Likelihood method yields |
---|
108 | the estimator $m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2}$. |
---|
109 | |
---|
110 | |
---|
111 | \subsection{Variance} |
---|
112 | In case of varying variance, there is no point estimating a variance |
---|
113 | since it is different for each data point. |
---|
114 | |
---|
115 | Instead we look at the case when we want to estimate the variance over |
---|
116 | $f$ but are sampling from $f'$. For the mean of an observable $O$ we |
---|
117 | have $\widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum |
---|
118 | w_i}$. Hence, an estimator of the variance of $X$ is |
---|
119 | \begin{eqnarray} |
---|
120 | \sigma^2=<X^2>-<X>^2= |
---|
121 | \\\frac{\sum w_ix_i^2}{\sum w_i}-\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}= |
---|
122 | \\\frac{\sum w_i(x_i^2-m^2)}{\sum w_i} |
---|
123 | \\\frac{\sum w_i(x_i^2-2mx_i+m^2)}{\sum w_i} |
---|
124 | \\\frac{\sum w_i(x_i-m)^2}{\sum w_i} |
---|
125 | \end{eqnarray} |
---|
126 | This estimator fulfills that it is invariant under a rescaling and |
---|
127 | having a weight equal to zero is equivalent to removing the data |
---|
128 | point. Having all weight equal to unity we get $\sigma=\frac{\sum |
---|
129 | (x_i-m)^2}{N}$, which is the same as returned from Averager. Hence, |
---|
130 | this estimator is slightly biased, but still very efficient. |
---|
131 | |
---|
132 | \subsection{Standard Error} |
---|
133 | The standard error squared is equal to the expexted squared error of |
---|
134 | the estimation of $m$. The squared error consists of two parts, the |
---|
135 | variance of the estimator and the squared |
---|
136 | bias. $<m-\mu>^2=<m-<m>+<m>-\mu>^2=<m-<m>>^2+(<m>-\mu)^2$. In the |
---|
137 | case when weights are included in analysis due to varying measurement |
---|
138 | errors and the weights can be treated as deterministic ,we have |
---|
139 | \begin{equation} |
---|
140 | Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}= |
---|
141 | \frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}= |
---|
142 | \frac{\sigma_0^2}{\sum w_i}, |
---|
143 | \end{equation} |
---|
144 | where we need to estimate $\sigma_0^2$. Again we have the likelihood |
---|
145 | $L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{-\frac{w_i(x-m)^2}{2\sigma_0^2}}$ |
---|
146 | and taking the derivity with respect to $\sigma_o^2$, $\frac{d\ln |
---|
147 | L}{d\sigma_i^2}=\sum |
---|
148 | -\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2}$ which |
---|
149 | yields an estimator $\sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2$. This |
---|
150 | estimator is not ignoring weights equal to zero, because deviation is |
---|
151 | most often smaller than the expected infinity. Therefore, we modify |
---|
152 | the expression as follows $\sigma_0^2=\frac{\sum w_i^2}{\left(\sum |
---|
153 | w_i\right)^2}\sum w_i(x-m)^2$ and we get the following estimator of |
---|
154 | the variance of the mean $\sigma_0^2=\frac{\sum w_i^2}{\left(\sum |
---|
155 | w_i\right)^3}\sum w_i(x-m)^2$. This estimator fulfills the conditions |
---|
156 | above: adding a weight zero does not change it: rescaling the weights |
---|
157 | does not change it, and setting all weights to unity yields the same |
---|
158 | expression as in the non-weighted case. |
---|
159 | |
---|
160 | In a case when it is not a good approximation to treat the weights as |
---|
161 | deterministic, there are two ways to get a better estimation. The |
---|
162 | first one is to linearize the expression $\left<\frac{\sum |
---|
163 | w_ix_i}{\sum w_i}\right>$. The second method when the situation is |
---|
164 | more complicated is to estimate the standard error using a |
---|
165 | bootstrapping method. |
---|
166 | |
---|
167 | \section{AveragerPairWeighted} |
---|
168 | Here data points come in pairs (x,y). We are sampling from $f'_{XY}$ |
---|
169 | but want to measure from $f_{XY}$. To compensate for this decrepency, |
---|
170 | averages of $g(x,y)$ are taken as $\sum \frac{f}{f'}g(x,y)$. Even |
---|
171 | though, $X$ and $Y$ are not independent $(f_{XY}\neq f_Xf_Y)$ we |
---|
172 | assume that we can factorize the ratio and get $\frac{\sum |
---|
173 | w_xw_yg(x,y)}{\sum w_xw_y}$ |
---|
174 | \subsection{Covariance} |
---|
175 | Following the variance calculations for AveragerWeighted we have |
---|
176 | $Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}$ where |
---|
177 | $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$ |
---|
178 | |
---|
179 | \subsection{correlation} |
---|
180 | |
---|
181 | As the mean is estimated as $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$, |
---|
182 | the variance is estimated as $\sigma_x^2=\frac{\sum |
---|
183 | w_xw_y(x-m_x)^2}{\sum w_xw_y}$. As in the non-weighted case we define |
---|
184 | the correlation to be the ratio between the covariance and geomtrical |
---|
185 | avergae of the variances |
---|
186 | |
---|
187 | $\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum |
---|
188 | w_xw_y(y-m_y)^2}}$. |
---|
189 | |
---|
190 | This expression fulfills the following |
---|
191 | \begin{itemize} |
---|
192 | \item Having N weights the expression reduces to the non-weighted expression. |
---|
193 | \item Adding a pair of data, in which one weight is zero is equivalent |
---|
194 | to ignoring the data pair. |
---|
195 | \item Correlation is equal to unity if and only if $x$ is equal to |
---|
196 | $y$. Otherwise the correlation is between -1 and 1. |
---|
197 | \end{itemize} |
---|
198 | \section{Score} |
---|
199 | |
---|
200 | |
---|
201 | \subsection{Pearson} |
---|
202 | |
---|
203 | $\frac{\sum w(x-m_x)(y-m_y)}{\sqrt{\sum w(x-m_x)^2\sum w(y-m_y)^2}}$. |
---|
204 | |
---|
205 | See AveragerPairWeighted correlation. |
---|
206 | |
---|
207 | \subsection{ROC} |
---|
208 | |
---|
209 | An interpretation of the ROC curve area is the probability that if we |
---|
210 | take one sample from class $+$ and one sample from class $-$, what is |
---|
211 | the probability that the sample from class $+$ has greater value. The |
---|
212 | ROC curve area calculates the ratio of pairs fulfilling this |
---|
213 | |
---|
214 | \begin{equation} |
---|
215 | \frac{\sum_{\{i,j\}:x^-_i<x^+_j}1}{\sum_{i,j}1}. |
---|
216 | \end{equation} |
---|
217 | |
---|
218 | An geometrical interpretation is to have a number of squares where |
---|
219 | each square correspond to a pair of samples. The ROC curve follows the |
---|
220 | border between pairs in which the samples from class $+$ has a greater |
---|
221 | value and pairs in which this is not fulfilled. The ROC curve area is |
---|
222 | the area of those latter squares and a natural extension is to weight |
---|
223 | each pair with its two weights and consequently the weighted ROC curve |
---|
224 | area becomes |
---|
225 | |
---|
226 | \begin{equation} |
---|
227 | \frac{\sum_{\{i,j\}:x^-_i<x^+_j}w^-_iw^+_j}{\sum_{i,j}w^-_iw^+_j} |
---|
228 | \end{equation} |
---|
229 | |
---|
230 | This expression is invariant under a rescaling of weight. Adding a |
---|
231 | data value with weight zero adds nothing to the exprssion, and having |
---|
232 | all weight equal to unity yields the non-weighted ROC curve area. |
---|
233 | |
---|
234 | \subsection{tScore} |
---|
235 | |
---|
236 | Assume that $x$ and $y$ originate from the same distribution |
---|
237 | $N(\mu,\sigma_i^2)$ where $\sigma_i^2=\frac{\sigma_0^2}{w_i}$. We then |
---|
238 | estimate $\sigma_0^2$ as |
---|
239 | \begin{equation} |
---|
240 | \frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2} |
---|
241 | {\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+ |
---|
242 | \frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2} |
---|
243 | \end{equation} |
---|
244 | The variance of difference of the means becomes |
---|
245 | \begin{eqnarray} |
---|
246 | Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum |
---|
247 | w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}= |
---|
248 | \frac{\sigma_0^2}{\sum w_i}+\frac{\sigma_0^2}{\sum w_i}, |
---|
249 | \end{eqnarray} |
---|
250 | and consequently the t-score becomes |
---|
251 | \begin{equation} |
---|
252 | \frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2} |
---|
253 | {\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+ |
---|
254 | \frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2} |
---|
255 | \left(\frac{1}{\sum w_i}+\frac{1}{\sum w_i}\right), |
---|
256 | \end{equation} |
---|
257 | |
---|
258 | For a $w_i=w$ we this expression get condensed down to |
---|
259 | \begin{equation} |
---|
260 | \frac{w\sum (x-m_x)^2+w\sum (y-m_y)^2} |
---|
261 | {n_x+n_y-2} |
---|
262 | \left(\frac{1}{wn_x}+\frac{1}{wn_y}\right), |
---|
263 | \end{equation} |
---|
264 | in other words the good old expression as for non-weighted. |
---|
265 | |
---|
266 | \subsection{FoldChange} |
---|
267 | Fold-Change is simply the difference between the weighted mean of the |
---|
268 | two groups //$\frac{\sum w_xx}{\sum w_x}-\frac{\sum w_yy}{\sum w_y}$ |
---|
269 | |
---|
270 | \subsection{WilcoxonFoldChange} |
---|
271 | Taking all pair samples (one from class $+$ and one from class $-$) |
---|
272 | and calculating the weighted median of the distances. |
---|
273 | |
---|
274 | \section{Kernel} |
---|
275 | \subsection{Polynomial Kernel} |
---|
276 | The polynomial kernel of degree $N$ is defined as $(1+<x,y>)^N$, where |
---|
277 | $<x,y>$ is the linear kenrel (usual scalar product). For weights we |
---|
278 | define the linear kernel to be $<x,y>=\frac{\sum w_xw_yxy}{\sum |
---|
279 | w_xw_y}$ and the polynomial kernel can be calculated as before |
---|
280 | $(1+<x,y>)^N$. Is this kernel a proper kernel (always being semi |
---|
281 | positive definite). Yes, because $<x,y>$ is obviously a proper kernel |
---|
282 | as it is a scalar product. Adding a positive constant to a kernel |
---|
283 | yields another kernel so $1+<x,y>$ is still a proper kernel. Then also |
---|
284 | $(1+<x,y>)^N$ is a proper kernel because taking a proper kernel to the |
---|
285 | $Nth$ power yields a new proper kernel (see any good book on SVM). |
---|
286 | \subsection{Gaussian Kernel} |
---|
287 | We define the weighted Gaussian kernel as $\exp\left(-\frac{\sum |
---|
288 | w_xw_y(x-y)^2}{\sum w_xw_y}\right)$, which fulfills the conditions |
---|
289 | listed in the introduction. |
---|
290 | |
---|
291 | Is this kernel a proper kernel? Yes, following the proof of the |
---|
292 | non-weighted kernel we see that $K=\exp\left(-\frac{\sum |
---|
293 | w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum |
---|
294 | w_xw_y}\right)\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$, |
---|
295 | which is a product of two proper kernels. $\exp\left(-\frac{\sum |
---|
296 | w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum |
---|
297 | w_xw_y}\right)$ is a proper kernel, because it is a scalar product and |
---|
298 | $\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$ is a proper |
---|
299 | kernel, because it a polynomial of the linear kernel with positive |
---|
300 | coefficients. As product of two kernel also is a kernel, the Gaussian |
---|
301 | kernel is a proper kernel. |
---|
302 | |
---|
303 | \section{Distance} |
---|
304 | |
---|
305 | \section{Regression} |
---|
306 | \subsection{Naive} |
---|
307 | \subsection{Linear} |
---|
308 | We have the model |
---|
309 | |
---|
310 | \begin{equation} |
---|
311 | y_i=\alpha+\beta (x-m_x)+\epsilon_i, |
---|
312 | \end{equation} |
---|
313 | |
---|
314 | where $\epsilon_i$ is the noise. The variance of the noise is |
---|
315 | inversely proportional to the weight, |
---|
316 | $Var(\epsilon_i)=\frac{\sigma^2}{w_i}$. In order to determine the |
---|
317 | model parameters, we minimimize the sum of quadratic errors. |
---|
318 | |
---|
319 | \begin{equation} |
---|
320 | Q_0 = \sum \epsilon_i^2 |
---|
321 | \end{equation} |
---|
322 | |
---|
323 | Taking the derivity with respect to $\alpha$ and $\beta$ yields two conditions |
---|
324 | |
---|
325 | \begin{equation} |
---|
326 | \frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha - |
---|
327 | \beta (x_i-m_x)=0 |
---|
328 | \end{equation} |
---|
329 | |
---|
330 | and |
---|
331 | |
---|
332 | \begin{equation} \frac{\partial Q_0}{\partial \beta} = -2 \sum |
---|
333 | w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0 |
---|
334 | \end{equation} |
---|
335 | |
---|
336 | or equivalently |
---|
337 | |
---|
338 | \begin{equation} |
---|
339 | \alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y |
---|
340 | \end{equation} |
---|
341 | |
---|
342 | and |
---|
343 | |
---|
344 | \begin{equation} \beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum |
---|
345 | w_i(x_i-m_x)^2}=\frac{Cov(x,y)}{Var(x)} |
---|
346 | \end{equation} |
---|
347 | |
---|
348 | Note, by having all weights equal we get back the unweighted |
---|
349 | case. Furthermore, we calculate the variance of the estimators of |
---|
350 | $\alpha$ and $\beta$. |
---|
351 | |
---|
352 | \begin{equation} |
---|
353 | \textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}= |
---|
354 | \frac{\sigma^2}{\sum w_i} |
---|
355 | \end{equation} |
---|
356 | |
---|
357 | and |
---|
358 | \begin{equation} |
---|
359 | \textrm{Var}(\beta )= \frac{w_i^2(x_i-m_x)^2\frac{\sigma^2}{w_i}} |
---|
360 | {(\sum w_i(x_i-m_x)^2)^2}= |
---|
361 | \frac{\sigma^2}{\sum w_i(x_i-m_x)^2} |
---|
362 | \end{equation} |
---|
363 | |
---|
364 | Finally, we estimate the level of noise, $\sigma^2$. Inspired by the |
---|
365 | unweighted estimation |
---|
366 | |
---|
367 | \begin{equation} |
---|
368 | s^2=\frac{\sum (y_i-\alpha-\beta (x_i-m_x))^2}{n-2} |
---|
369 | \end{equation} |
---|
370 | |
---|
371 | we suggest the following estimator |
---|
372 | |
---|
373 | \begin{equation} s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum |
---|
374 | w_i-2\frac{\sum w_i^2}{\sum w_i}} \end{equation} |
---|
375 | |
---|
376 | \section{Outlook} |
---|
377 | \subsection{Hierarchical clustering} |
---|
378 | A hierarchical clustering consists of two things: finding the two |
---|
379 | closest data points, merge these two data points two a new data point |
---|
380 | and calculate the new distances from this point to all other points. |
---|
381 | |
---|
382 | In the first item, we need a distance matrix, and if we use Euclidean |
---|
383 | distanses the natural modification of the expression would be |
---|
384 | |
---|
385 | \begin{equation} |
---|
386 | d(x,y)=\frac{\sum w_i^xw_j^y(x_i-y_i)^2}{\sum w_i^xw_j^y} |
---|
387 | \end{equation} |
---|
388 | |
---|
389 | For the second item, inspired by average linkage, we suggest |
---|
390 | |
---|
391 | \begin{equation} |
---|
392 | d(xy,z)=\frac{\sum w_i^xw_j^z(x_i-z_i)^2+\sum |
---|
393 | w_i^yw_j^z(y_i-z_i)^2}{\sum w_i^xw_j^z+\sum w_i^yw_j^z} |
---|
394 | \end{equation} |
---|
395 | |
---|
396 | to be the distance between the new merged point $xy$ and $z$, and we |
---|
397 | also calculate new weights for this point: $w^{xy}_i=w^x_i+w^y_i$ |
---|
398 | |
---|
399 | \end{document} |
---|
400 | |
---|
401 | |
---|
402 | |
---|