Changeset 494 for trunk/doc/Statistics.tex
- Timestamp:
- Jan 10, 2006, 2:44:14 PM (17 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
trunk/doc/Statistics.tex
r492 r494 1 \documentstyle[12pt]{article} 1 \documentclass[12pt]{article} 2 3 \usepackage{html} 4 2 5 3 6 \flushbottom … … 12 15 \topmargin 0pt 13 16 14 \newcommand{\bea} {\begin{eqnarray}}15 \newcommand{\eea} {\end{eqnarray}}16 \newcommand{\beq} {\begin{equation}}17 \newcommand{\eeq} {\end{equation}}18 \newcommand{\bibl}[5]19 {#1, {\it #2} {\bf #3} (#4) #5}20 \newcommand{\ol}{\overline}21 22 17 \renewcommand{\baselinestretch} {1.0} 23 18 \renewcommand{\textfraction} {0.1} … … 35 30 \newcommand{\ovr}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)} 36 31 37 % Use these to include comments and remarks into the text, these will38 % obviously appear as footnotes in the final output.39 \newcommand{\CR}[1]{\footnote{CR: #1}}40 \newcommand{\JH}[1]{\footnote{JH: #1}}41 \newcommand{\PE}[1]{\footnote{PE: #1}}42 \newcommand{\PJ}[1]{\footnote{PJ: #1}}43 44 32 \begin{document} 45 33 … … 47 35 {\bf Weighted Statistics} 48 36 \normalsize 37 \begin{htmlonly} 38 This document is also available in 39 \htmladdnormallink{PDF}{Statistics.pdf}. 40 \end{htmlonly} 49 41 50 42 \tableofcontents … … 52 44 53 45 \section{Introduction} 54 There are several different reasons why a statistical analysis needs to adjust for weighting. In literature reasons are mainly diveded in to groups. 55 56 The first group is when some of the measurements are known to be more precise than others. The more precise a measuremtns is the larger weight it is given. The simplest case is when the weight are given before the measurements and they can be treated as deterministic. It becomes more complicated when the weight can be determined not until afterwards, and even more complicated if the weight depends on the value of the observable. 57 58 The second group of situations is when calculating averages over one distribution and sampling from another distribution. Compensating for this discrepency weights are introduced to the analysis. A simple example may be that we are interviewing people but for economical reasons we choose to interview more people from the city than from the countryside. When summarizing the statistics the answers from the city are given a smaller weight. In this example we are choosing the proportions of people from countryside and people from city being intervied. Hence, we can determine the weights before and consider them to be deterministic. In other situations the proportions are not deterministic, but rather a result from the sampling and the weights must be treated as stochastic and only in rare situations the weights can be treated as independent of the observable. 59 60 Since there are various origin for a weight occuring in a statistical analysis, there are various way to treat the weights and in general the analysis should be tailored to treat the weights correctly. We have not chosen one situation for our implementations, so see specific function documentation for what assumtions are made. Though, common for implementationare the following: 46 There are several different reasons why a statistical analysis needs 47 to adjust for weighting. In literature reasons are mainly diveded in 48 to groups. 49 50 The first group is when some of the measurements are known to be more 51 precise than others. The more precise a measuremtns is the larger 52 weight it is given. The simplest case is when the weight are given 53 before the measurements and they can be treated as deterministic. It 54 becomes more complicated when the weight can be determined not until 55 afterwards, and even more complicated if the weight depends on the 56 value of the observable. 57 58 The second group of situations is when calculating averages over one 59 distribution and sampling from another distribution. Compensating for 60 this discrepency weights are introduced to the analysis. A simple 61 example may be that we are interviewing people but for economical 62 reasons we choose to interview more people from the city than from the 63 countryside. When summarizing the statistics the answers from the city 64 are given a smaller weight. In this example we are choosing the 65 proportions of people from countryside and people from city being 66 intervied. Hence, we can determine the weights before and consider 67 them to be deterministic. In other situations the proportions are not 68 deterministic, but rather a result from the sampling and the weights 69 must be treated as stochastic and only in rare situations the weights 70 can be treated as independent of the observable. 71 72 Since there are various origin for a weight occuring in a statistical 73 analysis, there are various way to treat the weights and in general 74 the analysis should be tailored to treat the weights correctly. We 75 have not chosen one situation for our implementations, so see specific 76 function documentation for what assumtions are made. Though, common 77 for implementationare the following: 61 78 \begin{itemize} 62 \item Setting all weights to unity yields the same result as the non-weighted version. 79 \item Setting all weights to unity yields the same result as the 80 non-weighted version. 63 81 \item Rescaling the weights does not change any function. 64 82 \item Setting a weight to zero is equivalent to removing the data point. 65 83 \end{itemize} 66 An important case is when weights are binary (either 1 or 0). Then we get same result using the weighted version as using the data with weight not equal to zero and the non-weighted version. Hence, using binary weights and the weighted version missing values can be treated in a proper way. 84 An important case is when weights are binary (either 1 or 0). Then we 85 get same result using the weighted version as using the data with 86 weight not equal to zero and the non-weighted version. Hence, using 87 binary weights and the weighted version missing values can be treated 88 in a proper way. 67 89 68 90 \section{AveragerWeighted} … … 72 94 \subsection{Mean} 73 95 74 For any situation the weight is always designed so the weighted mean is calculated as $m=\frac{\sum w_ix_i}{\sum w_i}$, which obviously fulfills the conditions above. 75 76 In the case of varying measurement error, it could be motivated that the weight shall be $w_i = 1/\sigma_i^2$. We assume measurement error to be Gaussian and the likelihood to get our measurements is 77 $L(m)=\prod (2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}}$. 78 We maximize the likelihood by taking the derivity with respect to $m$ on the logarithm of the likelihood 79 $\frac{d\ln L(m)}{dm}=\sum \frac{x_i-m}{\sigma_i^2}$. Hence, the Maximum Likelihood method yields the estimator 80 $m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2}$. 96 For any situation the weight is always designed so the weighted mean 97 is calculated as $m=\frac{\sum w_ix_i}{\sum w_i}$, which obviously 98 fulfills the conditions above. 99 100 In the case of varying measurement error, it could be motivated that 101 the weight shall be $w_i = 1/\sigma_i^2$. We assume measurement error 102 to be Gaussian and the likelihood to get our measurements is 103 $L(m)=\prod 104 (2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}}$. We 105 maximize the likelihood by taking the derivity with respect to $m$ on 106 the logarithm of the likelihood $\frac{d\ln L(m)}{dm}=\sum 107 \frac{x_i-m}{\sigma_i^2}$. Hence, the Maximum Likelihood method yields 108 the estimator $m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2}$. 81 109 82 110 83 111 \subsection{Variance} 84 In case of varying variance, there is no point estimating a variance since it is different for each data point. 85 86 Instead we look at the case when we want to estimate the variance over $f$ but are sampling from $f'$. For the mean of an observable $O$ we have 87 $\widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum w_i}$. Hence, an estimator of the variance of $X$ is 112 In case of varying variance, there is no point estimating a variance 113 since it is different for each data point. 114 115 Instead we look at the case when we want to estimate the variance over 116 $f$ but are sampling from $f'$. For the mean of an observable $O$ we 117 have $\widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum 118 w_i}$. Hence, an estimator of the variance of $X$ is 88 119 \begin{eqnarray} 89 120 \sigma^2=<X^2>-<X>^2= … … 93 124 \\\frac{\sum w_i(x_i-m)^2}{\sum w_i} 94 125 \end{eqnarray} 95 This estimator fulfills that it is invariant under a rescaling and having a weight equal to zero is equivalent to removing the data point. Having all weight equal to unity we get $\sigma=\frac{\sum (x_i-m)^2}{N}$, which is the same as returned from Averager. Hence, this estimator is slightly biased, but still very efficient. 126 This estimator fulfills that it is invariant under a rescaling and 127 having a weight equal to zero is equivalent to removing the data 128 point. Having all weight equal to unity we get $\sigma=\frac{\sum 129 (x_i-m)^2}{N}$, which is the same as returned from Averager. Hence, 130 this estimator is slightly biased, but still very efficient. 96 131 97 132 \subsection{Standard Error} 98 The standard error squared is equal to the expexted squared error of the estimation of $m$. The squared error consists of two parts, the variance of the estimator and the squared bias. $<m-\mu>^2=<m-<m>+<m>-\mu>^2=<m-<m>>^2+(<m>-\mu)^2$. 99 In the case when weights are included in analysis due to varying measurement errors and the weights can be treated as deterministic ,we have 100 \begin{eqnarray} 133 The standard error squared is equal to the expexted squared error of 134 the estimation of $m$. The squared error consists of two parts, the 135 variance of the estimator and the squared 136 bias. $<m-\mu>^2=<m-<m>+<m>-\mu>^2=<m-<m>>^2+(<m>-\mu)^2$. In the 137 case when weights are included in analysis due to varying measurement 138 errors and the weights can be treated as deterministic ,we have 139 \begin{equation} 101 140 Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}= 102 \ \\frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}141 \frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}= 103 142 \frac{\sigma_0^2}{\sum w_i}, 104 \end{eq narray}143 \end{equation} 105 144 where we need to estimate $\sigma_0^2$. Again we have the likelihood 106 $L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{-\frac{w_i(x-m)^2}{2\sigma_0^2}}$ and taking the derivity with respect to $\sigma_o^2$, 107 $\frac{d\ln L}{d\sigma_i^2}=\sum -\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2}$ 108 which yields an estimator $\sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2$. This estimator is not ignoring weights equal to zero, because deviation is most often smaller than the expected infinity. Therefore, we modify the expression as follows $\sigma_i^2=\frac{\sum w_i^2}{\left(\sum w_i\right)^2}\sum w_i(x-m)^2$ and we get the following estimator of the variance of the mean 109 $\sigma_i^2=\frac{\sum w_i^2}{\left(\sum w_i\right)^3}\sum w_i(x-m)^2$. This estimator fulfills the conditions above: adding a weight zero does not change it: rescaling the weights does not change it, and setting all weights to unity yields the same expression as in the non-weighted case. 110 111 In a case when it is not a good approximation to treat the weights as deterministic, there are two ways to get a better estimation. The first one is to linearize the expression $\left<\frac{\sum w_ix_i}{\sum w_i}\right>$. The second method when the situation is more complicated is to estimate the standard error using a bootstrapping method. 145 $L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{-\frac{w_i(x-m)^2}{2\sigma_0^2}}$ 146 and taking the derivity with respect to $\sigma_o^2$, $\frac{d\ln 147 L}{d\sigma_i^2}=\sum 148 -\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2}$ which 149 yields an estimator $\sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2$. This 150 estimator is not ignoring weights equal to zero, because deviation is 151 most often smaller than the expected infinity. Therefore, we modify 152 the expression as follows $\sigma_0^2=\frac{\sum w_i^2}{\left(\sum 153 w_i\right)^2}\sum w_i(x-m)^2$ and we get the following estimator of 154 the variance of the mean $\sigma_0^2=\frac{\sum w_i^2}{\left(\sum 155 w_i\right)^3}\sum w_i(x-m)^2$. This estimator fulfills the conditions 156 above: adding a weight zero does not change it: rescaling the weights 157 does not change it, and setting all weights to unity yields the same 158 expression as in the non-weighted case. 159 160 In a case when it is not a good approximation to treat the weights as 161 deterministic, there are two ways to get a better estimation. The 162 first one is to linearize the expression $\left<\frac{\sum 163 w_ix_i}{\sum w_i}\right>$. The second method when the situation is 164 more complicated is to estimate the standard error using a 165 bootstrapping method. 112 166 113 167 \section{AveragerPairWeighted} 114 Here data points come in pairs (x,y). We are sampling from $f'_{XY}$ but want to measure from $f_{XY}$. To compensate for this decrepency, averages of $g(x,y)$ are taken as $\sum \frac{f}{f'}g(x,y)$. Even though, $X$ and $Y$ are not independent $(f_{XY}\neq f_Xf_Y)$ we assume that we can factorize the ratio and get $\frac{\sum w_xw_yg(x,y)}{\sum w_xw_y}$ 168 Here data points come in pairs (x,y). We are sampling from $f'_{XY}$ 169 but want to measure from $f_{XY}$. To compensate for this decrepency, 170 averages of $g(x,y)$ are taken as $\sum \frac{f}{f'}g(x,y)$. Even 171 though, $X$ and $Y$ are not independent $(f_{XY}\neq f_Xf_Y)$ we 172 assume that we can factorize the ratio and get $\frac{\sum 173 w_xw_yg(x,y)}{\sum w_xw_y}$ 115 174 \subsection{Covariance} 116 Following the variance calculations for AveragerWeighted we have $Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}$ where $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$ 175 Following the variance calculations for AveragerWeighted we have 176 $Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}$ where 177 $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$ 117 178 118 179 \subsection{correlation} 119 180 120 As the mean is estimated as $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$, the variance is estimated as 121 $\sigma_x^2=\frac{\sum w_xw_y(x-m_x)^2}{\sum w_xw_y}$. As in the non-weighted case we define the correlation to be the ratio between the covariance and geomtrical avergae of the variances 122 123 $\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum w_xw_y(y-m_y)^2}}$. 181 As the mean is estimated as $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$, 182 the variance is estimated as $\sigma_x^2=\frac{\sum 183 w_xw_y(x-m_x)^2}{\sum w_xw_y}$. As in the non-weighted case we define 184 the correlation to be the ratio between the covariance and geomtrical 185 avergae of the variances 186 187 $\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum 188 w_xw_y(y-m_y)^2}}$. 124 189 125 190 This expression fulfills the following 126 191 \begin{itemize} 127 192 \item Having N weights the expression reduces to the non-weighted expression. 128 \item Adding a pair of data, in which one weight is zero is equivalent to ignoring the data pair. 129 \item Correlation is equal to unity if and only if $x$ is equal to $y$. Otherwise the correlation is between -1 and 1. 193 \item Adding a pair of data, in which one weight is zero is equivalent 194 to ignoring the data pair. 195 \item Correlation is equal to unity if and only if $x$ is equal to 196 $y$. Otherwise the correlation is between -1 and 1. 130 197 \end{itemize} 131 198 \section{Score} … … 140 207 \subsection{ROC} 141 208 142 An interpretation of the ROC curve area is the probability that if we take one sample from class $+$ and one sample from class $-$, what is the probability that the sample from class $+$ has greater value. The ROC curve area calculates the ratio of pairs fulfilling this 143 144 \beq 209 An interpretation of the ROC curve area is the probability that if we 210 take one sample from class $+$ and one sample from class $-$, what is 211 the probability that the sample from class $+$ has greater value. The 212 ROC curve area calculates the ratio of pairs fulfilling this 213 214 \begin{equation} 145 215 \frac{\sum_{\{i,j\}:x^-_i<x^+_j}1}{\sum_{i,j}1}. 146 \eeq 147 148 An geometrical interpretation is to have a number of squares where each square correspond to a pair of samples. The ROC curve follows the border between pairs in which the samples from class $+$ has a greater value and pairs in which this is not fulfilled. The ROC curve area is the area of those latter squares and a natural extension is to weight each pair with its two weights and consequently the weighted ROC curve area becomes 149 150 \beq 216 \end{equation} 217 218 An geometrical interpretation is to have a number of squares where 219 each square correspond to a pair of samples. The ROC curve follows the 220 border between pairs in which the samples from class $+$ has a greater 221 value and pairs in which this is not fulfilled. The ROC curve area is 222 the area of those latter squares and a natural extension is to weight 223 each pair with its two weights and consequently the weighted ROC curve 224 area becomes 225 226 \begin{equation} 151 227 \frac{\sum_{\{i,j\}:x^-_i<x^+_j}w^-_iw^+_j}{\sum_{i,j}w^-_iw^+_j} 152 \eeq 153 154 This expression is invariant under a rescaling of weight. Adding a data value with weight zero adds nothing to the exprssion, and having all weight equal to unity yields the non-weighted ROC curve area. 228 \end{equation} 229 230 This expression is invariant under a rescaling of weight. Adding a 231 data value with weight zero adds nothing to the exprssion, and having 232 all weight equal to unity yields the non-weighted ROC curve area. 155 233 156 234 \subsection{tScore} 157 235 158 Assume that $x$ and $y$ originate from the same distribution $N(\mu,\sigma_i^2)$ where $\sigma_i^2=\frac{\sigma_0^2}{w_i}$. We then estimate $\sigma_0^2$ as 236 Assume that $x$ and $y$ originate from the same distribution 237 $N(\mu,\sigma_i^2)$ where $\sigma_i^2=\frac{\sigma_0^2}{w_i}$. We then 238 estimate $\sigma_0^2$ as 159 239 \begin{equation} 160 240 \frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2} … … 164 244 The variance of difference of the means becomes 165 245 \begin{eqnarray} 166 Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}= 246 Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum 247 w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}= 167 248 \frac{\sigma_0^2}{\sum w_i}+\frac{\sigma_0^2}{\sum w_i}, 168 249 \end{eqnarray} … … 184 265 185 266 \subsection{FoldChange} 186 Fold-Change is simply the difference between the weighted mean of the two groups //$\frac{\sum w_xx}{\sum w_x}-\frac{\sum w_yy}{\sum w_y}$ 267 Fold-Change is simply the difference between the weighted mean of the 268 two groups //$\frac{\sum w_xx}{\sum w_x}-\frac{\sum w_yy}{\sum w_y}$ 187 269 188 270 \subsection{WilcoxonFoldChange} 189 Taking all pair samples (one from class $+$ and one from class $-$) and calculating the weighted median of the distances. 271 Taking all pair samples (one from class $+$ and one from class $-$) 272 and calculating the weighted median of the distances. 190 273 191 274 \section{Kernel} 192 275 \subsection{Polynomial Kernel} 193 The polynomial kernel of degree $N$ is defined as $(1+<x,y>)^N$, where $<x,y>$ is the linear kenrel (usual scalar product). For weights we define the linear kernel to be $<x,y>=\frac{\sum w_xw_yxy}{\sum w_xw_y}$ and the polynomial kernel can be calculated as before $(1+<x,y>)^N$. Is this kernel a proper kernel (always being semi positive definite). Yes, because $<x,y>$ is obviously a proper kernel as it is a scalar product. Adding a positive constant to a kernel yields another kernel so $1+<x,y>$ is still a proper kernel. Then also $(1+<x,y>)^N$ is a proper kernel because taking a proper kernel to the $Nth$ power yields a new proper kernel (see any good book on SVM). 276 The polynomial kernel of degree $N$ is defined as $(1+<x,y>)^N$, where 277 $<x,y>$ is the linear kenrel (usual scalar product). For weights we 278 define the linear kernel to be $<x,y>=\frac{\sum w_xw_yxy}{\sum 279 w_xw_y}$ and the polynomial kernel can be calculated as before 280 $(1+<x,y>)^N$. Is this kernel a proper kernel (always being semi 281 positive definite). Yes, because $<x,y>$ is obviously a proper kernel 282 as it is a scalar product. Adding a positive constant to a kernel 283 yields another kernel so $1+<x,y>$ is still a proper kernel. Then also 284 $(1+<x,y>)^N$ is a proper kernel because taking a proper kernel to the 285 $Nth$ power yields a new proper kernel (see any good book on SVM). 194 286 \subsection{Gaussian Kernel} 195 We define the weighted Gaussian kernel as 196 $\exp\left(-\frac{\sum w_xw_y(x-y)^2}{\sum w_xw_y}\right)$, which fulfills the conditions listed in the introduction. 197 198 Is this kernel a proper kernel? Yes, following the proof of the non-weighted kernel we see that $K=\exp\left(-\frac{\sum w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum w_xw_y}\right)\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$, which is a product of two proper kernels. $\exp\left(-\frac{\sum w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum w_xw_y}\right)$ is a proper kernel, because it is a scalar product and $\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$ is a proper kernel, because it a polynomial of the linear kernel with positive coefficients. As product of two kernel also is a kernel, the Gaussian kernel is a proper kernel. 287 We define the weighted Gaussian kernel as $\exp\left(-\frac{\sum 288 w_xw_y(x-y)^2}{\sum w_xw_y}\right)$, which fulfills the conditions 289 listed in the introduction. 290 291 Is this kernel a proper kernel? Yes, following the proof of the 292 non-weighted kernel we see that $K=\exp\left(-\frac{\sum 293 w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum 294 w_xw_y}\right)\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$, 295 which is a product of two proper kernels. $\exp\left(-\frac{\sum 296 w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum 297 w_xw_y}\right)$ is a proper kernel, because it is a scalar product and 298 $\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$ is a proper 299 kernel, because it a polynomial of the linear kernel with positive 300 coefficients. As product of two kernel also is a kernel, the Gaussian 301 kernel is a proper kernel. 199 302 200 303 \section{Distance} … … 205 308 We have the model 206 309 207 \be q310 \begin{equation} 208 311 y_i=\alpha+\beta (x-m_x)+\epsilon_i, 209 \e eq312 \end{equation} 210 313 211 314 where $\epsilon_i$ is the noise. The variance of the noise is … … 214 317 model parameters, we minimimize the sum of quadratic errors. 215 318 216 \be q319 \begin{equation} 217 320 Q_0 = \sum \epsilon_i^2 218 \e eq321 \end{equation} 219 322 220 323 Taking the derivity with respect to $\alpha$ and $\beta$ yields two conditions 221 324 222 \beq 223 \frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha - \beta (x_i-m_x)=0 224 \eeq 325 \begin{equation} 326 \frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha - 327 \beta (x_i-m_x)=0 328 \end{equation} 225 329 226 330 and 227 331 228 \be q229 \frac{\partial Q_0}{\partial \beta} = -2 \sum w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0 230 \e eq332 \begin{equation} \frac{\partial Q_0}{\partial \beta} = -2 \sum 333 w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0 334 \end{equation} 231 335 232 336 or equivalently 233 337 234 \be q338 \begin{equation} 235 339 \alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y 236 \e eq340 \end{equation} 237 341 238 342 and 239 343 240 \be q241 \beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum w_i(x_i-m_x)^2}=\frac{Cov(x,y)}{Var(x)} 242 \e eq344 \begin{equation} \beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum 345 w_i(x_i-m_x)^2}=\frac{Cov(x,y)}{Var(x)} 346 \end{equation} 243 347 244 348 Note, by having all weights equal we get back the unweighted … … 246 350 $\alpha$ and $\beta$. 247 351 248 \be q352 \begin{equation} 249 353 \textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}= 250 354 \frac{\sigma^2}{\sum w_i} 251 \e eq355 \end{equation} 252 356 253 357 and 254 \be q358 \begin{equation} 255 359 \textrm{Var}(\beta )= \frac{w_i^2(x_i-m_x)^2\frac{\sigma^2}{w_i}} 256 360 {(\sum w_i(x_i-m_x)^2)^2}= 257 361 \frac{\sigma^2}{\sum w_i(x_i-m_x)^2} 258 \e eq362 \end{equation} 259 363 260 364 Finally, we estimate the level of noise, $\sigma^2$. Inspired by the 261 365 unweighted estimation 262 366 263 \be q367 \begin{equation} 264 368 s^2=\frac{\sum (y_i-\alpha-\beta (x_i-m_x))^2}{n-2} 265 \e eq369 \end{equation} 266 370 267 371 we suggest the following estimator 268 372 269 \beq 270 s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum w_i-2\frac{\sum w_i^2}{\sum w_i}} 271 \eeq 373 \begin{equation} s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum 374 w_i-2\frac{\sum w_i^2}{\sum w_i}} \end{equation} 272 375 273 376 \section{Outlook} 274 377 \subsection{Hierarchical clustering} 275 \label{hc}276 378 A hierarchical clustering consists of two things: finding the two 277 379 closest data points, merge these two data points two a new data point 278 and calculate the new distances from this point to all other points. \\380 and calculate the new distances from this point to all other points. 279 381 280 382 In the first item, we need a distance matrix, and if we use Euclidean 281 383 distanses the natural modification of the expression would be 282 384 283 \beq 284 d(x,y)=\frac{\sum w_i^xw_j^y(x_i-y_i)^2}{\sum w_i^xw_j^y} \eeq \\ 385 \begin{equation} 386 d(x,y)=\frac{\sum w_i^xw_j^y(x_i-y_i)^2}{\sum w_i^xw_j^y} 387 \end{equation} 285 388 286 389 For the second item, inspired by average linkage, we suggest 287 390 288 \beq 289 d(xy,z)=\frac{\sum w_i^xw_j^z(x_i-z_i)^2+\sum w_i^yw_j^z(y_i-z_i)^2}{\sum w_i^xw_j^z+\sum w_i^yw_j^z} 290 \eeq 391 \begin{equation} 392 d(xy,z)=\frac{\sum w_i^xw_j^z(x_i-z_i)^2+\sum 393 w_i^yw_j^z(y_i-z_i)^2}{\sum w_i^xw_j^z+\sum w_i^yw_j^z} 394 \end{equation} 291 395 292 396 to be the distance between the new merged point $xy$ and $z$, and we
Note: See TracChangeset
for help on using the changeset viewer.