 Timestamp:
 Jan 9, 2006, 10:38:12 AM (17 years ago)
 File:

 1 edited
Legend:
 Unmodified
 Added
 Removed

trunk/doc/Statistics.tex
r489 r492 1 1 \documentstyle[12pt]{article} 2 %\documentstyle[slidesonly]{seminar}3 4 % $Id: error.tex,v 1.11 2004/10/01 09:08:39 peter Exp $5 6 %\input{psfig}7 8 %  This bit puts ``draft'' over everything! 9 %\special{!userdict begin /bophook{gsave 200 30 translate10 %65 rotate /TimesRoman findfont 216 scalefont setfont11 %0 0 moveto 0.93 setgray (DRAFT) show grestore}def end}12 %  end of this bit that puts `draft' over everything 13 14 2 15 3 \flushbottom … … 64 52 65 53 \section{Introduction} 66 67 68 \section{WeightedAverager} 69 70 Building an estimator $T$ of the parameter $\Theta$ there 71 is a number of criterias that is welcome to be fullfilled. 72 54 There are several different reasons why a statistical analysis needs to adjust for weighting. In literature reasons are mainly diveded in to groups. 55 56 The first group is when some of the measurements are known to be more precise than others. The more precise a measuremtns is the larger weight it is given. The simplest case is when the weight are given before the measurements and they can be treated as deterministic. It becomes more complicated when the weight can be determined not until afterwards, and even more complicated if the weight depends on the value of the observable. 57 58 The second group of situations is when calculating averages over one distribution and sampling from another distribution. Compensating for this discrepency weights are introduced to the analysis. A simple example may be that we are interviewing people but for economical reasons we choose to interview more people from the city than from the countryside. When summarizing the statistics the answers from the city are given a smaller weight. In this example we are choosing the proportions of people from countryside and people from city being intervied. Hence, we can determine the weights before and consider them to be deterministic. In other situations the proportions are not deterministic, but rather a result from the sampling and the weights must be treated as stochastic and only in rare situations the weights can be treated as independent of the observable. 59 60 Since there are various origin for a weight occuring in a statistical analysis, there are various way to treat the weights and in general the analysis should be tailored to treat the weights correctly. We have not chosen one situation for our implementations, so see specific function documentation for what assumtions are made. Though, common for implementationare the following: 73 61 \begin{itemize} 74 \item The bias $b=<T>  \Theta$ should be zero. The 75 estimator is unbiased. 76 77 \item The estimator is efficient if the mean squared error 78 $<(T\Theta)^2>$ is small. If the estimator is unbiased the 79 mean squared error is equal to the variance 80 $<(T<T>)^2>$ of the estimator. 81 82 \item The estimator is consistent if the mean squared error goes to 83 zero in the limit of infinite number of data points. 84 85 \item adding a data point with weight zero should not change the 86 estimator. 87 62 \item Setting all weights to unity yields the same result as the nonweighted version. 63 \item Rescaling the weights does not change any function. 64 \item Setting a weight to zero is equivalent to removing the data point. 88 65 \end{itemize} 89 90 We will use these criterias to find the estimator. We will minimize 91 the variance with the constraint that the estimator must be 92 unbiased. Also, we check that the estimator is consistent and handles 93 zero weight in a desired way. 66 An important case is when weights are binary (either 1 or 0). Then we get same result using the weighted version as using the data with weight not equal to zero and the nonweighted version. Hence, using binary weights and the weighted version missing values can be treated in a proper way. 67 68 \section{AveragerWeighted} 69 94 70 95 71 96 72 \subsection{Mean} 97 We start building an estimator of the mean in the case with varying 98 variance. 99 100 The likelihood is 101 102 \beq 103 L(m)=\prod (2\pi\sigma_i^2)^{1/2}e^{\frac{(x_im)^2}{2\sigma_i^2}}, 104 \eeq 105 106 which we want to maximize, but can equally well minimize 107 108 \beq 109 \ln L(m) = \sum \frac{1}{2}\ln2\pi\sigma_i^2+\frac{(x_im)^2}{2\sigma_i^2}. 110 \eeq 111 112 Taking the derivity yields 113 114 \beq 115 \frac{d\ln L(m)}{dm}=\sum  \frac{x_im}{\sigma_i^2} 116 \eeq 117 118 Hence, the Maximum Likelihood method yields the estimator 119 120 \beq 121 m=\frac{\sum x_i/\sigma_i^2}{\sum 1/\sigma_i^2} 122 \eeq 123 124 Let us check the criterias defined above. 125 First, we want the estimator to be unbiased. 126 127 \beq 128 b=<m>\mu=\frac{\sum <x_i>/\sigma_i^2}{\sum 1/\sigma_i^2}\mu=\frac{\sum \mu/\sigma_i^2}{\sum 1/\sigma_i^2}\mu=0, 129 \eeq 130 131 so the estimator is unbiased. 132 133 Second, we examine how efficient the estimator is, \ie how small the 134 variance is. 135 136 \beq 137 V(m)=\frac{\sum V(x_i)/\sigma_i^2}{\left(\sum 1/\sigma_i^2\right)^2}=\frac{1}{\sum 1/\sigma_i^2}, 138 \eeq 139 140 which obviously goes to zero when number of samples (with finite 141 $\sigma$) goes to infinity. The estimator is consistent. 142 143 Trivially we can see that a zero weight data point does not change the 144 estimator, which was our last condition. 73 74 For any situation the weight is always designed so the weighted mean is calculated as $m=\frac{\sum w_ix_i}{\sum w_i}$, which obviously fulfills the conditions above. 75 76 In the case of varying measurement error, it could be motivated that the weight shall be $w_i = 1/\sigma_i^2$. We assume measurement error to be Gaussian and the likelihood to get our measurements is 77 $L(m)=\prod (2\pi\sigma_i^2)^{1/2}e^{\frac{(x_im)^2}{2\sigma_i^2}}$. 78 We maximize the likelihood by taking the derivity with respect to $m$ on the logarithm of the likelihood 79 $\frac{d\ln L(m)}{dm}=\sum \frac{x_im}{\sigma_i^2}$. Hence, the Maximum Likelihood method yields the estimator 80 $m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2}$. 81 145 82 146 83 \subsection{Variance} 147 Let us now examine the case when we do not know the variance, but only 148 the weight $w_i$ that is proportional to the inverse 149 of the variance $\sigma_i^2$ 150 151 \beq 152 w_i=\frac{\kappa}{\sigma_i^2}, 153 \eeq 154 155 and the variance is 156 157 \beq 158 V(m)=\frac{1}{\sum 1/\sigma_i^2}=\frac{\kappa}{\sum w_i} 159 \eeq 160 161 so we need to estimate $\kappa$. The likelihood is now 162 163 \beq 164 L(k)=P(k)\prod \frac{\sqrt{w_i}} {\sqrt{2\pi k}}e^{\frac{w_i(x_im)^2}{2k}}, 165 \eeq 166 167 where $P(k)$ is the prior probabilty distribution. If we have no prior knowledge 168 about $k$, $P(k)$ is constant. 169 170 Taking the derivity of the logarithm again yields 171 172 \beq 173 \frac{d\ln L(k)}{dk}=\frac{d\ln P(k)}{dk}+\sum \left(\frac{1}{2k}\frac{w_i(x_im)^2}{2k^2}\right) 84 In case of varying variance, there is no point estimating a variance since it is different for each data point. 85 86 Instead we look at the case when we want to estimate the variance over $f$ but are sampling from $f'$. For the mean of an observable $O$ we have 87 $\widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum w_i}$. Hence, an estimator of the variance of $X$ is 88 \begin{eqnarray} 89 \sigma^2=<X^2><X>^2= 90 \\\frac{\sum w_ix_i^2}{\sum w_i}\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}= 91 \\\frac{\sum w_i(x_i^2m^2)}{\sum w_i} 92 \\\frac{\sum w_i(x_i^22mx_i+m^2)}{\sum w_i} 93 \\\frac{\sum w_i(x_im)^2}{\sum w_i} 94 \end{eqnarray} 95 This estimator fulfills that it is invariant under a rescaling and having a weight equal to zero is equivalent to removing the data point. Having all weight equal to unity we get $\sigma=\frac{\sum (x_im)^2}{N}$, which is the same as returned from Averager. Hence, this estimator is slightly biased, but still very efficient. 96 97 \subsection{Standard Error} 98 The standard error squared is equal to the expexted squared error of the estimation of $m$. The squared error consists of two parts, the variance of the estimator and the squared bias. $<m\mu>^2=<m<m>+<m>\mu>^2=<m<m>>^2+(<m>\mu)^2$. 99 In the case when weights are included in analysis due to varying measurement errors and the weights can be treated as deterministic ,we have 100 \begin{eqnarray} 101 Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}= 102 \\\frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2} 103 \frac{\sigma_0^2}{\sum w_i}, 104 \end{eqnarray} 105 where we need to estimate $\sigma_0^2$. Again we have the likelihood 106 $L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{\frac{w_i(xm)^2}{2\sigma_0^2}}$ and taking the derivity with respect to $\sigma_o^2$, 107 $\frac{d\ln L}{d\sigma_i^2}=\sum \frac{1}{2\sigma_0^2}+\frac{w_i(xm)^2}{2\sigma_0^2\sigma_o^2}$ 108 which yields an estimator $\sigma_0^2=\frac{1}{N}\sum w_i(xm)^2$. This estimator is not ignoring weights equal to zero, because deviation is most often smaller than the expected infinity. Therefore, we modify the expression as follows $\sigma_i^2=\frac{\sum w_i^2}{\left(\sum w_i\right)^2}\sum w_i(xm)^2$ and we get the following estimator of the variance of the mean 109 $\sigma_i^2=\frac{\sum w_i^2}{\left(\sum w_i\right)^3}\sum w_i(xm)^2$. This estimator fulfills the conditions above: adding a weight zero does not change it: rescaling the weights does not change it, and setting all weights to unity yields the same expression as in the nonweighted case. 110 111 In a case when it is not a good approximation to treat the weights as deterministic, there are two ways to get a better estimation. The first one is to linearize the expression $\left<\frac{\sum w_ix_i}{\sum w_i}\right>$. The second method when the situation is more complicated is to estimate the standard error using a bootstrapping method. 112 113 \section{AveragerPairWeighted} 114 Here data points come in pairs (x,y). We are sampling from $f'_{XY}$ but want to measure from $f_{XY}$. To compensate for this decrepency, averages of $g(x,y)$ are taken as $\sum \frac{f}{f'}g(x,y)$. Even though, $X$ and $Y$ are not independent $(f_{XY}\neq f_Xf_Y)$ we assume that we can factorize the ratio and get $\frac{\sum w_xw_yg(x,y)}{\sum w_xw_y}$ 115 \subsection{Covariance} 116 Following the variance calculations for AveragerWeighted we have $Cov=\frac{\sum w_xw_y(xm_x)(ym_y)}{\sum w_xw_y}$ where $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$ 117 118 \subsection{correlation} 119 120 As the mean is estimated as $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$, the variance is estimated as 121 $\sigma_x^2=\frac{\sum w_xw_y(xm_x)^2}{\sum w_xw_y}$. As in the nonweighted case we define the correlation to be the ratio between the covariance and geomtrical avergae of the variances 122 123 $\frac{\sum w_xw_y(xm_x)(ym_y)}{\sqrt{\sum w_xw_y(xm_x)^2\sum w_xw_y(ym_y)^2}}$. 124 125 This expression fulfills the following 126 \begin{itemize} 127 \item Having N weights the expression reduces to the nonweighted expression. 128 \item Adding a pair of data, in which one weight is zero is equivalent to ignoring the data pair. 129 \item Correlation is equal to unity if and only if $x$ is equal to $y$. Otherwise the correlation is between 1 and 1. 130 \end{itemize} 131 \section{Score} 132 133 134 \subsection{Pearson} 135 136 $\frac{\sum w(xm_x)(ym_y)}{\sqrt{\sum w(xm_x)^2\sum w(ym_y)^2}}$. 137 138 See AveragerPairWeighted correlation. 139 140 \subsection{ROC} 141 142 An interpretation of the ROC curve area is the probability that if we take one sample from class $+$ and one sample from class $$, what is the probability that the sample from class $+$ has greater value. The ROC curve area calculates the ratio of pairs fulfilling this 143 144 \beq 145 \frac{\sum_{\{i,j\}:x^_i<x^+_j}1}{\sum_{i,j}1}. 146 \eeq 147 148 An geometrical interpretation is to have a number of squares where each square correspond to a pair of samples. The ROC curve follows the border between pairs in which the samples from class $+$ has a greater value and pairs in which this is not fulfilled. The ROC curve area is the area of those latter squares and a natural extension is to weight each pair with its two weights and consequently the weighted ROC curve area becomes 149 150 \beq 151 \frac{\sum_{\{i,j\}:x^_i<x^+_j}w^_iw^+_j}{\sum_{i,j}w^_iw^+_j} 152 \eeq 153 154 This expression is invariant under a rescaling of weight. Adding a data value with weight zero adds nothing to the exprssion, and having all weight equal to unity yields the nonweighted ROC curve area. 155 156 \subsection{tScore} 157 158 Assume that $x$ and $y$ originate from the same distribution $N(\mu,\sigma_i^2)$ where $\sigma_i^2=\frac{\sigma_0^2}{w_i}$. We then estimate $\sigma_0^2$ as 159 \begin{equation} 160 \frac{\sum w(xm_x)^2+\sum w(ym_y)^2} 161 {\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+ 162 \frac{\left(\sum w_y\right)^2}{\sum w_y^2}2} 163 \end{equation} 164 The variance of difference of the means becomes 165 \begin{eqnarray} 166 Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}= 167 \frac{\sigma_0^2}{\sum w_i}+\frac{\sigma_0^2}{\sum w_i}, 168 \end{eqnarray} 169 and consequently the tscore becomes 170 \begin{equation} 171 \frac{\sum w(xm_x)^2+\sum w(ym_y)^2} 172 {\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+ 173 \frac{\left(\sum w_y\right)^2}{\sum w_y^2}2} 174 \left(\frac{1}{\sum w_i}+\frac{1}{\sum w_i}\right), 175 \end{equation} 176 177 For a $w_i=w$ we this expression get condensed down to 178 \begin{equation} 179 \frac{w\sum (xm_x)^2+w\sum (ym_y)^2} 180 {n_x+n_y2} 181 \left(\frac{1}{wn_x}+\frac{1}{wn_y}\right), 182 \end{equation} 183 in other words the good old expression as for nonweighted. 184 185 \subsection{FoldChange} 186 FoldChange is simply the difference between the weighted mean of the two groups //$\frac{\sum w_xx}{\sum w_x}\frac{\sum w_yy}{\sum w_y}$ 187 188 \subsection{WilcoxonFoldChange} 189 Taking all pair samples (one from class $+$ and one from class $$) and calculating the weighted median of the distances. 190 191 \section{Kernel} 192 \subsection{Polynomial Kernel} 193 The polynomial kernel of degree $N$ is defined as $(1+<x,y>)^N$, where $<x,y>$ is the linear kenrel (usual scalar product). For weights we define the linear kernel to be $<x,y>=\frac{\sum w_xw_yxy}{\sum w_xw_y}$ and the polynomial kernel can be calculated as before $(1+<x,y>)^N$. Is this kernel a proper kernel (always being semi positive definite). Yes, because $<x,y>$ is obviously a proper kernel as it is a scalar product. Adding a positive constant to a kernel yields another kernel so $1+<x,y>$ is still a proper kernel. Then also $(1+<x,y>)^N$ is a proper kernel because taking a proper kernel to the $Nth$ power yields a new proper kernel (see any good book on SVM). 194 \subsection{Gaussian Kernel} 195 We define the weighted Gaussian kernel as 196 $\exp\left(\frac{\sum w_xw_y(xy)^2}{\sum w_xw_y}\right)$, which fulfills the conditions listed in the introduction. 197 198 Is this kernel a proper kernel? Yes, following the proof of the nonweighted kernel we see that $K=\exp\left(\frac{\sum w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(\frac{\sum w_xw_yy^2}{\sum w_xw_y}\right)\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$, which is a product of two proper kernels. $\exp\left(\frac{\sum w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(\frac{\sum w_xw_yy^2}{\sum w_xw_y}\right)$ is a proper kernel, because it is a scalar product and $\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$ is a proper kernel, because it a polynomial of the linear kernel with positive coefficients. As product of two kernel also is a kernel, the Gaussian kernel is a proper kernel. 199 200 \section{Distance} 201 202 \section{Regression} 203 \subsection{Naive} 204 \subsection{Linear} 205 We have the model 206 207 \beq 208 y_i=\alpha+\beta (xm_x)+\epsilon_i, 209 \eeq 210 211 where $\epsilon_i$ is the noise. The variance of the noise is 212 inversely proportional to the weight, 213 $Var(\epsilon_i)=\frac{\sigma^2}{w_i}$. In order to determine the 214 model parameters, we minimimize the sum of quadratic errors. 215 216 \beq 217 Q_0 = \sum \epsilon_i^2 218 \eeq 219 220 Taking the derivity with respect to $\alpha$ and $\beta$ yields two conditions 221 222 \beq 223 \frac{\partial Q_0}{\partial \alpha} = 2 \sum w_i(y_i  \alpha  \beta (x_im_x)=0 224 \eeq 225 226 and 227 228 \beq 229 \frac{\partial Q_0}{\partial \beta} = 2 \sum w_i(x_im_x)(y_i\alpha\beta(x_im_x)=0 174 230 \eeq 175 231 … … 177 233 178 234 \beq 179 k=\frac{1}{N}\sum w_i(x_im)^2+2k^2\frac{d\ln P(k)}{dk} 180 \eeq 181 182 In principle, any prior probabilty distribution $P(k)$ could be 183 used. Here we, for simplicity, focus on the one where the last term 184 becomes a constant, namely 185 186 \beq 187 P(k)=\exp(\lambda/k). 188 \eeq 189 190 One problem with this choice is that we have to truncate the 191 distribution in order to normalize it. 192 193 The estimator $k$ becomes 194 195 \beq 196 k=\frac{1}{N}\sum w_i(x_im)^2+A, 197 \eeq 198 199 where $A$ is constant (depending on $\lambda$ and the truncation point). 200 201 Having an estimation of $\kappa$ we can calculate the variance of $m$ 202 203 \beq 204 V(m)=\frac{\frac{1}{N}\sum w_i(x_im)^2+A}{\sum w_i}=\frac{\frac{1}{N}\sum (x_im)^2/\sigma_i^2}{\sum 1/\sigma_i^2}+\frac{A}{\sum 1/\sigma_i^2} 205 \eeq 206 207 Let us now look at estimation of $\kappa$. Is the criterias above 208 fullfilled? We start looking at the bias 209 210 \beq 211 b=<k>\kappa=\frac{1}{N}\sum w_i\left<(x_im)^2\right>\kappa 212 \eeq 213 214 Let us look at 215 216 \bea 217 \left<(x_im)^2\right>= 218 \\\left<(x_i\mu)^2+(m\mu)^22(x_i\mu)(m\mu)\right> 219 \\V(x_i)+V(m)2\left<(x_i\mu)(m\mu)\right> 220 \\V(x_i)+V(m)2\left<(x_i\mu)(\frac{\sum_j x_j/\sigma_j^2}{\sum_k 1/\sigma_k^2}\mu)\right> 221 \\V(x_i)+V(m)2\left<(\frac{(x_i\mu)^2/\sigma_j^2}{\sum_k 1/\sigma_k^2})\right> 222 \\V(x_i)+V(m)2\frac{1}{\sum_k 1/\sigma_k^2} 223 \\\sigma_i^2\frac{1}{\sum_k 1/\sigma_k^2} 224 \eea 225 226 so the bias is 227 228 \bea 229 b=<k>\kappa= 230 \\\frac{1}{N}\sum_i (w_i\sigma_i^2\frac{w_i}{\sum_k 1/\sigma_k^2})\kappa= 231 \\\frac{\kappa}{N}\sum_i (1\frac{w_i}{\sum_k w_k})\kappa= 232 \\\frac{\kappa}{N}(N1)\kappa 233 \eea 234 235 so the estimator is asymptotically unbiased. If we want the estimation 236 to be unbiased we could \eg modify $N$ to $N1$ , exactly as in the 237 unweighted case, and we get the following estimator of $\kappa$ 238 239 \beq 240 k=\frac{1}{N1}\sum w_i(x_im)^2 241 \eeq 242 243 One problem with this estimator is that it is sensitive to weight zero 244 samples due to the $N$. To solve that we have to express $N$ using 245 $w$, wich will make the estimator biased. We suggest the substitution 246 247 \beq 248 N\rightarrow\frac{(\sum w_i)^2}{\sum w_i^2} 249 \eeq 250 251 so the estimator finally becomes 252 253 \beq 254 k=\frac{\sum w_i^2}{(\sum w_i)^2\sum w_i^2}\sum w_i(x_im)^2 255 \eeq 256 257 and the variance is 258 259 \beq 260 V(m)=\frac{\sum w_i^2}{(\sum w_i)^2\sum w_i^2}\frac{\sum w_i(x_im)^2}{\sum w_i}+\frac{A}{\sum 1/\sigma_i^2} 261 \eeq 262 263 264 \section{Score} 265 \subsection{Pearson} 266 267 Pearson correlation is defined as: 268 \beq 269 \frac{\sum_i(x_i\bar{x})(y_i\bar{y})}{\sqrt{\sum_i (x_i\bar{x})^2\sum_i (x_i\bar{x})^2}}. 270 \eeq 271 The weighted version should satisfy the following conditions: 272 273 \begin{itemize} 274 \item Having N equal weights the expression reduces to the unweighted case. 275 \item Adding a pair of data where one the weight is zero does not change the expression. 276 \item When $x$ and $y$ are identical, the correlation is one. 277 \end{itemize} 278 279 Therefore we define the weighted correlation to be 280 \beq 281 \frac{\sum_iw_i^xw_i^y(x_i\bar{x})(y_i\bar{y})}{\sqrt{\sum_iw_i^xw_i^y(x_i\bar{x})^2\sum_iw_i^xw_i^y(x_i\bar{x})^2}}, 282 \eeq 283 where 284 \beq 285 \bar{x}=\frac{\sum_i w^x_iw^y_ix_i}{\sum_i w^x_iw^y_i} 286 \eeq 235 \alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y 236 \eeq 237 287 238 and 288 \beq 289 \bar{y}=\frac{\sum_i w^x_iw^y_iy_i}{\sum_i w^x_iw^y_i}. 290 \eeq 291 292 \subsection{ROC} 293 If we have a set of values $x^+$ from class + and a set of values 294 $x^$ from class , the ROC curve area is equal to 295 296 \beq 297 \frac{\sum_{\{i,j\}:x^_i<x^+_j}1}{\sum_{i,j}1} 298 \eeq 299 300 so a natural extension using weights could be 301 302 \beq 303 \frac{\sum_{\{i,j\}:x^_i<x^+_j}w^_iw^+_j}{\sum_{i,j}w^_iw^+_j} 304 \eeq 305 306 \section{Hierarchical clustering} 239 240 \beq 241 \beta=\frac{\sum w_i(x_im_x)(ym_y)}{\sum w_i(x_im_x)^2}=\frac{Cov(x,y)}{Var(x)} 242 \eeq 243 244 Note, by having all weights equal we get back the unweighted 245 case. Furthermore, we calculate the variance of the estimators of 246 $\alpha$ and $\beta$. 247 248 \beq 249 \textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}= 250 \frac{\sigma^2}{\sum w_i} 251 \eeq 252 253 and 254 \beq 255 \textrm{Var}(\beta )= \frac{w_i^2(x_im_x)^2\frac{\sigma^2}{w_i}} 256 {(\sum w_i(x_im_x)^2)^2}= 257 \frac{\sigma^2}{\sum w_i(x_im_x)^2} 258 \eeq 259 260 Finally, we estimate the level of noise, $\sigma^2$. Inspired by the 261 unweighted estimation 262 263 \beq 264 s^2=\frac{\sum (y_i\alpha\beta (x_im_x))^2}{n2} 265 \eeq 266 267 we suggest the following estimator 268 269 \beq 270 s^2=\frac{\sum w_i(y_i\alpha\beta (x_im_x))^2}{\sum w_i2\frac{\sum w_i^2}{\sum w_i}} 271 \eeq 272 273 \section{Outlook} 274 \subsection{Hierarchical clustering} 307 275 \label{hc} 308 276 A hierarchical clustering consists of two things: finding the two … … 325 293 also calculate new weights for this point: $w^{xy}_i=w^x_i+w^y_i$ 326 294 327 \section{Regression}328 We have the model329 330 \beq331 y_i=\alpha+\beta (xm_x)+\epsilon_i,332 \eeq333 334 where $\epsilon_i$ is the noise. The variance of the noise is335 inversely proportional to the weight,336 $Var(\epsilon_i)=\frac{\sigma^2}{w_i}$. In order to determine the337 model parameters, we minimimize the sum of quadratic errors.338 339 \beq340 Q_0 = \sum \epsilon_i^2341 \eeq342 343 Taking the derivity with respect to $\alpha$ and $\beta$ yields two conditions344 345 \beq346 \frac{\partial Q_0}{\partial \alpha} = 2 \sum w_i(y_i  \alpha  \beta (x_im_x)=0347 \eeq348 349 and350 351 \beq352 \frac{\partial Q_0}{\partial \beta} = 2 \sum w_i(x_im_x)(y_i\alpha\beta(x_im_x)=0353 \eeq354 355 or equivalently356 357 \beq358 \alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y359 \eeq360 361 and362 363 \beq364 \beta=\frac{\sum w_i(x_im_x)(ym_y)}{\sum w_i(x_im_x)^2}365 \eeq366 367 Note, by having all weights equal we get back the unweighted368 case. Furthermore, we calculate the variance of the estimators of369 $\alpha$ and $\beta$.370 371 \beq372 \textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}=373 \frac{\sigma^2}{\sum w_i}374 \eeq375 376 and377 \beq378 \textrm{Var}(\beta )= \frac{w_i^2(x_im_x)^2\frac{\sigma^2}{w_i}}379 {(\sum w_i(x_im_x)^2)^2}=380 \frac{\sigma^2}{\sum w_i(x_im_x)^2}381 \eeq382 383 Finally, we estimate the level of noise, $\sigma^2$. Inspired by the384 unweighted estimation385 386 \beq387 s^2=\frac{\sum (y_i\alpha\beta (x_im_x))^2}{n2}388 \eeq389 390 we suggest the following estimator391 392 \beq393 s^2=\frac{\sum w_i(y_i\alpha\beta (x_im_x))^2}{\sum w_i2\frac{\sum w_i^2}{\sum w_i}}394 \eeq395 396 295 \end{document} 397 296
Note: See TracChangeset
for help on using the changeset viewer.