# Changeset 492 for trunk/doc/Statistics.tex

Ignore:
Timestamp:
Jan 9, 2006, 10:38:12 AM (17 years ago)
Message:

updated Weighted Statistics document and reimplemented few stuff. NOTE, this revision may not compile. Cannot test compilation on Le Mac.

File:
1 edited

### Legend:

Unmodified
 r489 \documentstyle[12pt]{article} %\documentstyle[slidesonly]{seminar} % $Id: error.tex,v 1.11 2004/10/01 09:08:39 peter Exp$ %\input{psfig} % --- This bit puts draft'' over everything! --- %\special{!userdict begin /bop-hook{gsave 200 30 translate %65 rotate /Times-Roman findfont 216 scalefont setfont %0 0 moveto 0.93 setgray (DRAFT) show grestore}def end} %  --- end of this bit that puts `draft' over everything --- \flushbottom \section{Introduction} \section{WeightedAverager} Building an estimator $T$ of the parameter $\Theta$ there is a number of criterias that is welcome to be fullfilled. There are several different reasons why a statistical analysis needs to adjust for weighting. In literature reasons are mainly diveded in to groups. The first group is when some of the measurements are known to be more precise than others. The more precise a measuremtns is the larger weight it is given. The simplest case is when the weight are given before the measurements and they can be treated as deterministic. It becomes more complicated when the weight can be determined not until afterwards, and even more complicated if the weight depends on the value of the observable. The second group of situations is when calculating averages over one distribution and sampling from another distribution. Compensating for this discrepency weights are introduced to the analysis. A simple example may be that we are interviewing people but for economical reasons we choose to interview more people from the city than from the countryside. When summarizing the statistics the answers from the city are given a smaller weight. In this example we are choosing the proportions of people from countryside and people from city being intervied. Hence, we can determine the weights before and consider them to be deterministic. In other situations the proportions are not deterministic, but rather a result from the sampling and the weights must be treated as stochastic and only in rare situations the weights can be treated as independent of the observable. Since there are various origin for a weight occuring in a statistical analysis, there are various way to treat the weights and in general the analysis should be tailored to treat the weights correctly. We have not chosen one situation for our implementations, so see specific function documentation for what assumtions are made. Though, common for implementationare the following: \begin{itemize} \item The bias $b= - \Theta$ should be zero. The estimator is unbiased. \item The estimator is efficient if the mean squared error $<(T-\Theta)^2>$ is small. If the estimator is unbiased the mean squared error is equal to the variance $<(T-)^2>$ of the estimator. \item The estimator is consistent if the mean squared error goes to zero in the limit of infinite number of data points. \item adding a data point with weight zero should not change the estimator. \item Setting all weights to unity yields the same result as the non-weighted version. \item Rescaling the weights does not change any function. \item Setting a weight to zero is equivalent to removing the data point. \end{itemize} We will use these criterias to find the estimator. We will minimize the variance with the constraint that the estimator must be unbiased. Also, we check that the estimator is consistent and handles zero weight in a desired way. An important case is when weights are binary (either 1 or 0). Then we get same result using the weighted version as using the data with weight not equal to zero and the non-weighted version. Hence, using binary weights and the weighted version missing values can be treated in a proper way. \section{AveragerWeighted} \subsection{Mean} We start building an estimator of the mean in the case with varying variance. The likelihood is \beq L(m)=\prod (2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}}, \eeq which we want to maximize, but can equally well minimize \beq -\ln L(m) = \sum \frac{1}{2}\ln2\pi\sigma_i^2+\frac{(x_i-m)^2}{2\sigma_i^2}. \eeq Taking the derivity yields \beq \frac{d-\ln L(m)}{dm}=\sum - \frac{x_i-m}{\sigma_i^2} \eeq Hence, the Maximum Likelihood method yields the estimator \beq m=\frac{\sum x_i/\sigma_i^2}{\sum 1/\sigma_i^2} \eeq Let us check the criterias defined above. First, we want the estimator to be unbiased. \beq b=-\mu=\frac{\sum /\sigma_i^2}{\sum 1/\sigma_i^2}-\mu=\frac{\sum \mu/\sigma_i^2}{\sum 1/\sigma_i^2}-\mu=0, \eeq so the estimator is unbiased. Second, we examine how efficient the estimator is, \ie how small the variance is. \beq V(m)=\frac{\sum V(x_i)/\sigma_i^2}{\left(\sum 1/\sigma_i^2\right)^2}=\frac{1}{\sum 1/\sigma_i^2}, \eeq which obviously goes to zero when number of samples (with finite $\sigma$) goes to infinity. The estimator is consistent. Trivially we can see that a zero weight data point does not change the estimator, which was our last condition. For any situation the weight is always designed so the weighted mean is calculated as $m=\frac{\sum w_ix_i}{\sum w_i}$, which obviously fulfills the conditions above. In the case of varying measurement error, it could be motivated that the weight shall be $w_i = 1/\sigma_i^2$. We assume measurement error to be Gaussian and the likelihood to get our measurements is $L(m)=\prod (2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}}$. We maximize the likelihood by taking the derivity with respect to $m$ on the logarithm of the likelihood $\frac{d\ln L(m)}{dm}=\sum \frac{x_i-m}{\sigma_i^2}$. Hence, the Maximum Likelihood method yields the estimator $m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2}$. \subsection{Variance} Let us now examine the case when we do not know the variance, but only the weight $w_i$ that is proportional to the inverse of the variance $\sigma_i^2$ \beq w_i=\frac{\kappa}{\sigma_i^2}, \eeq and the variance is \beq V(m)=\frac{1}{\sum 1/\sigma_i^2}=\frac{\kappa}{\sum w_i} \eeq so we need to estimate $\kappa$. The likelihood is now \beq L(k)=P(k)\prod \frac{\sqrt{w_i}} {\sqrt{2\pi k}}e^{-\frac{w_i(x_i-m)^2}{2k}}, \eeq where $P(k)$ is the prior probabilty distribution. If we have no prior knowledge about $k$, $P(k)$ is constant. Taking the derivity of the logarithm again yields \beq \frac{d-\ln L(k)}{dk}=-\frac{d\ln P(k)}{dk}+\sum \left(\frac{1}{2k}-\frac{w_i(x_i-m)^2}{2k^2}\right) In case of varying variance, there is no point estimating a variance since it is different for each data point. Instead we look at the case when we want to estimate the variance over $f$ but are sampling from $f'$. For the mean of an observable $O$ we have $\widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum w_i}$. Hence, an estimator of the variance of $X$ is \begin{eqnarray} \sigma^2=-^2= \\\frac{\sum w_ix_i^2}{\sum w_i}-\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}= \\\frac{\sum w_i(x_i^2-m^2)}{\sum w_i} \\\frac{\sum w_i(x_i^2-2mx_i+m^2)}{\sum w_i} \\\frac{\sum w_i(x_i-m)^2}{\sum w_i} \end{eqnarray} This estimator fulfills that it is invariant under a rescaling and having a weight equal to zero is equivalent to removing the data point. Having all weight equal to unity we get $\sigma=\frac{\sum (x_i-m)^2}{N}$, which is the same as returned from Averager. Hence, this estimator is slightly biased, but still very efficient. \subsection{Standard Error} The standard error squared is equal to the expexted squared error of the estimation of $m$. The squared error consists of two parts, the variance of the estimator and the squared bias. $^2=+-\mu>^2=>^2+(-\mu)^2$. In the case when weights are included in analysis due to varying measurement errors and the weights can be treated as deterministic ,we have \begin{eqnarray} Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}= \\\frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2} \frac{\sigma_0^2}{\sum w_i}, \end{eqnarray} where we need to estimate $\sigma_0^2$. Again we have the likelihood $L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{-\frac{w_i(x-m)^2}{2\sigma_0^2}}$ and taking the derivity with respect to $\sigma_o^2$, $\frac{d\ln L}{d\sigma_i^2}=\sum -\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2}$ which yields an estimator $\sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2$. This estimator is not ignoring weights equal to zero, because deviation is most often smaller than the expected infinity. Therefore, we modify the expression as follows $\sigma_i^2=\frac{\sum w_i^2}{\left(\sum w_i\right)^2}\sum w_i(x-m)^2$ and we get the following estimator of the variance of the mean $\sigma_i^2=\frac{\sum w_i^2}{\left(\sum w_i\right)^3}\sum w_i(x-m)^2$. This estimator fulfills the conditions above: adding a weight zero does not change it: rescaling the weights does not change it, and setting all weights to unity yields the same expression as in the non-weighted case. In a case when it is not a good approximation to treat the weights as deterministic, there are two ways to get a better estimation. The first one is to linearize the expression $\left<\frac{\sum w_ix_i}{\sum w_i}\right>$. The second method when the situation is more complicated is to estimate the standard error using a bootstrapping method. \section{AveragerPairWeighted} Here data points come in pairs (x,y). We are sampling from $f'_{XY}$ but want to measure from $f_{XY}$. To compensate for this decrepency, averages of $g(x,y)$ are taken as $\sum \frac{f}{f'}g(x,y)$. Even though, $X$ and $Y$ are not independent $(f_{XY}\neq f_Xf_Y)$ we assume that we can factorize the ratio and get $\frac{\sum w_xw_yg(x,y)}{\sum w_xw_y}$ \subsection{Covariance} Following the variance calculations for AveragerWeighted we have $Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}$ where $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$ \subsection{correlation} As the mean is estimated as $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$, the variance is estimated as $\sigma_x^2=\frac{\sum w_xw_y(x-m_x)^2}{\sum w_xw_y}$. As in the non-weighted case we define the correlation to be the ratio between the covariance and geomtrical avergae of the variances $\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum w_xw_y(y-m_y)^2}}$. This expression fulfills the following \begin{itemize} \item Having N weights the expression reduces to the non-weighted expression. \item Adding a pair of data, in which one weight is zero is equivalent to ignoring the data pair. \item Correlation is equal to unity if and only if $x$ is equal to $y$. Otherwise the correlation is between -1 and 1. \end{itemize} \section{Score} \subsection{Pearson} $\frac{\sum w(x-m_x)(y-m_y)}{\sqrt{\sum w(x-m_x)^2\sum w(y-m_y)^2}}$. See AveragerPairWeighted correlation. \subsection{ROC} An interpretation of the ROC curve area is the probability that if we take one sample from class $+$ and one sample from class $-$, what is the probability that the sample from class $+$ has greater value. The ROC curve area calculates the ratio of pairs fulfilling this \beq \frac{\sum_{\{i,j\}:x^-_i)^N$, where $$is the linear kenrel (usual scalar product). For weights we define the linear kernel to be =\frac{\sum w_xw_yxy}{\sum w_xw_y} and the polynomial kernel can be calculated as before (1+)^N. Is this kernel a proper kernel (always being semi positive definite). Yes, because$$ is obviously a proper kernel as it is a scalar product. Adding a positive constant to a kernel yields another kernel so$1+$is still a proper kernel. Then also$(1+)^N$is a proper kernel because taking a proper kernel to the$Nth$power yields a new proper kernel (see any good book on SVM). \subsection{Gaussian Kernel} We define the weighted Gaussian kernel as$\exp\left(-\frac{\sum w_xw_y(x-y)^2}{\sum w_xw_y}\right)$, which fulfills the conditions listed in the introduction. Is this kernel a proper kernel? Yes, following the proof of the non-weighted kernel we see that$K=\exp\left(-\frac{\sum w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum w_xw_y}\right)\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$, which is a product of two proper kernels.$\exp\left(-\frac{\sum w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum w_xw_y}\right)$is a proper kernel, because it is a scalar product and$\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$is a proper kernel, because it a polynomial of the linear kernel with positive coefficients. As product of two kernel also is a kernel, the Gaussian kernel is a proper kernel. \section{Distance} \section{Regression} \subsection{Naive} \subsection{Linear} We have the model \beq y_i=\alpha+\beta (x-m_x)+\epsilon_i, \eeq where$\epsilon_i$is the noise. The variance of the noise is inversely proportional to the weight,$Var(\epsilon_i)=\frac{\sigma^2}{w_i}$. In order to determine the model parameters, we minimimize the sum of quadratic errors. \beq Q_0 = \sum \epsilon_i^2 \eeq Taking the derivity with respect to$\alpha$and$\beta$yields two conditions \beq \frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha - \beta (x_i-m_x)=0 \eeq and \beq \frac{\partial Q_0}{\partial \beta} = -2 \sum w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0 \eeq \beq k=\frac{1}{N}\sum w_i(x_i-m)^2+2k^2\frac{d\ln P(k)}{dk} \eeq In principle, any prior probabilty distribution$P(k)$could be used. Here we, for simplicity, focus on the one where the last term becomes a constant, namely \beq P(k)=\exp(-\lambda/k). \eeq One problem with this choice is that we have to truncate the distribution in order to normalize it. The estimator$k$becomes \beq k=\frac{1}{N}\sum w_i(x_i-m)^2+A, \eeq where$A$is constant (depending on$\lambda$and the truncation point). Having an estimation of$\kappa$we can calculate the variance of$m$\beq V(m)=\frac{\frac{1}{N}\sum w_i(x_i-m)^2+A}{\sum w_i}=\frac{\frac{1}{N}\sum (x_i-m)^2/\sigma_i^2}{\sum 1/\sigma_i^2}+\frac{A}{\sum 1/\sigma_i^2} \eeq Let us now look at estimation of$\kappa$. Is the criterias above fullfilled? We start looking at the bias \beq b=-\kappa=\frac{1}{N}\sum w_i\left<(x_i-m)^2\right>-\kappa \eeq Let us look at \bea \left<(x_i-m)^2\right>= \\\left<(x_i-\mu)^2+(m-\mu)^2-2(x_i-\mu)(m-\mu)\right> \\V(x_i)+V(m)-2\left<(x_i-\mu)(m-\mu)\right> \\V(x_i)+V(m)-2\left<(x_i-\mu)(\frac{\sum_j x_j/\sigma_j^2}{\sum_k 1/\sigma_k^2}-\mu)\right> \\V(x_i)+V(m)-2\left<(\frac{(x_i-\mu)^2/\sigma_j^2}{\sum_k 1/\sigma_k^2})\right> \\V(x_i)+V(m)-2\frac{1}{\sum_k 1/\sigma_k^2} \\\sigma_i^2-\frac{1}{\sum_k 1/\sigma_k^2} \eea so the bias is \bea b=-\kappa= \\\frac{1}{N}\sum_i (w_i\sigma_i^2-\frac{w_i}{\sum_k 1/\sigma_k^2})-\kappa= \\\frac{\kappa}{N}\sum_i (1-\frac{w_i}{\sum_k w_k})-\kappa= \\\frac{\kappa}{N}(N-1)-\kappa \eea so the estimator is asymptotically unbiased. If we want the estimation to be unbiased we could \eg modify$N$to$N-1$, exactly as in the unweighted case, and we get the following estimator of$\kappa$\beq k=\frac{1}{N-1}\sum w_i(x_i-m)^2 \eeq One problem with this estimator is that it is sensitive to weight zero samples due to the$N$. To solve that we have to express$N$using$w$, wich will make the estimator biased. We suggest the substitution \beq N\rightarrow\frac{(\sum w_i)^2}{\sum w_i^2} \eeq so the estimator finally becomes \beq k=\frac{\sum w_i^2}{(\sum w_i)^2-\sum w_i^2}\sum w_i(x_i-m)^2 \eeq and the variance is \beq V(m)=\frac{\sum w_i^2}{(\sum w_i)^2-\sum w_i^2}\frac{\sum w_i(x_i-m)^2}{\sum w_i}+\frac{A}{\sum 1/\sigma_i^2} \eeq \section{Score} \subsection{Pearson} Pearson correlation is defined as: \beq \frac{\sum_i(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_i (x_i-\bar{x})^2\sum_i (x_i-\bar{x})^2}}. \eeq The weighted version should satisfy the following conditions: \begin{itemize} \item Having N equal weights the expression reduces to the unweighted case. \item Adding a pair of data where one the weight is zero does not change the expression. \item When$x$and$y$are identical, the correlation is one. \end{itemize} Therefore we define the weighted correlation to be \beq \frac{\sum_iw_i^xw_i^y(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_iw_i^xw_i^y(x_i-\bar{x})^2\sum_iw_i^xw_i^y(x_i-\bar{x})^2}}, \eeq where \beq \bar{x}=\frac{\sum_i w^x_iw^y_ix_i}{\sum_i w^x_iw^y_i} \eeq \alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y \eeq and \beq \bar{y}=\frac{\sum_i w^x_iw^y_iy_i}{\sum_i w^x_iw^y_i}. \eeq \subsection{ROC} If we have a set of values$x^+$from class + and a set of values$x^-\$ from class -, the ROC curve area is equal to \beq \frac{\sum_{\{i,j\}:x^-_i