Ignore:
Timestamp:
Jan 9, 2006, 10:38:12 AM (16 years ago)
Author:
Peter
Message:

updated Weighted Statistics document and reimplemented few stuff. NOTE, this revision may not compile. Cannot test compilation on Le Mac.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/doc/Statistics.tex

    r489 r492  
    11\documentstyle[12pt]{article}
    2 %\documentstyle[slidesonly]{seminar}
    3 
    4 % $Id: error.tex,v 1.11 2004/10/01 09:08:39 peter Exp $
    5 
    6 %\input{psfig}
    7 
    8 % --- This bit puts ``draft'' over everything! ---
    9 %\special{!userdict begin /bop-hook{gsave 200 30 translate
    10 %65 rotate /Times-Roman findfont 216 scalefont setfont
    11 %0 0 moveto 0.93 setgray (DRAFT) show grestore}def end}
    12 %  --- end of this bit that puts `draft' over everything ---
    13 
    142
    153\flushbottom
     
    6452
    6553\section{Introduction}
    66 
    67 
    68 \section{WeightedAverager}
    69 
    70 Building an estimator $T$ of the parameter $\Theta$ there
    71 is a number of criterias that is welcome to be fullfilled.
    72 
     54There are several different reasons why a statistical analysis needs to adjust for weighting. In literature reasons are mainly diveded in to groups.
     55
     56The first group is when some of the measurements are known to be more precise than others. The more precise a measuremtns is the larger weight it is given. The simplest case is when the weight are given before the measurements and they can be treated as deterministic. It becomes more complicated when the weight can be determined not until afterwards, and even more complicated if the weight depends on the value of the observable.
     57
     58The second group of situations is when calculating averages over one distribution and sampling from another distribution. Compensating for this discrepency weights are introduced to the analysis. A simple example may be that we are interviewing people but for economical reasons we choose to interview more people from the city than from the countryside. When summarizing the statistics the answers from the city are given a smaller weight. In this example we are choosing the proportions of people from countryside and people from city being intervied. Hence, we can determine the weights before and consider them to be deterministic. In other situations the proportions are not deterministic, but rather a result from the sampling and the weights must be treated as stochastic and only in rare situations the weights can be treated as independent of the observable.
     59
     60Since there are various origin for a weight occuring in a statistical analysis, there are various way to treat the weights and in general the analysis should be tailored to treat the weights correctly. We have not chosen one situation for our implementations, so see specific function documentation for what assumtions are made. Though, common for implementationare the following:
    7361\begin{itemize}
    74 \item The bias $b=<T> - \Theta$ should be zero. The
    75 estimator is unbiased.
    76 
    77 \item The estimator is efficient if the mean squared error
    78 $<(T-\Theta)^2>$ is small. If the estimator is unbiased the
    79 mean squared error is equal to the variance
    80 $<(T-<T>)^2>$ of the estimator.
    81 
    82 \item The estimator is consistent if the mean squared error goes to
    83 zero in the limit of infinite number of data points.
    84 
    85 \item adding a data point with weight zero should not change the
    86 estimator.
    87 
     62\item Setting all weights to unity yields the same result as the non-weighted version.
     63\item Rescaling the weights does not change any function.
     64\item Setting a weight to zero is equivalent to removing the data point.
    8865\end{itemize}
    89 
    90 We will use these criterias to find the estimator. We will minimize
    91 the variance with the constraint that the estimator must be
    92 unbiased. Also, we check that the estimator is consistent and handles
    93 zero weight in a desired way.
     66An important case is when weights are binary (either 1 or 0). Then we get same result using the weighted version as using the data with weight not equal to zero and the non-weighted version. Hence, using binary weights and the weighted version missing values can be treated in a proper way.
     67
     68\section{AveragerWeighted}
     69
    9470
    9571
    9672\subsection{Mean}
    97 We start building an estimator of the mean in the case with varying
    98 variance.
    99 
    100 The likelihood is
    101 
    102 \beq
    103 L(m)=\prod (2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}},
    104 \eeq
    105 
    106 which we want to maximize, but can equally well minimize
    107 
    108 \beq
    109 -\ln L(m) = \sum \frac{1}{2}\ln2\pi\sigma_i^2+\frac{(x_i-m)^2}{2\sigma_i^2}.
    110 \eeq
    111 
    112 Taking the derivity yields
    113 
    114 \beq
    115 \frac{d-\ln L(m)}{dm}=\sum - \frac{x_i-m}{\sigma_i^2}
    116 \eeq
    117 
    118 Hence, the Maximum Likelihood method yields the estimator
    119 
    120 \beq
    121 m=\frac{\sum x_i/\sigma_i^2}{\sum 1/\sigma_i^2}
    122 \eeq
    123 
    124 Let us check the criterias defined above.
    125 First, we want the estimator to be unbiased.
    126 
    127 \beq
    128 b=<m>-\mu=\frac{\sum <x_i>/\sigma_i^2}{\sum 1/\sigma_i^2}-\mu=\frac{\sum \mu/\sigma_i^2}{\sum 1/\sigma_i^2}-\mu=0,
    129 \eeq
    130 
    131 so the estimator is unbiased.
    132 
    133 Second, we examine how efficient the estimator is, \ie how small the
    134 variance is.
    135 
    136 \beq
    137 V(m)=\frac{\sum V(x_i)/\sigma_i^2}{\left(\sum 1/\sigma_i^2\right)^2}=\frac{1}{\sum 1/\sigma_i^2},
    138 \eeq
    139 
    140 which obviously goes to zero when number of samples (with finite
    141 $\sigma$) goes to infinity. The estimator is consistent.
    142 
    143 Trivially we can see that a zero weight data point does not change the
    144 estimator, which was our last condition.
     73
     74For any situation the weight is always designed so the weighted mean is calculated as $m=\frac{\sum w_ix_i}{\sum w_i}$, which obviously fulfills the conditions above.
     75
     76In the case of varying measurement error, it could be motivated that the weight shall be $w_i = 1/\sigma_i^2$. We assume measurement error to be Gaussian and the likelihood to get our measurements is
     77$L(m)=\prod (2\pi\sigma_i^2)^{-1/2}e^{-\frac{(x_i-m)^2}{2\sigma_i^2}}$.
     78We maximize the likelihood by taking the derivity with respect to $m$ on the logarithm of the likelihood
     79$\frac{d\ln L(m)}{dm}=\sum \frac{x_i-m}{\sigma_i^2}$. Hence, the Maximum Likelihood method yields the estimator
     80$m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2}$.
     81
    14582
    14683\subsection{Variance}
    147 Let us now examine the case when we do not know the variance, but only
    148 the weight $w_i$ that is proportional to the inverse
    149 of the variance $\sigma_i^2$
    150 
    151 \beq
    152 w_i=\frac{\kappa}{\sigma_i^2},
    153 \eeq
    154 
    155 and the variance is
    156 
    157 \beq
    158 V(m)=\frac{1}{\sum 1/\sigma_i^2}=\frac{\kappa}{\sum w_i}
    159 \eeq
    160 
    161 so we need to estimate $\kappa$. The likelihood is now
    162 
    163 \beq
    164 L(k)=P(k)\prod \frac{\sqrt{w_i}} {\sqrt{2\pi k}}e^{-\frac{w_i(x_i-m)^2}{2k}},
    165 \eeq
    166 
    167 where $P(k)$ is the prior probabilty distribution. If we have no prior knowledge
    168 about $k$, $P(k)$ is constant.
    169 
    170 Taking the derivity of the logarithm again yields
    171 
    172 \beq
    173 \frac{d-\ln L(k)}{dk}=-\frac{d\ln P(k)}{dk}+\sum \left(\frac{1}{2k}-\frac{w_i(x_i-m)^2}{2k^2}\right)
     84In case of varying variance, there is no point estimating a variance since it is different for each data point.
     85
     86Instead we look at the case when we want to estimate the variance over $f$ but are sampling from $f'$. For the mean of an observable $O$ we have
     87$\widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum w_i}$. Hence, an estimator of the variance of $X$ is
     88\begin{eqnarray}
     89\sigma^2=<X^2>-<X>^2=
     90\\\frac{\sum w_ix_i^2}{\sum w_i}-\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}=
     91\\\frac{\sum w_i(x_i^2-m^2)}{\sum w_i}
     92\\\frac{\sum w_i(x_i^2-2mx_i+m^2)}{\sum w_i}
     93\\\frac{\sum w_i(x_i-m)^2}{\sum w_i}
     94\end{eqnarray}
     95This estimator fulfills that it is invariant under a rescaling and having a weight equal to zero is equivalent to removing the data point. Having all weight equal to unity we get $\sigma=\frac{\sum (x_i-m)^2}{N}$, which is the same as returned from Averager. Hence, this estimator is slightly biased, but still very efficient.
     96
     97\subsection{Standard Error}
     98The standard error squared is equal to the expexted squared error of the estimation of $m$. The squared error consists of two parts, the variance of the estimator and the squared bias. $<m-\mu>^2=<m-<m>+<m>-\mu>^2=<m-<m>>^2+(<m>-\mu)^2$.
     99In the case when weights are included in analysis due to varying measurement errors and the weights can be treated as deterministic ,we have
     100\begin{eqnarray}
     101Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}=
     102\\\frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}
     103\frac{\sigma_0^2}{\sum w_i},
     104\end{eqnarray}
     105where we need to estimate $\sigma_0^2$. Again we have the likelihood
     106$L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{-\frac{w_i(x-m)^2}{2\sigma_0^2}}$ and taking the derivity with respect to $\sigma_o^2$,
     107$\frac{d\ln L}{d\sigma_i^2}=\sum -\frac{1}{2\sigma_0^2}+\frac{w_i(x-m)^2}{2\sigma_0^2\sigma_o^2}$
     108which yields an estimator $\sigma_0^2=\frac{1}{N}\sum w_i(x-m)^2$. This estimator is not ignoring weights equal to zero, because deviation is most often smaller than the expected infinity. Therefore, we modify the expression as follows $\sigma_i^2=\frac{\sum w_i^2}{\left(\sum w_i\right)^2}\sum w_i(x-m)^2$ and we get the following estimator of the variance of the mean
     109 $\sigma_i^2=\frac{\sum w_i^2}{\left(\sum w_i\right)^3}\sum w_i(x-m)^2$. This estimator fulfills the conditions above: adding a weight zero does not change it: rescaling the weights does not change it, and setting all weights to unity yields the same expression as in the non-weighted case.
     110
     111In a case when it is not a good approximation to treat the weights as deterministic, there are two ways to get a better estimation. The first one is to linearize the expression $\left<\frac{\sum w_ix_i}{\sum w_i}\right>$. The second method when the situation is more complicated is to estimate the standard error using a bootstrapping method. 
     112
     113\section{AveragerPairWeighted}
     114Here data points come in pairs (x,y). We are sampling from $f'_{XY}$ but want to measure from $f_{XY}$. To compensate for this decrepency, averages of $g(x,y)$ are taken as $\sum \frac{f}{f'}g(x,y)$. Even though, $X$ and $Y$ are not independent $(f_{XY}\neq f_Xf_Y)$ we assume that we can factorize the ratio and get $\frac{\sum w_xw_yg(x,y)}{\sum w_xw_y}$ 
     115\subsection{Covariance}
     116Following the variance calculations for AveragerWeighted we have $Cov=\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sum w_xw_y}$ where $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$
     117
     118\subsection{correlation}
     119
     120As the mean is estimated as $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$, the variance is estimated as
     121$\sigma_x^2=\frac{\sum w_xw_y(x-m_x)^2}{\sum w_xw_y}$. As in the non-weighted case we define the correlation to be the ratio between the covariance and geomtrical avergae of the variances 
     122
     123$\frac{\sum w_xw_y(x-m_x)(y-m_y)}{\sqrt{\sum w_xw_y(x-m_x)^2\sum w_xw_y(y-m_y)^2}}$.
     124
     125This expression fulfills the following
     126\begin{itemize}
     127\item Having N weights the expression reduces to the non-weighted expression.
     128\item Adding a pair of data, in which one weight is zero is equivalent to ignoring the data pair.
     129\item Correlation is equal to unity if and only if $x$ is equal to $y$. Otherwise the correlation is between -1 and 1.
     130\end{itemize}
     131\section{Score}
     132
     133
     134\subsection{Pearson}
     135
     136$\frac{\sum w(x-m_x)(y-m_y)}{\sqrt{\sum w(x-m_x)^2\sum w(y-m_y)^2}}$.
     137
     138See AveragerPairWeighted correlation.
     139
     140\subsection{ROC}
     141
     142An interpretation of the ROC curve area is the probability that if we take one sample from class $+$ and one sample from class $-$, what is the probability that the sample from class $+$ has greater value. The ROC curve area calculates the ratio of pairs fulfilling this
     143
     144\beq
     145\frac{\sum_{\{i,j\}:x^-_i<x^+_j}1}{\sum_{i,j}1}.
     146\eeq
     147
     148An geometrical interpretation is to have a number of squares where each square correspond to a pair of samples. The ROC curve follows the border between pairs in which the samples from class $+$ has a greater value and pairs in which this is not fulfilled. The ROC curve area is the area of those latter squares and a natural extension is to weight each pair with its two weights and consequently the weighted ROC curve area becomes
     149
     150\beq
     151\frac{\sum_{\{i,j\}:x^-_i<x^+_j}w^-_iw^+_j}{\sum_{i,j}w^-_iw^+_j}
     152\eeq
     153
     154This expression is invariant under a rescaling of weight. Adding a data value with weight zero adds nothing to the exprssion, and having all weight equal to unity yields the non-weighted ROC curve area.
     155
     156\subsection{tScore}
     157
     158Assume that $x$ and $y$ originate from the same distribution $N(\mu,\sigma_i^2)$ where $\sigma_i^2=\frac{\sigma_0^2}{w_i}$. We then estimate $\sigma_0^2$ as
     159\begin{equation}
     160\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
     161{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
     162\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
     163\end{equation}
     164The variance of difference of the means becomes
     165\begin{eqnarray}
     166Var(m_x)+Var(m_y)=\\\frac{\sum w_i^2Var(x_i)}{\left(\sum w_i\right)^2}+\frac{\sum w_i^2Var(y_i)}{\left(\sum w_i\right)^2}=
     167\frac{\sigma_0^2}{\sum w_i}+\frac{\sigma_0^2}{\sum w_i},
     168\end{eqnarray}
     169and consequently the t-score becomes
     170\begin{equation}
     171\frac{\sum w(x-m_x)^2+\sum w(y-m_y)^2}
     172{\frac{\left(\sum w_x\right)^2}{\sum w_x^2}+
     173\frac{\left(\sum w_y\right)^2}{\sum w_y^2}-2}
     174\left(\frac{1}{\sum w_i}+\frac{1}{\sum w_i}\right),
     175\end{equation}
     176
     177For a $w_i=w$ we this expression get condensed down to
     178\begin{equation}
     179\frac{w\sum (x-m_x)^2+w\sum (y-m_y)^2}
     180{n_x+n_y-2}
     181\left(\frac{1}{wn_x}+\frac{1}{wn_y}\right),
     182\end{equation}
     183in other words the good old expression as for non-weighted.
     184
     185\subsection{FoldChange}
     186Fold-Change is simply the difference between the weighted mean of the two groups //$\frac{\sum w_xx}{\sum w_x}-\frac{\sum w_yy}{\sum w_y}$
     187
     188\subsection{WilcoxonFoldChange}
     189Taking all pair samples (one from class $+$ and one from class $-$) and calculating the weighted median of the distances.
     190
     191\section{Kernel}
     192\subsection{Polynomial Kernel}
     193The polynomial kernel of degree $N$ is defined as $(1+<x,y>)^N$, where $<x,y>$ is the linear kenrel (usual scalar product). For weights we define the linear kernel to be $<x,y>=\frac{\sum w_xw_yxy}{\sum w_xw_y}$ and the polynomial kernel can be calculated as before $(1+<x,y>)^N$. Is this kernel a proper kernel (always being semi positive definite). Yes, because $<x,y>$ is obviously a proper kernel as it is a scalar product. Adding a positive constant to a kernel yields another kernel so $1+<x,y>$ is still a proper kernel. Then also $(1+<x,y>)^N$ is a proper kernel because taking a proper kernel to the $Nth$ power yields a new proper kernel (see any good book on SVM).
     194\subsection{Gaussian Kernel}
     195We define the weighted Gaussian kernel as
     196$\exp\left(-\frac{\sum w_xw_y(x-y)^2}{\sum w_xw_y}\right)$, which fulfills the conditions listed in the introduction.
     197
     198Is this kernel a proper kernel? Yes, following the proof of the non-weighted kernel we see that $K=\exp\left(-\frac{\sum w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum w_xw_y}\right)\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$, which is a product of two proper kernels. $\exp\left(-\frac{\sum w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(-\frac{\sum w_xw_yy^2}{\sum w_xw_y}\right)$ is a proper kernel, because it is a scalar product and $\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$ is a proper kernel, because it a polynomial of the linear kernel with positive coefficients. As product of two kernel also is a kernel, the Gaussian kernel is a proper kernel.
     199
     200\section{Distance}
     201
     202\section{Regression}
     203\subsection{Naive}
     204\subsection{Linear}
     205We have the model
     206
     207\beq
     208y_i=\alpha+\beta (x-m_x)+\epsilon_i,
     209\eeq
     210
     211where $\epsilon_i$ is the noise. The variance of the noise is
     212inversely proportional to the weight,
     213$Var(\epsilon_i)=\frac{\sigma^2}{w_i}$. In order to determine the
     214model parameters, we minimimize the sum of quadratic errors.
     215
     216\beq
     217Q_0 = \sum \epsilon_i^2
     218\eeq
     219
     220Taking the derivity with respect to $\alpha$ and $\beta$ yields two conditions
     221
     222\beq
     223\frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha - \beta (x_i-m_x)=0
     224\eeq
     225
     226and
     227
     228\beq
     229\frac{\partial Q_0}{\partial \beta} = -2 \sum w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0
    174230\eeq
    175231
     
    177233
    178234\beq
    179 k=\frac{1}{N}\sum w_i(x_i-m)^2+2k^2\frac{d\ln P(k)}{dk}
    180 \eeq
    181 
    182 In principle, any prior probabilty distribution $P(k)$ could be
    183 used. Here we, for simplicity, focus on the one where the last term
    184 becomes a constant, namely
    185 
    186 \beq
    187 P(k)=\exp(-\lambda/k).
    188 \eeq
    189 
    190 One problem with this choice is that we have to truncate the
    191 distribution in order to normalize it.
    192 
    193 The estimator $k$ becomes
    194 
    195 \beq
    196 k=\frac{1}{N}\sum w_i(x_i-m)^2+A,
    197 \eeq
    198 
    199 where $A$ is constant (depending on $\lambda$ and the truncation point).
    200 
    201 Having an estimation of $\kappa$ we can calculate the variance of $m$
    202 
    203 \beq
    204 V(m)=\frac{\frac{1}{N}\sum w_i(x_i-m)^2+A}{\sum w_i}=\frac{\frac{1}{N}\sum (x_i-m)^2/\sigma_i^2}{\sum 1/\sigma_i^2}+\frac{A}{\sum 1/\sigma_i^2}
    205 \eeq
    206 
    207 Let us now look at estimation of $\kappa$. Is the criterias above
    208 fullfilled? We start looking at the bias
    209 
    210 \beq
    211 b=<k>-\kappa=\frac{1}{N}\sum w_i\left<(x_i-m)^2\right>-\kappa
    212 \eeq
    213 
    214 Let us look at
    215 
    216 \bea
    217 \left<(x_i-m)^2\right>=
    218 \\\left<(x_i-\mu)^2+(m-\mu)^2-2(x_i-\mu)(m-\mu)\right>
    219 \\V(x_i)+V(m)-2\left<(x_i-\mu)(m-\mu)\right>
    220 \\V(x_i)+V(m)-2\left<(x_i-\mu)(\frac{\sum_j x_j/\sigma_j^2}{\sum_k 1/\sigma_k^2}-\mu)\right>
    221 \\V(x_i)+V(m)-2\left<(\frac{(x_i-\mu)^2/\sigma_j^2}{\sum_k 1/\sigma_k^2})\right>
    222 \\V(x_i)+V(m)-2\frac{1}{\sum_k 1/\sigma_k^2}
    223 \\\sigma_i^2-\frac{1}{\sum_k 1/\sigma_k^2}
    224 \eea
    225 
    226 so the bias is
    227 
    228 \bea
    229 b=<k>-\kappa=
    230 \\\frac{1}{N}\sum_i (w_i\sigma_i^2-\frac{w_i}{\sum_k 1/\sigma_k^2})-\kappa=
    231 \\\frac{\kappa}{N}\sum_i (1-\frac{w_i}{\sum_k w_k})-\kappa=
    232 \\\frac{\kappa}{N}(N-1)-\kappa
    233 \eea
    234 
    235 so the estimator is asymptotically unbiased. If we want the estimation
    236 to be unbiased we could \eg modify $N$ to $N-1$ , exactly as in the
    237 unweighted case, and we get the following estimator of $\kappa$
    238 
    239 \beq
    240 k=\frac{1}{N-1}\sum w_i(x_i-m)^2
    241 \eeq
    242 
    243 One problem with this estimator is that it is sensitive to weight zero
    244 samples due to the $N$. To solve that we have to express $N$ using
    245 $w$, wich will make the estimator biased. We suggest the substitution
    246 
    247 \beq
    248 N\rightarrow\frac{(\sum w_i)^2}{\sum w_i^2}
    249 \eeq
    250 
    251 so the estimator finally becomes
    252 
    253 \beq
    254 k=\frac{\sum w_i^2}{(\sum w_i)^2-\sum w_i^2}\sum w_i(x_i-m)^2
    255 \eeq
    256 
    257 and the variance is
    258 
    259 \beq
    260 V(m)=\frac{\sum w_i^2}{(\sum w_i)^2-\sum w_i^2}\frac{\sum w_i(x_i-m)^2}{\sum w_i}+\frac{A}{\sum 1/\sigma_i^2}
    261 \eeq
    262 
    263 
    264 \section{Score}
    265 \subsection{Pearson}
    266 
    267 Pearson correlation is defined as:
    268 \beq
    269 \frac{\sum_i(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_i (x_i-\bar{x})^2\sum_i (x_i-\bar{x})^2}}.
    270 \eeq
    271 The weighted version should satisfy the following conditions:
    272 
    273 \begin{itemize}
    274 \item Having N equal weights the expression reduces to the unweighted case.
    275 \item Adding a pair of data where one the weight is zero does not change the expression.
    276 \item When $x$ and $y$ are identical, the correlation is one.
    277 \end{itemize}
    278 
    279 Therefore we define the weighted correlation to be
    280 \beq
    281 \frac{\sum_iw_i^xw_i^y(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_iw_i^xw_i^y(x_i-\bar{x})^2\sum_iw_i^xw_i^y(x_i-\bar{x})^2}},
    282 \eeq
    283 where
    284 \beq
    285 \bar{x}=\frac{\sum_i w^x_iw^y_ix_i}{\sum_i w^x_iw^y_i}
    286 \eeq
     235\alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y
     236\eeq
     237
    287238and
    288 \beq
    289 \bar{y}=\frac{\sum_i w^x_iw^y_iy_i}{\sum_i w^x_iw^y_i}.
    290 \eeq
    291 
    292 \subsection{ROC}
    293 If we have a set of values $x^+$ from class + and a set of values
    294 $x^-$ from class -, the ROC curve area is equal to
    295 
    296 \beq
    297 \frac{\sum_{\{i,j\}:x^-_i<x^+_j}1}{\sum_{i,j}1}
    298 \eeq
    299 
    300 so a natural extension using weights could be
    301 
    302 \beq
    303 \frac{\sum_{\{i,j\}:x^-_i<x^+_j}w^-_iw^+_j}{\sum_{i,j}w^-_iw^+_j}
    304 \eeq
    305 
    306 \section{Hierarchical clustering}
     239
     240\beq
     241\beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum w_i(x_i-m_x)^2}=\frac{Cov(x,y)}{Var(x)}
     242\eeq
     243
     244Note, by having all weights equal we get back the unweighted
     245case. Furthermore, we calculate the variance of the estimators of
     246$\alpha$ and $\beta$.
     247
     248\beq
     249\textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}=
     250\frac{\sigma^2}{\sum w_i}
     251\eeq
     252
     253and
     254\beq
     255\textrm{Var}(\beta )= \frac{w_i^2(x_i-m_x)^2\frac{\sigma^2}{w_i}}
     256{(\sum w_i(x_i-m_x)^2)^2}=
     257\frac{\sigma^2}{\sum w_i(x_i-m_x)^2}
     258\eeq
     259
     260Finally, we estimate the level of noise, $\sigma^2$. Inspired by the
     261unweighted estimation
     262
     263\beq
     264s^2=\frac{\sum (y_i-\alpha-\beta (x_i-m_x))^2}{n-2}
     265\eeq
     266
     267we suggest the following estimator
     268
     269\beq
     270s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum w_i-2\frac{\sum w_i^2}{\sum w_i}}
     271\eeq
     272
     273\section{Outlook}
     274\subsection{Hierarchical clustering}
    307275\label{hc}
    308276A hierarchical clustering consists of two things: finding the two
     
    325293also calculate new weights for this point: $w^{xy}_i=w^x_i+w^y_i$
    326294
    327 \section{Regression}
    328 We have the model
    329 
    330 \beq
    331 y_i=\alpha+\beta (x-m_x)+\epsilon_i,
    332 \eeq
    333 
    334 where $\epsilon_i$ is the noise. The variance of the noise is
    335 inversely proportional to the weight,
    336 $Var(\epsilon_i)=\frac{\sigma^2}{w_i}$. In order to determine the
    337 model parameters, we minimimize the sum of quadratic errors.
    338 
    339 \beq
    340 Q_0 = \sum \epsilon_i^2
    341 \eeq
    342 
    343 Taking the derivity with respect to $\alpha$ and $\beta$ yields two conditions
    344 
    345 \beq
    346 \frac{\partial Q_0}{\partial \alpha} = -2 \sum w_i(y_i - \alpha - \beta (x_i-m_x)=0
    347 \eeq
    348 
    349 and
    350 
    351 \beq
    352 \frac{\partial Q_0}{\partial \beta} = -2 \sum w_i(x_i-m_x)(y_i-\alpha-\beta(x_i-m_x)=0
    353 \eeq
    354 
    355 or equivalently
    356 
    357 \beq
    358 \alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y
    359 \eeq
    360 
    361 and
    362 
    363 \beq
    364 \beta=\frac{\sum w_i(x_i-m_x)(y-m_y)}{\sum w_i(x_i-m_x)^2}
    365 \eeq
    366 
    367 Note, by having all weights equal we get back the unweighted
    368 case. Furthermore, we calculate the variance of the estimators of
    369 $\alpha$ and $\beta$.
    370 
    371 \beq
    372 \textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}=
    373 \frac{\sigma^2}{\sum w_i}
    374 \eeq
    375 
    376 and
    377 \beq
    378 \textrm{Var}(\beta )= \frac{w_i^2(x_i-m_x)^2\frac{\sigma^2}{w_i}}
    379 {(\sum w_i(x_i-m_x)^2)^2}=
    380 \frac{\sigma^2}{\sum w_i(x_i-m_x)^2}
    381 \eeq
    382 
    383 Finally, we estimate the level of noise, $\sigma^2$. Inspired by the
    384 unweighted estimation
    385 
    386 \beq
    387 s^2=\frac{\sum (y_i-\alpha-\beta (x_i-m_x))^2}{n-2}
    388 \eeq
    389 
    390 we suggest the following estimator
    391 
    392 \beq
    393 s^2=\frac{\sum w_i(y_i-\alpha-\beta (x_i-m_x))^2}{\sum w_i-2\frac{\sum w_i^2}{\sum w_i}}
    394 \eeq
    395 
    396295\end{document}
    397296
Note: See TracChangeset for help on using the changeset viewer.