Index: /trunk/doc/Makefile.am
===================================================================
 /trunk/doc/Makefile.am (revision 1108)
+++ /trunk/doc/Makefile.am (revision 1109)
@@ 25,11 +25,11 @@
# 021111307, USA.
doc: doxygen.config doxygenlocal Statisticslocal
+doc: doxygen.config doxygenlocal
dvilocal: Statistics.dvi
+dvilocal:
pdflocal: Statistics.pdf
+pdflocal:
htmllocal: doxygen.config doxygenlocal html/Statistics/Statistics.html
+htmllocal: doxygen.config doxygenlocal
mostlycleanlocal:
@@ 48,28 +48,2 @@
html/Statistics/Statistics.html: Statistics.tex
 @$(install_sh) d html/Statistics
 @if $(HAVE_LATEX2HTML); then \
 latex2html t "Weighted Statistics used in yat." \
 dir html/Statistics Statistics.tex;\
 fi

Statisticslocal: html/Statistics/Statistics.html \
 html/Statistics/Statistics.pdf

Statistics.dvi: Statistics.tex
 @if $(HAVE_LATEX); then \
 @latex Statistics.tex; \
 @latex Statistics.tex; \
 fi

Statistics.pdf: Statistics.dvi
 @if $(HAVE_DVIPDFM); then \
 dvipdfm Statistics.dvi; \
 fi

html/Statistics/Statistics.pdf: $(pdf)
 @if test f Statistics.pdf; then \
 $(install_sh) d html/Statistics; \
 cp Statistics.pdf html/Statistics/.; \
 fi
Index: /trunk/doc/Statistics.doxygen
===================================================================
 /trunk/doc/Statistics.doxygen (revision 1109)
+++ /trunk/doc/Statistics.doxygen (revision 1109)
@@ 0,0 +1,409 @@
+// $Id$
+//
+// Copyright (C) 2005 Peter Johansson
+// Copyright (C) 2006 Jari Häkkinen, Markus Ringnér, Peter Johansson
+// Copyright (C) 2007, 2008 Peter Johansson
+//
+// This file is part of the yat library, http://trac.thep.lu.se/yat
+//
+// The yat library is free software; you can redistribute it and/or
+// modify it under the terms of the GNU General Public License as
+// published by the Free Software Foundation; either version 2 of the
+// License, or (at your option) any later version.
+//
+// The yat library is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+// General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License
+// along with this program; if not, write to the Free Software
+// Foundation, Inc., 59 Temple Place  Suite 330, Boston, MA
+// 021111307, USA.
+
+
+/**
+\page weighted_statistics Weighted Statistics
+
+\section Introduction
+There are several different reasons why a statistical analysis needs
+to adjust for weighting. In literature reasons are mainly diveded in
+to groups.
+
+The first group is when some of the measurements are known to be more
+precise than others. The more precise a measurement is, the larger
+weight it is given. The simplest case is when the weight are given
+before the measurements and they can be treated as deterministic. It
+becomes more complicated when the weight can be determined not until
+afterwards, and even more complicated if the weight depends on the
+value of the observable.
+
+The second group of situations is when calculating averages over one
+distribution and sampling from another distribution. Compensating for
+this discrepency weights are introduced to the analysis. A simple
+example may be that we are interviewing people but for economical
+reasons we choose to interview more people from the city than from the
+countryside. When summarizing the statistics the answers from the city
+are given a smaller weight. In this example we are choosing the
+proportions of people from countryside and people from city being
+intervied. Hence, we can determine the weights before and consider
+them to be deterministic. In other situations the proportions are not
+deterministic, but rather a result from the sampling and the weights
+must be treated as stochastic and only in rare situations the weights
+can be treated as independent of the observable.
+
+Since there are various origins for a weight occuring in a statistical
+analysis, there are various ways to treat the weights and in general
+the analysis should be tailored to treat the weights correctly. We
+have not chosen one situation for our implementations, so see specific
+function documentation for what assumtions are made. Though, common
+for implementations are the following:
+
+  Setting all weights to unity yields the same result as the
+nonweighted version.
+  Rescaling the weights does not change any function.
+  Setting a weight to zero is equivalent to removing the data point.
+
+An important case is when weights are binary (either 1 or 0). Then we
+get the same result using the weighted version as using the data with
+weight not equal to zero and the nonweighted version. Hence, using
+binary weights and the weighted version missing values can be treated
+in a proper way.
+
+\section AveragerWeighted
+
+
+
+\subsection Mean
+
+For any situation the weight is always designed so the weighted mean
+is calculated as \f$ m=\frac{\sum w_ix_i}{\sum w_i} \f$, which obviously
+fulfills the conditions above.
+
+
+
+In the case of varying measurement error, it could be motivated that
+the weight shall be \f$ w_i = 1/\sigma_i^2 \f$. We assume measurement error
+to be Gaussian and the likelihood to get our measurements is
+\f$ L(m)=\prod
+(2\pi\sigma_i^2)^{1/2}e^{\frac{(x_im)^2}{2\sigma_i^2}} \f$. We
+maximize the likelihood by taking the derivity with respect to \f$ m \f$ on
+the logarithm of the likelihood \f$ \frac{d\ln L(m)}{dm}=\sum
+\frac{x_im}{\sigma_i^2} \f$. Hence, the Maximum Likelihood method yields
+the estimator \f$ m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2} \f$.
+
+
+\subsection Variance
+In case of varying variance, there is no point estimating a variance
+since it is different for each data point.
+
+Instead we look at the case when we want to estimate the variance over
+\f$f\f$ but are sampling from \f$ f' \f$. For the mean of an observable \f$ O \f$ we
+have \f$ \widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum
+w_i} \f$. Hence, an estimator of the variance of \f$ X \f$ is
+
+\f$
+s^2 = ^2=
+\f$
+
+\f$
+ = \frac{\sum w_ix_i^2}{\sum w_i}\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}=
+\f$
+
+\f$
+ = \frac{\sum w_i(x_i^2m^2)}{\sum w_i}=
+\f$
+
+\f$
+ = \frac{\sum w_i(x_i^22mx_i+m^2)}{\sum w_i}=
+\f$
+
+\f$
+ = \frac{\sum w_i(x_im)^2}{\sum w_i}
+\f$
+
+This estimator fulfills that it is invariant under a rescaling and
+having a weight equal to zero is equivalent to removing the data
+point. Having all weights equal to unity we get \f$ \sigma=\frac{\sum
+(x_im)^2}{N} \f$, which is the same as returned from Averager. Hence,
+this estimator is slightly biased, but still very efficient.
+
+\subsection standard_error Standard Error
+The standard error squared is equal to the expexted squared error of
+the estimation of \f$m\f$. The squared error consists of two parts, the
+variance of the estimator and the squared bias:
+
+\f$
+^2=+\mu>^2=
+\f$
+\f$
+>^2+(\mu)^2
+\f$.
+
+In the case when weights are included in analysis due to varying
+measurement errors and the weights can be treated as deterministic, we
+have
+
+\f$
+Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}=
+\f$
+\f$
+\frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}=
+\f$
+\f$
+\frac{\sigma_0^2}{\sum w_i},
+\f$
+
+where we need to estimate \f$ \sigma_0^2 \f$. Again we have the likelihood
+
+\f$
+L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{(\frac{w_i(xm)^2}{2\sigma_0^2})}
+\f$
+and taking the derivity with respect to
+\f$\sigma_o^2\f$,
+
+\f$
+\frac{d\ln L}{d\sigma_i^2}=
+\f$
+\f$
+\sum \frac{1}{2\sigma_0^2}+\frac{w_i(xm)^2}{2\sigma_0^2\sigma_o^2}
+\f$
+
+which
+yields an estimator \f$ \sigma_0^2=\frac{1}{N}\sum w_i(xm)^2 \f$. This
+estimator is not ignoring weights equal to zero, because deviation is
+most often smaller than the expected infinity. Therefore, we modify
+the expression as follows \f$\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
+w_i\right)^2}\sum w_i(xm)^2\f$ and we get the following estimator of
+the variance of the mean \f$\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
+w_i\right)^3}\sum w_i(xm)^2\f$. This estimator fulfills the conditions
+above: adding a weight zero does not change it: rescaling the weights
+does not change it, and setting all weights to unity yields the same
+expression as in the nonweighted case.
+
+In a case when it is not a good approximation to treat the weights as
+deterministic, there are two ways to get a better estimation. The
+first one is to linearize the expression \f$\left<\frac{\sum
+w_ix_i}{\sum w_i}\right>\f$. The second method when the situation is
+more complicated is to estimate the standard error using a
+bootstrapping method.
+
+\section AveragerPairWeighted
+Here data points come in pairs (x,y). We are sampling from \f$f'_{XY}\f$
+but want to measure from \f$f_{XY}\f$. To compensate for this decrepency,
+averages of \f$g(x,y)\f$ are taken as \f$\sum \frac{f}{f'}g(x,y)\f$. Even
+though, \f$X\f$ and \f$Y\f$ are not independent \f$(f_{XY}\neq f_Xf_Y)\f$ we
+assume that we can factorize the ratio and get \f$\frac{\sum
+w_xw_yg(x,y)}{\sum w_xw_y}\f$
+\subsection Covariance
+Following the variance calculations for AveragerWeighted we have
+\f$Cov=\frac{\sum w_xw_y(xm_x)(ym_y)}{\sum w_xw_y}\f$ where
+\f$m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}\f$
+
+\subsection Correlation
+
+As the mean is estimated as
+\f$
+m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}
+\f$,
+the variance is estimated as
+\f$
+\sigma_x^2=\frac{\sum w_xw_y(xm_x)^2}{\sum w_xw_y}
+\f$.
+As in the nonweighted case we define the correlation to be the ratio
+between the covariance and geometrical average of the variances
+
+\f$
+\frac{\sum w_xw_y(xm_x)(ym_y)}{\sqrt{\sum w_xw_y(xm_x)^2\sum
+w_xw_y(ym_y)^2}}
+\f$.
+
+
+This expression fulfills the following
+  Having N equal weights the expression reduces to the nonweighted expression.
+  Adding a pair of data, in which one weight is zero is equivalent
+to ignoring the data pair.
+  Correlation is equal to unity if and only if \f$x\f$ is equal to
+\f$y\f$. Otherwise the correlation is between 1 and 1.
+
+\section Score
+
+\subsection Pearson
+
+\f$\frac{\sum w(xm_x)(ym_y)}{\sqrt{\sum w(xm_x)^2\sum w(ym_y)^2}}\f$.
+
+See AveragerPairWeighted correlation.
+
+\subsection ROC
+
+An interpretation of the ROC curve area is the probability that if we
+take one sample from class \f$+\f$ and one sample from class \f$\f$, what is
+the probability that the sample from class \f$+\f$ has greater value. The
+ROC curve area calculates the ratio of pairs fulfilling this
+
+\f$
+\frac{\sum_{\{i,j\}:x^_i)^N\f$, where
+\f$\f$ is the linear kernel (usual scalar product). For the weighted
+case we define the linear kernel to be \f$=\sum {w_xw_yxy}\f$ and the
+polynomial kernel can be calculated as before
+\f$(1+)^N\f$. Is this kernel a proper kernel (always being semi
+positive definite). Yes, because \f$\f$ is obviously a proper kernel
+as it is a scalar product. Adding a positive constant to a kernel
+yields another kernel so \f$1+\f$ is still a proper kernel. Then also
+\f$(1+)^N\f$ is a proper kernel because taking a proper kernel to the
+\f$Nth\f$ power yields a new proper kernel (see any good book on SVM).
+\subsection{Gaussian Kernel}
+We define the weighted Gaussian kernel as \f$\exp\left(\frac{\sum
+w_xw_y(xy)^2}{\sum w_xw_y}\right)\f$, which fulfills the conditions
+listed in the introduction.
+
+Is this kernel a proper kernel? Yes, following the proof of the
+nonweighted kernel we see that \f$K=\exp\left(\frac{\sum
+w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(\frac{\sum w_xw_yy^2}{\sum
+w_xw_y}\right)\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)\f$,
+which is a product of two proper kernels. \f$\exp\left(\frac{\sum
+w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(\frac{\sum w_xw_yy^2}{\sum
+w_xw_y}\right)\f$ is a proper kernel, because it is a scalar product and
+\f$\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)\f$ is a proper
+kernel, because it a polynomial of the linear kernel with positive
+coefficients. As product of two kernel also is a kernel, the Gaussian
+kernel is a proper kernel.
+
+\section Distance
+
+\section Regression
+\subsection Naive
+\subsection Linear
+We have the model
+
+\f$
+y_i=\alpha+\beta (xm_x)+\epsilon_i,
+\f$
+
+where \f$\epsilon_i\f$ is the noise. The variance of the noise is
+inversely proportional to the weight,
+\f$Var(\epsilon_i)=\frac{\sigma^2}{w_i}\f$. In order to determine the
+model parameters, we minimimize the sum of quadratic errors.
+
+\f$
+Q_0 = \sum \epsilon_i^2
+\f$
+
+Taking the derivity with respect to \f$\alpha\f$ and \f$\beta\f$ yields two conditions
+
+\f$
+\frac{\partial Q_0}{\partial \alpha} = 2 \sum w_i(y_i  \alpha 
+\beta (x_im_x)=0
+\f$
+
+and
+
+\f$ \frac{\partial Q_0}{\partial \beta} = 2 \sum
+w_i(x_im_x)(y_i\alpha\beta(x_im_x)=0
+\f$
+
+or equivalently
+
+\f$
+\alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y
+\f$
+
+and
+
+\f$ \beta=\frac{\sum w_i(x_im_x)(ym_y)}{\sum
+w_i(x_im_x)^2}=\frac{Cov(x,y)}{Var(x)}
+\f$
+
+Note, by having all weights equal we get back the unweighted
+case. Furthermore, we calculate the variance of the estimators of
+\f$\alpha\f$ and \f$\beta\f$.
+
+\f$
+\textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}=
+\frac{\sigma^2}{\sum w_i}
+\f$
+
+and
+\f$
+\textrm{Var}(\beta )= \frac{w_i^2(x_im_x)^2\frac{\sigma^2}{w_i}}
+{(\sum w_i(x_im_x)^2)^2}=
+\frac{\sigma^2}{\sum w_i(x_im_x)^2}
+\f$
+
+Finally, we estimate the level of noise, \f$\sigma^2\f$. Inspired by the
+unweighted estimation
+
+\f$
+s^2=\frac{\sum (y_i\alpha\beta (x_im_x))^2}{n2}
+\f$
+
+we suggest the following estimator
+
+\f$ s^2=\frac{\sum w_i(y_i\alpha\beta (x_im_x))^2}{\sum
+w_i2\frac{\sum w_i^2}{\sum w_i}} \f$
+
+*/
+
+
+
Index: unk/doc/Statistics.tex
===================================================================
 /trunk/doc/Statistics.tex (revision 1108)
+++ (revision )
@@ 1,420 +1,0 @@
\documentclass[12pt]{article}

% $Id$
%
% Copyright (C) 2005 Peter Johansson
% Copyright (C) 2006 Jari Häkkinen, Markus Ringnér, Peter Johansson
% Copyright (C) 2007 Peter Johansson
%
% This file is part of the yat library, http://trac.thep.lu.se/yat
%
% The yat library is free software; you can redistribute it and/or
% modify it under the terms of the GNU General Public License as
% published by the Free Software Foundation; either version 2 of the
% License, or (at your option) any later version.
%
% The yat library is distributed in the hope that it will be useful,
% but WITHOUT ANY WARRANTY; without even the implied warranty of
% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
% General Public License for more details.
%
% You should have received a copy of the GNU General Public License
% along with this program; if not, write to the Free Software
% Foundation, Inc., 59 Temple Place  Suite 330, Boston, MA
% 021111307, USA.



\flushbottom
\footskip 54pt
\headheight 0pt
\headsep 0pt
\oddsidemargin 0pt
\parindent 0pt
\parskip 2ex
\textheight 230mm
\textwidth 165mm
\topmargin 0pt

\renewcommand{\baselinestretch} {1.0}
\renewcommand{\textfraction} {0.1}
\renewcommand{\topfraction} {1.0}
\renewcommand{\bottomfraction} {1.0}
\renewcommand{\floatpagefraction} {1.0}

\renewcommand{\d}{{\mathrm{d}}}
\newcommand{\nd}{$^{\mathrm{nd}}$}
\newcommand{\eg}{{\it {e.g.}}}
\newcommand{\ie}{{\it {i.e., }}}
\newcommand{\etal}{{\it {et al.}}}
\newcommand{\eref}[1]{Eq.~(\ref{e:#1})}
\newcommand{\fref}[1]{Fig.~\ref{f:#1}}
\newcommand{\ovr}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)}

\begin{document}

\large
{\bf Weighted Statistics}
\normalsize

\tableofcontents
\clearpage

\section{Introduction}
There are several different reasons why a statistical analysis needs
to adjust for weighting. In literature reasons are mainly diveded in
to groups.

The first group is when some of the measurements are known to be more
precise than others. The more precise a measurement is, the larger
weight it is given. The simplest case is when the weight are given
before the measurements and they can be treated as deterministic. It
becomes more complicated when the weight can be determined not until
afterwards, and even more complicated if the weight depends on the
value of the observable.

The second group of situations is when calculating averages over one
distribution and sampling from another distribution. Compensating for
this discrepency weights are introduced to the analysis. A simple
example may be that we are interviewing people but for economical
reasons we choose to interview more people from the city than from the
countryside. When summarizing the statistics the answers from the city
are given a smaller weight. In this example we are choosing the
proportions of people from countryside and people from city being
intervied. Hence, we can determine the weights before and consider
them to be deterministic. In other situations the proportions are not
deterministic, but rather a result from the sampling and the weights
must be treated as stochastic and only in rare situations the weights
can be treated as independent of the observable.

Since there are various origins for a weight occuring in a statistical
analysis, there are various ways to treat the weights and in general
the analysis should be tailored to treat the weights correctly. We
have not chosen one situation for our implementations, so see specific
function documentation for what assumtions are made. Though, common
for implementations are the following:
\begin{itemize}
\item Setting all weights to unity yields the same result as the
nonweighted version.
\item Rescaling the weights does not change any function.
\item Setting a weight to zero is equivalent to removing the data point.
\end{itemize}
An important case is when weights are binary (either 1 or 0). Then we
get the same result using the weighted version as using the data with
weight not equal to zero and the nonweighted version. Hence, using
binary weights and the weighted version missing values can be treated
in a proper way.

\section{AveragerWeighted}



\subsection{Mean}

For any situation the weight is always designed so the weighted mean
is calculated as $m=\frac{\sum w_ix_i}{\sum w_i}$, which obviously
fulfills the conditions above.

In the case of varying measurement error, it could be motivated that
the weight shall be $w_i = 1/\sigma_i^2$. We assume measurement error
to be Gaussian and the likelihood to get our measurements is
$L(m)=\prod
(2\pi\sigma_i^2)^{1/2}e^{\frac{(x_im)^2}{2\sigma_i^2}}$. We
maximize the likelihood by taking the derivity with respect to $m$ on
the logarithm of the likelihood $\frac{d\ln L(m)}{dm}=\sum
\frac{x_im}{\sigma_i^2}$. Hence, the Maximum Likelihood method yields
the estimator $m=\frac{\sum w_i/\sigma_i^2}{\sum 1/\sigma_i^2}$.


\subsection{Variance}
In case of varying variance, there is no point estimating a variance
since it is different for each data point.

Instead we look at the case when we want to estimate the variance over
$f$ but are sampling from $f'$. For the mean of an observable $O$ we
have $\widehat O=\sum\frac{f}{f'}O_i=\frac{\sum w_iO_i}{\sum
w_i}$. Hence, an estimator of the variance of $X$ is
\begin{eqnarray}
\sigma^2=^2=
\\\frac{\sum w_ix_i^2}{\sum w_i}\frac{(\sum w_ix_i)^2}{(\sum w_i)^2}=
\\\frac{\sum w_i(x_i^2m^2)}{\sum w_i}=
\\\frac{\sum w_i(x_i^22mx_i+m^2)}{\sum w_i}=
\\\frac{\sum w_i(x_im)^2}{\sum w_i}
\end{eqnarray}
This estimator fulfills that it is invariant under a rescaling and
having a weight equal to zero is equivalent to removing the data
point. Having all weights equal to unity we get $\sigma=\frac{\sum
(x_im)^2}{N}$, which is the same as returned from Averager. Hence,
this estimator is slightly biased, but still very efficient.

\subsection{Standard Error}
The standard error squared is equal to the expexted squared error of
the estimation of $m$. The squared error consists of two parts, the
variance of the estimator and the squared
bias. $^2=+\mu>^2=>^2+(\mu)^2$. In the
case when weights are included in analysis due to varying measurement
errors and the weights can be treated as deterministic ,we have
\begin{equation}
Var(m)=\frac{\sum w_i^2\sigma_i^2}{\left(\sum w_i\right)^2}=
\frac{\sum w_i^2\frac{\sigma_0^2}{w_i}}{\left(\sum w_i\right)^2}=
\frac{\sigma_0^2}{\sum w_i},
\end{equation}
where we need to estimate $\sigma_0^2$. Again we have the likelihood
$L(\sigma_0^2)=\prod\frac{1}{\sqrt{2\pi\sigma_0^2/w_i}}\exp{\frac{w_i(xm)^2}{2\sigma_0^2}}$
and taking the derivity with respect to $\sigma_o^2$, $\frac{d\ln
L}{d\sigma_i^2}=\sum
\frac{1}{2\sigma_0^2}+\frac{w_i(xm)^2}{2\sigma_0^2\sigma_o^2}$ which
yields an estimator $\sigma_0^2=\frac{1}{N}\sum w_i(xm)^2$. This
estimator is not ignoring weights equal to zero, because deviation is
most often smaller than the expected infinity. Therefore, we modify
the expression as follows $\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
w_i\right)^2}\sum w_i(xm)^2$ and we get the following estimator of
the variance of the mean $\sigma_0^2=\frac{\sum w_i^2}{\left(\sum
w_i\right)^3}\sum w_i(xm)^2$. This estimator fulfills the conditions
above: adding a weight zero does not change it: rescaling the weights
does not change it, and setting all weights to unity yields the same
expression as in the nonweighted case.

In a case when it is not a good approximation to treat the weights as
deterministic, there are two ways to get a better estimation. The
first one is to linearize the expression $\left<\frac{\sum
w_ix_i}{\sum w_i}\right>$. The second method when the situation is
more complicated is to estimate the standard error using a
bootstrapping method.

\section{AveragerPairWeighted}
Here data points come in pairs (x,y). We are sampling from $f'_{XY}$
but want to measure from $f_{XY}$. To compensate for this decrepency,
averages of $g(x,y)$ are taken as $\sum \frac{f}{f'}g(x,y)$. Even
though, $X$ and $Y$ are not independent $(f_{XY}\neq f_Xf_Y)$ we
assume that we can factorize the ratio and get $\frac{\sum
w_xw_yg(x,y)}{\sum w_xw_y}$
\subsection{Covariance}
Following the variance calculations for AveragerWeighted we have
$Cov=\frac{\sum w_xw_y(xm_x)(ym_y)}{\sum w_xw_y}$ where
$m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$

\subsection{correlation}

As the mean is estimated as $m_x=\frac{\sum w_xw_yx}{\sum w_xw_y}$,
the variance is estimated as $\sigma_x^2=\frac{\sum
w_xw_y(xm_x)^2}{\sum w_xw_y}$. As in the nonweighted case we define
the correlation to be the ratio between the covariance and geometrical
average of the variances

$\frac{\sum w_xw_y(xm_x)(ym_y)}{\sqrt{\sum w_xw_y(xm_x)^2\sum
w_xw_y(ym_y)^2}}$.

This expression fulfills the following
\begin{itemize}
\item Having N weights the expression reduces to the nonweighted expression.
\item Adding a pair of data, in which one weight is zero is equivalent
to ignoring the data pair.
\item Correlation is equal to unity if and only if $x$ is equal to
$y$. Otherwise the correlation is between 1 and 1.
\end{itemize}
\section{Score}


\subsection{Pearson}

$\frac{\sum w(xm_x)(ym_y)}{\sqrt{\sum w(xm_x)^2\sum w(ym_y)^2}}$.

See AveragerPairWeighted correlation.

\subsection{ROC}

An interpretation of the ROC curve area is the probability that if we
take one sample from class $+$ and one sample from class $$, what is
the probability that the sample from class $+$ has greater value. The
ROC curve area calculates the ratio of pairs fulfilling this

\begin{equation}
\frac{\sum_{\{i,j\}:x^_i)^N$, where
$$ is the linear kernel (usual scalar product). For the weighted
case we define the linear kernel to be $=\sum {w_xw_yxy}$ and the
polynomial kernel can be calculated as before
$(1+)^N$. Is this kernel a proper kernel (always being semi
positive definite). Yes, because $$ is obviously a proper kernel
as it is a scalar product. Adding a positive constant to a kernel
yields another kernel so $1+$ is still a proper kernel. Then also
$(1+)^N$ is a proper kernel because taking a proper kernel to the
$Nth$ power yields a new proper kernel (see any good book on SVM).
\subsection{Gaussian Kernel}
We define the weighted Gaussian kernel as $\exp\left(\frac{\sum
w_xw_y(xy)^2}{\sum w_xw_y}\right)$, which fulfills the conditions
listed in the introduction.

Is this kernel a proper kernel? Yes, following the proof of the
nonweighted kernel we see that $K=\exp\left(\frac{\sum
w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(\frac{\sum w_xw_yy^2}{\sum
w_xw_y}\right)\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$,
which is a product of two proper kernels. $\exp\left(\frac{\sum
w_xw_yx^2}{\sum w_xw_y}\right)\exp\left(\frac{\sum w_xw_yy^2}{\sum
w_xw_y}\right)$ is a proper kernel, because it is a scalar product and
$\exp\left(\frac{\sum w_xw_yxy}{\sum w_xw_y}\right)$ is a proper
kernel, because it a polynomial of the linear kernel with positive
coefficients. As product of two kernel also is a kernel, the Gaussian
kernel is a proper kernel.

\section{Distance}

\section{Regression}
\subsection{Naive}
\subsection{Linear}
We have the model

\begin{equation}
y_i=\alpha+\beta (xm_x)+\epsilon_i,
\end{equation}

where $\epsilon_i$ is the noise. The variance of the noise is
inversely proportional to the weight,
$Var(\epsilon_i)=\frac{\sigma^2}{w_i}$. In order to determine the
model parameters, we minimimize the sum of quadratic errors.

\begin{equation}
Q_0 = \sum \epsilon_i^2
\end{equation}

Taking the derivity with respect to $\alpha$ and $\beta$ yields two conditions

\begin{equation}
\frac{\partial Q_0}{\partial \alpha} = 2 \sum w_i(y_i  \alpha 
\beta (x_im_x)=0
\end{equation}

and

\begin{equation} \frac{\partial Q_0}{\partial \beta} = 2 \sum
w_i(x_im_x)(y_i\alpha\beta(x_im_x)=0
\end{equation}

or equivalently

\begin{equation}
\alpha = \frac{\sum w_iy_i}{\sum w_i}=m_y
\end{equation}

and

\begin{equation} \beta=\frac{\sum w_i(x_im_x)(ym_y)}{\sum
w_i(x_im_x)^2}=\frac{Cov(x,y)}{Var(x)}
\end{equation}

Note, by having all weights equal we get back the unweighted
case. Furthermore, we calculate the variance of the estimators of
$\alpha$ and $\beta$.

\begin{equation}
\textrm{Var}(\alpha )=\frac{w_i^2\frac{\sigma^2}{w_i}}{(\sum w_i)^2}=
\frac{\sigma^2}{\sum w_i}
\end{equation}

and
\begin{equation}
\textrm{Var}(\beta )= \frac{w_i^2(x_im_x)^2\frac{\sigma^2}{w_i}}
{(\sum w_i(x_im_x)^2)^2}=
\frac{\sigma^2}{\sum w_i(x_im_x)^2}
\end{equation}

Finally, we estimate the level of noise, $\sigma^2$. Inspired by the
unweighted estimation

\begin{equation}
s^2=\frac{\sum (y_i\alpha\beta (x_im_x))^2}{n2}
\end{equation}

we suggest the following estimator

\begin{equation} s^2=\frac{\sum w_i(y_i\alpha\beta (x_im_x))^2}{\sum
w_i2\frac{\sum w_i^2}{\sum w_i}} \end{equation}

\section{Outlook}
\subsection{Hierarchical clustering}
A hierarchical clustering consists of two things: finding the two
closest data points, merge these two data points two a new data point
and calculate the new distances from this point to all other points.

In the first item, we need a distance matrix, and if we use Euclidean
distanses the natural modification of the expression would be

\begin{equation}
d(x,y)=\frac{\sum w_i^xw_j^y(x_iy_i)^2}{\sum w_i^xw_j^y}
\end{equation}

For the second item, inspired by average linkage, we suggest

\begin{equation}
d(xy,z)=\frac{\sum w_i^xw_j^z(x_iz_i)^2+\sum
w_i^yw_j^z(y_iz_i)^2}{\sum w_i^xw_j^z+\sum w_i^yw_j^z}
\end{equation}

to be the distance between the new merged point $xy$ and $z$, and we
also calculate new weights for this point: $w^{xy}_i=w^x_i+w^y_i$

\end{document}



Index: /trunk/doc/doxygen.config.in
===================================================================
 /trunk/doc/doxygen.config.in (revision 1108)
+++ /trunk/doc/doxygen.config.in (revision 1109)
@@ 350,5 +350,5 @@
# with spaces.
INPUT = first_page.doxygen namespaces.doxygen concepts.doxygen ../yat
+INPUT = first_page.doxygen namespaces.doxygen concepts.doxygen Statistics.doxygen ../yat
# If the value of the INPUT tag contains directories, you can use the
@@ 874,20 +874,4 @@
DOTFILE_DIRS =
# The MAX_DOT_GRAPH_WIDTH tag can be used to set the maximum allowed width
# (in pixels) of the graphs generated by dot. If a graph becomes larger than
# this value, doxygen will try to truncate the graph, so that it fits within
# the specified constraint. Beware that most browsers cannot cope with very
# large images.

MAX_DOT_GRAPH_WIDTH = 1024

# The MAX_DOT_GRAPH_HEIGHT tag can be used to set the maximum allows height
# (in pixels) of the graphs generated by dot. If a graph becomes larger than
# this value, doxygen will try to truncate the graph, so that it fits within
# the specified constraint. Beware that most browsers cannot cope with very
# large images.

MAX_DOT_GRAPH_HEIGHT = 1024

# If the GENERATE_LEGEND tag is set to YES (the default) Doxygen will
# generate a legend page explaining the meaning of the various boxes and
Index: /trunk/doc/first_page.doxygen
===================================================================
 /trunk/doc/first_page.doxygen (revision 1108)
+++ /trunk/doc/first_page.doxygen (revision 1109)
@@ 36,9 +36,4 @@
href="namespacemembers.html">Namespace Members link above.
 There is a document on the weighted statistics included in the
 package with underlying theory and more detailed motivations [ html  pdf ].

Future development
We use trac as issue tracking system. Through the