# Changeset 2594 for trunk/yat/statistics/ROC.h

Ignore:
Timestamp:
Oct 30, 2011, 3:36:17 AM (11 years ago)
Message:

improve docs for ROC and sister class AUC. closes #144

File:
1 edited

### Legend:

Unmodified
 r2592 /// /// @brief Class for Reciever Operating Characteristic. /// /// @brief Reciever Operating Characteristic. /// /// As the area under an ROC curve is equivalent to Mann-Whitney U /// statistica, this class can be used to perform a Mann-Whitney /// U-test (aka Wilcoxon). /// /// \see AUC /// class ROC /** Adding a data value to ROC. \brief Add a data value. \param value data value \param target \c true if value belongs to class positive \param weight indicating how important the data point is. A zero weight implies the data point is ignored. A negative weight should be understood as removing a data point and thus typically only makes sense if there is a previously added data point with same \a value and \a target. */ void add(double value, bool target, double weight=1.0); /** The area is defines as \f$\frac{\sum w^+w^-} {\sum w^+w^-}\f$, where the sum in the numerator goes over all pairs where value+ is larger than value-. The denominator goes over all pairs. \brief Area Under Curve, AUC \see AUC for how the area is calculated @return Area under curve. double area(void); /// /// minimum_size is the threshold for when a normal /// approximation is used for the p-value calculation. /// /// @return reference to minimum_size /// /** \brief threshold for p_value calculation Function can used to change the minimum_size. \return reference to threshold minimum size */ unsigned int& minimum_size(void); /** minimum_size is the threshold for when a normal approximation is used for the p-value calculation. \brief threshold for p_value calculation Threshold deciding whether p-value is computed using exact method or a Gaussian approximation. If both number of positive samples, n_pos(void), and number of negative samples, n_neg(void), are smaller than minimum_size the exact method is used. @return const reference to minimum_size \see p_value \return const reference to minimum_size */ const unsigned int& minimum_size(void) const; /// /// \brief number of samples /// /// @return sum of weights /// /// /// \brief number of negative samples /// /// @return sum of weights with negative target /// /// /// \brief number of positive samples /// /// @return sum of weights with positive target /// double n_pos(void) const; /// ///Calculates the p-value, i.e. the probability of observing an ///area equally or larger if the null hypothesis is true. If P is ///near zero, this casts doubt on this hypothesis. The null ///hypothesis is that the values from the 2 classes are generated ///from 2 identical distributions. The alternative is that the ///median of the first distribution is shifted from the median of ///the second distribution by a non-zero amount. If the smallest ///group size is larger than minimum_size (default = 10), then P ///is calculated using a normal approximation. /// /// \note Weights should be either zero or unity, else present /// implementation is nonsense. /// /// @return One-sided p-value. /// /** \brief One-sided P-value Calculates the one-sided p-value, i.e., probability to get this area (or greater) given that there is no difference between the two classes. \b Exact \b method: In the exact method the function goes through all permutations and counts what fraction for which the area is greater (or equal) than area in original permutation. \b Large-sample \b Approximation: When many data points are available, see minimum_size(), a Gaussian approximation is used and the p-value is calculated as \f[ P = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^z \exp{\left(-\frac{t^2}{2}\right)} dt \f] where \f[ z = \frac{\textrm{area} - 0.5 - 0.5/(n^\cdot +n^-)}{s} \f] and \f[ s^2 = \frac{n+1+\sum \left(n_x \cdot (n_x^2-1)\right)} {12\cdot n^+\cdot n^-} \f] where sum runs over different data values (of ties) and \f$n_x \f$ is number data points with that value. The sum i a correction term for ties and is zero if there are no ties. \return \f$P(a \ge \textrm{area}) \f$ \note Weights should be -1, 0, or 1; otherwise the p-value is undefined and may change in future versions. */ double p_value_one_sided(void) const; /** @brief Two-sided p-value. @return min(2*p_value_one_sided, 2-2*p_value_one_sided) \brief Two-sided p-value. Calculates the probability to get an area, \c a, equal or more extreme than \c area \f[ P(a \ge \textrm{max}(\textrm{area},1-\textrm{area})) + P(a \le \textrm{min}(\textrm{area}, 1-\textrm{area})) \f] If there are no ties, distribution of \a a is symmetric, so if area is greater than 0.5, this boils down to \f$P = 2*P(a \ge \textrm{area}) = 2*P_\textrm{one-sided}\f$. \return two-sided p-value \see p_value_one_sided */ double p_value(void) const;