قيمة شاذة

في الإحصاء، القيمة الشاذة (outlier) هي عنصر شاذ وخارج عن النسق المميز لمجموعة أو تركيبة معينة. ففي عموم الدراسات في الإحصاء، الرياضياتيون أنجزوا خوارزميات قادرة على التخفيف من تأثير القيم الشاذة، أو إلغائها، وحتى حذفها، مستخدمين طرق الإحصاء المتين .إلا أنه في بعض الأحيان يكون وجودها مفيدا لمعرفة سلوك تركيبة، أو منظومة.

أما القيمة المستحيلة (anomaly) فتعتبر قرائة خاطئة لأنها تدل على ظاهرة مستحيل حدوثها.

التعرف

في التعرف على الحالات الشاذة Anomaly

مقال رئيسي: التعرف على الحالات الشاذة

اختبار تومسون تاو المعدل

The modified Thompson Tau test^{[citation needed]} is a method used to determine if an outlier exists in a data set. The strength of this method lies in the fact that it takes into account a data set’s standard deviation, average and provides a statistically determined rejection zone; thus providing an objective method to determine if a data point is an outlier. Note: Although intuitively appealing, this method appears to be unpublished (it is not described in Thompson (1985)^[1]) and one should use it with caution.

كيف تعمل: First, a data set's average is determined. Next the absolute deviation between each data point and the average are determined. Thirdly, a rejection region is determined using the formula: $Rejection Region = \frac{t_{α / 2} (n - 1)}{\sqrt{n} \sqrt{n - 2 + t_{α / 2}^{2}}}$ ; where $t_{α / 2}$ is the critical value from the Student t distribution, n is the sample size, and s is the sample standard deviation. To determine if a value is an outlier: Calculate δ = |(X - mean(X)) / s|. If δ > Rejection Region, the data point is an outlier. If δ ≤ Rejection Region, the data point is not an outlier.

The modified Thompson Tau test is used to find one outlier at a time (largest value of δ is removed if it is an outlier). Meaning, if a data point is found to be an outlier, it is removed from the data set and the test is applied again with a new average and rejection region. This process is continued until no outliers remain in a data set.

Some work has also examined outliers for nominal (or categorical) data. In the context of a set of examples (or instances) in a data set, instance hardness measures the probability that an instance will be misclassified ( $1 - p (y | x)$ where $y$ is the assigned class label and $x$ represent the input attribute value for an instance in the training set $t$ ).^[2] Ideally, instance hardness would be calculated by summing over the set of all possible hypotheses $H$ :

$\begin{array}{l} I H (⟨ x, y ⟩) & = \sum_{H} (1 - p (y, x, h)) p (h | t) \\ = \sum_{H} p (h | t) - p (y, x, h) p (h | t) \\ = 1 - \sum_{H} p (y, x, h) p (h | t) . \end{array}$

Practically, this formulation is unfeasible as $H$ is potentially or infinite and calculating $p (h | t)$ is unknown for many algorithms. Thus, instance hardness can be approximated using a diverse subset $L \subset H$ :

$I H_{L} (⟨ x, y ⟩) = 1 - \frac{1}{| L |} \sum_{j = 1}^{| L |} p (y | x, g_{j} (t, α)$

where $g_{j} (t, α)$ is the hypothesis induced by learning algorithm $g_{j}$ trained on training set $t$ with hyperparameters $α$ . Instance hardness provides a continuous value for determining if an instance is an outlier instance.

انظر أيضاً

References

^ Thompson .R. (1985). "A Note on Restricted Maximum Likelihood Estimation with an Alternative Outlier Model".Journal of the Royal Statistical Society. Series B (Methodological), Vol. 47, No. 1, pp. 53-55
^ Smith, M.R.; Martinez, T.; Giraud-Carrier, C. (2014). "An Instance Level Analysis of Data Complexity". Machine Learning, 95(2): 225-256.

ISO 16269-4, Statistical interpretation of data — Part 4: Detection and treatment of outliers
Strutz, Tilo (2010). Data Fitting and Uncertainty - A practical introduction to weighted least squares and beyond. Vieweg+Teubner. ISBN 978-3-8348-1022-9.

وصلات خارجية

Renze, John, قيمة شاذة at MathWorld.
Balakrishnan, N.; Childs, A. (2001), "Outlier", Encyclopedia of Mathematics, EMS Press
Grubbs test described by NIST manual
how to detect univariate outliers, how to detect multivariate outliers and how to deal with outliers

[1] Thompson .R. (1985). "A Note on Restricted Maximum Likelihood Estimation with an Alternative Outlier Model".Journal of the Royal Statistical Society. Series B (Methodological), Vol. 47, No. 1, pp. 53-55

[2] Smith, M.R.; Martinez, T.; Giraud-Carrier, C. (2014). "An Instance Level Analysis of Data Complexity". Machine Learning, 95(2): 225-256.

[1]

[2]