Monday 19 October 2015

Reverse Nearest Neighbors In Unsupervised Distance-Based Outlier Detection



 ABSTRACT:
Outlier detection in high-dimensional data presents various challenges resulting from the “curse of dimensionality.” A prevailing view is that distance concentration, i.e., the tendency of distances in high-dimensional data to become indiscernible, hinders the detection of outliers by making distance-based methods label all points as almost equally good outliers. In this paper we provide evidence supporting the opinion that such a view is too simple, by demonstrating that distance-based methods can produce more contrasting outlier scores in high-dimensional settings. Furthermore, we show that high dimensionality can have a different impact, by reexamining the notion of reverse nearest neighbors in the unsupervised outlier-detection context. Namely, it was recently observed that the distribution of points’ reverse-neighbor counts becomes skewed in high dimensions, resulting in the phenomenon known as hubness. We provide insight into how some points (antihubs) appear very infrequently in k-NN lists of other points, and explain the connection between antihubs, outliers, and existing unsupervised outlier-detection methods. By evaluating the classic k-NN method, the angle-based technique (ABOD) designed for high-dimensional data, the density-based local outlier factor (LOF) and influenced outlierness (INFLO) methods, and antihub-based methods on various synthetic and real-world data sets, we offer novel insight into the usefulness of reverse neighbor counts in unsupervised outlier detection.
AIM
The aims of this paper show that high dimensionality can have a different impact, by reexamining the notion of reverse nearest neighbors in the unsupervised outlier-detection context.
SCOPE
 The Scope of this project evaluating the classic k-NN method, the angle-based technique (ABOD) designed for high-dimensional data, the density-based local outlier factor (LOF) and influenced outlierness (INFLO) methods, and anti hub-based methods on various synthetic and real-world data sets, we offer novel insight into the usefulness of reverse neighbor counts in unsupervised outlier detection

EXISTING SYSTEM
Distinguishes three problems brought by the “curse of dimensionality” in the general context of search, indexing, and data mining applications: poor discrimination of distances caused by concentration, presence of irrelevant attributes, and presence of redundant attributes, all of which hinder the usability of traditional distance and similarity measures. The authors conclude that despite such limitations, common distance/similarity measures still form a good foundation for secondary measures, such as shared-neighbor distances, which are less sensitive to the negative effects of the curse. the discussion of problems relevant to unsupervised outlier-detection methods in high-dimensional data by identifying seven issues in addition to distance concentration: noisy attributes, definition of reference sets, bias (comparability) of scores, interpretation and contrast of scores, exponential search space, data-snooping bias, and hubness. In this article we will focus on the aspect of hubness, and assume that all attributes carry useful information, i.e., are not overly noisy.
DISADVANTAGES:

  1. Curse  of dimensionality
  2. The tendency of distances in high-dimensional data to become indiscernible.

PROPOSED SYSTEM
In this paper, Reverse nearest-neighbor counts have been proposed in the past as a method for expressing outlierness of data points, but no insight apart from basic intuition was offered as to why these counts should represent meaningful outlier scores. Recent observations that reverse-neighbor counts are affected by increased dimensionality of data warrant their reexamination for the outlier-detection task. In this light, we will revisit the ODIN method. we explore two ways of using k-occurrence information for expressing the outlierness of points, starting with the method ODIN proposed . Our main goal is to provide insight into the behavior of k- occurrence counts in different realistic scenarios (high and low dimensionality, multimodality of data), that would assist researchers and practitioners in using reverse neighbor information in a less ad-hoc fashion. we describe experiments with synthetic and real data sets, the results of which illustrate the impact of factors such as dimensionality, cluster density and anti hubs on outlier detection, demonstrating the benefits of the methods, and the conditions in which the benefits are expected.
 
ADVANTAGES
  1.  Focusing on the effects of high dimensionality on unsupervised outlier-detection methods and the hubness phenomenon, extending the previous examinations of (anti)hubness to large values of k, and exploring the relationship between hubness and data sparsity.
  2.  It would be interesting to examine supervised and semi-supervised methods as well.

 SYSTEM CONFIGURATION

HARDWARE REQUIREMENTS:-

·       Processor   -   Pentium –III

·      Speed            -    1.1 Ghz
·      RAM             -    256 MB(min)
·      Hard Disk              -   20 GB
·      Floppy Drive         -    1.44 MB
·      Key Board             -    Standard Windows Keyboard
·      Mouse           -    Two or Three Button Mouse
·      Monitor                 -    SVGA

SOFTWARE REQUIREMENTS:-

·      Operating System          : Windows  7                                  
·      Front End                      : JSP AND SERVLET
·      Database                       : MYSQL
REFERENCE:
Nanopoulos, A. ; Ivanovic, M. Radovanovic, M. “Reverse Nearest Neighbors In Unsupervised Distance-Based Outlier Detection”, IEEE Transactions on Knowledge and Data Engineering, Volume 27, Issue 5 NOVEMBER  2014.





No comments:

Post a Comment