Sunday 18 October 2015

Are Data Sets Like Documents?: Evaluating Similarity-Based Ranked Search over Scientific Data

Abstract
The past decade has seen a dramatic increase in the amount of data captured and made available to scientists for research. This increase amplifies the difficulty scientists face in finding the data most relevant to their information needs. In prior work, we hypothesized that Information Retrieval-style ranked search can be applied to data sets to help a scientist discover the most relevant data amongst the thousands of data sets in many formats, much like text-based ranked search helps users make sense of the vast number of Internet documents. To test this hypothesis, we explored the use of ranked search for scientific data using an existing multi-terabyte observational archive as our test-bed. In this paper, we investigate whether the concept of varying relevance, and therefore ranked search, applies to numeric data— that is, are data sets are enough like documents for Information Retrieval techniques and evaluation measures to apply? We present a user study that demonstrates that data set similarity resonates with users as a basis for relevance and, therefore, for ranked search. We evaluate a prototype implementation of ranked search over data sets with a second user study and demonstrate that ranked search improves a scientist’s ability to find needed data.
Aim
The main aim is to improve a scientist’s ability to find needed data using ranked search.
Scope
The scope is to explore the use of ranked search for scientific data using an existing multi-terabyte observational archive.
Existing system
At first, the comparison of data sets to documents may seem strange. On the other hand, if a feature-space model can be used to calculate an overall similarity score between a search consisting of several words and a document containing hundreds or thousands of words, adapting the model to comparing similarities between numeric search conditions and numeric data with hundreds or thousands of attribute values seems viable.
To adapt IR techniques to scientific-data set search, we need three things: a way to express a scientific information need as a set of search conditions; a method for extracting features from data sets; and a similarity measure to compare search conditions to the extracted features.
Further, we must validate that any proposed set of features and similarity measure resonates with potential searchers;
That is, we show that the search system has utility, and that the similarity measure embodies a notion of relevance that mimics the judgment of potential users. As noted, the notion of relevance differentiates IR from database retrieval (although databases may be used to implement IR). The concept of different levels of relevance for different items, and approximation of those levels via a similarity measure, supports ranked retrieval based on relative similarity scores for different items. We could thus present a research scientist with a ranked list of all available data sets that is ordered by decreasing estimated relevance to a posed search. If these concepts can be confirmed, then the application of IR measures, such as mean average precision, to the resulting approaches should also be valid. Traditional text IR treats a document as a bag of words, with each distinct word a feature; further, a frequently used word is seen as having less value than a less frequently used word, leading to the tf-idf similarity measure. A text IR query also consists of a bag of words, and thus each search term can be matched to a document feature. Our scientists, however, do not search for specific values found in a data set (“air temperature ¼ 14.93615C”), but rather express their information needs in terms of an observational variable with values in some range (“water temperature between 5 and 10 C”). Thus, we rejected the bag-of-words model and tf-idf measure in favor of using variable names and value ranges as our features, and developing a similarity measure that allows us to compare them.
Disadvantages
·      Metadata collection, curation and maintenance is an acknowledged and ongoing problem, and reliance on manual collection of metadata is considered a prescription for failure.
·      Both manual navigation and metadata-query approaches often result in time-consuming, repeated actions.
Proposed System
We demonstrate via our first user study that the concepts of “data set relevance” and “data set similarity” are meaningful, implying that Information- Retrieval-style ranked search over scientific data is reasonable.
We show that we can directly map these principles into a ranked retrieval system for data sets; and, we implemented these principles in a prototype.
We present a second user study that demonstrates the prototype improves scientists’ ability to find relevant data, thus removing a significant impediment to research productivity.
We demonstrate that IR measures (such as RBP and DCG) are applicable to data set search, and they indicate our candidate similarity measure performs well compared to several alternatives.
Advantages
·      The Internet has seen similar explosive growth, and web search techniques now allow users to easily find relevant documents despite that growth.
·      Incorporating data sets from other sources into the catalog, allowing users to search for data across multiple organizations’ archives.
·      These techniques have broad applicability, and address a need by scientists that will only become greater as data volumes and heterogeneity continue to grow.
System Architecture



System Specification

Hardware Requirements
  • Speed                  -    1.1 Ghz
  • Processor              -    Pentium IV
  • RAM                    -    512 MB (min)
  • Hard Disk            -    40 GB
  • Key Board                    -    Standard Windows Keyboard
  • Mouse                  -    Two or Three Button Mouse
  • Monitor                -     LCD/LED
 Software requirements
  • Operating System              : Windows 7             
  •  Front End                           : ASP.Net and C#
  • Database                             : MSSQL
  • Tool                                    : Microsoft Visual studio

Reference
Maier, D. Megler, V.M.," ARE DATA SETS LIKE DOCUMENTS?: EVALUATING SIMILARITY-BASED RANKED SEARCH OVER SCIENTIFIC DATA" IEEE Transactions on  Knowledge and Data Engineering  Volume:27 ,  Issue: 1, April 2014


No comments:

Post a Comment