With the constant increase in the volume of information available on the Web, it is more dificult to find the specific information related to a given domain. Users are facing the problem of information overload, in which a query about a specialized subject (local information, e-commerce: hotels, airlines, car rental; science: biology, mathematics, medicine, etc.) on a web search engine, it returns a lot of web pages or results that in most of the cases are outside the domain of interest. This is one reason why the vertical search tools have become a necessity for users that seek specific-domain information from diferent databases available in the Web through input sources called Web Query Interfaces (ICWs). This paper describes an approach for automatic integration of ICWs, a crucial task to construct vertical search tools. The proposed methodology is validated by realizing a vertical search prototype called VSearch that allows users to transparently query multiple web databases in a specific-domain through a unified ICW. The proposed approach for automatic ICWs integration is based on: i) a hierarchical model called AEV for modeling the visual content of ICW; ii) semantic clustering for the identification of relationships between fields in ICWs; and iii) a field homogenization and unification process of AEV schemes for the construction of a unified ICW. The VSearch prototype was implemented and evaluated using a study case. The experimental results demonstrate the high precision in the integration phase and an efective methodology to create a functional vertical search tool for a given domain.
Huge amount of data is present in the hidden web and to access this data from the deep web sites, forms need to be filled and submitted, to get the data from the web databases. Deep web crawlers. General deep web crawlers do a breadth search on the deep web, for retrieving general web data; whereas vertical deep web crawlers do a depth based search, focusing on a particular domain to extract the deep web sites based on a specific topic.
System In schema matching, instead of filling the form of the deep web site and then extracting the data to find if they are relevant to the search, a schema of the required data is prepared and only those sites which match the schema are retrieved. This technique greatly reduces the cost of extraction of web pages and then processing them. Schema matching can be done by web source virtual integration system
Various techniques can be used to extract relevant information from the deep web . In Vision based approach the web page is assumed to be divided into sections that contain particular type of information. Rather than extracting the complete web page information and then parsing it, only the section that contains the relevant information is extracted using this technique.
The deep Web is qualitatively different from the surface web. The term “Deep Web” refers to web pages that are not accessible to search engines. The existing automated web crawlers cannot index these pages, thus they are hidden from the Web search engines .
The data in digital libraries, various government organizations, companies is available through search forms. A deep web site is a web server that provides information maintained in one or more back-end web databases, each of which is searchable through one or more HTML forms as its query interfaces
The process of data source selection can be automated by periodically analyzing different deep web sources and user can be given recommendations about a small number of data sources that will be most appropriate for their query. A data mining method to extract a high-level summary of the differences in data provided by different deep web data sources is proposed in .
Pattern of values are considered with respect to the same entity and a new data mining problem is formulated, referred as differential rule mining. An algorithm for mining such rules is developed. It includes a pruning method to summarize the identified differential rules. For efficiency, a hash-table is used to accelerate the pruning process.
The paper discusses the way to extend the traditional web crawlers to surface the Deep Web. Hidden Web content can be accessed by Deep Web Crawlers that can fill and submit forms to query the online databases for information extraction. In this technique the extracted content is analyzed to check if it is relevant. Schema Matching has proved to be an efficient technique for extracting relevant content. Data from the Deep Web can be extracted by applying various techniques such as mining, building ontology to assist domain specific data retrieval.
Visual approach is an efficient technique to extract only the required data. The paper also shows the comparative analysis of the two techniques widely used for surfacing the hidden web form processing and querying the deep web by Hidden Web Crawlers and Schema Matching for Virtual Integration systems. Depending upon the application area the surfacing technique can be selected and be combined with other techniques to overcome the drawbacks in the original method.