Towards Automatic Web Data Scraper and Aligner (WDSA)

Web is very immense and fast emerging source of information. Web browsers along with search engines have come forward as famous tools for retrieving and accessing the information present on web. Enormous growth of web made the data extraction from web harder than ever. This paper presents the Automatic Web Data Scraper and Aligner (WDSA). Automatic WDSA extracts the interested web data present in dynamically generated web page received from search engine when user gives a query. Automatic web data scraping is necessary because human being can identify the interested query relevant contents from query result web page, however it is tricky for computer applications. Extracted web data can be further transferred into a format suitable for use in applications like comparison shopping, data integrations, value added services etc. WDSA does this by aligning the extracted web data pairwise as well as holistically in table. The novel thing about Automatic WDSA is that Data Scraper and Aligner uses new approach which combines similarity of both tag and value, for extraction and alignment process. Also Data Scraper handles the data which is present in non contiguous fashion due to presence of auxiliary information like advertisement banners, navigational links, pop ups etc. Experimental results show that Automatic WDSA achieves high precision and recall. Further Automatic WDSA is compared with existing most widely used famous tools like Helium scraper, Outwit Hub, Screen Scraper etc. During comparison we observed that Manual labeling or extraction patterns of desired data is to be specified for working of existing tools while Automatic WDSA does not require any user involvement which made it fully automatic.


INTRODUCTION
Currently the quantity of information existing on the net in HTML format grows at a very fast rate. This makes the web available to the public as largest "knowledge base" ever developed. However HTML sites can be referred as modern legacy systems, because such a huge data cannot be accessed and manipulated easily. The reason for the same is that web data sources are to be browsed by human beings, and not computed over by applications. This leads to the consequence such as web pages data extracting process and making its availability to computer applications becomes a relevant and complex job.
Software units called wrappers are mainly used for data extraction from HTML pages. The foundation of early approaches towards the wrapping of web sites was on manual techniques [12 ], [13], [16] and [17]. A key dilemma with manually implied wrappers is that coding them is usually a complicated and manual effort demanding task, and further wrappers tend to be fragile and difficult to maintain by their nature. To overcome with these problems some unsupervised learning methods [3 ], [5], [7], [10] and [14] have been proposed to automatically extract the data from web pages. Such methods are fully using the tag structure of HTML pages which may lead to inaccurate extraction.
This paper provides spotlight on the problem of automatically extracting user query relevant data records that are encoded in the query result pages generated dynamically by web databases.
Generally, in response to a user query when submitted through the query interface of a web database, deep web generates dynamic web pages unlike in surface web where unique URL is used to access the web pages. On receipt of a user"s query, a web database gives the relevant data, encoded in HTML web pages in either structured or semi structured format. Many web applications, like comparison shopping, meta querying and data integration or aggregation require the data to be supplied from various web databases. These applications need to exploit the data which is embedded in HTML pages, which further leads to conclusion that automatic data extraction is crucial. Further everyone knows the very well known fact that when the data are extracted and structured in a well organized fashion, like tables, only then they can be aggregated and compared. Hence, accurate data extraction or scraping is the main central point of attention for these applications to achieve high accuracy by performing correctly.
This paper contributes to the development of a new approach to the web data extraction problem by fully automating the wrapper generation process, by making it independent of any prior knowledge regarding the target pages and their contents.
In universal, a query result page contains not only the actual data, but also other decorative information, such as advertisements, navigational panels, comments, information related to sites used for hosting, and so on. The aim of web data extraction is to remove any irrelevant information from the query result page, extract the Query Relevant Records i.e. QRRs the web page, and further align the extracted QRRs into a table such that the data values belonging to the same attribute are placed into the same table column.
Automatic WDSA uses two novel steps, QRR scraping or extraction and QRR alignment to extract the records from query result page and then align the data values of records into table.
1. QRR scraping or extraction: -This step identifies the QRRs in query result page by identifying the data regions and further segmenting it into records.
2. QRR alignment: -This step aligns the data values of QRRs in table, pairwise and holistically such that the data values for same attribute will share the same column of table.
The rest of the paper is organized as follows: Section 2 reviews recent work on data extraction. Section 3 describes the system architecture for Automatic WDSA along with main steps of our method: Data Scraping and Data alignment. Section 4 shows result screen shots of implementation. Section 5 describes performance evaluation our method. Finally section 6 concludes the paper.

RELATED WORK
An interesting active research field for many years is improving the wrapper generation techniques used for extraction of web data. Till now many approaches have been proposed and [14 ] provides comparison of surveyed techniques.
All the wrapper generation methods require some kind of human involvement to build and constitute the wrapper. However in applications where the sources are unknown in advance, this approach will not be feasible. So, automatic wrapper generation techniques are introduced. Several works have addressed the problem of performing web data extraction tasks without requiring human input.
IEPAD [3] uses the techniques such as Patricia tree and string alignment to search the HTML tag string of a page to find repetitive patterns. However there is high probability that the method used by IEPAD generates incorrect patterns along with the correct ones, so human involvement is required for post-processing of the output.
RoadRunner [15] gets multiple pages conforming to the same template as input and union-free regular expression (UFRE) is induced from them which can be further used to extract the data from the pages conforming to the template. The basic idea behind the RoadRunner is performing an iterative process where the system takes the first page as initial UFRE and then, for each subsequent page, tests if it can be generated using the current template. If not, the template is modified to represent the new page. The drawback of the proposed method is that it requires multiple pages conforming to the same template as input and further it is unable to deal with disjunctions in the input schema. A p r i l 2 3 , 2 0 1 4 DEPTA [2] uses the details about visual layout of information present in the page and tree edit-distance techniques to identify lists of records in the page and further extracts the structured data records that make the page. DEPTA requires can receive one single page containing a list of structured data records as input and uses the observation that, DOM tree of a page consists of a set of consecutive sibling subtrees which generates each record in a list.
On the other hand, following additional assumptions are made: 1) All records must be formed by exactly the same number of sub-trees, and 2) The visual gap between two data records in a list is bigger than the gap between any two data values from the same record. However, in all web sources these assumptions do not hold.
DeLa [7] models the structured data present in template-generated web pages as string instances which are encoded in HTML tags, of the implied nested type of their web database. A regular expression is used to model the HTML-encoded version of the nested type. If the page contains more than one instance of the data then the HTML tag-structure enclosing the data appears repeatedly. So in this case the page is first changed into a token sequence collected of HTML tags and a special token "text" used for representing text string which is enclosed by pairs of HTML tags. Then, continuous repeated substrings are extracted from the token sequence and a regular expression wrapper is induced from the repeated substrings according to some hierarchical relationships present among them. The main problem with this method is that it often produces multiple patterns (rules) and it is hard to decide which one is correct.
Since deriving the accurate wrappers based exclusively on HTML tags is very difficult [2] some more techniques are introduced which use the additional information from query result page.
ViPER [10] uses both visual data value similarity features and the HTML tag structure to first identify and rank potential repetitive patterns. Then, matching subsequences are aligned with global matching information. But ViPER suffers from poor results for nested structured data.
ViNTs [5] learns a wrapper from a set of training pages from a website by using both visual and tag features. It first utilizes the visual data value similarity without considering the tag structure to identify data value similarity regularities, denoted as data value similarity lines, and then combines them with the HTML tag structure regularities to generate wrappers. Both visual and non visual features are used to weight the relevance of different extraction rules. Several result pages, each of which must contain at least four QRRs, and one no-result page are required to build a wrapper.
Among the above discussed web data extraction methods, some techniques reveals flat records while some are not able to handle non contiguous data regions. Also some techniques require training pages along with a prelearned wrapper for a website. DeLa, extracts records using wrapper induction method, others are based on operations on tree structure of the page such as tree alignment, tree merging and tree matching. In DEPTA extraction is performed mainly by partial tree alignment. ViPER uses the extraction method which is based on visual perception. Most of the wrappers are based exclusively on HTML tags structure.
In contrast, Automatic WDSA uses similarity of both tag with its data value and also handles non contiguous data regions. Further it requires neither training pages nor a prelearned wrapper for a website.

SYSTEM ARCHITECTURE
Automatic WDSA is designed with the objective as to automatically extract the Query Relevant Records (QRRs) in a page, and align the data values of the QRRs into a table. Automatic WDSA architecture is shown in Figure 1.
The system takes the input query result page containing atleast two QRRs decorated with auxiliary information. The query result page passes through two phases i.e. Data scraping and Data alignment to give output as data records referred as QRRs and aligned data values respectively. Data scraping phase consists of four steps, DOM tree builder, Data region identification, Record segmentation and Query result section identification. Data alignment consists of two steps, Pairwise alignment and Holistic alignment.
When a query result page is given as input, the tag tree for the page rooted in the <HTML> tag is constructed by DOM Tree Builder module. Next, in the Data Region Identification module, identification of all possible data regions is done, which usually contain dynamically generated data, top down starting from the root node. When system enters into the Record Segmentation module it segments the identified data regions into data records according to the tag patterns in the data regions. Finally, when the segmented data records are given, the Query Result Section Identification module is responsible for selecting one of the data regions as the one that contains the QRRs. Further when QRRs are supplied to Pairwise Alignment module, aligns the data values of QRRs which further holistically aligned by Holistic Alignment module. A p r i l 2 3 , 2 0 1 4 1) Property 1:-Each QRR in DOM tree is presented in a set of consecutive sibling subtrees.
2) Property 2:-The occurrences of each attribute in several QRRs share the same path from the root in the DOM tree.
For given input Query result HTML page Tag Tree Construction module builds the Tag or DOM tree rooted in <html> tag. Every node represents a tag in html page and its children are tags enclosed inside it. Each internal node "n" of the tag tree has a tag string "tsn", which includes the tags of "n" and all tags of n"s descendants.

Data Region Identification:-
According to property 1, each QRR is composed of one or more consecutive sibling subtrees sharing same parent node which are direct descendants of the root node of the data region. Here we propose new data region identification algorithm for handling QRRs that can be present in non contiguous region. Given query result page containing minimum two QRRs, data region identification algorithm finds data regions in a top down fashion. Algorithm: /* DataRegionIdentification procedure finds data regions by applying following steps recursively to children of every node in T only if it does not have similar siblings.*/Proc DataRegionIdentification ( Tag Tree T) 1. Calculate similarity simij of each pair of nodes ni and nj, i,j=1…m and i≠j, using similarity computation method.
2. Group the nodes according to their similarity and assign them respective sibling identifier, sib_id. Finally before iteration completes the grouped nodes sharing the same parent are clustered under same data region identifier. As shown in Figure 2 first and third TR nodes get the sib_id as 1, second and forth TR nodes get the sib_id as 2. Further these nodes are clustered under Region 1. Similarly Region 2 consists of two TD nodes with sib_id 3.

Fig. 2 Artificial Tag Tree
Similarity computation method [6]:-Two nodes are similar if their similarity is larger than or equal to threshold, which is set to 0.6. Tag strings of nodes are considered to calculate the node similarity. Use following dynamic programming approach for calculating edit distance similarity between nodes ni and nj with tag string tsi and tsj which computes the desired final solution ed(ni,nj) by filling incrementally an (m + 1)X(n + 1) dynamic programming  For getting more accuracy normalized edit distance is considered which is calculated as, Ned(ni,nj) = 1 -ed(ni,nj) / (length(tsi) + length(tsj))

Record Segmentation:-
This module divides the identified data region into set of possible records based on pattern governed by region. For segmenting the region into records, we build a sequence by listing in order the nodes in the data region, representing each node with the sib_id assigned to it. For example, referring the fig 1, algorithm generates the sequence for Region 1 as 1212 and for Region 2 as 33.
By property 1, we know each QRR or record is formed by a list of consecutive subtrees i.e. records are encoded consistently. Therefore the pattern will tend to be formed by tandem repeats i.e. repetitive sequence of sib_id, each sequence corresponding to a QRR. So, in Region 1, tandem repeats 12 correspond to record and in Region 2 tandem repeat 3 correspond to record.

Query Result Section Identification:-
Data region identification module identifies multiple data regions in the input query result web page. But we assume that only one data region consists of QRRs. So following three rules are used to find this final region called as Query Result Section.
1. Data region with the largest area in the query result page is Query Result Section.
2. Data region located at the center of the query result page is Query Result Section.
3. Data region with more data strings than others is Query Result Section.
Data region satisfying above three rules is selected as Query Result Section and assumed that it contains the required QRRs.

Data Alignment:-
QRRs are aligned with the help of two steps.  The Figure 4 shows the entry page of Automatic WDSA system where user has to select the input query result page who's QRRs are to be extracted and aligned in table.

Figure. 4 Selection of query result page
The tag tree for selected query result page is shown in Figure. 5.

Fig. 5 Tag or DOM Tree
The Figure 6 shows result for data scraping steps i.e. for data region identification, record segmentation and query result section identification steps.

PERFORMANCE EVALUATION
The system was successfully tested on Data set 1 (TBDW version 1.02) and Data set 2 (www.tribebyamarpali.com). Two sets of evaluation metrics are used to evaluate performance. The first set is record level precision and recall metrics can be defined as,

Pr =Nc /Ne and Rr =Nc / Nr
Where Nc is the number of correctly extracted and aligned QRRs, Ne is the number of extracted QRRs, and Nr is the actual number of QRRs in the query result pages.
The number of QRRs in different query result pages varies from a few to hundreds. Consequently, pages with many QRRs will dominate the record-level metrics. To deal with this problem, we also use a page-level metric, namely, page-level precision defined as,

Pp =Np/Na
Where Np is the number of correctly extracted pages, which means that all the QRRs in the pages are correctly extracted and aligned, and Na is the number of all the pages from which QRRs are extracted. The page-level recall is always equal to the page-level precision because we assume that each of the input pages contains at least two QRRs and the data extraction is performed on all input pages.
Table1 shows the experimental results for Automatic WDSA. Table 1 shows that Automatic WDSA can extract and align QRRs very effectively, with both record level precision and recall around 98 percent on Dataset 1 and 100 percent on Dataset 2, and page level precision 95 and 100 percent on Dataset 1 and 2 respectively. A p r i l 2 3 , 2 0 1 4

CONCLUSION
In this paper, we dealt with solving the problem of data extraction from web pages and to make them usable by representing those in proper format which can be used in applications like value added services, data integration, comparison shopping etc. Particularly, main aim is to find a way to automatically extract query relevant data and convert them in a standard format like tables. Automatic Web Data Scraper and Aligner (WDSA) works in two steps: Data scraping and Data Alignment. Data scraping performs automatic data extraction to get the QRRs from input query result page. The data values of QRRs are aligned by novel step, data alignment in pairwise and holistic fashion by using similarity of both tag and value which differentiates our method from others. Experimental results on Data set 1 and Data set 2 demonstrated the effectiveness of our method.