A deep search method to survey data portals in the whole web: toward a machine learning classification model
Time:2022-11-12 Views:

A deep search method to survey data portals in the whole web: toward a machine learning classification model

Andreiwid Sheffer Correa, Alencar Melo Jr., Flavio Soares Correa da Silva


The emergence of standardized open data software platforms has provided a similar set of features to sustain the lifecycle of open data practices, which includes storing, managing, publishing, and visualizing data, in addition to providing an out-of-the-box solution for data portals. Accordingly, the dissemination of data portals that implement such platforms has paved the way for automation, wherein (meta)data extraction supplies the demand for quantity-oriented metrics, mainly for benchmark purposes. This has given rise to an issue regarding how to survey data portals globally, especially reducing the manual efforts, while covering a wide variety of sources that may not implement standardized solutions. Thus, this study raises two main problems: searching for standardized open data software platforms and identifying specific developed web-based software operated as data portals. This study aims to develop a method that deeply searches each web page on the internet and formalizes a machine learning classification model to improve the identification of data portals, irrespective of how these data portals implement a standardized open data software platform and comply with the open data technical guidelines. The following subsections discuss the main processes in detail:

First, deep searching the web. If a data portal works on the internet and implements an open data platform among CKAN, Socrata, OpenDataSoft, and ArcGIS Open Data, there is a high chance that it will be discovered throughout the proposed method. The deep search method is organized into four main processes, and the method presupposes the access to Common Crawl WARC files that contain the full content of crawled web pages, including the HTML source code. The first process, keyword search, expects an input text file named warc.paths that can be found from the Common Crawl project; it contains the paths to all WARC files located in the Amazon Public Datasets repository.6 This process is conducted in the following four steps: 1) download files, 2) decompress files, 3) iterate through its content while searching the keywords, and 4) write a result file that stores the URLs whose web page matches the keywords. In detail, Step 3) considers a predefined list of keywords such as open data, opendata, ckan, socrata, opendatasoft, and arcgis, which are normally found in data portal web pages and were a part of this experiment. These keywords were selected after a careful analysis of the HTML of each software platform, where we noticed at least one occurrence of respective product reference in the source code. Bothopendata and open data strings were selected based on an internationally agreed list of terms that usually describe an open data portal. All steps within the keyword search process occur sequentially in the order from one to four for each WARC file, where the processed files are deleted from the machine after execution.

Second, generating a machine learning classification model. The first step, extract, involves the extraction of pieces of information (known as features), taken from web pages as examples of both data and non-data portals. Features include words and information regarding their occurrence, ignoring the word order and sentence construction. A machine learning algorithm will work with these features to predict whether a given non-classified web page is a data portal based on the occurrence of words that were previously classified in the training model. The second step, train, consists of using a probabilistic classifierto find the probability based on the contribution of each feature to estimate the likelihood of a label defined in the training model. The third and last step, predict, aims to determine the label that should be applied for a given input that has not yet been classifiedcalled test set. In this step, the algorithm decides whether a web page is a data portal and learns how to generalize the model to new examples. The machine learning algorithm takes decisions based on a probabilistic model, where there is room for mistakes. The best approach is to check whether the algorithm performed well, but it is not as straightforward as it seems. To verify the accuracy of the results, they should be checked manually, which would require unreasonable efforts. This can be performed on a small sample to reduce the efforts; however, this sample cannot guarantee 100% accuracy.

The study contribute and suggest some implications for open data, government organizations, AI, and intelligent automation. For open data, the feasibility to create a repository with a worldwide list of data portals opens the way for an entire new source for automation. Mainly, manual benchmarks can improve this by relying on a broader variety of evidence automatically collected from such a repository. As the results of bench-marking exercises tend to emphasize the strengths and sometimes reveal the weaknesses, government institutions and organizations can base their open data action plans on the best practice models run by others. For AI, the use of machine learning for web page classification in the field of data portals inaugurates a new domain of applications where the underlying techniques showcase their effectiveness and potential for novel research. Finally, for intelligent automation, the use of massive sources of web data, as demonstrated using the Common Crawn project, shows fair practical applications even for newcomers to the domain of web data retrieval.


Translated by JiaXing Ding  Reviewed by ChangHua Chen