Dropbox Paper

Benchmark 我們自己的 A.I. Parser

GitHub: https://github.com/scrapinghub/article-extraction-benchmark

Benchmark Ureka dataset

標註Ureka 40間媒體每一間至少取兩篇(有幾間html下載不下來或是網頁有問題)，因此dataset共有78篇文章

與上次評分一樣，第一個評分是專注在比較內文的相同程度，第二個評分會考慮到爬不到的情況進行扣分，第三個會多考慮到爬到完全不相干的情況。

Trafilatura表現很穩定

User-uploaded image: Screen+Shot+2021-12-14+at+1.52.19+PM.png

Trafilatura

extractor focus on metadata, main body text and comments

extraction algorithm is based on a cascade of rule-based filters and content heuristic

content delimitation is performed by XPath expressing targetting common HTML elements and attribute as well as idosyncrasies of main content management systems.(exclude unwanted parts and center on the desirable content)

In case noting worked, a baseline extraction is run order to look for wild text element

大致上的演算法

load_html →

tree_cleaning(remove unwanted and empty element)

分成kill tag(該tag的sub tree 全部移除)和remove tag(移除該tag而已)

→

convert tag →

extract comment or exclude comment →

extract content

(cleaned_tree → use BODY_XPATH to find potential body → prune unwanted node by DISCARD_XPATH → remove elements by link density → check if meet minimum length → transfer the result tree into text → if result text < minimum length go back to cleaned tree and consider previous discard nodes)

link_density_test

input : element tree

output: Boolean(True→ high density, need to delete. False→low density, may be content), my list(the word in <ref>)

def link_density_test(element):

links_xpath = element.xpath('.//ref') # find links

if links_xpath:

elemtext = element's context

elemlen = length of trim(elemtext) # trim: remove unneccessary space within a text

if element.tag == 'p':

      limitlen, threshold = 25, 0.8 # limitlen：文字少於這個數量就歸類non-content # threshold: link文字佔比超過這一個比率就歸類為non-content

else:

if element.getnext() is None:

limitlen, threshold = 200, 0.66

else:

limitlen, threshold = 100, 0.66

if elemlen < limitlen:

linklen = the total length of word in links_xpath

elemnum = how many links

shortelems = the number of the word with length < 10 in links_xpath

mylist = list of the word in links_xpath

​​Benchmark Ureka dataset

​​Trafilatura

Benchmark Ureka dataset

Trafilatura