extractor focus on metadata, main body text and comments
extraction algorithm is based on a cascade of rule-based filters and content heuristic
content delimitation is performed by XPath expressing targetting common HTML elements and attribute as well as idosyncrasies of main content management systems.(exclude unwanted parts and center on the desirable content)
In case noting worked, a baseline extraction is run order to look for wild text element
(cleaned_tree → use BODY_XPATH to find potential body → prune unwanted node by DISCARD_XPATH → remove elements by link density → check if meet minimum length → transfer the result tree into text → if result text < minimum length go back to cleaned tree and consider previous discard nodes)
link_density_test
input : element tree
output: Boolean(True→ high density, need to delete. False→low density, may be content), my list(the word in <ref>)
Benchmark Ureka dataset
Trafilatura
def link_density_test(element):
links_xpath = element.xpath('.//ref') # find links
if links_xpath:
elemtext = element's context
elemlen = length of trim(elemtext) # trim: remove unneccessary space within a text
if element.tag == 'p':
limitlen, threshold = 25, 0.8 # limitlen:文字少於這個數量就歸類non-content # threshold: link文字佔比超過這一個比率就歸類為non-content
else:
if element.getnext() is None:
limitlen, threshold = 200, 0.66
else:
limitlen, threshold = 100, 0.66
if elemlen < limitlen:
linklen = the total length of word in links_xpath
elemnum = how many links
shortelems = the number of the word with length < 10 in links_xpath
mylist = list of the word in links_xpath