Benchmark 我們自己的 A.I. Parser



Benchmark Ureka dataset

  • 標註Ureka 40間媒體每一間至少取兩篇(有幾間html下載不下來或是網頁有問題),因此dataset共有78篇文章
  • 與上次評分一樣,第一個評分是專注在比較內文的相同程度,第二個評分會考慮到爬不到的情況進行扣分,第三個會多考慮到爬到完全不相干的情況。
  • Trafilatura表現很穩定

Trafilatura

  • extractor focus on metadata, main body text and comments
  • extraction algorithm is based on a cascade of rule-based filters and content heuristic
  • content delimitation is performed by XPath expressing targetting common HTML elements and attribute as well as idosyncrasies of main content management systems.(exclude unwanted parts and center on the desirable content)
  • In case noting worked, a baseline extraction is run order to look for wild text element

大致上的演算法
load_html → 
tree_cleaning(remove unwanted and empty element)
分成kill tag(該tag的sub tree 全部移除)和remove tag(移除該tag而已)
 → 
convert tag → 
extract comment or exclude comment → 
(cleaned_tree → use BODY_XPATH to find potential body → prune unwanted node by DISCARD_XPATH → remove elements by link density → check if meet minimum length → transfer the result tree into text → if result text < minimum length go back to cleaned tree and consider previous discard nodes)

link_density_test      
input : element tree   
output: Boolean(True→ high density, need to delete. False→low density, may be content), my list(the word in <ref>)
def link_density_test(element):
  links_xpath = element.xpath('.//ref') # find links
  if links_xpath:
    elemtext = element's context
    elemlen = length of trim(elemtext) # trim: remove unneccessary space within a text
    if element.tag == 'p':
      limitlen, threshold = 25, 0.8 # limitlen:文字少於這個數量就歸類non-content # threshold: link文字佔比超過這一個比率就歸類為non-content
    else:
      if element.getnext() is None:
        limitlen, threshold = 200, 0.66
      else:
        limitlen, threshold = 100, 0.66
    if elemlen < limitlen:
      linklen = the total length of word in links_xpath
      elemnum = how many links
      shortelems = the number of the word with length < 10 in links_xpath
      mylist = list of the word in links_xpath