Extractive Text Summarisation
  • Extractive summarisation basically involves binary classification for each sentence indicating if it would be part of the summary 
  • Encoder-Decoder architecture - Encoder - Obtains document level features and domain understanding and creates word, or sentence level representations (in case of BERT the three word, position and sentence encodings) and then these vectors are fed to Decoder
  • PreSumm decoder - Stacked transformer layers to obtain document level understanding and then a logistic function for label classification. 
  • Decoder used by PreSumm- Randomly Initialised transformer decoder. 
  • Optimisers of encoder and decoders are separate—because encoder here is pretrained. 
  • Two Transformer layers works the best it’s called BERTSUMEXT

Math That I don’t think I completely understand. 

Some Data: 

Rouge Scores: 

TransformerExt
40.90
18.02
37.17
BertSumExt
43.23
20.24
39.63
BertSumExt (large)
43.85
20.34
39.90
SummaRunner
39.60
16.20
35.30
BanditSum
41.50
18.70
37.60

Datasets; 

  • SummaRunner
  • Dataset: CNN/DailyMail
  • Train: 196557 Documents
  • Val: 12147 Documents
  • Embeddings : 100-word2vec
  • Vocab : 150k
  • Gradient Clipping

  • BanditSum
  • Dataset: CNN/DailyMail
  • Train : 196,557 training documents
  • Val : 12,147 validation documents
  • Embeddings :100-dimensional Glove embeddings
  • Gradient Clipping

  • PreSumm
  • Dataset: CNN/DailyMail
  • Train:90266+196961
  • Val: 1200 + 12148
  • Embeddings : BERT

PreSumm - 

  • Takes pre-trained encoders (BERT) and stacks inter-sentence transformers on them to capture document level features.  
  • Both extractive and abstractive summarisation—also might have a combination of both?
  • Assigns scores to each sentence and top three sentences are selected as model summary
  • Trigram blocking used to reduce redundancy—i.e they don’t want similar sentences in summary