Extractive summarisation basically involves binary classification for each sentence indicating if it would be part of the summary
Encoder-Decoder architecture - Encoder - Obtains document level features and domain understanding and creates word, or sentence level representations(in case of BERT the three word, position and sentence encodings) and then these vectors are fed to Decoder
PreSumm decoder - Stacked transformer layers to obtain document level understanding and then a logistic function for label classification.
Decoder used by PreSumm- Randomly Initialised transformer decoder.
Optimisers of encoder and decoders are separate—because encoder here is pretrained.
Two Transformer layers works the best it’s called BERTSUMEXT
Math That I don’t think I completely understand.
Some Data:
Rouge Scores:
Datasets;
PreSumm -