Skip to content

Official code for Keyphrase Generation Beyond the Boundaries of Title and Abstract (EMNLP Findings 2022)

License

Notifications You must be signed in to change notification settings

kgarg8/FullTextKP

Repository files navigation

FullTextKP

Keyphrase Generation Beyond the Boundaries of Title and Abstract

Create environment

conda create -n FKP_env python=3.6

conda activate FKP_env

conda install pytorch cudatoolkit=11.3 -c pytorch

pip install transformers==4.12.0

Run Commands

Preprocess

cd preprocess

# Stage1
python preprocess_ACM_stage1.py

# Stage2

## Title+Abstract
python preprocess_ACM_stage2_v2.py

## Citations
python preprocess_ACM_stage2_v4.py

## Non-Citations
python preprocess_ACM_stage2_v5.py

## Random
python preprocess_ACM_stage2_v6.py

Summarization

Expects processed_data in the main directory, pacssum_models in the summarization folder

Download the pretrained models (into pacssum_models) for BERT using https://drive.google.com/file/d/1wbMlLmnbD_0j7Qs8YY8cSCh935WKKdsP/view?usp=sharing

cd summarization

# Run tfidf summarizer
python run.py --rep tfidf

# Run BERT Summarizer
python run.py --rep bert

Abstractive summarization

cd abstractive_summarization

# Stage1
python abs_sum.py

# Stage2
cd preprocess_abs_sum.py

python preprocess_abs_sum.py

Retrieval Augmentation

cd specter

python preprocess_ACM.py

./embed.sh

Train & Test

# Train
python train.py

# Train on limited data
python train.py --limit=100

# Load Checkpoint
python train.py --checkpoint=True

# Train for multiple runs after the initial run(s)
python train.py --times=3 --initial_time=1

# Test (assuming that saved weights are present)
python train.py --test=True

Citation

Please consider citing our paper if you find this work useful:

@inproceedings{garg-etal-2022-keyphrase,
    title = "Keyphrase Generation Beyond the Boundaries of Title and Abstract",
    author = "Garg, Krishna  and
      Ray Chowdhury, Jishnu  and
      Caragea, Cornelia",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.427",
    pages = "5809--5821",
    abstract = "Keyphrase generation aims at generating important phrases (keyphrases) that best describe a given document. In scholarly domains, current approaches have largely used only the title and abstract of the articles to generate keyphrases. In this paper, we comprehensively explore whether the integration of additional information from the full text of a given article or from semantically similar articles can be helpful for a neural keyphrase generation model or not. We discover that adding sentences from the full text, particularly in the form of the extractive summary of the article can significantly improve the generation of both types of keyphrases that are either present or absent from the text. Experimental results with three widely used models for keyphrase generation along with one of the latest transformer models suitable for longer documents, Longformer Encoder-Decoder (LED) validate the observation. We also present a new large-scale scholarly dataset FullTextKP for keyphrase generation. Unlike prior large-scale datasets, FullTextKP includes the full text of the articles along with the title and abstract. We release the source code at https://github.com/kgarg8/FullTextKP.",
}

Credits

PacSum Repo for Summarization

Specter

FAISS

Questions

Please contact [email protected] for any questions related to this work.

About

Official code for Keyphrase Generation Beyond the Boundaries of Title and Abstract (EMNLP Findings 2022)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages