Submitted by Spico197 t3_10gp7rm in MachineLearning
Hi guys, thanks for reading this post. I built a simplistic paper search tool that integrates ACL Anthology, arXiv API, and DBLP API.
Github address: Spico197/paper-hero
Motivation: I'm majoring NLP and I'd like to search for papers with "Event Extraction" as titles in specific proceedings (e.g. ACL, EMNLP).
Challenge: There are lots of search tools and APIs, but few of them provide field-specific searches, like authors, titles, abstracts, and venues.
Methodology: I integrate ACL Anthology, arXiv API, and DBLP API, and provide a two-stage search toolkit, which first stores target papers via the official fuzzy search API, and then matches specific fields.
Advantages: This tool satisfies my need to stockpile papers and it can dump checklists in markdown format, or complete paper information in jsonl. AND and OR logics are supported in search queries.
Limitations: This tool is based on simple string matching, so you have to know some terminologies in the target fields.
You are warmly welcome to have a try and feel free to drop me an issue!
from src.interfaces.aclanthology import AclanthologyPaperList
from src.utils import dump_paper_list_to_markdown_checklist
if __name__ == "__main__":
# use `bash scripts/get_aclanthology.sh` to download and prepare anthology data first
paper_list = AclanthologyPaperList("cache/aclanthology.json")
ee_query = {
"title": [
# Any of the strings below is matched
["information extraction"],
["event", "extraction"], # title must include `event` and `extraction`
["event", "argument", "extraction"],
["event", "detection"],
["event", "classification"],
["event", "tracking"],
["event", "relation", "extraction"],
],
# Besides the title constraint, venue must also meet the needs
"venue": [
["acl"],
["emnlp"],
["naacl"],
["coling"],
["findings"],
["tacl"],
["cl"],
],
}
ee_papers = paper_list.search(ee_query)
dump_paper_list_to_markdown_checklist(ee_papers, "results/ee-paper-list.md")
​
_Arsenie_Boca_ t1_j5492g7 wrote
I think the idea is great! How long does it take to execute a query on the ArXiv set? Have you considered making a huggingface space out of this?