Submitted by Spico197 t3_10gp7rm in MachineLearning

Hi guys, thanks for reading this post. I built a simplistic paper search tool that integrates ACL Anthology, arXiv API, and DBLP API.

Github address: Spico197/paper-hero

Motivation: I'm majoring NLP and I'd like to search for papers with "Event Extraction" as titles in specific proceedings (e.g. ACL, EMNLP).

Challenge: There are lots of search tools and APIs, but few of them provide field-specific searches, like authors, titles, abstracts, and venues.

Methodology: I integrate ACL Anthology, arXiv API, and DBLP API, and provide a two-stage search toolkit, which first stores target papers via the official fuzzy search API, and then matches specific fields.

Advantages: This tool satisfies my need to stockpile papers and it can dump checklists in markdown format, or complete paper information in jsonl. AND and OR logics are supported in search queries.

Limitations: This tool is based on simple string matching, so you have to know some terminologies in the target fields.

You are warmly welcome to have a try and feel free to drop me an issue!

from src.interfaces.aclanthology import AclanthologyPaperList
from src.utils import dump_paper_list_to_markdown_checklist

if __name__ == "__main__":
    # use `bash scripts/get_aclanthology.sh` to download and prepare anthology data first
    paper_list = AclanthologyPaperList("cache/aclanthology.json")
    ee_query = {
        "title": [
            # Any of the strings below is matched
            ["information extraction"],
            ["event", "extraction"],    # title must include `event` and `extraction`
            ["event", "argument", "extraction"],
            ["event", "detection"],
            ["event", "classification"],
            ["event", "tracking"],
            ["event", "relation", "extraction"],
        ],
        # Besides the title constraint, venue must also meet the needs
        "venue": [
            ["acl"],
            ["emnlp"],
            ["naacl"],
            ["coling"],
            ["findings"],
            ["tacl"],
            ["cl"],
        ],
    }
    ee_papers = paper_list.search(ee_query)
    dump_paper_list_to_markdown_checklist(ee_papers, "results/ee-paper-list.md")

​

markdown checklist

42

Comments

You must log in or register to comment.

_Arsenie_Boca_ t1_j5492g7 wrote

I think the idea is great! How long does it take to execute a query on the ArXiv set? Have you considered making a huggingface space out of this?

5

Spico197 OP t1_j54pdgj wrote

Thanks very much for your reply.

I didn't evaluate the query time. This tool doesn't download the whole arxiv dataset, it just calls the official API. So time is up to the web connection. But it wouldn't take a long time to execute a query.

Yes, absolutely! There are some other things to do before making an online demo, e.g. merging the current two-stage searching into one step. I'm working on it. Thanks again for the advice!

4