Legal datasets are extremely expensive because lawyers are, and this has bottlenecked legal NLP.

To address this, we release the Merger Agreement Understand Dataset (MAUD), with over 39,000 multiple-choice reading comprehension examples for 152 merger agreements that have been manually labeled by legal experts. The dataset was created with the help of the American Bar Association; without their help the dataset would have cost over $5,000,000 to create.

MAUD has substantial room for improvement and can could serve as a research challenge for NLP researchers without any legal background.

Dataset and Baselines: https://github.com/TheAtticusProject/maud/

Paper: https://arxiv.org/abs/2301.00876

Comments

You must log in or register to comment.

lebeaudiable t1_j2yhty4 wrote on January 4, 2023 at 8:53 PM

Attorney here just getting into NLP. What should I be doing to take advantage of this intersection? I am going to use this dataset and explore more.

moopling t1_j2zfgcj wrote on January 5, 2023 at 12:28 AM

What do I know, but I’d suggest the best thing you can bring to the table is identifying worthwhile problems in law which are solvable with AI.

Too often we have ML people picking problems amenable to ML algorithms but which ultimately don’t create a ton of value, or domain experts picking important problems which are unsolvable with current techniques.

Athomas1 t1_j2yl315 wrote on January 4, 2023 at 9:13 PM

What kind of law do you practice?

lebeaudiable t1_j2ze2vm wrote on January 5, 2023 at 12:19 AM

Gov. Lit. I’m trying to make the transition to in-house for a corp. while continuing to build my skills and GitHub as a full-stack dev (JS/Python/SQL). The goal is GC and C-Suite for an F500 and leverage my legal, finance, and dev skills into some weird hybrid in the future.

Dry-Sweet-3008 t1_j327kry wrote on January 5, 2023 at 3:38 PM

Computational Linguist here, currently getting a PhD in NLP. If you want to get into that area, full-stack development isn't going to help (although it's a cool thing to do on its own if you're interested). Web development and Data Science (ML /DL etc.) are very different thigns. Also, while SQL is helpful in a lot of ML projects, natural language data is unstructured and is not to be stored in SQL databases. Instead, I'd suggest learning the fundamentals of Machine Learning first. Once you're there, you can start specializing in NLp topics. As a lawyer, your strength will probably be understanding the methods enough so you can assess whether or not a certain problem can be solved with NLP. Hope this helps!

lebeaudiable t1_j32cx1s wrote on January 5, 2023 at 4:12 PM

Thank you for the advice. I fell into development during the pandemic and it’s been my new area of interest ever since. Being a full-stack dev is just a personal goal of mine. I am planning on learning more about ML in general after I finish reading/following along with the NLTK book, and I will likely take a course. Do you recommend any SPECIFIC materials? I know what’s commonly recommended via wiki and search, but I’m curious to know what you’re using and reading in your program or what you’d recommend in general, personally.

StackOwOFlow t1_j33kwse wrote on January 5, 2023 at 8:34 PM

As a domain expert, you’d probably want to focus specifically on feature engineering if you’re looking to continue training the existing model or new models. A lot of it comes down to asking good questions and hypothesis testing informed by knowledge of the law that you already have.

Figuring out how to use those models in real-world applications employs a different skillset, however, and that sounds more like what your original question is asking about. You’d probably get a better sense of this through examples of applications that intro to ML courses reference and surveying ML-driven applications in various industries. Here's a good hands-on resource: https://machinelearningmastery.com/start-here

IndustryNext7456 t1_j3hto6k wrote on January 8, 2023 at 6:05 PM

EE here, 25 years in NLP. Working in Prolog, Datalog, Logica for formal verification. Using NLP to extract facts for verification.

StackOwOFlow t1_j2zk4f1 wrote on January 5, 2023 at 1:00 AM

create a lexis-nexis competitor 🤭

EightEqualsEqualsDe t1_j30nzfn wrote on January 5, 2023 at 6:10 AM

The subject matter a company such as Everlaw covers might be of interest

Effective-Victory906 t1_j3pwkx8 wrote on January 10, 2023 at 6:31 AM

Can you contribute datasets?

That would help so many!

cheddacheese148 t1_j2zwbgs wrote on January 5, 2023 at 2:24 AM

I’m not sure if I missed it while skimming the repo and paper but do you have a license on this?

Edit: and I did miss it…in section A.1 they state that it’s under CC-BY-4.0.

stevevaius t1_j30hdld wrote on January 5, 2023 at 5:08 AM

Not Expert in law but this dataset has any value for British law?

levkin76 t1_j340swm wrote on January 5, 2023 at 10:09 PM

It is a common law dataset, but it is trained on American M&A concepts, rather than UK.

CatalyzeX_code_bot t1_j3yz104 wrote on January 12, 2023 at 12:50 AM

Found relevant code at https://github.com/TheAtticusProject/maud + all code implementations here

To opt out from receiving code links, DM me

[deleted] t1_j2zdy5k wrote on January 5, 2023 at 12:18 AM

[deleted]

habTrermalawlld t1_j31tqr0 wrote on January 5, 2023 at 2:02 PM

repo and paper but do you have

[deleted] t1_j3s06ag wrote on January 10, 2023 at 5:59 PM

[removed]