Submitted by Sea-Connection462 t3_103b1ck in MachineLearning

Legal datasets are extremely expensive because lawyers are, and this has bottlenecked legal NLP.

To address this, we release the Merger Agreement Understand Dataset (MAUD), with over 39,000 multiple-choice reading comprehension examples for 152 merger agreements that have been manually labeled by legal experts. The dataset was created with the help of the American Bar Association; without their help the dataset would have cost over $5,000,000 to create.

MAUD has substantial room for improvement and can could serve as a research challenge for NLP researchers without any legal background.

Dataset and Baselines: https://github.com/TheAtticusProject/maud/

Paper: https://arxiv.org/abs/2301.00876

306

Comments

You must log in or register to comment.

lebeaudiable t1_j2yhty4 wrote

Attorney here just getting into NLP. What should I be doing to take advantage of this intersection? I am going to use this dataset and explore more.

21

moopling t1_j2zfgcj wrote

What do I know, but I’d suggest the best thing you can bring to the table is identifying worthwhile problems in law which are solvable with AI.

Too often we have ML people picking problems amenable to ML algorithms but which ultimately don’t create a ton of value, or domain experts picking important problems which are unsolvable with current techniques.

13

Athomas1 t1_j2yl315 wrote

What kind of law do you practice?

7

lebeaudiable t1_j2ze2vm wrote

Gov. Lit. I’m trying to make the transition to in-house for a corp. while continuing to build my skills and GitHub as a full-stack dev (JS/Python/SQL). The goal is GC and C-Suite for an F500 and leverage my legal, finance, and dev skills into some weird hybrid in the future.

9

Dry-Sweet-3008 t1_j327kry wrote

Computational Linguist here, currently getting a PhD in NLP. If you want to get into that area, full-stack development isn't going to help (although it's a cool thing to do on its own if you're interested). Web development and Data Science (ML /DL etc.) are very different thigns. Also, while SQL is helpful in a lot of ML projects, natural language data is unstructured and is not to be stored in SQL databases. Instead, I'd suggest learning the fundamentals of Machine Learning first. Once you're there, you can start specializing in NLp topics. As a lawyer, your strength will probably be understanding the methods enough so you can assess whether or not a certain problem can be solved with NLP. Hope this helps!

3

lebeaudiable t1_j32cx1s wrote

Thank you for the advice. I fell into development during the pandemic and it’s been my new area of interest ever since. Being a full-stack dev is just a personal goal of mine. I am planning on learning more about ML in general after I finish reading/following along with the NLTK book, and I will likely take a course. Do you recommend any SPECIFIC materials? I know what’s commonly recommended via wiki and search, but I’m curious to know what you’re using and reading in your program or what you’d recommend in general, personally.

2

StackOwOFlow t1_j33kwse wrote

As a domain expert, you’d probably want to focus specifically on feature engineering if you’re looking to continue training the existing model or new models. A lot of it comes down to asking good questions and hypothesis testing informed by knowledge of the law that you already have.

Figuring out how to use those models in real-world applications employs a different skillset, however, and that sounds more like what your original question is asking about. You’d probably get a better sense of this through examples of applications that intro to ML courses reference and surveying ML-driven applications in various industries. Here's a good hands-on resource: https://machinelearningmastery.com/start-here

1

IndustryNext7456 t1_j3hto6k wrote

EE here, 25 years in NLP. Working in Prolog, Datalog, Logica for formal verification. Using NLP to extract facts for verification.

2

cheddacheese148 t1_j2zwbgs wrote

I’m not sure if I missed it while skimming the repo and paper but do you have a license on this?

Edit: and I did miss it…in section A.1 they state that it’s under CC-BY-4.0.

6

stevevaius t1_j30hdld wrote

Not Expert in law but this dataset has any value for British law?

3

levkin76 t1_j340swm wrote

It is a common law dataset, but it is trained on American M&A concepts, rather than UK.

3