Submitted by 8hubham t3_znkeh2 in MachineLearning

I am planning to take up an interesting NLP project, due to my limited exposure to NLP I am stuck at the moment. I want to extract 'goal' statements from lengthy reports. For example, the goals can be We would be reducing our carbon footprint by 50% by 2025 or Our company aims to increase the diversity in the work-force in upcoming months. Check below image for example text and highlighted goals.

How can I go about the process of goal extraction, I would like to get some pointers on possible NLP approaches I can start with ?

Note that I do not have an annotated dataset with extracted goals.

https://preview.redd.it/z6houyh7ra6a1.png?width=970&format=png&auto=webp&s=b3f6032bf14bff0932a6eee44444f86e5b82c67b

6

Comments

You must log in or register to comment.

KlutzyLeadership3652 t1_j0hn1ba wrote

You can look up 'extractive text summarization'. Or if you're looking for to-the-point keyphrases in the paragraphs, then 'keyphrase extraction'. See how off-the-shelf models work on your examples.

6

AlexMourne t1_j0hmd33 wrote

I would recommend you to look at NER( but I don't think that it will show you good results since your entities are preatty vague) and maybe to try to make a classifier for every separate sentence since it seems like the context is not very important here.

4

8hubham OP t1_j0honkl wrote

Thank you for the suggestion.

If I understand correctly, you suggest that post NER processing, use a classifier to classify 'goal' statements.

Note that I do not have an annotated dataset for training a classifier, so I don't think I can use this approach.

Please correct me if I am wrong.

3

AlexMourne t1_j0hvlbm wrote

Oh, yes, you are right of course, that is my fault. I don't actually know then what you can do there without the dataset. Maybe try a clustering and look if one of the cluster will look like what you need?

2

NovelspaceOnly t1_j0hxa4a wrote

potentially you could use coreference resolution and NER to extract them.

2

z0nar t1_j0ic382 wrote

You could use Watchful to do a first pass as identfying likely key phrases, even enriching with an off the shelf model that will help identify these key spans. You could even do the full annotation of your original corpus.

Lemme know if you are a student or researcher - can set you up with an environment to play with if you are interested or maybe even a local license.

Disclosure - I am a co-founder @ Watchful.

2

8hubham OP t1_j0iemey wrote

Thank you for the offer.

Yes, I am a student.

I will checkout watchful.io and I will let you know.

1

NinoIvanov t1_j0korab wrote

Classically, you would use some form of "template", in the simplest form a sort of "anchor word" whereabout in a certain radius other (pre-defined) words are sought. If a "match" is found — a goal is recognized. The difficulty, evidently, is how to get down false positives & false negatives and how to "estimate" good templates — the advantage is, however, full explainability: "WHY was that goal suggested" would be exactly traceable. The templates can get arbitrarily involved, e.g. with probabilities, conditional probabilities, dependencies of words and goals, etc.

"With machine learning" you could give it a set of "labelled texts", as in, "this text is about this, that text is about that", and you could have the system reduce the matching words (in the simplest form: simply as a set of words in no particular order) progressively until the ability to recognize a goal given a small "bag of words" has been optimized. You can e.g. use for that random forests, or whatever else you like. Disadvantage: EXPLAINING the goals will be way harder. — EDIT: for this approach, you do need an annotated data set, for the above one — not, there instead you need the templates'

2

8hubham OP t1_j0lpb55 wrote

Thank you for the suggestions.

I would like to learn more about the first approach. Can you share any links/articles explaining the first approach.

2

NinoIvanov t1_j0nl601 wrote

A brief intro using regular expressions, giving you the general idea:

https://www.nzini.com/lessons/NLP2+-+Template+Matching.html

Also, classically, look for the "Message Understanding Conferences" and "Information Extraction" & "Named Entity Recognition" as a task.

It gets really tricky if the information is "implied": John bought flowers for Lucy —> "Does John like Lucy?": evidently yes, but nobody SAYS that. Good luck! 😊

1

pythoslabs t1_j0pqrsp wrote

Custom NER is the way to go. I believe you will have to run a custom annotation pipeline defining your custom NERs . In your case - do a fine tune on a model to train on the defined spans on a few documents on 'Goals' . ( If you have more than one NER, add spancategorizer into your pipeline ) https://spacy.io/api/spancategorizer

Check out "training custom NER in spacy" on youtube - you should get plenty of detailed videos.

And if you want to go an extra step and extract a cause and effect relationship (this is out of scope for your project though ), but for the benefit of any future reader coming here - in case you have a relation like "Goal" - "Action" , you can use the following two methods -

  1. Spacy has a model for this. ( you can create your entity relation extractor ) on this. Check out this video . https://www.youtube.com/watch?v=8HL-Ap5_Axo
  2. Kindred is a project which is specially for Biomedical text. eg : in case there is a cause - effect relationship ( check it out here - https://spacy.io/universe/project/kindred )

DM me in case you need any further points.

2

Hopeful-Yam-1718 t1_j0tb08s wrote

Have you given ChatGPT a chance? it is a game changer and probably the next disruptive technology available to us. Even in its infancy, what it can do is quite amazing. It will be the technology that ignites so very much more, and quickly. I have 30+ years as a mercenary IT consultant, so I’ve got a little bit of insight. Prompt it with something like the following: "extract all of the goals in the following text". And then paste that data set and hit enter. when I first looked at it, there were options and configurations. Now it just gives you a text box, but that still might work. I need to find the site that let you configure it.

So far I’ve only used it to write content with queries such as, "write me a paragraph about looking for sponsors for a charity music festival on memorial day weekend." it cranked out a beautifully written paragraph with so much extra added in about how festivals are fun and bring the community together, blah blah blah.

However, it can also write code, and this is where it gets interesting, and a bit scary. I have said for decades that the minute we can get AI or a computer that can write code. It is game over. Why? Because it can exponentially write better and better code, which could mean improving itself. i’m not saying that it would choose to improve itself, operators would need to tell it to basically. However, all of the operators told it to keep improving itsself in some aspect. No, not sentient, but a exponential explosion of technology. It doesn’t have to be relegated to just writing code.

2

Rei1003 t1_j0hn6qv wrote

Maybe you can fine-tune t5 to generate the goal statement, as it sounds like summarization

1

ZestyData t1_j0jbs10 wrote

If you can label a dataset of sentences where the goal is highlighted you can just train a classifier.

1

Repulsive_Tart3669 t1_j0jlaa6 wrote

Back in 2012 we were experimenting with an engineering-based approach to extract relations and events from texts. Examples of events are company announcements, merger and acquisitions, management position changes, customer complaints about products etc. Our NLP pipelined included two major steps - named entity recognizers and rule-based engine over graph of annotations. The former step extracts various types of entities - names of companies and people, geographical locations, temporal expressions and dictionary-based extractor that extracts anchor verbs (e.g., acquire, purchase, announce, step down). The latter step uses a rule-based engine that tries to match tokens and named entities into high-level concepts using regular expression-type syntax, e.g., 'annotate[COMPANY_ANNOUNCEMENT] if match[COMPANY ANNOUNCEMENT_VERB]'. Then, if I recall correctly, we switched to use rules over dependency structure of sentences (something like subject - verb - object) - with slightly lower precision this resulted in much better recall. But this was 10 years ago, and a lot has changed since then.

1