dancingnightly

dancingnightly t1_je0o082 wrote

The benefit of finetuning or training your own text model (e.g. in the olden days on BERT), now through the OpenAI API vs the benefit of just using contextual semantic search is reducing day-by-day... especially with the extended context window of GPT-4.

If you want something in house, finetuning GPT-J or so could be the way to go, but it's definitely not the career direction I'd take.

2

dancingnightly t1_jadj7fa wrote

Edit: Seems like for this one yes. They do consider human instructions (similarish to the goal of a RLHF which requires more RAM), by adding them directly in the text dataset, as mentioned in 3.3 Language-Only Instruction Tuning-

For other models, like OpenAssistant coming up, one thing to note is that, although the generative model itself may be runnable locally, the reward model (the bit that "adds finishing touches" and ensures following instructions) can be much bigger. Even if the GPT-J underlying model is 11GB on RAM and 6B params, the RLHF could seriously increase that.

This models is in the realm of the smaller T5, BART and GPT-2 models released 3 years ago and runnable then on decent gaming GPUs

7

dancingnightly t1_j8y81v9 wrote

Do you know of any kind of similar encoding where you vectorise relative time? as multiple proportions of completeness, if that makes sense?

​

Say, completeness within a paragraph, within a chapter, within a book? (Besides sinusidal embeddings which push up the number of examples you need)

3

dancingnightly t1_j8y7fny wrote

"If you look at the internals, it's a nightmare. A literal nightmare."

Yes, the copy paste button is heavily rinsed at HF HQ.

But you won't believe how much easier they made it to run, tokenize and train models in 2018-19, and at that, train compatible models.

We probably owe a month of NLP progress just to them coming in with those one liners and sensible argument API surfaces.

​

Now, yes, it's getting crazy - but if there's a new paradigm, a new complex way to code, then a similar library will simplify it, and we'll mostly jump there except for legacy. It'll become like scikit learn (although that still holds up for most real ML tasks), lots of finegrained detail and slightly questionable amounts of edge cases (looking at the clustering algorithms in particular), but as easy as pie to keep going.

​

I personally couldn't ask for more. I was worried they were going to push auto-switching models to their API at some point, but they've been brilliant. There are bugs, but I've never seen them in inference(besides your classic CUDA OOM), and like Fit_Schedule5951 says, it's all about that with HF.

1

dancingnightly t1_j8g0oqx wrote

Hold on Jurasstic is here from April 2022 I believe with something fairly similar:

https://arxiv.org/pdf/2204.10019.pdf

https://www.ai21.com/blog/jurassic-x-crossing-the-neuro-symbolic-chasm-with-the-mrkl-system

It didn't learn for new tools I think, but it did work well for calculations and wiki search.

3

dancingnightly t1_j7s355b wrote

In a sense, you can communicate between semantic text embeddings and LM models through this method(would operate differently to multi modal embeddings): https://www.lesswrong.com/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight

This method, which is only practical for toy problems really right now, would allow you to use semantic embeddings to find what to look for when doing SVD on an (autoregressive) LM. You could depend this on the input, for example, transforming your embedding into the keys to apply the abduction with in that process, and impacting the generation of logits. I'm not sure this would behave much differently to altering the logit_bias of tokens, but it would be interesting to hear if it was.

2

dancingnightly t1_j76uuee wrote

In this goal, you may find Mixture of Experts architectures interesting.

I like your idea. I have always thought too that in ML we are trying to replicate one human on one task with the worlds data for that task, or one human on many tasks, more recently.

But older ideas and replicating societies and communication for one or many tasks could be equally or more effective. Which this heads in the direction of. There is a library called GeNN which is pretty useful for these experiments, although it's a little slow due to deliberate true-to-biology design.

3

dancingnightly t1_j76t0gh wrote

In theory training T5 alongiside the image embedding models they use (primarily DETR?) shouldn't take much more than a 3090 or Collab Pro GPU. You could train T5s on even consumer high end GPUs in 2020, for example, but the DETR image model probably needs to be ran for each image at the same time which might take up quite a bit of GPU together. The `main.py` script looks like a nice and fairly short typical training script you'd be able to quickly run if you download their repo, pull the scienceQA dataset and send the training args to see if it crashes.

2

dancingnightly t1_j6oaxeo wrote

This is commercial, not research but: A lot of scenarios where explainable AI is needed use simple statistical solutions.

​

For example a company I knew had to identify people in poverty in order to distribute a large ($M) grant fund to people in need, and they had only basic data about some relatively unrelated information, like how often these people travelled lets say, their age, etc.

​

In order to create an explainable model where factors can be understood by higher ups, and considered for bias easily, they used a k-means approach with just 3 factors.

​

It captured close to as much information as deep learning, but with more robustness to data drift, and with clear graphs segmenting the target group and general group. It also reduced use of data, being pro-privacy.

​

This 30 line of code solution with a dozen explanatory output graphs about EDA probably got sold for >500k in fees... but they did make the right choices in this circumstance. They saved on a complex ML model, bias/security/privacy/deployment hell, and left a maintainable solution.

​

Now for research, it's interesting from the perspective of applied AI (which is arguably still dominantly GOFAI/simple statistics) and communication about AI with the public, although I wouldn't say it's in vogue.

5

dancingnightly t1_j5v5zwe wrote

The internet isn't accessed live by most of these models, as others have said.

You can finetune language models, but you don't add knowledge as such to them; you bias them to output more words in similar order to your sample data; it won't add facts as such if you do this fine tuning.

One approach you can do though is semantic search through your notes for a given topic/search query. You basically collect the relevant notes with meanings similar to your topic/search query. Then you can populate a prompt with that text. The answer will use that information and any facts, if the model is big enough and RLHF tuned (like ChatGPT/Instruct/text-00x models from OpenAI).

An open source module for this is GPTIndex, I also work on a commercial solution which encompasses videos etc too and has some optimisations. It is possible you can add data/facts from the internet to the prompt(context) at time of generation too; you can use an approach like WebGPT.

3

dancingnightly t1_j5c31u6 wrote

Oh ok. Thank you for taking the time to explain. I see that this graph approach isn't for extending beyond the existing context of RoBERTa/similar transformer models, but rather enhancing performance.

I was hoping graphs could capture relational information (in a way compatible with transformer embeddings) within the document at far parts between it essentially (like: for each doc.ents, connect in a fully connected graph), sounds like this dynamic graph size/structure per document input wouldn't work with the transformer embeddings for now though.

1

dancingnightly t1_j58anv8 wrote

That's a great resource, thanks. I have studied how this kind of autoregressive model works and found attention fascinating, but here it's graph embedding entities you brought up that sound exciting. I have just skim read your paper for now, so perhaps I made a mistake, but what I mean is:

For graph embeddings, could you dynamically capture different entities/tokens up to a much broader context than for common sense reasoning statements and questions? i.e. do entailment on a whole chapter(or knowledge base entry with 50 triplets), where the graph embeddings meaningfully represent many entities (perhaps with Sine positional embeddings for each additional text entry mention in addition to the graph, just like for attention)?

[Why I'm interested: because I presume it's impractical to scale this approach up in context - similar to for autoregressive models - due to the graph scaling exponentially if fully connected, but I'd love to know your thoughts - can a graph be strategically connected etc]

1

dancingnightly t1_j583tfa wrote

>we first generate a graph that can capture relationship between entities in the question

This is really impressive, what's your thoughts on the state of this kind of approach? Could it be extended from sentences to whole context paragraphs at some stage, with the entities dynamically being different graph items?

1

dancingnightly t1_iyolsht wrote

For a neural network, no you want to train it to represent the raw data (or near to raw like FFT) as other answers mention.

You could create a simple baseline Logistic regression model to check this. When you think about it, part of that model calculates the mean of each feature (128 electrodes * t times) as part of it's model(for binary classification). So even in this case, providing the mean isn't useful.

What would benefit from feature engineering?

If you have excel-table-style or low amount of data.

A classification Decision Tree is more likely to benefit - but this still usually only works if you can do some preprocessing with distances in embedding or other mathematical spaces, or augment the data format (e.g. appending presence of POS tags for text data, useful for logistic regression too).

A decision tree (usually) can't so easily implement things like totalling individual features, so things like totalling can on occasion be useful when you have low data numbers (although in theory, an ensemble of trees[which is the default nowadays for most] can approximate this and would if useful) - another example would be e.g. precalculating profit margin from the costs, net/gross profit for a company prediction dataset.

1