dancingnightly t1_jebf9zn wrote on March 30, 2023 at 8:04 PM

Reply to comment by Jadien in [D] Turns out, Othello-GPT does have a world model. by Desi___Gigachad

incredibly interesting given humans represent some quantities this way too (spanning from left-to-right in the brain for numbers)

dancingnightly t1_je0o082 wrote on March 28, 2023 at 3:51 PM

Reply to comment by antonivs in [D] FOMO on the rapid pace of LLMs by 00001746

The benefit of finetuning or training your own text model (e.g. in the olden days on BERT), now through the OpenAI API vs the benefit of just using contextual semantic search is reducing day-by-day... especially with the extended context window of GPT-4.

If you want something in house, finetuning GPT-J or so could be the way to go, but it's definitely not the career direction I'd take.

dancingnightly t1_jdcnhuh wrote on March 23, 2023 at 1:24 PM

Reply to [P] Open-source GPT4 & LangChain Chatbot for large PDF docs by radi-cho

Will you add semantic chunking?

dancingnightly t1_jadj7fa wrote on February 28, 2023 at 5:40 PM

Reply to comment by Beli_Mawrr in [R] Microsoft introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot) by MysteryInc152

Edit: Seems like for this one yes. They do consider human instructions (similarish to the goal of a RLHF which requires more RAM), by adding them directly in the text dataset, as mentioned in 3.3 Language-Only Instruction Tuning-

For other models, like OpenAssistant coming up, one thing to note is that, although the generative model itself may be runnable locally, the reward model (the bit that "adds finishing touches" and ensures following instructions) can be much bigger. Even if the GPT-J underlying model is 11GB on RAM and 6B params, the RLHF could seriously increase that.

This models is in the realm of the smaller T5, BART and GPT-2 models released 3 years ago and runnable then on decent gaming GPUs

dancingnightly t1_ja2lfup wrote on February 26, 2023 at 11:00 AM

Reply to comment by davidmezzetti in [P] Introducing txtchat, next-generation conversational search and workflows by davidmezzetti

Is this current version mostly RAG + WebGPT semantic search to GPT answer, then?

Big fan of your recent work.

dancingnightly t1_j95wa9s wrote on February 19, 2023 at 2:22 PM

Reply to comment by RideOrDieRemember in [D] Things you wish you knew before you started training on the cloud? by I_will_delete_myself

Try multiple regions and zones. There are peaks and troughs in availability, most notably the weekend is a good time to spot. There are some sites that help you do this / scripts online that use the aws cli to check for you.

dancingnightly t1_j8y81v9 wrote on February 17, 2023 at 8:18 PM

Reply to comment by pyfreak182 in [Discussion] Time Series methods comparisons: XGBoost, MLForecast, Prophet, ARIMAX? by RAFisherman

Do you know of any kind of similar encoding where you vectorise relative time? as multiple proportions of completeness, if that makes sense?

Say, completeness within a paragraph, within a chapter, within a book? (Besides sinusidal embeddings which push up the number of examples you need)

dancingnightly t1_j8y7fny wrote on February 17, 2023 at 8:14 PM

Reply to [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee

"If you look at the internals, it's a nightmare. A literal nightmare."

Yes, the copy paste button is heavily rinsed at HF HQ.

But you won't believe how much easier they made it to run, tokenize and train models in 2018-19, and at that, train compatible models.

We probably owe a month of NLP progress just to them coming in with those one liners and sensible argument API surfaces.

Now, yes, it's getting crazy - but if there's a new paradigm, a new complex way to code, then a similar library will simplify it, and we'll mostly jump there except for legacy. It'll become like scikit learn (although that still holds up for most real ML tasks), lots of finegrained detail and slightly questionable amounts of edge cases (looking at the clustering algorithms in particular), but as easy as pie to keep going.

I personally couldn't ask for more. I was worried they were going to push auto-switching models to their API at some point, but they've been brilliant. There are bugs, but I've never seen them in inference(besides your classic CUDA OOM), and like Fit_Schedule5951 says, it's all about that with HF.

dancingnightly t1_j8g0oqx wrote on February 14, 2023 at 12:58 AM

Reply to comment by EducationalCicada in [R] [N] Toolformer: Language Models Can Teach Themselves to Use Tools - paper by Meta AI Research by radi-cho

Hold on Jurasstic is here from April 2022 I believe with something fairly similar:

https://arxiv.org/pdf/2204.10019.pdf

https://www.ai21.com/blog/jurassic-x-crossing-the-neuro-symbolic-chasm-with-the-mrkl-system

It didn't learn for new tools I think, but it did work well for calculations and wiki search.

dancingnightly t1_j7s355b wrote on February 9, 2023 at 12:25 AM

Reply to [D] Papers that inject embeddings into LMs by _Arsenie_Boca_

In a sense, you can communicate between semantic text embeddings and LM models through this method(would operate differently to multi modal embeddings): https://www.lesswrong.com/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight

This method, which is only practical for toy problems really right now, would allow you to use semantic embeddings to find what to look for when doing SVD on an (autoregressive) LM. You could depend this on the input, for example, transforming your embedding into the keys to apply the abduction with in that process, and impacting the generation of logits. I'm not sure this would behave much differently to altering the logit_bias of tokens, but it would be interesting to hear if it was.

dancingnightly t1_j76uuee wrote on February 4, 2023 at 3:10 PM

Reply to comment by Feeling_Card_4162 in [R] Topologically evolving new self-modifying multi-task learning algorithms by Feeling_Card_4162

In this goal, you may find Mixture of Experts architectures interesting.

I like your idea. I have always thought too that in ML we are trying to replicate one human on one task with the worlds data for that task, or one human on many tasks, more recently.

But older ideas and replicating societies and communication for one or many tasks could be equally or more effective. Which this heads in the direction of. There is a library called GeNN which is pretty useful for these experiments, although it's a little slow due to deliberate true-to-biology design.

dancingnightly t1_j76t0gh wrote on February 4, 2023 at 2:56 PM

Reply to comment by zbyte64 in [R] Multimodal Chain-of-Thought Reasoning in Language Models - Amazon Web Services Zhuosheng Zhang et al - Outperforms GPT-3.5 by 16% (75%->91%) and surpasses human performance on ScienceQA while having less than 1B params! by Singularian2501

In theory training T5 alongiside the image embedding models they use (primarily DETR?) shouldn't take much more than a 3090 or Collab Pro GPU. You could train T5s on even consumer high end GPUs in 2020, for example, but the DETR image model probably needs to be ran for each image at the same time which might take up quite a bit of GPU together. The `main.py` script looks like a nice and fairly short typical training script you'd be able to quickly run if you download their repo, pull the scienceQA dataset and send the training args to see if it crashes.

dancingnightly t1_j6wiwtz wrote on February 2, 2023 at 11:44 AM

Reply to Google is asking employees to test potential ChatGPT competitors, including a chatbot called 'Apprentice Bard' by No-Drawing-6975

I run AnyQuestions.ai and have noticed this lol. We don't rely on ChatGPT there, and I think they noticed that.

dancingnightly t1_j6oaxeo wrote on January 31, 2023 at 7:12 PM

Reply to [D] Have researchers given up on traditional machine learning methods? by fujidaiti

This is commercial, not research but: A lot of scenarios where explainable AI is needed use simple statistical solutions.

For example a company I knew had to identify people in poverty in order to distribute a large ($M) grant fund to people in need, and they had only basic data about some relatively unrelated information, like how often these people travelled lets say, their age, etc.

In order to create an explainable model where factors can be understood by higher ups, and considered for bias easily, they used a k-means approach with just 3 factors.

It captured close to as much information as deep learning, but with more robustness to data drift, and with clear graphs segmenting the target group and general group. It also reduced use of data, being pro-privacy.

This 30 line of code solution with a dozen explanatory output graphs about EDA probably got sold for >500k in fees... but they did make the right choices in this circumstance. They saved on a complex ML model, bias/security/privacy/deployment hell, and left a maintainable solution.

Now for research, it's interesting from the perspective of applied AI (which is arguably still dominantly GOFAI/simple statistics) and communication about AI with the public, although I wouldn't say it's in vogue.

dancingnightly t1_j5v5zwe wrote on January 25, 2023 at 8:03 PM

Reply to [D] Efficient retrieval of research information for graduate research by [deleted]

The internet isn't accessed live by most of these models, as others have said.

You can finetune language models, but you don't add knowledge as such to them; you bias them to output more words in similar order to your sample data; it won't add facts as such if you do this fine tuning.

One approach you can do though is semantic search through your notes for a given topic/search query. You basically collect the relevant notes with meanings similar to your topic/search query. Then you can populate a prompt with that text. The answer will use that information and any facts, if the model is big enough and RLHF tuned (like ChatGPT/Instruct/text-00x models from OpenAI).

An open source module for this is GPTIndex, I also work on a commercial solution which encompasses videos etc too and has some optimisations. It is possible you can add data/facts from the internet to the prompt(context) at time of generation too; you can use an approach like WebGPT.

dancingnightly t1_j5c31u6 wrote on January 21, 2023 at 10:37 PM

Reply to comment by axm92 in [R] Is there a way to combine a knowledge graph and other types of data for ML purposes? by Low-Mood3229

Oh ok. Thank you for taking the time to explain. I see that this graph approach isn't for extending beyond the existing context of RoBERTa/similar transformer models, but rather enhancing performance.

I was hoping graphs could capture relational information (in a way compatible with transformer embeddings) within the document at far parts between it essentially (like: for each doc.ents, connect in a fully connected graph), sounds like this dynamic graph size/structure per document input wouldn't work with the transformer embeddings for now though.

dancingnightly t1_j58anv8 wrote on January 21, 2023 at 2:34 AM

Reply to comment by axm92 in [R] Is there a way to combine a knowledge graph and other types of data for ML purposes? by Low-Mood3229

That's a great resource, thanks. I have studied how this kind of autoregressive model works and found attention fascinating, but here it's graph embedding entities you brought up that sound exciting. I have just skim read your paper for now, so perhaps I made a mistake, but what I mean is:

For graph embeddings, could you dynamically capture different entities/tokens up to a much broader context than for common sense reasoning statements and questions? i.e. do entailment on a whole chapter(or knowledge base entry with 50 triplets), where the graph embeddings meaningfully represent many entities (perhaps with Sine positional embeddings for each additional text entry mention in addition to the graph, just like for attention)?

[Why I'm interested: because I presume it's impractical to scale this approach up in context - similar to for autoregressive models - due to the graph scaling exponentially if fully connected, but I'd love to know your thoughts - can a graph be strategically connected etc]

dancingnightly t1_j583tfa wrote on January 21, 2023 at 1:40 AM

Reply to comment by axm92 in [R] Is there a way to combine a knowledge graph and other types of data for ML purposes? by Low-Mood3229

>we first generate a graph that can capture relationship between entities in the question

This is really impressive, what's your thoughts on the state of this kind of approach? Could it be extended from sentences to whole context paragraphs at some stage, with the entities dynamically being different graph items?

dancingnightly t1_j52k7sv wrote on January 19, 2023 at 11:16 PM

Reply to [D] is it time to investigate retrieval language models? by hapliniste

Yup, I fully believe retrieval of sources will go up in value over time, in addition to the benefits you have outlined. Because when lots of things are AI generated, being able to trust and see a source has value (even for some AI summary answer say)

dancingnightly t1_iyolsht wrote on December 2, 2022 at 11:11 PM

Reply to [D] Do neural networks take care of feature engineering? by Steve_Sizzou

For a neural network, no you want to train it to represent the raw data (or near to raw like FFT) as other answers mention.

You could create a simple baseline Logistic regression model to check this. When you think about it, part of that model calculates the mean of each feature (128 electrodes * t times) as part of it's model(for binary classification). So even in this case, providing the mean isn't useful.

What would benefit from feature engineering?

If you have excel-table-style or low amount of data.

A classification Decision Tree is more likely to benefit - but this still usually only works if you can do some preprocessing with distances in embedding or other mathematical spaces, or augment the data format (e.g. appending presence of POS tags for text data, useful for logistic regression too).

A decision tree (usually) can't so easily implement things like totalling individual features, so things like totalling can on occasion be useful when you have low data numbers (although in theory, an ensemble of trees[which is the default nowadays for most] can approximate this and would if useful) - another example would be e.g. precalculating profit margin from the costs, net/gross profit for a company prediction dataset.