trnka

trnka t1_jdhvzy3 wrote

Eh, we've gone through a lot of hype cycles before and the field still exists. For example, deep learning was hyped to replace all feature engineering for all problems and then NLP would be trivialized. In practice, that was overhyped and you still need to understand NLP to get value out of deep learning for NLP. And in practice, there's still quite a bit of feature engineering (and practices like it).

I think LLMs will turn out to be similar. They'll change the way we approach many problems, but you'll still need to understand both LLMs and more problem-specific aspects of ML.

Back to your question, if you enjoy AI/ML and you're worried about jobs in a few years, I think it's still worth pursuing your interests.

If anything, the bigger challenge in jobs in the next year or two is the current job market.

1

trnka t1_jd82eo1 wrote

If you're using some API, it's probably best to look at the API docs.

If I had to guess, I'd say that top_k is about the beam width in beam search. And top_p is dynamically adjusting the beam width to cover the amount of the probability distribution you specify.

top_k=1 is probably what we'd call a greedy search. It's going left to right and picking the most probable token. The sequence of tokens selected in this way might not be the most probable sequence though.

Again, check the API docs to be sure.

All that said, these are just settings for discovering the most probable sequence in a computationally efficient way. It's still deterministic and still attempting to find the most probable sequence. What I was describing in the previous response was adding some randomness so that it's not deterministic.

1

trnka t1_jcyped6 wrote

Some systems output the most probable token in each context, so those will be consistent given a prompt. Traditionally that could lead to very generic responses.

So it's common to add a bit of randomness into it. The simplest approach is to generate tokens according to their probability. There are many other variations on this to allow more control over how "creative" the generator can be.

1

trnka t1_jcalqfm wrote

Converting the text to fixed-size windows is done to make training more efficient. If the inputs are shorter, they're padded up to the correct length with null tokens. Otherwise they're clipped. It's done so that you can combine multiple examples into a single batch, which becomes an additional dimension on your tensors. It's a common technique even for LSTMs/CNNs.

It's often possible to take the trained model and apply it to variable-length testing data so long as you're dealing with a single example at a time rather than a batch. But keep in mind with transformers that attention does N^2 comparisons, where N is the number of tokens, so it doesn't scale well to long texts.

It's possible that the positional encoding may be specific to the input length, depending on the transformer implementation. For instance in Karpathy's GPT recreation video he made the positional encoding learnable by position, so it wouldn't have defined values for longer sequences.

One common alternative in training is to create batches of examples that are mostly the same text length, then pad to the max length. You can get training speedups that way but it takes a bit of extra code.

2

trnka t1_jc8csxm wrote

If you have significant data, I'd suggest starting with BERT (and including some basic baselines).

If you only have a small amount of data, you might be able to use GPT models with a fair amount of prompt engineering.

Also, you'll probably face different challenges if the candidate types the response vs an interviewer is summarizing a response. If it's an interviewer's notes, you might find simple proxies like certain interviewers will type more for good candidates.

1

trnka t1_j91xnym wrote

In terms of probabilities yeah that's right.

In the actual code, it's most common to do a softmax over the output vocabulary. In practice that means the model computes the probability of every possible next output (whether word or subword) and then we sort it, take the argmax, or the top K depending on the problem.

I think about generating one word at a time as a key part of the way we're searching through the space of probable sentences, because we can't afford to brute-force search.

1

trnka t1_j91sshb wrote

It doesn't look like it's headed that way, no. The set of possible next sentences is just too big to iterate over or to compute a softmax over, so it's broken down into words. In fact, the set of possible words is often too big so it's broken down into subwords with methods like byte pair encoding and WordPiece.

The key when dealing with predicting one word or subword at a time is to model long-range dependencies well enough so that the LM can generate coherent sentences and paragraphs.

1

trnka t1_j8hcpwt wrote

I've been learning more about multilingual neural machine translation models lately such as the one in Google's recent paper:

Bapna, A., Caswell, I., Kreutzer, J., Firat, O., van Esch, D., Siddhant, A., Niu, M., Baljekar, P., Garcia, X., Macherey, W., Breiner, T., Axelrod, V., Riesa, J., Cao, Y., Chen, M. X., Macherey, K., Krikun, M., Wang, P., Gutkin, A., … Hughes, M. (2022). BUILDING MACHINE TRANSLATION SYSTEMS FOR THE NEXT THOUSAND LANGUAGES

I'm not sure I understand why it works for languages with no parallel data with any language though.... for instance Latinized Hindi doesn't have any parallel data. Why would the encoder or decoder representations of Latinized Hindi be compatible with any other language?

Is it because byte-pair encoding is done across languages, and that Latinized Hindi will have some word overlap with languages that DO have parallel data? So then it's encouraging the learning algorithm to represent those languages in the same latent space?

2

trnka t1_j6d5fbk wrote

I think most people split by participant. I don't remember if there's a name for that, sorry! Hopefully someone else will chime in.

If you have data from multiple hospitals or facilities, it's also common to split by that because there can be hospital-specific things in the data and you really want your evaluation to estimate the quality of the model for patients not in your data at hospitals not in your data.

1

trnka t1_j6ce4td wrote

I try not to think of it as right and wrong, but more about risk. If you have a big data set and do EDA over the full thing before splitting testing data, and intend to build a model, then yes you're learning a little about the test data but it probably won't bias your findings.

If you have a small data set and do EDA over the full thing, there's more risk of it being affected by the not-yet-held-out data.

In real-world problems though, ideally you're getting more data over time so your testing data will change and it won't be as risky.

1

trnka t1_j6583q3 wrote

If you're ingesting from an API, typically the limiting factor is the number of API calls or network round trips. So if there's a "search" API or anything similar that returns paginated data that'll speed it up a LOT.

If you need to traverse the API to crawl data, that'll slow it down a lot. Like say if there's a "game" endpoint, a "player" endpoint, a "map" endpoint, etc.

If you're working with image data, fetching the images is usually a separate step that can be slow.

After that, it you can fit it in RAM you're good. If you can fit it on one disk, there are decent libraries with each ML framework to efficiently load from disk in batches, and you can probably optimize the disk loading too.

----

What you're describing is usually called exploratory data analysis but it depends on the general direction you want to go in. If you're trying to identify people with thyroid cancer earlier, for example, you might want to compare the data of recently-diagnosed people to similar people that have been tested and found not to have thyroid cancer. Personally, in that situation I like to just train a logistic regression model to predict that from various patient properties then check if it's predictive on a held-out data sample. If it's predictive I'll then look at the coefficients of the features to understand what's going on, then work to improve the features.

Another simple thing you can do, if the data is small enough and tabular rather than text/image/video/audio is to load it up in Pandas and run .corr then check correlations with the column you care about (has_thyroid_cancer).

Hope this helps! Happy to follow up too.

2

trnka t1_j5nukd2 wrote

I'm not sure what you mean by applying a NN to linear regression.

I'll try wording it differently. Sometimes a NN can outperform linear regression on regression problems, like in the example if there's a nonlinear relationship between some features and car price. But neural networks are also prone to over-fitting so I recommend against having a NN as one's first attempt to model some data. I recommend starting simple and trying complex models when it gets difficult to improve results in simple models.

I didn't say this before but another benefit of starting simple is that linear regression is usually much faster than neural networks, so you can iterate faster and try out more ideas quickly.

2

trnka t1_j5kksex wrote

Hmm, you might also try feature selection. I'm not sure what you mean by not iterating, unless you mean recursive feature elimination? There are a lot of really fast correlation functions you can try for feature selection -- scikit-learn has some popular options. They run very quickly, and if you have lots of data you can probably do the feature selection part on a random subset of the training data.

Also, you could do things like dimensionality reduction learned from a subset of the training data, whether PCA or a NN approach.

1

trnka t1_j5k77wb wrote

The difference from application-level evaluation is a bit vague in that text. I'll use a medical example that I'm more familiar with - predicting the diagnosis from text input.

Application-level evaluation: If the output is a diagnosis code and explanation, I might measure how often doctors accept the recommended diagnosis and read the explanation without checking more information from the patient. And I'd probably want a medical quality evaluation as well, to penalize any biasing influence of the model.

Non-expert evaluation: With the same model, I might compare 2-3 different models and possibly a random baseline model. I'd ask people like myself with some exposure to medicine which explanation is best for a particular case and I could compare against random.

That said I'm not used to seeing non-experts used as evaluators, though it makes some sense in the early stages of poor explanations.

I'm more used to seeing the distinction between real and artificial evaluation. I included that in my example above -- "real" would be when we're asking users to accomplish some task that relies on explanation and we're measuring task success. "Artificial" is more just asking for an opinion about the explanation but the evaluators won't be as critical as they would be in a task-based evaluation.

Hope this helps! I'm not an expert in explainability though I've done some work with it in production in healthcare tech.

1

trnka t1_j5k5ndr wrote

Yeah you can use a neural network instead of linear regression if you'd like. I usually start with linear regression though, especially regularized, because it usually generalizes well and I don't need to worry about overfitting so much.

Once you're confident that you have a working linear regression model then it can be good to develop the neural network and use the linear regression model as something to compare to. I'd also suggest a "dumb" model like predicting the average car price as another point of comparison, just to be sure the model is actually learning something.

I'm not familiar with the Levenberg–Marquardt algorithm so I can't comment on that. From the Wikipedia page it sounds like a second-order method, and those can be used if the data set is small but they're uncommon for larger data. Typically with a neural network we'd use an optimizer like plain stochastic gradient descent or a variation like Adam.

1

trnka t1_j5k4ldr wrote

It depends on the data and the problems you're having with high-dimensional data.

  • If the variables are phrases like "acute sinusitis, site not specified" you could use a one hot encoding of ngrams that appear in them.
  • If you have many rare values, you can just retain the top K values per feature.
  • If those don't work, the hashing trick is another great thing to try. It's just not easily interpretable.
  • If there's any internal structure to the categories, like if they're hierarchical in some way, you can cut them off at a higher level in the hierarchy
2

trnka t1_j4r661s wrote

Think about it more like autocomplete. It's able to complete thoughts coherently enough to fool some people, when provided enough input to complete from. It's often incorrect with very technical facts though.

It's really about how you make use of it. In scientific work, you could present your idea and ask for pros and cons of the idea, or to write a story about how the idea might fail horribly. That can be useful at times. Or to explain basic ideas from other fields.

It's kinda like posing a question to Reddit except that ChatGPT generally isn't mean.

There are other approaches like Elicit or Consensus that use LLMs more for literature review which is probably more helpful.

1

trnka t1_j488v5u wrote

You might try Snorkel. The gist is that domain experts write rules and those rules are fed into ML. If that company doesn't work, I'm pretty sure there are alternatives. Or maybe they had their work in a Python library... it's been a while.

Compared to traditional ML, the benefit is that you're involving the subject matter experts more and giving them a say more directly. That tends to ensure that they're bought in to the approach. Having been in healthcare ML for a while, getting buy-in can be very challenging.

2

trnka t1_j45klqc wrote

No it's not strictly needed, though I haven't seen a course that teaches ML starting from the application and working backwards to the fundamentals. In teaching that's sometimes called "top down" as opposed to starting from fundamentals.

If you're taking courses, you may need to pick up a bit of math along the way. If you're self-taught, you might try starting with tutorials of ML libraries like scikit-learn and keeping a journal of any terms you need to look up later.

1

trnka t1_j40i4fa wrote

Fine-tuning is when you take a pretrained network, change the output layer only, and run the optimizer a little more.

Transfer learning is when you take any sort of pretraining. Fine-tuning is one example of transfer learning. Using pretrained word embeddings is another example of transfer learning.

You can do deep learning without either. It's just that existing pretrained models and components are so good that it's tough to build a competitive model without either.

2