royalemate357 t1_je34nnn wrote on March 29, 2023 at 1:40 AM

Reply to [D] Do model weights have the same license as the modem architecture? by murphwalker

not a lawyer, but i dont think it is enough to change the license, as its still derived from the LLaMa weights and so you'd still have to follow the rules.

>Meta grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty free and limited license under Meta’s copyright interests to reproduce, distribute, and create derivative works of the Software solely for your non-commercial research purposes. The foregoing license is personal to you, and you may not assign or sublicense this License or any other rights or obligations under this License without Meta’s prior written consent; any such assignment or sublicense will be void and will automatically and immediately terminate this License.

https://huggingface.co/decapoda-research/llama-7b-hf/blob/main/LICENSE

royalemate357 t1_jd1stda wrote on March 21, 2023 at 5:31 AM

Reply to comment by hosjiu in [Project] Alpaca-30B: Facebook's 30b parameter LLaMa fine-tuned on the Alpaca dataset by imgonnarelph

Not op, but I imagine they're referring to the sampling hyperparameters that control the text generation process. For example there is a temperature setting, a lower temperature makes it sample more from the most likely choices. So it would potentially be more precise/accurate but also less diverse and creative in it's outputs

royalemate357 t1_jcnjaeo wrote on March 18, 2023 at 3:04 AM

Reply to comment by mike94025 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

oh cool, thanks for the clarification. Nice that you folk made it more backend independent. Would be interesting to try it out on amd/mps devices, i wonder if those requirements are met on those devices though.

royalemate357 t1_jclz4t0 wrote on March 17, 2023 at 8:09 PM

Reply to comment by MoistYogurtcloset400 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

hmm, I am not too sure but their blogpost says this:

>TorchInductor uses a pythonic define-by-run loop level IR to automatically map PyTorch models into generated Triton code on GPUs and C++/OpenMP on CPUs.

so it seems like they support CPU. I also tried it briefly on google colab CPU-only, and it seems to work (i didn't benchmark speed though). I doubt it supports non cuda GPUs but then again support for those even in the general case isnt very good.

royalemate357 t1_jckqgsr wrote on March 17, 2023 at 3:22 PM

Reply to comment by RobbinDeBank in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap

Pretty sure the main improvement is "torch.compile" which can optimize your code in a nice easy one liner. There's some other nice quality of life improvements like the built in flash attention OP is using, and I think some distributed training stuff. But it's fully backwards compatible, which is great (looking at you tensorflow) https://pytorch.org/get-started/pytorch-2.0/#pytorch-2x-faster-more-pythonic-and-as-dynamic-as-ever

royalemate357 t1_jb1h7wl wrote on March 5, 2023 at 6:37 PM

Reply to comment by Art10001 in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng

hmm I very much doubt it couldve ran 100x faster for the same parameter count, as you are memory bandwith bound (both GPT and RWKV have to load the parameters n times to generate n tokens). Also Im somewhat skeptical that you only need 3GB for 14B parameters *without offloading the model*, as even 4-bit quantization is 14B/2 = 7GB needed. and offloading the model is slow to the point of being unusable as you need to do CPU<->GPU transfers.

royalemate357 t1_jb0smq3 wrote on March 5, 2023 at 3:49 PM

Reply to comment by Art10001 in [R] RWKV (100% RNN) can genuinely model ctx4k+ documents in Pile, and RWKV model+inference+generation in 150 lines of Python by bo_peng

It's awesome work, but I don't think anyone is claiming anywhere near 100x faster speed and lower VRAM are they?

>RWKV-3 1.5B on A40 (tf32) = always 0.015 sec/token, tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M
>
>GPT2-XL 1.3B on A40 (tf32) = 0.032 sec/token (for ctxlen 1000), tested using HF, GPU utilization 45% too (interesting), VRAM 9655M

From this it sounds like about ~2x improvement (dont get me wrong 2x improvement is great for same performance). As for you have to store all the parameters of RWKV model just like GPT, that takes up most of the memory if you're trying to fit models in consumer hardware. Memory is just less because of no need for KV cache.

royalemate357 t1_jabauxj wrote on February 28, 2023 at 4:41 AM

Reply to [P] [R] Neural Network in Fortran! by Etterererererer

> I know it’s common for massive projects to use Fortran in order to train NN

Is it? I'm not aware of any high profile / large scale ml projects recently written in it. My understanding is they mostly use python for the model development, and then C/C++ for the actual math. Fwiw i think parts of numpy are written in fortran though.

royalemate357 t1_j9s2pf3 wrote on February 24, 2023 at 3:45 AM

Reply to comment by DigThatData in [D] To the ML researchers and practitioners here, do you worry about AI safety/alignment of the type Eliezer Yudkowsky describes? by SchmidhuberDidIt

hmm i didn't realize that the origin of the paperclip maximizer analogy, but it seems like you're right that some human had to tell it to make paperclips in the first place.

royalemate357 t1_j9s125d wrote on February 24, 2023 at 3:32 AM

Reply to comment by DigThatData in [D] To the ML researchers and practitioners here, do you worry about AI safety/alignment of the type Eliezer Yudkowsky describes? by SchmidhuberDidIt

> instead of "maximizing paperclips," "it" is just trying to maximize engagement and click-through rate. and just like the paperclips thing, "it" is burning the world down trying to maximize the only metrics it cares about

Isn't there a difference between the two, because the latter concerns a human trying to pursue a certain goal (maximize user engagement), and giving the AI that goal. and so arguably, the latter is "aligned" (for some sense of the word) to the human that's using it to maximize their engagement, in that its doing what a specific human intends it to do. Whereas the paperclip scenario is more like, human tells AI to maximize engagement, yet the AI has a different goal and chooses to pursue that instead.

royalemate357 t1_j9rzbbc wrote on February 24, 2023 at 3:18 AM

Reply to comment by wind_dude in [D] To the ML researchers and practitioners here, do you worry about AI safety/alignment of the type Eliezer Yudkowsky describes? by SchmidhuberDidIt

>When they scale they hallucinate more, produce more wrong information

Any papers/literature on this? AFAIK they do better and better on fact/trivia benchmarks and whatnot as you scale them up. It's not like smaller (GPT-like) language models are factually more correct ...

royalemate357 t1_j9ryzg7 wrote on February 24, 2023 at 3:15 AM

Reply to comment by DigThatData in [D] To the ML researchers and practitioners here, do you worry about AI safety/alignment of the type Eliezer Yudkowsky describes? by SchmidhuberDidIt

>I think the whole "paperclip" metaphor descibres problems that are already here

Does it? My understanding of the paperclip metaphor is that an advanced AI will pursue its own goals that are totally unrelated to human goals, e.g. creating as many paperclips as possible. But AIs aren't advanced enough right now to be at this point.

As for what constitutes "x-risks", AFAIK it means "existential risk" which is like all of humanity going extinct. IMO the reason why people consider advanced AGIs an x-risk, and the others are not, is because the other problems you mentioned don't result in the extinction of *every* single human on Earth

royalemate357 t1_j9rsqd3 wrote on February 24, 2023 at 2:28 AM

Reply to comment by MinaKovacs in [D] To the ML researchers and practitioners here, do you worry about AI safety/alignment of the type Eliezer Yudkowsky describes? by SchmidhuberDidIt

>We are not even remotely close to anything like actual brain functions. Intelligence need not look anything remotely close to actual brain functions though, right? Like a plane's wings don't function anything like a bird's wings, yet it can still fly. In the same sense, why must intelligence not be algorithmic?

At any rate I feel like saying that probabilistic machine learning approaches like GPT3 are nowhere near intelligence is a bit of a stretch. If you continue scaling up these approaches, you get closer and closer to the entropy of natural language/whatever other domain, and if youve learned the exact distribution of language, imo that would be "understanding"

royalemate357 t1_j9rphfc wrote on February 24, 2023 at 2:04 AM

Reply to [D] To the ML researchers and practitioners here, do you worry about AI safety/alignment of the type Eliezer Yudkowsky describes? by SchmidhuberDidIt

I think the biggest danger isn't AIs/AGIs pursuing their own goals/utility functions that involve turning all humans into paperclips. I think the "predict-the-next-word" AIs that are currently the closest thing to AGI aren't capable of recursively self improving arbitrarity, nor is there evidence AFAIK that they pursue their own goals.

Instead the danger is in people using increasingly capable AIs to pursue their own goals, which may or may not be benign. Like, the same AIs that can cure cancer can also create highly dangerous bioweapons or nanotechnology.

royalemate357 t1_j94ax4h wrote on February 19, 2023 at 3:20 AM

Reply to [D] Things you wish you knew before you started training on the cloud? by I_will_delete_myself

Depending on what scale you're working at, egress fees / data transfer fees can be something to look out for. Be aware of them if you are moving data around or data is leaving (e.g. you are downloading a model checkpoint).

royalemate357 t1_j8ehkvl wrote on February 13, 2023 at 6:44 PM

Reply to [R] What are some papers that describe TikTok's algorithm? by Thin-Shirt6688

Some researchers at tiktok's parent company released a paper on a recommender system called Monolith here: https://arxiv.org/abs/2209.07663.

I'm not sure its actually what tiktok is using, but they do say that "Monolith has successfully landed in the BytePlus Recommend product".

royalemate357 t1_j6espto wrote on January 29, 2023 at 9:05 PM

Reply to comment by YoutubeStruggle in [P] AI Content Detector by YoutubeStruggle

excluding the last sentence of your comment, your website says this comment is more likely AI generated (33% human). link: https://imgur.com/a/vqBo4BK

royalemate357 t1_j6eq454 wrote on January 29, 2023 at 8:49 PM

Reply to [P] AI Content Detector by YoutubeStruggle

I tried it with chatgpt, and it correctly identified the text as ai generated when i used the output exactly. but then when i changed the capitalization of the first letter in the sentence and removed a few commas, it changed to human generated (84%). it seems to me its kind of a superficial detector, and is quite easy to fool. also, what is the false positive rate? if this tool or others are used to flag students for plagiarism, it had better be pretty close to zero.

royalemate357 t1_j61k9vy wrote on January 27, 2023 at 1:30 AM

Reply to comment by Individual-Cause-616 in [D] score based vs. Diffusion models by Individual-Cause-616

the speed and quality of score based/diffusion depends on what sampler you use. If youre using euler's method to solve the ODE for example, that might be slower than some of the newer methods developed for diffusion models, like tero karass' ODE solvers. AFAIK there isnt consensus on what the best sampler to use is though.

i dont think it affects training convergence much though since its more or less the same objective.

royalemate357 t1_j60xuup wrote on January 26, 2023 at 10:51 PM

Reply to [D] score based vs. Diffusion models by Individual-Cause-616

there's an implementation of score-based models from the paper that showed how score based models and diffusion models are the same here: https://github.com/yang-song/score_sde_pytorch

imo their implementation is more or less the same as a diffusion model, except score based models would use a numerical ODE/SDE solver to generate samples instead of using the DDPM based sampling method. it might also train on continuous time, so rather than choosing t ~ randint(0, 1000) it would be t ~ rand_uniform(0, 1.)

royalemate357 t1_j4qdfwj wrote on January 17, 2023 at 3:15 PM

Reply to comment by BeatLeJuce in [D] Tim Dettmers' GPU advice blog updated for 4000 series by init__27

Tbh I don't think it's an especially good name, but I believe the answer to your question is that it actually uses 32 bits to store a TF32 value in memory. its just that when they pass it into tensor cores to do matmuls, they temporarily downcast it to this 19-bit precision format.

>Dot product computation, which forms the building block for both matrix multiplies and convolutions, rounds FP32 inputs to TF32, computes the products without loss of precision, then accumulates those products into an FP32 output (Figure 1).

(from https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/)

royalemate357 t1_j4migdx wrote on January 16, 2023 at 7:39 PM

Reply to comment by BeatLeJuce in [D] Tim Dettmers' GPU advice blog updated for 4000 series by init__27

TF32 is tensorfloat 32, which is a relatively new precision format for newer GPUs. Basically, when doing math, it uses the same number of mantissa as FP16 (10 bits), and the same number of exponent bits as normal float32 (8 bits). more on it here: https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/