Sylv__ t1_j65ib3y wrote on January 27, 2023 at 9:20 PM

Reply to [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot by Secure-Technology-78

already posted a few weeks ago, thank you for your low effort post linking to an arxiv link

Sylv__ t1_j2yrga7 wrote on January 4, 2023 at 9:52 PM

Reply to comment by faschu in [Discussion]: Quantization in native pytorch for GPUs (Cuda)? by faschu

Well, you can always debug / try quantization configs with fake quantization on GPU. And once one is good enough for you, move to TensorRT, although AFAIK the support in TRT is very limited. Of course, this will only allow you to benchmark configs for prediction quality, not speedup.

Maybe there will be a support for quantized kernels in torchinductor? I recall reading around this in a github issue at some point.

Otherwise you could try bitsandbytes, and pass the good param to do all computations in 8-bit.

The authors of SmoothQuant implemented as well torch-int, which is a wrapper around CUTLASS to use int8 GEMM. You can find it on github!

Sylv__ t1_j2sukvs wrote on January 3, 2023 at 6:45 PM

Reply to [R] Massive Language Models Can Be Accurately Pruned in One-Shot by starstruckmon

What's the TL;DR of the novelty here?

Sylv__ t1_j1zznu0 wrote on December 28, 2022 at 5:44 PM

Reply to [P] We finally got Text-to-PowerPoint working!! (Generative AI for Slides ✨) by Mastersulm

This is awesome!

Sylv__ t1_iww823x wrote on November 18, 2022 at 8:39 PM

Reply to [R] RWKV-4 7B release: an attention-free RNN language model matching GPT-J performance (14B training in progress) by bo_peng

Plot twist: the model getting integrated in transformers lib ( ͡° ͜ʖ ͡°)

Sylv__ t1_itu2oar wrote on October 26, 2022 at 9:28 AM

Reply to [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee

Impressive work! Thank you for open-sourcing it.