Submitted by Singularian2501 t3_yijfkw in MachineLearning
ServiceNow and Hugging Face have released a 3.1TB dataset of permissively licensed code in 30 programming languages. This is about 4x larger than the dataset used to train GPT-3 (though obviously ‘code only’), and 3x the size of CodeParrot, the next largest released code dataset.
Paper: https://drive.google.com/file/d/17J-0KXTDzY9Esp-JqXYHIcy--i_7G5Bb/view
Hugging Face: https://huggingface.co/datasets/bigcode/the-stack
Twitter: https://twitter.com/BigCodeProject/status/1585631176353796097
Download The Stack: https://hf.co/BigCode
Source: https://twitter.com/BigCodeProject/status/1585631176353796097
Source: https://twitter.com/BigCodeProject/status/1585631176353796097
​
Source: https://twitter.com/BigCodeProject/status/1585631176353796097
nomadiclizard t1_iujxwax wrote
I'm curious which 'permissive' licenses have terms permitting the use of the code as training data in machine learning algorithms. Are we assuming licenses which allow code to be modified/redistributed, also include this right?
What if a commercial for-profit company trains on a lot of copyleft code, then commercialises the result and refuses to release the model? Is that ethical?