I'd just like you to know that code with permissive licensing with attribution requirements are possibly unsuitable for training set inclusion. I'm bringing this to your attention not as a lawyer, but as a maintainer. Ask your own council. However, attribution requirements usually means derivatives must retain attribution of the original author. LLMs are apparently well-known to occasionally spit out exact derivatives, but without satisfying attribution requirements, which suggests this practice could be illegal.
I therefore request you at the very least process opt-out requests in retrospect for pre-existing data sets to fix this. However, just to stress this again, I'm not a lawyer and this isn't legal advice. But at least from the outside, this looks troubling.
For example, it appears you included repositories of mine that have attribution requirements:

I don't understand how StarCoder would possibly satisfy them.
I'd just like you to know that code with permissive licensing with attribution requirements are possibly unsuitable for training set inclusion. I'm bringing this to your attention not as a lawyer, but as a maintainer. Ask your own council. However, attribution requirements usually means derivatives must retain attribution of the original author. LLMs are apparently well-known to occasionally spit out exact derivatives, but without satisfying attribution requirements, which suggests this practice could be illegal.
I therefore request you at the very least process opt-out requests in retrospect for pre-existing data sets to fix this. However, just to stress this again, I'm not a lawyer and this isn't legal advice. But at least from the outside, this looks troubling.
For example, it appears you included repositories of mine that have attribution requirements:
I don't understand how StarCoder would possibly satisfy them.