@adlrocha - The siege of open source software?

Digressions on Github Copilot and more

Jul 04, 2021

Github Copilot’s beta is out! And with it, a heated debate on the use of open source software by big tech companies. For those of you who haven’t read about it yet, Github Copilot is an AI tool in the form of (at least for now) a VS Code extension that helps you write code by “giving suggestions for whole lines or entire functions inside your editor”. Github Copilot is powered by Open AI, and is trained on billions of lines of public code to achieve its task. It uses an OpenAI engine called OpenAI Codex which is a more capable engine than GPT-3 in code generation (in the end it seems to be a GPT-3 trained on billions of lines of code).

So far so good, it is not as if Github Copilot was the first AI tool of its kind to help developers in their job of writing code. You are probably familiar with Tabnine, which has been around for a few years now and does basically the same as Github Copilot.

What’s all the fuss about then? Well, it may have been implicit for the case of Tabnine, but as it was a small startup no one really cared that much, but in the case of Github Copilot is blatant: these tools have been trained using the code you’ve worked so hard to produce, and they are suggesting snippets of it to other developers.

Initial reactions to Github Copilot went from the regular “oh man, we are doomed, we’ll be out of our jobs in no time” to “my productivity will skyrocket with this tool”.

Then people started thinking a bit more deeply, and reactions looked a bit more like this:

Kelly Sommers @kellabyte

I heard GitHub is training their co-pilot with other peoples copyrighted source code thus allowing copyrighted source code to be injected into other peoples products. There was a time I felt GitHub was on the ethical side of things but that’s starting to fade.

I sometimes use this joke with my non-engineer friends to explain what I do for a living (and for fun): “we developers are just a dumb interface between StackOverflow and the application/system we want to implement”. So it was only a matter of time before counter-arguments in favor of Github Copilot in the line of the following arised:

Sam Burns @SamBurnsTech

@kellabyte I hate to say this, but a lot of my code is just sort of borrowed from Stack Overflow...

But there’s a huge difference between using Github Copilot-generated code and code snippets from StackOverflow in your program: the source of the code. When you use code from a StackOverflow thread, the person answering to that question is willingly sharing his code snippet with you (and others) to help you. Copilot-generated code may be inferred from pieces of code from one of your repositories that for some reason you may be quite hesitant to share: either because it is protected by a non-permissive license, or because you worked hard on it and you are too selfish to share it with others. It doesn’t matter, the thing is that you own that code, and you should be free to do whatever you want with it.

The fact that you are hosting the code in Github shouldn’t be enough reason for anyone to use it to train their AI. I don’t think we signed up to this when we created a Github account (or at least I am personally not aware of it, but maybe there’s something in Github’s terms and conditions I am not aware of. Please let me know if this is the case).

Licenses are there for a reason

Github is full of open source code under permissive licenses that one can openly read and use in its own projects without having to ask permission from anyone. However, depending on the specific license used, there may be some constraints, requirements, and limitations in the use of the code. We may be able to use code from a project as long as we don’t make profit from the derived work; or there are certain licenses that allow the use of the project’s code as long as every work derived from it is open source under the same license.

Github Copilot would have been a great tool if it had been trained exclusively using code under permissive licenses that didn’t require acknowledging the original author of the code (or other license-related requirements). Or even more, if according to the license of the project a developer is working on, the type of code used to train the model or make suggestions could be chosen accordingly.

Developers of open source software use licenses as the communication channel to let other developers and users know what they are allowed to do or not with their work. But my feeling after reading tweets like the following is that Github didn’t pay much attention to this when implementing and training Copilot:

Brian P. Hogan @bphogan

Hi. I know you’re excited about copilot. GitHub scraped your code. And they plan to charge you for copilot after you help train it further. It’s truly disappointing to watch people cheer at having their work and time exploited by a company worth billions.

And don’t get me wrong, I am totally in favor of tools like this that improve our productivity and can make our lives easier even if it needs billions of lines of codes to be trained, as long as this is done the right way. If you’ve contributed to open source software, or probably just hosted your code in Github even if it is not under a specific license, you are a small part of the reason why Github Copilot works. Have you been (or will be) rewarded in any way for this contribution? Not at all. You’ll probably have to start paying a subscription if you want to start using Github Copilot.

Github will be profiting from your work, probably even in the case where you explicitly stated in the license of your project that no-one could profit from work derived from your work. Unfortunately, there is no easy way of enforcing this. And what would have happened if everyone started doing things like this?

Brian P. Hogan @bphogan

What would be really funny Is if people who maintain popular repos Started putting in wrong code on purpose.

Many may argue that the same way you can use Google services for free in exchange for your data (which is essentially yours and you are the only owner); Github can use your code to train their models in exchange for all that free hosting and unlimited private repositories you get. But while this is quite clear when you create a Google account, I don’t think this is that clear when creating one in Github.

Time to self-host our critical services?

All of this makes me quite sad. The fact that we rely more and more on big tech services for our day to day lives means that we are quite defenseless against them doing other Github Copilots-related projects using our hard work and personal data. You want to use Github? Then you have to deal with them doing what they want with your code. Period. This is just one more example of how broken the Internet and its dynamics are these days. But what can we do to solve it? I can’t see an easy solution, apart from building an Internet substrate that enables people to escape these twisted dynamics.

Without this substrate where everyone can own its own piece of the Internet without having to rely on others, the only escape to the influence of big tech, and identity shifts like the one from Github is to self-host all the critical services you rely on, like this folk has done. Quoting from that website:

“I do not agree with GitHub's unauthorized and unlicensed use of copyrighted source code as training data for their ML-powered GitHub Copilot product. This product injects source code derived from copyrighted sources into the software of their customers without informing them of the license of the original source code. This significantly eases unauthorized and unlicensed use of a copyright holder's work.
I consider this a severe attack on the rights of copyright holders so therefore I cannot continue to rely on GitHub's services”

But this is not the panacea. Self-hosting every online critical service we depend on in our day to day is a lot of work. We have to worry about hosting the infrastructure for the service, maintenance, upgrades, security risks, etc. Of course it depends on the level of control you want over your services, but you most probably won’t be able to achieve the SLA of big tech services.

If in spite of all of these inconveniences you still want to start hosting some of these services yourself, this repo is a great start. It walks you through how to deploy a list of super useful services: your own VPN, web hosting, cloud storage, calendar, chat server, and a long list of other self-hosted open source alternatives.

Can open source software be closed?

Unfortunately, Github Copilot is not an isolated example of how big tech is sieging open source software. Visual Code is another interesting case I recently learned about. You may be thinking that when you install Visual Code in your machine what you are using is a build of the open source code hosted in this repo. Well, apparently this is not the case and you would be better off downloading the code and building it yourself.

“Microsoft’s vscode source code is open source (MIT-licensed), but the product available for download (Visual Studio Code) is licensed under this not-FLOSS license and contains telemetry/tracking. According to this comment from a Visual Studio Code maintainer:
When we [Microsoft] build Visual Studio Code, we do exactly this. We clone the vscode repository, we lay down a customized product.json that has Microsoft specific functionality (telemetry, gallery, logo, etc.), and then produce a build that we release under our license.
When you clone and build from the vscode repo, none of these endpoints are configured in the default product.json. Therefore, you generate a “clean” build, without the Microsoft customizations, which is by default licensed under the MIT license.”

This is why projects like VSCodium, a free/libre open source software binary of VSCode, have to exist. Apparently every time we use VSCode we are sending data to Microsoft. Some people may be comfortable and aware of these practices, but others may think this is outrageous. Why aren’t these companies more transparent with what they do with their users so that at least they can make informed decisions if to use them or not? Is it because they know what the answer would be?

This was a weaker case than the one of Github Copilot of siege of open source software, but still one worth being aware of. I personally don’t expect this type of cases to stop any time soon.

Elastic is another example of a company that had made a related move in this direction of seiging open source software by changing the license of some of their projects (that millions of people probably depend on) into a more restrictive one to increase their profit. Again, I am not against companies profiting from their work, and their projects they create, this is legit and awesome, I personally would do the same. What I am against is “changing the rules of the game midway”.

I haven’t talked to any contributor to Elasticsearch, for instance, but I am really curious to know how they felt when they learnt that all of their hard work contributing to an open source project they thought was protected under a specific license eventually changed into a more restrictive one. They probably shared the values of the project they were contributing to, and overnight, because someone unilaterally chose to, one of the key foundations of the project they have voluntarily worked hard to make changed.

Developers should be more aware of the licenses of the project they contribute to, and the consequences of it; while companies behind that software should be more respectful of their licenses and with their contributors. It all comes down to rewarding fairly everyone for their hard work, because open source software may seem free by design, but behind it there is a lot of hard work, and ethics should prevail. Even if this reward is just sticking to the initial values of the project for respect to its contributors. Ask anyone, open source software is almost never about the money.

Can we fix it?

But coming back to potential solutions to the problem at hand: what if one wants to self-host its own services without having to worry about the overhead of maintaining and ensuring a certain level of SLA not to make the service unusable on a daily basis? Here is where a new substrate for the Internet is needed. A substrate where we can be in control of our data and our services. Regular readers of this newsletter know what is coming next: we need to fix the Internet, and decentralizing it to minimize our reliance on big tech is the first step towards this goal.

Filecoin and IPFS are good examples of how decentralization and web3 protocols can help us return our control and build self-hosted services with redundancy and a great SLA without the nightmare of having to maintain the infrastructure. With these protocols we collaboratively maintain the infrastructure. We share the burden between all the participants of the system. Is not every man and woman by themselves or delegating everything to big tech giants. It is something in between.

I am really optimistic about the future of the Internet and Web3. We are getting to the point where all the foundations are there, we now have to make it better than Web2 not only for the people behind Web3, but for the users of Web2, i.e. everyone else. You want to join this exciting endeavor? Ping me and let’s have a chat! For the rest, see you next week.

@adlrocha Weekly Newsletter

Discussion about this post