AI in software dev: Who owns generated code?

Blog - David Orolin | Senior Programmer

Blog 04. 01. 2024 12 minutes

An increasing number of software developers are embracing generative AI tools in their everyday work process. And it’s natural to be concerned and ask questions about AI generated content, such as:

What happens to the data sent to the AI model? Is code from our close sourced application going to appear in development tools of a programmer on the other side of the planet? Who owns the content generated by an AI tool? And who is to blame for the eventuality where an AI tool suggests someone else’s code verbatim and breaches their intellectual property?

We usually can’t see behind the closed doors of these tools and, with some exceptions (like Tabnine’s Local Machine Mode in Tabnine Pro), we only get limited control over what happens with the data we share with them. At that point, there is only one place to turn - the user agreements and privacy policies of the individual tools. (In this article, we’re going to focus on ChatGPT, Google Bard ,Tabnine and GitHub Copilot.)

I have read through the agreements and other documents that you agree to when you start using these tools. I was pleasantly surprised by how well most of them are written. (Well, except for Tabnine. The Tabnine documents were clearly written by Real Lawyers). I have then consulted my notes with a custom GPT I have created for this purpose. I’d also like to point out that this article is applicable at the time of its writing, which is January of 2024. The information provided here will evolve along with AI and legal rulings surrounding it.

Brief introduction of the AI tools

I have chosen these tools due to their popularity and prevalence on today’s market. Two of them are specifically created to increase the effectiveness of writing code and the other two are so widespread that they’re almost certainly used to that end.

GitHub Copilot and Tabnine are tools (copilots) which are directly integrated into the working environment of developers and, when used correctly, they can dramatically improve their productivity. They automatically read relevant context (like the currently edited file), it gets sent to the model and developer gets generated content such as a finalized code block.

ChatGPT (by OpenAI) and Google Bard are tools with a broader focus, but they can still be efficiently used when developing software, whether for the purpose of research, generating scripts or code snippets. Data accessible to these tools is explicitly shared by inputting them in the chat window by the user.

What do they do with user data? Are they learning from them?

Both copilot tools clearly specify how they approach data from the developer's environment. First, they get sent to the model, there they get analyzed and after finishing the requested operation, the data gets deleted^{1, 2}. Additionally, GitHub Copilot claims that the model never trains on the user data¹.

The terms for Tabnine are a bit more involved - the user can pick whether they want the model to train on their data (at the time of writing, this option was only available in the Pro version of Tabnine). When a user opts into this feature, the generated content based on that data will only be available to that specific user. The content will never find its way to other Tabnine users^{2, 3}.

In case of individual licenses, OpenAI retains their data and they can be processed further as de-identified information⁴. While ChatGPT does train on user data, this behavior can be disabled by opting out of it⁵.

Owners of the Enterprise licenses then get to choose for how long is user data retained and OpenAI also claims that the stored data is encrypted⁶. OpenAI claims that the stored data is encrypted ⁶ and that it’s only used to train the customized enterprise model⁶.

"Plaintiff can point to no case in which a court has recognized copyright in a work originating with a nonhuman. We are approaching new frontiers in copyright."

Judge Beryl A. Howell | US District Court for the District of Columbia

Who owns the generated content?

Before delving deeper into this specific can of worms I’d like to emphasize that there is no international standard for AI generated content. There are some court precedents, but we can’t base any assumptions on those yet. Right now a whole array of court cases are underway and these will dictate the future of content that AI trains on. This is a quickly evolving topic and so it’s highly likely that the terms for AI tool usage will also change as the time goes on.

The only thing we can currently analyze is the intent of creators of these tools. Luckily, that seems quite permissive at this point. GitHub Copilot⁷, Tabnine³ and ChatGPT⁵ all grant users full ownership of the generated content.

Which brings us to the more complex question:

What about ownership of the content the models trained on?

Albeit the models should behave in such a way that doesn’t allow them to use the source data in the generated content literally, it turned out that situations like that may arise.

Tabnine solved this issue by only training on code with permissive licenses⁸. In other words, even if it generated content that’s a complete copy of source data, there would be no breach of intellectual property laws, because licenses of said data allow its distribution in this manner. However, in spite of that, Tabnine still makes it very clear that they are not to be held responsible for the tool generating copyrighted material⁹.

GitHub claims that copilot was trained on code available in public repositories¹⁰. On one hand, there’s no mention of licenses of said repositories. On the other hand, it does allow users to “Block suggestions matching public code”, which should actively compare the generated content with publicly available code and avoid such situations. I strongly suggest using this setting - GitHub even waives their Defense of Third Party Claims agreements in case this content is not blocked¹¹.

Analyzing what exactly did ChatGPT train on is wildly out of scope for this article, but at this point it’s more or less certain that it also used copyrighted material. Although there is very little chance that such material would ever be used in generated content, OpenAI accommodates this possibility in their “Copyright Shield” program, where they offer owners of the enterprise license and users of the API payment for expenses on copyright lawsuits. And just like Tabnine, ChatGPT holds the user responsible for damages caused by generated content¹². (Incidentally, I have not found a single mention of Copyright Shield in the ChatGPT’s customer agreements.)

So what about Google Bard?

Now we’re going to find out who’s been paying attention, because when I listed the tools to talk about in this article, there were four of them. But I have only talked about three in the previous chapters. There’s a reason for that: I couldn’t find any official answers to these questions from Google. So I’d lean towards the safe approach and assume the worst, albeit we don’t know the answers to any of the questions posed here for sure.

Actually, when it comes to the Google Terms of Service, I think what they’re implying is that Google is allowed to use user data for improvement of its services. I could not find anything about ownership of the content generated by Google Bard.

Let’s sum it all up

The first important aspect that (almost) all AI tools share is that all original content they generate belongs exclusively to the user. They are also rather careful and leave all responsibility for the aforementioned content on the shoulders of the user as well, including things the user can’t possibly be aware of. Such as generating copyrighted content.

When it comes to the tools focused directly at programmers (GitHub Copilot, Tabnine), the tools clearly take data protection very seriously. That includes a guarantee that the user data do not get stored server-side and that the copilots don’t train on the data. For chat-based tools, data protection is much more complex, albeit ChatGPT specifically does offer some options and use cases which prohibits the model from training on user input.

There’s still no way around the fact that implementation of all of these tools is closed and we can’t access it. Big tech companies have proven many times in the past that there’s a big difference in what they claim to do with our data and what’s actually happening. That is why their promises regarding behavior of AI tools need to be taken with a grain of salt. The best we can do is to meticulously follow security standards and never keep passwords or other secrets in our code bases, never send them to the chat tools and generally make sure that the AI tool doesn’t get access to sensitive information.

Sources

GITHUB, chapter 6. Data, In: GitHub Copilot Product Specific Terms [online], 27-09-2023 (quoted 01-01-2024), accessible at: https://github.com/customer-terms/github-copilot-product-specific-terms
TABNINE, chapter 2. Description of the Services, In: Terms of use [online], 31-07-2022 (quoted 01-01-2024), accessible at: https://www.tabnine.com/terms
TABNINE, chapter 11. Intellectual Property Ownership, In: Terms of use [online], 31-07-2022 (quoted 01-01-2024), accessible at: https://www.tabnine.com/terms
OPENAI, chapter 2. How we use personal information, In: Privacy policy [online], 14-11-2023 (quoted 01-01-2024), accessible at: https://openai.com/policies/privacy-policy
OPENAI, chapter Content, In: Terms of use [online], 14-11-2023 (quoted 01-01-2024), accessible at: https://openai.com/policies/terms-of-use
OPENAI, Enterprise privacy at OpenAI [online], quoted 01-01-2024, accessible at: https://openai.com/enterprise-privacy
GITHUB, chapter 2.Ownership of Suggestions and Your Code, In: GitHub Copilot Product Specific Terms [online], 27-09-2023 (quoted 01-01-2024), accessible at: https://github.com/customer-terms/github-copilot-product-specific-terms
TABNINE, chapter Summary, In: How Tabnine protects your code privacy [online], quoted 01-01-2024, accessible at: https://www.tabnine.com/code-privacy
TABNINE, chapter 13. Third Party Material, In: Terms of use [online], 31-07-2022 (quoted 01-01-2024), accessible at: https://www.tabnine.com/terms
GITHUB, About GitHub Copilot Individual [online], quoted 01-01-2024, accessible at: https://docs.github.com/en/copilot/overview-of-github-copilot/about-github-copilot-individual#about-github-copilot
GITHUB, chapter 4.Defense of Third Party Claims, In: GitHub Copilot Product Specific Terms [online], 27-09-2023 (quoted 02-01-2024), accessible at: https://github.com/customer-terms/github-copilot-product-specific-terms
OPENAI, chapter Limitation of Liability, In: Terms of use [online], 14-11-2023 (quoted 02-01-2024), accessible at: https://openai.com/policies/terms-of-use

If you’d like to know more or would like our help with introducing AI tools to your workflow, don’t hesitate to contact me using the following contact form.