Copyright is something of a minefield right now when it comes to AI, and there’s a new report claiming that Apple’s generative AI – specifically its ‘Ajax’ large language model (LLM) – may be one of the only ones to have been both legally and ethically trained. It’s claimed that Apple is trying to uphold privacy and legality standards by adopting innovative training methods.
Copyright law in the age of generative AI is difficult to navigate, and it’s becoming increasingly important as AI tools become more commonplace. One of the most glaring issues that comes up, again and again, is that many companies train their large language models (LLMs) using copyrighted works, typically not disclosing whether they license that training material. Sometimes, the outputs of these models include entire sections of copyright-protected works.
The current justification for why copyrighted material is so widely used as far as some of these companies to train their LLMs is that, not dissimilar to humans, these models need a substantial amount of information (called training data for LLMs) to learn and generate coherent and convincing responses – and as far as these companies are concerned, copyrighted materials are fair game.
Many critics of generative AI consider it copyright infringement if tech companies use works in training and output of LLMs without explicit agreements with copyright holders or their representatives. Still, this criticism hasn’t put tech companies off from doing exactly that, and it’s assumed to be the case for most AI tools, garnering a growing pool of resentment towards the companies in the generative AI space.
The forest of legal battles and ethical dilemmas in generative AI
There have even been a growing number of legal challenges mounted in these tech companies’ direction. OpenAI and Microsoft have actually been sued by the New York Times for copyright infringement back in December 2023, with the publisher accusing the two companies of training their LLMs on millions of New York Times articles. In September 2023, OpenAI and Microsoft were also sued by a number of prominent authors, including George R. R. Martin, Michael Connelly, and Jonathan Franzen. In July of 2023, over 15,000 authors signed an open letter directed at companies such as Microsoft, OpenAI, Meta, Alphabet, and others, calling on leaders of the tech industry to protect writers, calling on these companies to properly credit and compensate authors for their works when using them to train generative AI models.
In April of this year, The Register reported that Amazon was hit with a lawsuit by an ex-employee alleging she faced mistreatment, discrimination, and harassment, and in the process, she testified about her experience when it came to issues of copyright infringement. This employee alleges that she was told to deliberately ignore and violate copyright law to improve Amazon’s products to make them more competitive, and that her supervisor told her that “everyone else is doing it” when it came to copyright violations. Apple Insider echoes this claim, stating that this seems to be an accepted industry standard.
As we’ve seen with many other novel technologies, the legislation and ethical frameworks always arrive after an initial delay, but it looks like this is becoming a more problematic aspect of generative AI models that the companies responsible for them will have to respond to.
The Apple approach to ethical AI training (that we know of so far)
It looks like at least one major tech player might be trying to take the more careful and considered route to avoid as many legal (and moral!) challenges as possible – and somewhat surprisingly, it’s Apple. According to Apple Insider, Apple has been pursuing diligently licensing major news publications’ works when looking for AI training material. Back in December, Apple petitioned to license the archives of several major publishers to use these as training material for its own LLM, known internally as Ajax.
It’s speculated that Ajax will be the software for basic on-device functionality for future Apple products, and it might instead license software like Google’s Gemini for more advanced features, such as those requiring an internet connection. Apple Insider writes that this allows Apple to avoid certain copyright infringement liabilities as Apple wouldn’t be responsible for copyright infringement by, say, Google Gemini.
A paper published in March detailed how Apple intends to train its in-house LLM: a carefully chosen selection of images, image-text, and text-based input. In its methods, Apple simultaneously prioritized better image captioning and multi-step reasoning, at the same time as paying attention to preserving privacy. The last of these factors is made all the more possible for the Ajax LLM by it being entirely on-device and therefore not requiring an internet connection. There is a trade-off, as this does mean that Ajax won’t be able to check for copyrighted content and plagiarism itself, as it won’t be able to connect to online databases that store copyrighted material.
There is one other caveat that Apple Insider reveals about this when speaking to sources who are familiar with Apple’s AI testing environments: there don’t currently seem to be many, if any, restrictions on users utilizing copyrighted material themselves as the input for on-device test environments. It’s also worth noting that Apple isn’t technically the only company taking a rights-first approach: art AI tool Adobe Firefly is also claimed to be completely copyright-compliant, so hopefully more AI startups will be wise enough to follow Apple and Adobe‘s lead.
I personally welcome this approach from Apple as I think human creativity is one of the most incredible capabilities we have, and I think it should be rewarded and celebrated – not fed to an AI. We’ll have to wait to know more about what Apple’s regulations regarding copyright and training its AI look like, but I agree with Apple Insider’s assessment that this definitely sounds like an improvement – especially since some AIs have been documented regurgitating copyrighted material word-for-word. We can look forward to learning more about Apple’s generative AI efforts very soon, which is expected to be a key driver for its developer-focused software conference, WWDC 2024.