Waiting..
Auto Scroll
Sync
Top
Bottom
Select text to annotate, Click play in YouTube to begin
Welcome everyone, thank you for joining us. Just a reminder, we're in the session handling academic, corporate and AI research questions as the law develops. I'm Tim Vommer. I'm from the University of California Berkeley Libraries, Office of Scarlet Communication Services. And our office helps students, researchers, and faculty understand Scarlet Publishing,
of course, copyright and intellectual property issues, and also the broader sort of sphere of information policy in their research, in their teaching and writing. And at least for a few years now, we've been interested in and also contributing to copyright and licensing issues that intersect with text data mining and more recent discussions
around artificial intelligence in research and scholarship. And now I'd like to have my colleague introduce himself. Hi, I'm Jonathan Van D. I am a copyright lawyer in Washington, D.C. I represent ARL as well as ARL and ALA in the context of the library copyright alliance. My practice is more on the policy side of things, but now a lot of it is the policy relating
to the intersection between corporate and generative artificial intelligence. Great. So I just want to provide a little bit of an overview for our session today. Of course we know that there is a generous amount of interest in current artificial intelligence, tools and platforms.
And there are also a lot of legal and policy questions around the copyright, but also contractual implications for artificial intelligence as it intersects with higher education and academic libraries. And Jonathan and I will walk you through some of the main issues that we see. So first we'll give an example of how AI is being used in scholarly research and how it
aligns with existing practices such as text data mining and computational analysis. Then we'll touch briefly on some of the copyright basics because it's important to understand these fundamentals in context of the conversation. Next we'll overview three copyright issues that are implicated by generative artificial intelligence.
Then we'll explore why contractual arrangements can be specifically concerning for researchers to leverage their fair use rights under the copyright law. Then we'll take a look at what we're seeing in academic libraries with regard to publishers asking libraries to enter into more restrictive licensing terms that could limit scholarly
research using any artificial intelligence application. And we'll outline some approaches that we might take that will allow our researchers to use license content for their scholarship without negatively affecting publishers. And then we'll have some time for Q&A at the end. So as we've already seen in various sessions yesterday, there are so many ways that artificial
intelligence is being used in computational research on our campuses. Now some scholars want to be able to utilize generative AI tools and we should protect that. And generative AI is perhaps the biggest new thing in the public mind right now with tools like chat GBT of course, also a Gemini, Stability AI and Dalie for images and other platforms
where users input a prompt to it and it generates new content. But other scholars have relied on non generative AI tools for many years. And one way we've seen artificial intelligence used in research practices is in extracting information from copyrighted works.
So researchers are using this to categorize or classify relationships in or between sets of data. Now sometimes this is called using analytical AI and it evolves processes that are considered part of text and data mining. So we know that text data mining research methodologies can but they don't necessarily need to
rely on artificial intelligence systems to extract this information. So let me give you an example. At UC Berkeley professor David Bamman has been studying trends in literature and film. And right now he's working on a project to look at the representation of firearms in cinema. And in order for professor Bamman to assess how common guns are in films and the types
of circumstances in which they appear he has to find instances of firearms in thousands of hours of movie footage. Now to do that he needs an algorithm to search for and identify those guns. But first he has to show an AI tool what a gun looks like by showing it some pictures of film stills of guns.
So this way the artificial intelligence tool can learn how to identify a gun before it then goes off and looks for other instances of guns in a much larger body of works. So this is a classification technique that involves artificial intelligence but not generative AI. By that the AI is not creating new images or footage of guns as a part of his research.
And scholars have relied on this kind of non generative AI training within text data and mining for a long time under the fair use doctrine. So we'll explain the importance of fair use in the artificial intelligence context next. But to understand why fair use is important in the first place we have to understand what rights copyright actually provides.
Now copyright is awarded to authors and it as an incentive to create new things. And the reward for creating this original expression is that the author of that expression is the only person who can undertake various uses of what they create. The specific exclusive rights that a copyright owner gets over their original expression
are reproduction, distribution, public display, public performance and the creation of derivative works. So as an example, if I write a book, I'm the only one who gets to make copies of it, to sell it or give it out to other people, display it online and so on. But of course an important aspect of copyright is that authors receive these exclusive rights
only for a limited period of time. If they got the rights indefinitely then there wouldn't be any new scientific advances or scholarship because existing works could never be used in support of new creations. And then there are also limitations and exceptions, essentially safety valves to the exclusive rights of the copyright owner.
And the most relevant exception in the United States at least is fair use. And it provides for the ability to use copyrighted works without permission in certain instances like scholarship for research and in teaching. So how does this apply if you want to mine a corpus or a train in artificial intelligence
tool? Well, if the copyright law wasn't flexible at all, you'd need to get permission for each and every piece of content that you want to download and conduct your analysis on. And scholars need to rely on fair use to perform the acts of computational research with or without artificial intelligence.
And we'll talk more later about why we believe that these computational uses are and should continue to be considered fair uses. But before we circle back on that, Jonathan is going to provide more detail about how copyright issues intertwine with generative artificial intelligence. Thank you very much, Tim.
So as you can tell, we're covering a lot of real estate. And so there'll be time in the Q&A to go back and talk about various things in greater detail. When the whole, when open AI released chat GPT last year, there was this explosion of interest
in the media about AI and especially the copyright issues relating to AI, and then there was like a whole bunch of cases filed and you hear about every day. It seemed like a new case is being filed. So there's developments in these various cases and it seems a little overwhelming and very
confusing. And so I'm going to try to tease out some of the various issues involved in all this litigation. And I think one of the reasons why it's so confusing and it could be, you know, maybe you've experienced this yourself when you sort of read these various articles about
copyright and generative artificial intelligence is that we're really talking about three very different issues. But those three different issues get mixed together and that's where things get confusing. It's like if we could just talk about one issue at a time, we're okay. But then when you get these three issues mixed together and it's even when you're reading
articles and reputable magazines and newspapers, they kind of slide from one of these issues to the other and I as a copyright lawyer realized, yeah, okay, now we're switching topics. But I would think that for a lot of readers it's really confusing because you said, wait a minute, what are we talking about now? So if you get nothing else out of this presentation, you know, this is the, you know, get to understand
this slide, okay, that there are really three different issues that are being implicated by artificial intelligence. And this is true with, you know, all artificial intelligence, not just a generative but particularly generative. So the first issue is does the ingestion of content for training the AI?
Does that ingestion constitute copyright infringement? And so that was to some extent what when Tim was talking about, you know, the AI, you know, if you're sort of, you know, you're ingesting all of these movies so that then you can look for the guns, right? But first you have to ingest the movies. And so is that ingestion of the movies?
Is that an infringement? The second issue, and this is more specific to generative AI, is does the output in French? Now in the example that Tim gave, there really is no output or the output is just a list of movies, right? Or, you know, or data that comes out of them. So that's not an issue there. But certainly with generative AI and, you know, when the AI is producing an image or
it's producing text, then the question is, is that output infringing? And finally, the last issue is, is that output? Is it copyrightable? And what extent can that output belong to someone? Does it belong, can it belong, can it receive copyright protection, you know, particularly,
you know, with respect to the user who put in all the inputs, put in all the prompts and it generates a work, to what extent does the user have a copyright in that? And that actually, you know, the copyright office has done a lot of work particularly in that area. And these are three very, very different questions, right? And, you know, to some extent, the first two are sort of saying, ooh, maybe, you know,
AI is bad and leads to infringement. And the last one was like, ooh, maybe AI is good and leads to me being able to make some money, right? I'll be able to create works and then be able to, you know, exercise copyright over them. But even sort of like the attitude towards copyright reflected in these questions is different again, which is all adds to sort of the confusion.
Is this good? Is this bad? Is it pro-cop or does it add that up? So, but anyway, but these are the three very different questions. So, let's look at the first question. And in many ways, this is sort of a, for many people, maybe, you know, is the most interesting question, the one where there might not be clear answers, at least with respect to some
perspectives. So training AI requires the ingestion of a large volume of works. Now, it could be as we go forward with the technology and that maybe it doesn't need as much ingestion as we think it does, right? But currently, and a lot of the AI systems, the belief is or the understanding is that
you really need to ingest lots of works. And one of the reasons why you need to ingest a lot of work is to make sure you're getting it's all representative, right? That, you know, if you don't want to just ingest a lot of, you know, works that are, you know, a lot of, let's say, books that are written in countries in the global north or in, you know, western, you know, in countries in the west that you want to have a representation
and the same thing with images. But again, how much you really need to, how many works you really need to ingest is one of those things that I, you know, we're discovering. It could be that you don't need quite as much as we thought we did. Now once the work is in, all these works are ingested into sort of the training database. And you have software that analyzes the works and discovers patterns, trends, relationships
that then get separated from the works and they're put into their, basically this is called what would you end up with is, is the model, you know, it could be a large language model, for example. Um, supposedly no expression, none of the original expression from the works that were ingested ends up in the model.
That the model should just be this, you know, kind of mass of relationships and patterns and algorithms and all that kind of stuff, but no expression itself. Now again, that's one of those issues that we're going to see unfold in litigation and there's, you know, there, there, it could be that large language models currently have
this problem of what's called memorization that they seem to regurgitate expression even though they're not supposed to. And you know, that might be a problem that gets solved in the near future, but right now there does seem to be this issue that in terms of some of the results, certainly that either intentionally or accidentally it does regurgitate expression.
And, and, you know, that that's, you know, we'll get to that in a minute in more detail. Now what will also just is that most countries that have looked at the ingestion issue, not that many have, but the ones that have have more or less decided that ingestion does not constitute infringement.
And now I have a huge asterisk next to that statement. I'm sure a lot of people can disagree with it, but, but, but, but, just as a simple statement, you know, let, let's, you know, let's go with it. That at least most countries, in most countries, it would be considered not an infringement to ingest. Okay, so let's get dig deeper into that notion.
So as Tim indicated in the U.S., we would be looking at this as a matter of fair use. And there is precedent not in the generative artificial intelligence area, but in other, you know, you could say more basic forms of artificial intelligence are related areas, that courts have found the assembly of these large databases for training purposes, such
as search engines or plagiarism detection software, that those are non-infringing. So, you know, think of the, the I.P.I.R.DIME's case that involved plagiarism detection software or the perfect 10 versus Google or the Google Books cases, that that whole network of litigation,
the courts have tended to find that to be non-infringing, largely because the output was non-infringing. Okay, so that they're saying as long as, sort of this, as long as, this internal copying, we're sort of not going to pay attention to all the copies that are being made inside
the machine as long as what's coming out is not expression. And then the question will be, well, but in generative I.I., there is expression coming out. Now it might not be the same as the expression that was coming in, but, you know, we'll get to that a little bit, but at least that the underlying theory in these cases has been
because, you know, we're not going to worry too much about what's going on inside the computer because the recognition that G, computers do a lot of copying. That's how they work. And so, you know, we won't worry too much about what's going on inside the computer. But an event once this explosion of generative I.I.R.D.I. happened last year, everyone
started saying, well, you know, are those cases relevant? And a lot of cases were filed. And again, you know, every week or two, there's actually more cases, there's some of them or class actions. You have, you know, the biggest case pending right now probably is the New York Times versus Open AI. But there have been several other, you know, probably a dozen, at least a dozen if not more
other cases filed were various groups of authors in particular have sued Open AI and the other vendors. A lot of the cases are being brought by the same law firm. We're starting to see some decisions, some preliminary decisions in these various cases. So there's a lot going on. And again, on the ingestion side, the argument is does all the ingestion constitute
fair use. So the really has been no definitive decision yet. Interestingly, the Israeli Ministry of Justice had an issued an opinion letter late last year saying that under their fair use law, which is based on our fair use provisions, that sort of all the copying related to machine learning is a fair use.
But so far, that is sort of like the most definitive government statement. We will probably hear, be hearing more by the end of the year from the U.S. corporate office. The U.S. corporate office has launched an entire workflow over the past year looking at different aspects of corporate and AI and sort of breaking it up along the lines of these three different
questions that I suggested before. And one of the issues that they will be coming out with a paper at the end of the year is this ingestion issue. I have a feeling they'll probably say, well, on the one hand, on the other hand, here's the argument that some people make that it is a fair use. Here's the argument that some people make that it isn't a fair use and it's ultimately going to be resolved in the courts.
Now I just want to digress a little bit to talk about what's going on in Europe. In Europe a couple of years ago, or yes, it's more of this like four or five years ago by now, the European Union came out with a copyright and the digital single market directive that looked at some of these issues. And under Article 3 of the directive, sort of the copying necessary for text and data mining
for non-commercial scientific research purposes is considered not to infringe copyright. And in Article 4, that basically allows the copying necessary for text and data mining for other purposes, meaning for purposes other than non-commercial scientific research.
But that's subject to an opt out. And that's an important point. I'll get to that in a second. Now in addition to what's in the directive, these two provisions, Article 3 and Article 4 of the directive, there's also a proposed Artificial Intelligence Act that just passed
the parliament but it still hasn't been passed by the Council of Ministers. That's related. It's not copyright specific but it would require a disclosure of all the inputs. Now again, the question of how granular would that disclosure have to be, but you'd have to someone sort of creating a training database and then coming out with an AI system would
have to provide some indication of what was in the training database so that I guess a person could then figure out, do they want to sue or did they opt out and so forth. Now why is this opt out issue so important? It turns out that at least until relatively recently, the way most of these training
database databases as assembled as you have bots crawling the web and just downloading lots and lots of material. And that's how you get the training database. There has been for a while this concept of a bot exclusion header. So it's basically a do not enter sign that you put on a website.
It was a standard that was developed by the World Wide Web Consortium. It's been around for decades at this point. But the problem has been in the context of AI is let's say you really didn't want your website crawled for training purposes. And if you use the bot exclusion header, then that meant that your site would not be crawled
for any purpose, either for generative AI or for search. In other words, you would basically say, you know, I'm basically, my website is not going to be invisible. And people obviously didn't want to do that. And that would sort of, you know, you could sort of say, well, forcing a website publisher to make that choice. Either you share your material for training purposes or you become invisible.
That's not a great choice. But it turns out that more and more of the search engines have come up with refinements to the extensions to the bot exclusion headers. So certainly Google and Microsoft will now, you know, they've had extent to have extensions so that if you want to allow your site to be crawled for search purposes, but not for
AI training purposes, you can signal that. And it's the same with Microsoft. So both Google and Microsoft do that. And I imagine the other search engine companies are, and AI companies are, you know, sort of going to be working on that. Obviously, it would be better if there were, you know, one standard so that you don't have to comply because I'm, you know, I'm sure there's variations between what Google requires
and what Microsoft requires. But this is sort of, you know, conceivably a way out. At least this will satisfy the requirements in the European Union so that you'll, you'll, you know, a person can use a bot exclusion header, basically opt out from crawling, but at the same time not have that downside of becoming invisible.
And then sort of shifting back to the US. So this, you know, so this is going to become widespread in Europe, right? And it's also going to become under the AI Act. Any company that's going to be doing business in Europe is going to have to sort of comply with the standards created by the AI Act. So or by the, by the, the directive, by, by the, the, the copyright and the digital,
single market directive. So what that means is that this will in effect become a global approach because all of the, you know, the US companies that are engaged in, in, in AI services want to provide those services in Europe. And so they're going to have to comply with the AI Act, which means they're going to have
to comply with the directive. So then if you say, well, does that resolve the fair use question? I mean, so even to the extent that that one can say, well, gee, you know, is this ingestion if you're ingesting someone's website without their permission is that fair?
Is it not fair? Well, it could be that the court's analysis will be influenced by this opt out feature, meaning if a, if a developer or a website publisher has the, has the availability of opting out and chooses not to, that could really influence the, the fair use analysis significantly.
And then just one other point from the international perspective, both Singapore and Japan have statutes that specifically deal with text and data mining. So this is an addition to what, you know, so you have the copyright and the digital, single market directive that addresses it. And then Singapore and Japan also have provisions and they basically allow the reproduction necessary
for text and data mining. Again, and all this is on the ingestion side. So this is, these are different issues relating to the ingestion piece of the puzzle. So that's actually not an insignificant part of the world, right? If you talk about all of Europe, Singapore, Japan, Korea apparently is in the process of developing something along these lines.
Then if you, the UK has something very similar to, even though it's obviously no longer part of the EU, it has something very similar to the requirements of the digital, single market directive. And so that's a big chunk of the sort of the developed world. Okay, let's go to the second question.
Does the output in French? Now in many respects, this is an issue that really can be looked at under traditional copyright principles in the US. Typically, when you're looking at the output, and you could sort of say, well, you know, forget the fact that it was an output from a computer, right?
Let's just say, you know, if I start, you know, distributing an image and someone says, no, no, no, that's my image. You know, what you would, you would do that under an analysis of, you know, there's a substantial similarity analysis. And typically you would first say, did I, did the alleged infringer have access to the original work?
And if I did have access to it, and then, you know, then you get to the second question, is that is what I've produced substantially similar and protected expression to the original work? And if it is, then presumably, you know, it's presumptively unfair. And not unfair, it's presumptively infringing. And there's been a lot of case law over the years.
It's very confusing because sort of determining is something substantially similar and protected expression is a very complicated issue. I mean, it's easy when it's, you know, when it's verbatim the same or it's, you know, virtually identical. But you know, those cases are easy. But there's a lot of hard cases where there's similarity. And then the question is, is how similar is it?
Is it so similar and are the similar, are the similarities functionally dictated? Or the similarities dictated by the genre? And this is where things get very murky and metaphysical. And they will likely get murky and metaphysical in the generative AI context.
But that's copyright law. It's always complicated. Once you get to that issue of, is it substantially similar in protected expression? And so one can expect that there will be litigation on that area. And in the cases that have been followed so far, that is, you know, the cases have been
focusing on the ingestion issue. But they've also made claims about the output. And saying that the output is too similar. And so that's one of the things that will have to be resolved. But then the other question is, at least in the generative AI context, as opposed to your normal run of the mill litigation, is what is the liability of the AI provider?
So clearly, if the AI produces in response to prompt something that is substantially similar in protected expression to something that had been adjusted, clearly the per the user who put in the prompt, I mean, the first question is, is what is that person's liability? And you would, you know, especially if their prompts are very detailed, it would really
seem that that person would have direct infringement liability. They're the ones who cause the machine or directed the machine to put out to create this image. And on the other hand, if, you know, the question is, is what is the liability of the machine itself or the person who provided the machine? And that's going to be a little trickier.
Again, there are existing principles of copyright law dealing with secondary liability about contributory infringement and vicarious liability in the Sony doctrine. I mean, there are existing principles and, you know, those will be applied to these cases and we'll sort of see how it comes out. Another argument that has been used in some of these cases is, you know, even forgetting
the issue of substantial similarity, they're simply saying that the works that are being produced, any works that are coming out of the machine out of the AI, that those are derivative works, that they are per se derivative works. And that was one of the exclusive rights that Tim mentioned. And so in some of the cases, so those are derivative works infringing on the derivative
work right. So far the courts that have looked at this have basically said no, you know, just because it, just because it hasn't, the, the, the AI hasn't adjusted your work does not mean that anything that it produces is infringing the derivative of work right. And you have to look at is, is the derivative work substantially similar and protected
expression? If it is, then we'll look at it, you know, then that, then it might be an infringement. If it isn't, we don't even need to reach, you know, we're not going to say that it is per se derivative work. Now the last copyright issue is, is the output copyrightable. And again, as I mentioned, this is an issue where the, the copyright office has spent a lot of time.
There have been a couple of decisions first within the copyright office and then they went, you know, they were appealed and they went to court. These are kind of interesting issues. A lot of it has to do with the basic notion under US copyright law is that only a work reflecting human originality can receive copyright protection. And that's true in most cases, in most countries around the world.
And so if, you know, if a work is created entirely by, or almost entirely by the artificial intelligence, then it would not, the should not be copyrightable. On the other hand, if it's reflecting a lot of human creativity, let's say the user has lots, lots of prompts and then there's other refinements editing after the fact, then that work almost certainly would be protectable.
You know, there's, there's issues, of course, of the amount of degree, the amount of disclosure. These are very interesting issues, but not the most relevant to this audience. So I don't think we're going to deal with it much further. Other than to point out that what's very interesting is that even though, you know, we have this basic doctrine under copyright law that it has to be, you know, copyright protection
is only available with respect to works, reflecting human originality. And it's sort of think of the case involving Naruto, the monkey, right? You've probably all heard of that case where, you know, someone set up a camera and a monkey, click the camera. It's a great looking picture, but it's like, okay, ultimately there was the court found that that image was not copyrightable because it was the camera.
It was the monkey that, you know, sort of posed himself and, you know, he decided how close he was going to be to the camera. And he's the one who pressed, who clicked on the camera and that the photographer, you know, yes, the photographer did put the camera on a tripod in this, you know, in the jungle, but beyond that, you know, there wasn't any human originality involved in the creation
of the photograph. So no copyright in that image. And that, like I said, is a general rule, but in the UK, interestingly, there is a different rule or there might be a different rule, at least with the does provide for copyright protection for computer generated works without a human creator.
And so, you know, so this doctrine that you need to human authorship is different, at least in the UK, but like I said, I think that's the minority position. So, but again, at the highest level, three very different questions, three different issues, and when you're talking about copyright and AI, you know, at the very least, you need to make sure, okay, the which issue am I talking about? Am I talking about ingestion?
Am I talking about output or am I talking about copyright ability of the output? Thank you. Thanks Jonathan. So I want to return to the issue of why training artificial intelligence in research
context is and should continue to be a fair use. As Jonathan mentioned, there are several court cases right now that are going to be opining on this. The US copyright office has a major study out that has received comments from thousands of people sort of chiming in these issues.
And we can look a little bit deeper into the four factors that go into making a fair use decision. And in this context, the most important ones are factors one and factor four. Now, factor one considers the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes.
So, more aligned with what we're thinking about coming from academia with regard to scholarship. And the Supreme Court made clear in the Google V Oracle case that the inquiry into this first factor really depends large in a large measure on whether the copying at issue was what they said transformative.
So whether it adds something new with a further purpose or a different character. And then looking at factor four, it considers whether the effect of the use upon the potential market for or value of the copyrighted work. And we know that the courts and the copyright office have consistently recognized that
the non-expressive use of copyrighted works for purposes like computational analysis do not substitute for the copyright owner's ability to sell or communicate their copyrighted works to the public. So for these reasons, regardless of whether AI training is incorporated or not, the kind
of text data mining research being undertaken and hopefully AI training taken undertaken at our institutions should be treated as a fair use. But there's a question, if fair use is so essential to text data mining research and if the relevant court cases have consistently found text and data mining research to be
a fair use, then why are libraries still very concerned about this? And one reason is because there's another challenge beyond simply copyright. And that is the various contracts or license agreements that publishers require libraries to sign to access research content for our users.
And this is generally described as the problem of contractual override. So in the United States, even though fair use is a statutory right, which means it's in the law. And as I mentioned, even though there are court cases that confirm that practices like text data mining falls under fair use, there's no point.
So protection against the practice where private parties can essentially contract around fair use by requiring libraries to negotiate for otherwise lawful activities such as conducting text data mining or training artificial intelligence tools for research. And academic libraries are forced to negotiate or even pay additional sums every year to try
to preserve our fair use rights for our campus scholars. So just to summarize here, even if a researcher's use is a fair use, there may still be an overarching agreement that restricts text data mining or the use of content in artificial intelligence applications.
Now we know libraries are negotiating general clauses in our license agreements that preserve fair use. So we advocate for language in our licenses to protect the ability for our users to use materials under section 107 of the U.S. copyright law of their fair use rights.
And then in other instances, libraries are also negotiating for specific rights to enable say text data mining. Now although we're negotiating clauses in our license agreements that preserve fair use and even though TDM has been confirmed to be a fair use, some publishers are inserting new restrictions either in text data mining clauses or separate artificial intelligence
clauses that could take away uses that would be allowed or would normally be allowed by fair use. And so now we're seeing that some academic libraries and also researchers in the U.S. are facing the threat of contracts overriding these statutory fair use rights. And we're already starting to see amendments from vendors that are attempting to impose
bands or otherwise limit the use of artificial intelligence tools for the content that were licensing from them. So just as one example, here's some language from a recent proposed publisher amendment. And it requires that authorized users may not use the subscribe products in combination with an artificial intelligence tool, including to train an algorithm, test process, analyze,
generate output and or develop any form of artificial intelligence tool. So basically a ban on any researcher's rights to use the license content for any type of AI practice. So how can libraries negotiate sufficient artificial intelligence rights while still acknowledging
the concerns of publishers? And we believe that some vendors are attempting to curb AI usage because of three primary interests and they're right, me more of course. So the first is the security of their license products and the fear that researchers will leak or release content that sits behind a paywall or potentially train a public version
of an artificial intelligence tool using those subscribe materials. The second is that AI is being used to create a competing product that could substitute for the original license product and then undermine that share of the market. And then the third concern is basically a desire for additional revenue by charging extra
for their own licensed artificial intelligence products. Now as for the money issue, we think that academic libraries and consortia can work together to push back against additional charging. But the key point here is that we don't want to agree to contract terms that bypass user rights.
Now as for the first two concerns, security and competing products, these are probably more longstanding fears over the potential misuse of license materials. And we know that content providers are already able to impose contractual provisions to forbid
say the creation of derivative products or prevent the systematic sharing of license content with third parties. So therefore if those things are already covered in the license agreement, then bringing it up again in the context of attempting to control artificial intelligence is not really required. So we think that libraries can address this by working with publishers to understand what
their concerns are. And as we mentioned, we think the primary issues are about competing products and then content leakage. So a library could come back with licensing language that satisfies those concerns by including text like the following. So in the section of the license which communicates the conditions on the use of the content where
it says, you know, authorize users may not use the subscribe products in combination with an artificial intelligence tool to the extent doing so would a like create a competing or commercial product or service for use by third parties. Users should not unreasonably disrupt the functionality of the subscribe products.
Users shouldn't reproduce or redistribute the original subscribe products to third parties. And further a clause that might say something like artificial intelligence tools may not be used without commercially reasonable information security standards to undertake, mount, load, or integrate the subscribe products or authorize users servers or equipment.
So these are some things that we can do to make productive suggestions to mitigate some of these concerns from publishers while still allowing our users to leverage content, license content for particular uses. Now that proposed language that we've just reviewed over the last few slides authorizes
the use of publishers content for like a homegrown non generative AI tool or even a homegrown generative AI created by a researcher on our campus. And it would also cover the use of license content in an analytical AI tool that might be created by a third party. But of course there's an important final concern that we've heard from publishers.
So basically they don't want their content put into public versions of generative AI platforms which in theory could then assist in training those tools and platforms. So there are additional measures that could address this concern in the license agreement. And the gist of the language here is that none of the subscribe materials are shared with
these third party generative AI tools. So thinking about like chat GPT or other public versions of those generative AI platforms. Now we're going to have to see how these other negotiations play out. Right now at UC Berkeley we're trying to build some negotiation case studies with the
license agreements that come are coming across our desks. And we're trying to outline what different types of AI restrictions we're being presented with. And what kind of resolution we're able to reach in the different scenarios. Now certainly it's essential that more research libraries across across the country become educated on these issues so that they can also negotiate fair terms.
And we need a coordinated front on our negotiating for AI rights as a part of the license agreement in chief rather than as another priced bundle for adding on additional rights. So another piece that could help is having a directive from our faculty or our researchers when it comes to preserving fair use in our license negotiations.
And this would really empower us to be able to take a stand and provide guidance on what to do if publishers won't come to the table on artificial intelligence. So this is something we're collaborating on with our University of California system-wide colleagues, also our library committee and our academic senate. And we really hope that you'll be having these conversations as well.
So we have a few minutes left if you have questions or comments or want to share what's going on on your campus with regard to understanding copyright and licensing issues with an artificial intelligence and scholarly communication. So thank you very much. While we're waiting for people to line up at the microphone, just one interesting point.
So I mentioned the EU copyright and the digital single market directive. So it does have a contract override prevention clause with respect to non-commercial uses. So it would basically take care of all of the problems that Tim has worried about on college campuses.
But it doesn't apply, that override provision does not apply to commercial vendors like OpenAI. So again, it's the EU sort of anticipated Tim's problem and came up with a solution that sort of distinguishes university uses from commercial uses.
Yeah, please go ahead. Great. Thanks to both of you. I found this really helpful. Two thoughts come to mind. One of them is we, in my library, we were recipients of a letter from a vendor saying, please certify that none of your researchers will use our content for generative AI. And of course, that puts us in a very difficult position. One thing I'll note is that the letter I think explicitly mentioned Al Paco, which is an
open source LLM, it was retrained from Lama on Stanford's campus. So the letter was almost like they thought the library did this, which we didn't. It was real data scientists who did this, but they were asking essentially for us to exercise prior restraint on researchers from doing certain things. Second point is that in the same letter from a vendor who remain anonymous, they said, you can't use any of our stuff in any generative context.
What's really interesting about this technically is if I were to copy and paste a paragraph from a New York Times article into Gmail or to Google Docs, the spelling and grammar checker that is running in Google Docs in Gmail is based on a Transformers model. I've already violated the copyright or this alleged, you know, contractual closure just by pasting it. So I don't know how enforceable all this stuff is going to be.
Yeah, that's a really, is this on? Yeah. That's a really good point. It's so, you're right, it's so integrated in so many of our practices so far. You know, that's why we're trying to take a, like, careful approach and trying to understand the publisher's concerns.
It's actually been a great process so far because, yeah, we received those short snippets that they want to just ban everything. And once we dug in a little bit deeper to several of them, they're like, well, and we explain how we need to protect some of the research uses that are going on on campuses that don't necessarily involve a generative AI. And they're like, oh, yeah, well, that's, we didn't mean that. You know, so it's really important, I think, to open up these conversations and have
honest conversations with the publishers, to kind of really understand what their intentions are and how we can kind of work together because, you know, for at least a few of them that we've talked to, their big thing is, you know, they just don't want their content dumped into the public versions of these commercial platforms. But they're okay with protecting a lot of the other sort of uses that we're seeing on campuses.
So, yeah, I'll just echo what Tim said. And just add, of course, when you start getting those kinds of letters, you need to get your general counsel's office involved and not start responding to them on your own. But yeah, but that's the other thing. It's just that how everyone thinks that AI is new, but it isn't. And it is, as Tim's indicated, I mean, it is so integrated in the kinds of things that
we use all the time, like spell check. And similarly, even functions like, even though, you know, people say, well, AI is a new, but generative AI is new, but any translate function is generative AI. I mean, Google translate is generative AI, is generating new text.
And so, and that's been around, you know, at least a decade or longer. And so, and again, no one has ever claimed that any, you know, that there's no issue. No one has tried to assert or claim that somehow that that is infringing a derivative, the right to create a derivative work.
So again, yeah, it's just, you know, working, the bottom Tim's bottom line is that, you know, work with the publishers or again, have your general counsel work with the publishers to make them understand, you know, understand what they want and make them understand why what they're asking for is, you know, not, not completely reasonable.
Any other questions, comments? Hi, this is more of a, of a comment on Maria Asteroina with UNSW. I want to affirm that what we're seeing is not just new language restricting, but also
this policing requirement. It's going hand in hand. And that frankly, I'd love to see that also address because there's this, it seems to be becoming inextricable, not just that they want us to restrict, but they also want to hold us liable. And, and of course, you know, this sort of goes without saying the solution to all of these problems is to just go completely open access, right?
I mean, if all, if everything, if you, if you could just persuade your faculty to just only publish and open access publishers, you wouldn't have to worry about any of this. So, you know, I love that solution. So, Morris York, Big Ten Academic Alliance. I just want to say, and Billy on that, that one of the things that we started to see as
really disturbing is sort of attempts at AI language that say any use by any authorized user of AI will constitute a breach of contract. And we run the scenario and we're head, and our heads where we're like undergrad takes an article and throws it to Jet, Jet, TET, DPT and ask for a summary of like how are we supposed to police that and how could we possibly sign a contract that puts that kind of liability
on us. You had seen that or had comments on that kind of approach? Yeah, we've seen that a little bit too. And yeah, you're, it's, it's impossible to do that and we wouldn't do, wouldn't do that. One thing that we've tried to take attack is just to introduce sort of clarifying language that like users can still use commercial products, you know, and we can't, can't tell them
what to do with that with that. So you mean users are still going to be using chat G, P, T and all the other like products. So just trying to, I don't know if you know of its clarity, but just introduce the fact that that is going to happen and that's outside of our negotiation with the vendor is on
that. I bet it ultimately reinforces the notion that you just can't agree to everything, right? I mean, you need to, someone needs to look at what these licenses say and not just assume that, you know, it's the same, you know, just because it's the same vendor as last year that the license is the same. And, yeah, it just makes the whole licensing function that much more important.
Okay, I think unless there are any other questions, we will excuse you all two minutes early. Thank you.
End of transcript

This page is an adaptation of Dan Whaley's DropDoc web application.