The Copilot discussion got me thinking about training data

honey the codewitch

One issue with training data is it's often pulled from copyrighted material, whether deliberately or through an automated process. Another issue, probably overlooked but I mentioned it on the discussion is "model collapse" You can't get good models if your training data is generated AI content as well. It is essentially incestuous, and leads to model collapse. AI models collapse when trained on recursively generated data | Nature[^] I'm thinking regulations could actually solve both of these problems. AI could avoid using data marked as NOT training data. Data generated by AI and copyrighted material could be marked that way, and flagged as off limits to automated processes that scrape. AI companies would have incentive, not only because of regulatory statutes, but perhaps more importantly so their own models don't get poisoned. Thoughts?

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

k5054

Because robots.txt works so well ... Regulation is fine, as long as it can be enforced. If there's an enforcement mechanism that has sufficient teeth to discourage abuse of the regulations, the maybe it will encourage ethical behavior on the part of organizations creating and distributing AI. I feel like we need an equivalent of Asimov's "Three Laws of Robotics", but for AI.

"A little song, a little dance, a little seltzer down your pants" Chuckles the clown

jochance

I'm not sure how the copyright material stuff should be worked out. I categorically disagree with the idea you just can't/shouldn't be able to train on copyrighted data. If it is available to people for consumption then it should be available to feed to an algorithm. Piracy notwithstanding. But I'm not sure why you should have to buy all the media over and over again, even as an individual and that's kind of how we've been doing things for a bit. I'm skeptical that regulation of the wild west would do more than make the natives restless and the cowboys chuckle. The cost of trying to 'filter' all the content to not scoop that stuff up probably means that something purpose built (RISC?) to test content for model collapse against a model would be really useful.

honey the codewitch

There's definitely the problem of enforcement, but what I would say is that A) It already behooves the AI companies to follow suit as it keeps their models from becoming incestuous and collapsing. B) As far as people putting invalid tags on things - that's why I suggested tags marked NOT for consumption, because again, it typically incentivizes and otherwise doesn't really burden people to do it. It protects their copyrighted material in the first instance, and in the second, since the content is generated anyway marking it with tags is no bother. C) In regards to actual teeth and where the regulatory statute might be leveled - if someone is knowingly producing generated content to poison other people's models and it flies in the face of said regulations it at the very least exposes the actors to civil, if not criminal liability. Civil liability in the states isn't so difficult to prove as a criminal case, relying on the standard of "a preponderance of evidence" rather than "no reasonable doubt". Such a case seems easy enough to make in many instances, although suing Russian actors from the states might be problematic. Still, that's a problem with the internet in general, and the argument that we shouldn't make a law because it can't always be enforced is pretty much a non starter, as that could apply to many laws already on the books.

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

honey the codewitch

jochance wrote:

I categorically disagree with the idea you just can't/shouldn't be able to train on copyrighted data.

The danger in that, particularly where it pertains to code, is what if you indirectly rip from IBM's codebase via AI? Do you really want a behemoth like that coming after you with lawyers?

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

jochance

Recognizing that how things work and how I think they should are two different wildebeasts... Of course not, but I somewhat object to the idea that IBM (or anyone else) should own trivial bits of code. I can get onboard with a copyright across the whole of a system's code. But we'd never be on board with the idea someone can "own" singular or smaller groupings of lines in isolation and why would we? You and I have almost definitely typed some of the same lines of code before and we'd never even know it. In my opinion, the ratio of code worthy of the protections of copyright is just super low. If it were a Far Side cartoon it would be a caveman courtroom drama over Thag copying his neighbor's stick figures from their cave drawings. Most of the value isn't really in the code itself but whatever it is the code is doing. If you can rewrite that in another language or even the same one then you've effectively 'stolen' the value legitimately. Some of the best bits of problem solving code are just going to be the same (or close) no matter who/where they come from, especially if the problem itself can map to code in straightforward ways.

Richard Andrew x64

jochance wrote:

If it were a Far Side cartoon it would be a caveman courtroom drama over Thag copying his neighbor's stick figures from their cave drawings.

:laugh: :laugh:

The difficult we do right away... ...the impossible takes slightly longer.

Greg Utas

Apparently there's a software patent for saving the portion of a screen that will be overlaid by something else so that it can be quickly restored later. Utterly deranged.

Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing.

jochance

Lol wut? I wonder if that dates to really old ASCII based UI. It's the only thing off the top of my head that makes it make sense. Hasbro bought a bunch of IP from Atari. Amongst it, I think, was Pac-Man. This led to a short-lived claim and series of suits which were premised on the idea that any game featuring a protagonist in a maze like environment was a derivative work. Bold move, Cotton.