Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. The Lounge
  3. The Copilot discussion got me thinking about training data

The Copilot discussion got me thinking about training data

Scheduled Pinned Locked Moved The Lounge
designdiscussioncomgraphicsai-coding
9 Posts 5 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • H Offline
    H Offline
    honey the codewitch
    wrote on last edited by
    #1

    One issue with training data is it's often pulled from copyrighted material, whether deliberately or through an automated process. Another issue, probably overlooked but I mentioned it on the discussion is "model collapse" You can't get good models if your training data is generated AI content as well. It is essentially incestuous, and leads to model collapse. AI models collapse when trained on recursively generated data | Nature[^] I'm thinking regulations could actually solve both of these problems. AI could avoid using data marked as NOT training data. Data generated by AI and copyrighted material could be marked that way, and flagged as off limits to automated processes that scrape. AI companies would have incentive, not only because of regulatory statutes, but perhaps more importantly so their own models don't get poisoned. Thoughts?

    Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

    K J 2 Replies Last reply
    0
    • H honey the codewitch

      One issue with training data is it's often pulled from copyrighted material, whether deliberately or through an automated process. Another issue, probably overlooked but I mentioned it on the discussion is "model collapse" You can't get good models if your training data is generated AI content as well. It is essentially incestuous, and leads to model collapse. AI models collapse when trained on recursively generated data | Nature[^] I'm thinking regulations could actually solve both of these problems. AI could avoid using data marked as NOT training data. Data generated by AI and copyrighted material could be marked that way, and flagged as off limits to automated processes that scrape. AI companies would have incentive, not only because of regulatory statutes, but perhaps more importantly so their own models don't get poisoned. Thoughts?

      Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

      K Offline
      K Offline
      k5054
      wrote on last edited by
      #2

      Because robots.txt works so well ... Regulation is fine, as long as it can be enforced. If there's an enforcement mechanism that has sufficient teeth to discourage abuse of the regulations, the maybe it will encourage ethical behavior on the part of organizations creating and distributing AI. I feel like we need an equivalent of Asimov's "Three Laws of Robotics", but for AI.

      "A little song, a little dance, a little seltzer down your pants" Chuckles the clown

      H 1 Reply Last reply
      0
      • H honey the codewitch

        One issue with training data is it's often pulled from copyrighted material, whether deliberately or through an automated process. Another issue, probably overlooked but I mentioned it on the discussion is "model collapse" You can't get good models if your training data is generated AI content as well. It is essentially incestuous, and leads to model collapse. AI models collapse when trained on recursively generated data | Nature[^] I'm thinking regulations could actually solve both of these problems. AI could avoid using data marked as NOT training data. Data generated by AI and copyrighted material could be marked that way, and flagged as off limits to automated processes that scrape. AI companies would have incentive, not only because of regulatory statutes, but perhaps more importantly so their own models don't get poisoned. Thoughts?

        Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

        J Offline
        J Offline
        jochance
        wrote on last edited by
        #3

        I'm not sure how the copyright material stuff should be worked out. I categorically disagree with the idea you just can't/shouldn't be able to train on copyrighted data. If it is available to people for consumption then it should be available to feed to an algorithm. Piracy notwithstanding. But I'm not sure why you should have to buy all the media over and over again, even as an individual and that's kind of how we've been doing things for a bit. I'm skeptical that regulation of the wild west would do more than make the natives restless and the cowboys chuckle. The cost of trying to 'filter' all the content to not scoop that stuff up probably means that something purpose built (RISC?) to test content for model collapse against a model would be really useful.

        H 1 Reply Last reply
        0
        • K k5054

          Because robots.txt works so well ... Regulation is fine, as long as it can be enforced. If there's an enforcement mechanism that has sufficient teeth to discourage abuse of the regulations, the maybe it will encourage ethical behavior on the part of organizations creating and distributing AI. I feel like we need an equivalent of Asimov's "Three Laws of Robotics", but for AI.

          "A little song, a little dance, a little seltzer down your pants" Chuckles the clown

          H Offline
          H Offline
          honey the codewitch
          wrote on last edited by
          #4

          There's definitely the problem of enforcement, but what I would say is that A) It already behooves the AI companies to follow suit as it keeps their models from becoming incestuous and collapsing. B) As far as people putting invalid tags on things - that's why I suggested tags marked NOT for consumption, because again, it typically incentivizes and otherwise doesn't really burden people to do it. It protects their copyrighted material in the first instance, and in the second, since the content is generated anyway marking it with tags is no bother. C) In regards to actual teeth and where the regulatory statute might be leveled - if someone is knowingly producing generated content to poison other people's models and it flies in the face of said regulations it at the very least exposes the actors to civil, if not criminal liability. Civil liability in the states isn't so difficult to prove as a criminal case, relying on the standard of "a preponderance of evidence" rather than "no reasonable doubt". Such a case seems easy enough to make in many instances, although suing Russian actors from the states might be problematic. Still, that's a problem with the internet in general, and the argument that we shouldn't make a law because it can't always be enforced is pretty much a non starter, as that could apply to many laws already on the books.

          Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

          1 Reply Last reply
          0
          • J jochance

            I'm not sure how the copyright material stuff should be worked out. I categorically disagree with the idea you just can't/shouldn't be able to train on copyrighted data. If it is available to people for consumption then it should be available to feed to an algorithm. Piracy notwithstanding. But I'm not sure why you should have to buy all the media over and over again, even as an individual and that's kind of how we've been doing things for a bit. I'm skeptical that regulation of the wild west would do more than make the natives restless and the cowboys chuckle. The cost of trying to 'filter' all the content to not scoop that stuff up probably means that something purpose built (RISC?) to test content for model collapse against a model would be really useful.

            H Offline
            H Offline
            honey the codewitch
            wrote on last edited by
            #5

            jochance wrote:

            I categorically disagree with the idea you just can't/shouldn't be able to train on copyrighted data.

            The danger in that, particularly where it pertains to code, is what if you indirectly rip from IBM's codebase via AI? Do you really want a behemoth like that coming after you with lawyers?

            Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

            J 1 Reply Last reply
            0
            • H honey the codewitch

              jochance wrote:

              I categorically disagree with the idea you just can't/shouldn't be able to train on copyrighted data.

              The danger in that, particularly where it pertains to code, is what if you indirectly rip from IBM's codebase via AI? Do you really want a behemoth like that coming after you with lawyers?

              Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

              J Offline
              J Offline
              jochance
              wrote on last edited by
              #6

              Recognizing that how things work and how I think they should are two different wildebeasts... Of course not, but I somewhat object to the idea that IBM (or anyone else) should own trivial bits of code. I can get onboard with a copyright across the whole of a system's code. But we'd never be on board with the idea someone can "own" singular or smaller groupings of lines in isolation and why would we? You and I have almost definitely typed some of the same lines of code before and we'd never even know it. In my opinion, the ratio of code worthy of the protections of copyright is just super low. If it were a Far Side cartoon it would be a caveman courtroom drama over Thag copying his neighbor's stick figures from their cave drawings. Most of the value isn't really in the code itself but whatever it is the code is doing. If you can rewrite that in another language or even the same one then you've effectively 'stolen' the value legitimately. Some of the best bits of problem solving code are just going to be the same (or close) no matter who/where they come from, especially if the problem itself can map to code in straightforward ways.

              Richard Andrew x64R Greg UtasG 2 Replies Last reply
              0
              • J jochance

                Recognizing that how things work and how I think they should are two different wildebeasts... Of course not, but I somewhat object to the idea that IBM (or anyone else) should own trivial bits of code. I can get onboard with a copyright across the whole of a system's code. But we'd never be on board with the idea someone can "own" singular or smaller groupings of lines in isolation and why would we? You and I have almost definitely typed some of the same lines of code before and we'd never even know it. In my opinion, the ratio of code worthy of the protections of copyright is just super low. If it were a Far Side cartoon it would be a caveman courtroom drama over Thag copying his neighbor's stick figures from their cave drawings. Most of the value isn't really in the code itself but whatever it is the code is doing. If you can rewrite that in another language or even the same one then you've effectively 'stolen' the value legitimately. Some of the best bits of problem solving code are just going to be the same (or close) no matter who/where they come from, especially if the problem itself can map to code in straightforward ways.

                Richard Andrew x64R Offline
                Richard Andrew x64R Offline
                Richard Andrew x64
                wrote on last edited by
                #7

                jochance wrote:

                If it were a Far Side cartoon it would be a caveman courtroom drama over Thag copying his neighbor's stick figures from their cave drawings.

                :laugh: :laugh:

                The difficult we do right away... ...the impossible takes slightly longer.

                1 Reply Last reply
                0
                • J jochance

                  Recognizing that how things work and how I think they should are two different wildebeasts... Of course not, but I somewhat object to the idea that IBM (or anyone else) should own trivial bits of code. I can get onboard with a copyright across the whole of a system's code. But we'd never be on board with the idea someone can "own" singular or smaller groupings of lines in isolation and why would we? You and I have almost definitely typed some of the same lines of code before and we'd never even know it. In my opinion, the ratio of code worthy of the protections of copyright is just super low. If it were a Far Side cartoon it would be a caveman courtroom drama over Thag copying his neighbor's stick figures from their cave drawings. Most of the value isn't really in the code itself but whatever it is the code is doing. If you can rewrite that in another language or even the same one then you've effectively 'stolen' the value legitimately. Some of the best bits of problem solving code are just going to be the same (or close) no matter who/where they come from, especially if the problem itself can map to code in straightforward ways.

                  Greg UtasG Offline
                  Greg UtasG Offline
                  Greg Utas
                  wrote on last edited by
                  #8

                  Apparently there's a software patent for saving the portion of a screen that will be overlaid by something else so that it can be quickly restored later. Utterly deranged.

                  Robust Services Core | Software Techniques for Lemmings | Articles
                  The fox knows many things, but the hedgehog knows one big thing.

                  <p><a href="https://github.com/GregUtas/robust-services-core/blob/master/README.md">Robust Services Core</a>
                  <em>The fox knows many things, but the hedgehog knows one big thing.</em></p>

                  J 1 Reply Last reply
                  0
                  • Greg UtasG Greg Utas

                    Apparently there's a software patent for saving the portion of a screen that will be overlaid by something else so that it can be quickly restored later. Utterly deranged.

                    Robust Services Core | Software Techniques for Lemmings | Articles
                    The fox knows many things, but the hedgehog knows one big thing.

                    J Offline
                    J Offline
                    jochance
                    wrote on last edited by
                    #9

                    Lol wut? I wonder if that dates to really old ASCII based UI. It's the only thing off the top of my head that makes it make sense. Hasbro bought a bunch of IP from Atari. Amongst it, I think, was Pac-Man. This led to a short-lived claim and series of suits which were premised on the idea that any game featuring a protagonist in a maze like environment was a derivative work. Bold move, Cotton.

                    1 Reply Last reply
                    0
                    Reply
                    • Reply as topic
                    Log in to reply
                    • Oldest to Newest
                    • Newest to Oldest
                    • Most Votes


                    • Login

                    • Don't have an account? Register

                    • Login or register to search.
                    • First post
                      Last post
                    0
                    • Categories
                    • Recent
                    • Tags
                    • Popular
                    • World
                    • Users
                    • Groups