this post was submitted on 08 Aug 2023
967 points (97.7% liked)
Privacy
31553 readers
709 users here now
A place to discuss privacy and freedom in the digital world.
Privacy has become a very important issue in modern society, with companies and governments constantly abusing their power, more and more people are waking up to the importance of digital privacy.
In this community everyone is welcome to post links and discuss topics related to privacy.
Some Rules
- Posting a link to a website containing tracking isn't great, if contents of the website are behind a paywall maybe copy them into the post
- Don't promote proprietary software
- Try to keep things on topic
- If you have a question, please try searching for previous discussions, maybe it has already been answered
- Reposts are fine, but should have at least a couple of weeks in between so that the post can reach a new audience
- Be nice :)
Related communities
Chat rooms
-
[Matrix/Element]Dead
much thanks to @gary_host_laptop for the logo design :)
founded 4 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
This is a fundamental misunderstanding of how LLMs actually work. Given a list of previous tokens, a complicated set of linear algebra and normalization operations are applied to yield the “probability” (in quotes because this is a dubious application of the word imo) that each known token will follow it. The model is trained using an equally complicated regression algorithm that slowly adjusts the billions of linear algebra coefficients to more closely match the training data. RLHF is then used to make more adjustments that allow the AI to fulfill its intended purpose (e.g., to reinforce the question-answer format expected of ChatGPT).
You may recall regression from your first statistics class. Even in the case of simple linear regression, when the input consists of millions of data points, it is essentially impossible to determine which point should be “credited” for any given aspect of the output line. The same is true for AI: you could maybe compile a list of training data that makes a token “likely” to appear after another token, but nothing more complex than that. It is very rare for a small set of sources to be responsible for a sequence longer than a few tokens.
I do, however, believe they should be required to provided a very specific list of sources used for training the model. I think it’s ridiculous to claim that generative AI is transformative in a practical sense: I can’t imagine it would be legal for companies to make endless photocopies of copyrighted material and have a computer make fancy scrapbooks out of it, even if “it’s a fledgling industry” or whatever.