> Technically, robot.txt isn't enforcing anything, so it is just trust. There's ...

> Technically, robot.txt isn't enforcing anything, so it is just trust.

There's legal backing to it in the EU, as mentioned. With CommonCrawl you can just download it yourself to check. In other cases it wouldn't necessarily be as immediately obvious, but through monitoring IPs/behavior in access logs (or even prompting the LLM to see what information it has) it would be possible to catch them out if they were lying - like Perplexity were "caught out" in the mentioned case.

> Funny. If I can browse to it, it is public right? That is how some people's logic goes. And how OpenAI argued 2 years ago when GPT3.5/ChatGPT first started getting traction.

If you mean public as in the opposite of private, I think that's pretty much true by definition. Information's no longer private when you're putting it on the public Internet.

If you mean public as in public domain, I don't think that has been argued to be the case. The argument is that it's fair use (that is, the content is still under copyright, but fitting statistical models is substantially transformative/etc.)