this post was submitted on 21 Aug 2024

314 points (100.0% liked)

196

16224 readers

3935 users here now

Be sure to follow the rule before you head out.

Rule: You must post before you leave.

^other^ ^rules^

founded 1 year ago

MODERATORS

[email protected]

314

rulebots.txt (lemmy.world)

submitted 4 weeks ago by [email protected] to c/[email protected]

32 comments fedilink hide all child comments

top 32 comments

sorted by: hot top controversial new old

[–] [email protected] 110 points 4 weeks ago (1 children)

but not the misuse of public content

[–] [email protected] 12 points 4 weeks ago* (last edited 4 weeks ago) (1 children)

but not the misuse of public content

Is that an admission that they don't own the content others posted on their site?

[–] [email protected] 2 points 3 weeks ago

you would be a good lawyer

[–] [email protected] 69 points 4 weeks ago (3 children)

I am confused, does this mean Reddit is not going to be searchable on search engines anymore?

[–] [email protected] 83 points 4 weeks ago

Unfortunately yes. It was reported on last month.

[–] [email protected] 63 points 4 weeks ago (4 children)

oh no, Reddit is like, the only way to have google still be useful.

[–] [email protected] 51 points 4 weeks ago

Funnily enough, google is also the only way to have Reddit be useful.

Their own search function has been nothing but garbage.

[–] [email protected] 40 points 4 weeks ago (2 children)

That's the catch, Google made a deal with Reddit and remains the only search engine allowed to access its data for indexing. It cuts off every other search engine

[–] [email protected] 25 points 4 weeks ago (1 children)

Tell me that there is an anti trust suit over this.

[–] [email protected] 23 points 4 weeks ago

There's a suit over google in general so this may well be part of it

[–] [email protected] 1 points 3 weeks ago (1 children)

really? ddg will show me reddit links, did they have to make a webscraper or something

[–] [email protected] 2 points 3 weeks ago

There's a cutoff date, anything indexed before the robots.txt was changed stays in the index

[–] [email protected] 29 points 4 weeks ago (1 children)

We fucked the internet. It’s proprietary now.

[–] [email protected] 10 points 4 weeks ago* (last edited 4 weeks ago) (1 children)

we fucked the internet

kinky

[–] [email protected] 7 points 4 weeks ago (1 children)

cat5 sounding you say?

[–] [email protected] 2 points 3 weeks ago

cat5-o-nine-tails

[–] [email protected] 8 points 4 weeks ago (1 children)

Good news! Google paid up and still has access I'm pretty sure.

[–] [email protected] 0 points 3 weeks ago (1 children)

That's bad news, that means the internet is dying

[–] [email protected] -1 points 3 weeks ago

Sorry, the /s was sort of implied.

[–] [email protected] 9 points 4 weeks ago (1 children)

Perhaps, likely depends on the crawler though

[–] [email protected] 12 points 4 weeks ago

Yeah i dont think ignoring robots.txt is even illegal. They can ofcourse just block your crawlers IP but that would be a cat and mouse game that they would lose in the end.

[–] [email protected] 53 points 4 weeks ago

Not gonna lie this seems like ultimately a win for the Internet. The years of troubleshooting solutions Reddit Provided can be archived (hopefully) but the less people rely on the site itself, the better. At least in my opinion.

[–] [email protected] 52 points 4 weeks ago

I remember finding Google's robots.txt when they first came out. It was a cute little text ASCII art of a robot with a heart that said, "We love robots!"

[–] [email protected] 49 points 4 weeks ago (1 children)

An ancient text from the before-fore.

[–] [email protected] 59 points 4 weeks ago (1 children)

this is actually quite recent. the old one was much funnier and clearly had actual soul put into it.

[–] [email protected] 6 points 3 weeks ago

my shiny metal ass

[–] [email protected] 8 points 4 weeks ago (4 children)

As annoying as this is, it's to prevent LLMs from training themselves using Reddit content, and that's probably the greater of the two evils.

[–] [email protected] 37 points 4 weeks ago (1 children)

That's all well and good, but how many LLMs do you think actually respect robots.txt?

[–] [email protected] 13 points 4 weeks ago

from my limited experience, about half? i had to finally set up a robots.txt last month after Anthropic decided it would be OK to crawl my Wikipedia mirror from about a dozen different IP addresses simultaneously, non-stop, without any rate limiting, and bring it to its knees. fuck them for it, but at least it stopped once i added robots.txt.

Facebook, Amazon, and a few others are ignoring that robots.txt, on the other hand. they have the decency to do it slowly enough that i'd never notice unless i checked the logs, at least.

[–] [email protected] 31 points 4 weeks ago

I thought major LLMs ignored robots.txt

[–] [email protected] 24 points 4 weeks ago

It's to profit from training LLMs: https://arstechnica.com/information-technology/2024/02/your-reddit-posts-may-train-ai-models-following-new-60-million-agreement/

[–] [email protected] 11 points 3 weeks ago

It’s to prevent LLMs from training themselves using reddit content, unless they pay the party that took no part in creating said content

FTFY