Loomio
Mon 13 Jan 2020 10:38PM

Mastodon scraping incident

NS Nick Sellen Public Seen by 60

You might have seen this stuff about the fediverse being scraped for some research (see https://sunbeam.city/@puffinus_puffinus/103473239171088670), which included social.coop.

I've seen quite a range of opinions on the topic, some consider the content public, others feel their privacy has been violated.

There are a few measures that could be put into place to prevent/limit this kind of thing in the future (see https://todon.nl/@jeroenpraat/103476105410045435 and the thread around it), specifically:

  • change robots.txt (in some way, not sure precisely, needs research)

  • explicitly prevent it in the terms of service (e.g. https://scholar.social/terms)

  • disable "Allow unauthenticated access to public timeline" in settings (once we are on v3 or higher)

Thoughts?

NS

Nick Sellen Thu 16 Jan 2020 11:15AM

robots.txt is a static file included in the repo, see https://github.com/tootsuite/mastodon/blob/master/public/robots.txt (or for our current version), so not configurable within the instance, or per user, but we could choose to have our own one to override the default. I didn't manage to find an instance that has customized it though, so would need some research, maybe a question to #mastoadmins would come up with something.

D

Django Wed 15 Jan 2020 6:12PM

Just to be clear Opting out of Search engine indexing is insufficient to prevent scraping.

NS

Poll Created Wed 15 Jan 2020 7:21PM

Put a Peer Production License on Social.coop tweets Closed Sat 18 Jan 2020 7:02PM

Alongside any technical provisions we add about mass scraping of our data, I propose that we should place a peer production license on our content, restricting reuse to nonprofit and cooperative entities. (Of course, we can offer separate licensing to other entities on an ad hoc basis.)

Using the PPL would also be a way of extending solidarity to the broader co-op movement.

https://wiki.p2pfoundation.net/PeerProductionLicense

Results

Results Option % of points Voters
Yes 90.0% 9 LS N JB NS M DM M D COT
No 10.0% 1 AW
Undecided 0% 88 DS ST JD CZ BH F NS SH KT C G AM MSC CCC MC SC PA RB MN JG

10 of 98 people have participated (10%)

N

Noah Wed 15 Jan 2020 7:42PM

Yes

Without getting into the broader questions about licensing that Aaron has raised, I think a reasonable amendment here might be something along the lines of, "All toots covered by PPL unless specified otherwise by the user - check their profile"

NS

Nick Sellen Thu 16 Jan 2020 10:44AM

Yes

sounds a good experiment in this license, the link above is broken, and hopefully this one will work - Peer Production License - I tried reading that page, but it's a bit long and full of dense walls of text :/

AW

Aaron Wolf Fri 17 Jan 2020 6:36PM

No

mixed feelings and am open to changing my mind, but I'm skeptical of the PPL. I support co-op solidarity and the intention of the PPL 100%. But I'm critical of discriminatory licenses. I prefer PPL over CC-NC because blanket anti-commerce is even worse. But plain copyleft, CC-BY-SA would accomplish what I see everyone talking about here: getting anyone doing research to publish the research under free terms we could all access.

FWIW, I would like to mark my posts CC-BY-SA

NS

Nick Sellen Thu 16 Jan 2020 10:59AM

I wanted to explore more what the authorized fetches option is about, the Mastodon 3.0 in-depth blog post gives this explanation (for Secure mode, which I presume is the setting that the toot I read before was referring to):

Secure mode

Normally, all public resources are available without authentication or authorization. Because of this, it is hard to know who (in particular, which server, or which person) has accessed a particular resource, and impossible to deny that access to the ones you want to avoid. Secure mode requires authentication (via HTTP signatures) on all public resources, as well as disabling public REST API access (i.e. no access without access token, and no access with app-only access tokens, there has to be a user assigned to that access token). This means you always know who is accessing any resource on your server, and can deny that access using domain blocks.

Unfortunately, secure mode is not fully backwards-compatible with previous Mastodon versions. For this reason, it cannot be enabled by default. If you want to enable it, knowing that it may negatively impact communications with other servers, set the AUTHORIZED_FETCH=true environment variable.

Given we are not on v3.0 yet, maybe we can just wait until then to decide. It might be possible to assess which servers we will not be able to communicate with were the setting on...

M

mike_hales Thu 16 Jan 2020 1:16PM

@Nick Sellen great to have that greater depth, thanku. On that basis I'm happy to switch to a YES vote. Roll on v3.0!

D

Django Mon 20 Jan 2020 3:31PM

Thanks for expanding on this, I had made an assumption about this based on some toots. And as @mike_hales pointed out, more info was needed for the informed decision.

NS

Nathan Schneider Thu 16 Jan 2020 4:57PM

@Nick Sellen sorry about the bad link. Here's a nice article on the PPL.

Load More