Loomio
Mon 13 Jan 2020 10:38PM

Mastodon scraping incident

NS Nick Sellen Public Seen by 60

You might have seen this stuff about the fediverse being scraped for some research (see https://sunbeam.city/@puffinus_puffinus/103473239171088670), which included social.coop.

I've seen quite a range of opinions on the topic, some consider the content public, others feel their privacy has been violated.

There are a few measures that could be put into place to prevent/limit this kind of thing in the future (see https://todon.nl/@jeroenpraat/103476105410045435 and the thread around it), specifically:

  • change robots.txt (in some way, not sure precisely, needs research)

  • explicitly prevent it in the terms of service (e.g. https://scholar.social/terms)

  • disable "Allow unauthenticated access to public timeline" in settings (once we are on v3 or higher)

Thoughts?

D

Poll Created Wed 15 Jan 2020 5:33PM

Enable authorized fetches, disable public access via API (Once we are on v3+) Closed Mon 20 Jan 2020 4:03AM

This discussion is great and shows we as a Coop don't quite have consensus on the issue.

Let's see if we can agree on a few things.

Results

Results Option Voters
No 1 NS
Yes 7 NS N M DM M D COT
Undecided 90 DS ST JD CZ BH F SH KT C G AM MSC CCC AW MC SC PA RB MN JG

8 of 98 people have participated (8%)

M

mike_hales
No
Wed 15 Jan 2020 6:31PM

I don't fully understand the statement (eg 'authorized fetches') and there's no <abstain> option, so must vote a NO.

NS

Nathan Schneider
Yes
Wed 15 Jan 2020 7:18PM

Given the strong concerns raised here, I would be okay with this.

NS

Nick Sellen
No
Thu 16 Jan 2020 10:54AM

I think we need a more informed discussion about what it is first.

M

mike_hales
Yes
Thu 16 Jan 2020 1:18PM

I don't fully understand the statement (eg 'authorized fetches') and there's no <abstain> option, so must vote a NO. Later: Nick provided some detail and I'm happy to vote YES now. Thanku.

M

mike_hales
Yes
Thu 16 Jan 2020 1:18PM

I don't fully understand the statement (eg 'authorized fetches') and there's no <abstain> option, so must vote a NO. Later: Nick provided some detail and I'm happy to vote YES now. Thanku.

D

Django Wed 15 Jan 2020 5:42PM

I would like to create a second Poll regarding a Change to the Terms of Service, but fear it might be confusing having multiple polls at once

Here are the choices I have in mind, not sure if It should be a ranked choice or only choose one out of the 4 options

  • Explicitly prevent scraping 1

  • Researchers must explicitly ask for access 2, 3

  • Researchers are not obliged to ask for access 3

  • Status Quo

  1. This would also require some software to detect and ban IPs attempting to scrape

  2. This would also require software, and an temporary exception would be made

  3. Users who have Checked off 'Opt-out of search engine indexing' would be automatically excluded from research.

Thoughts?

W

Wooster Wed 22 Jan 2020 8:01AM

How would any of this actually prevent scraping?

N

Noah Wed 15 Jan 2020 6:03PM

Privacy of toots is a complicated question. Obviously "public" means "not private" but it does not adequately distinguish between "public as in what I say in my yard, or at a restaurant" and "public as in what an elected official says at a meeting." I do believe it should be more difficult to scrape the public timeline; it's not something required for regular, individual-level interaction and almost never done with intentions of directly benefiting the people whose data is being scraped. And although it's minimal let's not forget we're paying for the server resources consumed by the scraping!

I'm in favor of all three of the options given by @Nick Sellen , and honestly interested in going further. For example, I hope for us to someday have a discussion on the possibility of migrating from vanilla Mastodon to a compatible fork offering a local-only post privacy option (I know of Hometown and glitch-soc, there may be others).

D

Django Wed 15 Jan 2020 6:10PM

Oh yes regarding "change robots.txt (in some way, not sure precisely, needs research)"

I believe there is an Admin setting which turns this on by default for all users, but it is a user setting to opt of Search engine indexing

Load More