Loomio
January 13th, 2020 22:38

Mastodon scraping incident

Nick Sellen
Nick Sellen Public Seen by 103

You might have seen this stuff about the fediverse being scraped for some research (see https://sunbeam.city/@puffinus_puffinus/103473239171088670), which included social.coop.

I've seen quite a range of opinions on the topic, some consider the content public, others feel their privacy has been violated.

There are a few measures that could be put into place to prevent/limit this kind of thing in the future (see https://todon.nl/@jeroenpraat/103476105410045435 and the thread around it), specifically:

  • change robots.txt (in some way, not sure precisely, needs research)

  • explicitly prevent it in the terms of service (e.g. https://scholar.social/terms)

  • disable "Allow unauthenticated access to public timeline" in settings (once we are on v3 or higher)

Thoughts?

Creature Of The Hill

Creature Of The Hill January 14th, 2020 08:40

I post public, and therefore expect anyone to be able to read it. If not there are other tools, way of tooting.

However, this is in the spirit of engaging with individuals. Publicly posting means that others can find and engage with me in the same way I have done with them. It feels safe enough, because of the tools afforded to deal with the rare negative interaction. Positives far outweigh the negatives in my n=1 case.

However, scraping feels very different, and quite negative. It actually has me thinking about what I post at the moment. I wrote a tool (pre-backups) to grab all of my toots so I didn't lose media etc... It would feel extremely intrusive/abusive if I used such a tool against another account to grab all their public toots. I know I could scroll and read them, but automation brings a level of potential abuse that makes it feel more uncomfortable.

So I guess I am in favour of dealing with it somehow.

Terms of service seems right, because those signing up should know where the instance stands. But that I personally think should be backed up by disabling public timeline access (v3 dependent). This would mean that although stopping someone determined will never be possible, we can make it eminently provable they acted deliberately and remove the defence of ignorance.

My two-penneth. Interested to see what others think.

mike_hales

mike_hales January 14th, 2020 10:03

Thanks for flagging this. I'm opposed to any actor scraping the entire sphere, for purposes of an analysis that will not be fully returned, mirror-fashion, to the communities whose behaviour traces have been systematically syphoned off . . by an industrial-strength (military-strength?) machine which is not in any way equivalent to the ordinary 'public' access of actual persons to actions of other persons-in-public. In a world with bots (and other assymetrical real world surveillance by un-public agents) some defence against this kind of violation of social norms is needed.

I may be missing something here but how do terms of service actually inhibit this kind of practice? Who's gonna sue? Is the fediverse or social.coop really going to take a violator to court? This is something I don't understand in general - so for example, I don't understand how Copyfair is in fact supposed to make any real difference to private abuses of the commons. Can someone explain the practical rationale for a terms-of-service/licensing approach? Who's it gonna stop?

So seems to me, some act-of-fiat is required, a machine-level fix: disable "Allow unauthenticated access" or modify <robots.txt> or whatever. I guess it's robot wars? Outside the law?

Nathan Schneider

Nathan Schneider January 14th, 2020 16:48

I don't see any problem with scraping of public posts. That's one of the wonderful things about public microblogging; it's a resource for public research that helps us better understand ourselves. If you don't want to be scraped yourself, you can set your posts to private. I think it's within the values of Social.coop to welcome our public data to be available for study.

A colleague of mine has worked on this issue of user perceptions of research quite a bit. Some resources:

https://journals.sagepub.com/doi/10.1177/2056305118763366

https://howwegettonext.com/scientists-like-me-are-studying-your-tweets-are-you-ok-with-that-c2cfdfebf135?gi=269de090d941

mike_hales

mike_hales January 14th, 2020 17:22

I disagree @Nathan Schneider I don’t regard this as a question of personal privacy, but rather a question of all results of analysis being returned to the community that was scraped. It would be great if research “helps us better understand ourselves”. But research findings go into the black hole of professional literatures, and research on ‘us’ (members of ’the public’) is only very rarely available to us, in any remotely direct way. Just as with environmental commons, so with cultural commons: Nothing extracted, that’s not returned to source.

The problem with academic research results is that although they're nominally ‘public’ they are in fact behind gates (cost of journals, access to journals, kinds of literacy required to read, casual elitism in presentation to specialist peers). Even when governed by ethics committees and formal legal issues of copyright are observed, academic research is basically extractive - that's the politics of the professional-managerial class ("Trust us to figure things out on your behalf. Don't bother yourself with the details, it's too technical for you anyway"). Today, with digital and bots, this has also become the problem of Big Data. 150 years on from the invention of stats and public administration, we still don't have remotely adequate ways of dealing with such issues.

If research were truly a commons, in which the (tacit, passive) contributors of scraped data were also directly participating in both the governance of the results-pool and the mundane enjoying of the results commons, that would be wonderful. But this isn't how it works in 2020. I feel that a university research team scraping my toots for a ‘public’ dataset is about half as bad as Facebook scraping my traffic for commercial exploitation or worse - still not good. Still not properly ‘public’ behaviour, in a society of elites.

This is the stance of a retired insider: a university research professional producing nominally ‘public’, public-funded findings. Bring on the commons! No, don't trust even well-meaning professionals. All power to the general assembly (hmmnn, not that, either! This is a tough one).

Nick Sellen

Nick Sellen January 14th, 2020 18:49

In this case the researchers have made efforts to comply with terms of service, from the paper:

In the terms of service and privacy policy the gathering and the usage of public available data is never explicitly mentioned, consequently our data collection seems to be complaint with the policy of the instance.

they also said they complied with robots.txt:

we have also respected the limitations imposed by the robots.txt files of the different instances

This type of case seems preventable, if that is desired.

If there was a truly hostile person doing the scraping I would imagine having those things in place would be a better starting position from a legal perspective, not that I know much about that.

I agree with the distinction between ordinary public access by actual people and machine enabled public access, especially when you include the ability to analyse the data with current and future algorithms, which is an explicit aim of theirs:

The usage of this dataset empowers researchers to develop new applications as well as to evaluate different machine learning algorithms and methods on different tasks

Nathan Schneider

Nathan Schneider January 14th, 2020 23:45

I agree that research findings based on public data (or really any research from public institutions) should be publicly available. I don't know about the practices of these particular researchers, but most scholars I know at least make available open-access preprints of their research if the journals themselves are not open access. In general, academic research is more available and accessible than it ever has been, even though I believe the open access movement has a long way to go (and I've been trying to advance it through ethicaledtech.info).

But part of the point here is that public data is not just available to researchers at universities. It's available to anyone, in principle. It could be used for a variety of outcomes. One of the values of a truly open commons is that the resource is available to all, and that's the case with our data.

Of course, I believe there are times when we need to protect our data from certain forms of abuse. The recent rise in source-available licenses to prevent cloud software from being abused by Amazon is an example. In the co-op/commons community, we have experiments with the Peer Production License, which limits use to non-profit and cooperative entities. I would be very comfortable with applying the PPL to Social.coop content.

mike_hales

mike_hales January 15th, 2020 10:06

I would be very comfortable with applying the PPL to Social.coop content

Sounds good to me as a principle 🙂 At the same time, I still have the same real-world query as before . .

Is the fediverse or social.coop really going to take a violator to court? . . I don't understand how [PPL] is in fact supposed to make any real difference to private abuses of commons. Can someone explain the practical rationale for a terms-of-service/licensing approach? Who's it gonna stop?

The Rule of Law isn’t in fact how extractive privatising culture and economy work. They operate on “OK I did it. OK so sue me” - the Rule of Fiat. As if financial compensation could remove harm anyway. Genuinely, I don’t get what Copyfair etc are actually expected to achieve. There seems to be a mistaken equation between libertarianism and lawfulness. It seems to me that law for libertarians is a gun or a bulldozer.

mike_hales

mike_hales January 15th, 2020 10:43

the open access movement has a long way to go

All speed 👍

public data is . . available to anyone, in principle

Principle again. In practice the modus of the professional-managerial class is to pump data and knowledge out of . . let’s say ‘the Public’ . . into a stratum of culture in which it’s routinely mobilised by elites of professional wage labour, intellectual entrepreneurs, States and corporations, in the course of doing things to the public, not with those people who’ve been scraped, or enabling actions by those people. Typically, producing infrastructures for life and work that are not readily open to self-design and redesign by those whose lives and work they shape. This is 150 years of Fordist and post-Fordist capitalist practice. These are the stakes that the fediverse is playing for, and P2P-commons politics more widely.

The locations of ‘public’ datasets should be directly published to those communities from which they were scraped - with data from digital sources like the fediverse that’s not too hard at all, it’s mostly the intention that’s lacking. They should be glossed - in those same locations - so that they are meaningful to and useable by those people, rather than to the professionals the datasets are formulated for. Tools for using on the data should be published with the data - all of this under PPL/Copyfair. When analyses of the data are published they should be posted to the same locations and notified. When used in designing infrastructures, that designing should be conducted as codesign, with the communities who needs-must inhabit the infrastructures. This all sounds far fetched, and quite hard to interpret in practical terms (even though practices of codesign - and designs of codesign practices - have made great headway in the past 50 years) which is a good measure of how far we are from having an actual Public, as distinct from various kinds of privatised territory. Let’s rather call it the commons, and let’s take that politicised description seriously, in practices of active and explicit commoning, rather than falling back on myths of publicness and professionalism? It’s quite another kind of practice, and yes, a long way to go.

In my activist peer groups, this past 40 years, there have been two key principles (principles!): in-and-against the State, and in-and-against the professional-managerial class (principles that can be extended in other directions too, with regard to other kinds of oppression and supremacy). I think this is the territory we’re in here. A long way to go.

Bob Haugen

Bob Haugen January 15th, 2020 13:25

@Nathan Schneider

most scholars I know at least make available open-access preprints of their research if the journals themselves are not open access.

In one case I know about, some researchers had to pay $20,000 to a predatory publisher (a big name) to offer an open-access version of their paper. Academic publishing is a racket.

Nick Sellen

Nick Sellen January 15th, 2020 13:39

@mike_hales my comment above this one, to me, partly answers your real-world query - in this real-world case having these things in place would have been able to prevent it (for hostile cases it would increase the effort required to scrape the content).

Nick Sellen

Nick Sellen January 15th, 2020 14:04

Some interesting bits from those two papers/links are:

Within our survey sample, few users were previously aware that their public tweets could be used by researchers, and the majority felt that researchers should not be able to use tweets without consent.

and

The problem is that, for some researchers, whether the data is public is the only thing that matters. I suggest (sometimes loudly, to people who don’t want to hear it) that it shouldn’t be.

and

it’s critical that we move beyond simplistic rules and consider each situation individually and holistically. Researchers can’t place the whole burden on users; we can’t expect them to know that we’re out there watching, nor demand that they anticipate any and all potential harms that may come to them now or in the future.

I think people generally have very little awareness of how their data might flow around and be used, and often are not comfortable when they find out. Some people have been very upset by this scraping. I would love people to have more data awareness so they can make informed choices (I think most people don't know that server admins with root access can read all their private messages too... of course they shouldn't but how does anyone know that?).

@Nathan Schneider said:

If you don't want to be scraped yourself, you can set your posts to private

I, and I think other users, would like more nuance than that, is it really reasonable to combine two cases as one (posting to the local instance timeline, so individual humans can discover your content with making your toots available to anyone on the internet to scrape and analyse), it's not a technical limitation (excluding the malicious/hostile case) but a policy choice.

it's a resource for public research that helps us better understand ourselves

In this particular case, I'm kinda doubtful about their approach - content warnings are used for quite a range of purposes, and one of the things I love about the fediverse is the element of human and community curation and moderation, I think Facebook has a hard time moderating it's 982497294728472847287424 users, it's an unpleasant full time job for a lot of people, whereas in the fediverse, the moderation can be spread across each community as they wish, and it makes it perhaps more manageable again.

The model of magic AI to help moderation feels like it comes from the Facebook-type case, automate away this drudgery, but seems far less appropriate for the fediverse, where data and tools that can empower the human moderators seems more useful to me (and seems quite distinct from just automated spam/bot detection).

Perhaps this research can support people that want to go in that direction, but it doesn't seem a very good start, to act so disconnected from the communities under study. I don't really understand their motives.

mike_hales

mike_hales January 15th, 2020 14:17

I don't get it Nick. Aren't these just documents, protocols. Protocol observers will . . observe them. What effort does it take to not-observe them? And if a document has quote-unquote legal force . . legal force costs a lot of money to mobilise. Freedom under law is very skewed. I truly don't see how such things can be seen as practical defences, for distributed or digital commons, against determined abusers.

Nick Sellen

Nick Sellen January 15th, 2020 14:20

I agree for determined users, but for these particular ones, they were doing it in good faith that it was permitted and acceptable, and presumably would not have done otherwise.

mike_hales

mike_hales January 15th, 2020 14:57

one of the things I love about the fediverse is the element of human and community curation

This is close to the heart I think. Curating is one of three dynamics at the heart of (digital or other) commons, and curating is a practice of valuing. Actual persons in actual communities of commoners, actually practising the valuing, within collectives, of what’s contributed in commons, in actual cases. This is a big evolutionary step we’re contemplating, world scale, digitally facilitated.

Machines can be told to do helpful things - filtering or flagging based on pattern recognition, for example. But to make a closed-loop valuing process (valuing-and-enforcing process?), enacted by machines, is surely something that should be contemplated only rarely? As distinct, for example, from closed-loop processes in real-time engineering systems put, in place to prevent physical hazard?

AIs - well, just jumped-up machines - could do really helpful pattern recognition on ‘public’ Big data. We really could do with mirrors of our own tacit large-scale collective actions - in environmental commons, energy commons, media commons, material commons such as food supply chains or housing stocks, etc etc. Is the fediverse at work on this? Do we have to wait until the Big Data oligarchies are taken into coop ownership? Fat chance! Is anybody today going to trust ‘public’ ownership (the State) to do this? Or the professional communities of Big Data science, like the genome? I don’t think so. Policing the ownership of individuals’ data seems to be about as far as the Free Software and Free Web vision takes us? Have I got that wrong? Not the same thing at all as commoning.

Bob Haugen

Bob Haugen January 15th, 2020 15:02

@mike_hales

I don't understand how Copyfair is in fact supposed to make any real difference to private abuses of the commons. Can someone explain the practical rationale for a terms-of-service/licensing approach? Who's it gonna stop?

This is not directly about scraping data, it's about open source software licenses. Large companies with legal departments do not knowingly violate licenses. Which is why lots of companies will not use GPL code. I would expect universities might also want to avoid legal liabilities even if nobody is going to sue them.

Won't deter malicious actors, though...but FB and Goog getting sued by European government agencies for lots of money might put a crimp in their plans...

mike_hales

mike_hales January 15th, 2020 15:03

 I don't really understand their motives

I don’t mean to be snarky here - I’ve earned my living in non-tenured academic contract research too - but . . Career. Publish-or-perish. ‘Interesting problems in the academic field’.

Connecting with communities (being part of non-academic communities, contributing analytical work) is hard work, in relatively unexplored modes. Researchers have only 24 hours in their days too, and mortgages to pay, and if they’re not going to be rewarded for that additional hard work, not much of it is going to get done (and THAT will be in personal spare time?).

Creature Of The Hill

Creature Of The Hill January 15th, 2020 15:12

If you don't want to be scraped yourself, you can set your posts to private

I, and I think other users, would like more nuance than that, is it really reasonable to combine two cases as one (posting to the local instance timeline, so individual humans can discover your content with making your toots available to anyone on the internet to scrape and analyse), it's not a technical limitation (excluding the malicious/hostile case) but a policy choice.

Any individual is welcome to browse through my profile and public toots. Please, go try it. You will see the effort it takes to build a picture and context. If an individual is willing to do that, they are investing time in understanding the context and motivation for those toot and their connections. You will also notice it's not so easy to see everything in one place, as a single column. So it's public, but in a form designed to be interpreted by actual people.

An entity or organisation or tool is a different matter. To infer from the public setup of toots in the way a profile is presented or how they appear in a public timeline, that a user then consents to them being used en-mass by an entity to my eye is wrong.

I would think, that most users would see it like this.

That it is not to say that I or others might not give consent if informed.

But implying it just feels shady and convenient for those that want the data without having to go to too much effort. Just because I could go looking around the internet for open resources, doesn't mean it is right in all cases.

Django

Django is checking January 15th, 2020 17:33

Enable authorized fetches, disable public access via API (Once we are on v3+) Closed 11:03pm - Sunday 19 Jan 2020

This discussion is great and shows we as a Coop don't quite have consensus on the issue.

Let's see if we can agree on a few things.

1 - Yes
7 - No
Django

Django January 15th, 2020 17:42

I would like to create a second Poll regarding a Change to the Terms of Service, but fear it might be confusing having multiple polls at once

Here are the choices I have in mind, not sure if It should be a ranked choice or only choose one out of the 4 options

  • Explicitly prevent scraping 1

  • Researchers must explicitly ask for access 2, 3

  • Researchers are not obliged to ask for access 3

  • Status Quo

  1. This would also require some software to detect and ban IPs attempting to scrape

  2. This would also require software, and an temporary exception would be made

  3. Users who have Checked off 'Opt-out of search engine indexing' would be automatically excluded from research.

Thoughts?

Noah Hall

Noah Hall January 15th, 2020 18:03

Privacy of toots is a complicated question. Obviously "public" means "not private" but it does not adequately distinguish between "public as in what I say in my yard, or at a restaurant" and "public as in what an elected official says at a meeting." I do believe it should be more difficult to scrape the public timeline; it's not something required for regular, individual-level interaction and almost never done with intentions of directly benefiting the people whose data is being scraped. And although it's minimal let's not forget we're paying for the server resources consumed by the scraping!

I'm in favor of all three of the options given by @Nick Sellen , and honestly interested in going further. For example, I hope for us to someday have a discussion on the possibility of migrating from vanilla Mastodon to a compatible fork offering a local-only post privacy option (I know of Hometown and glitch-soc, there may be others).

Django

Django January 15th, 2020 18:10

Oh yes regarding "change robots.txt (in some way, not sure precisely, needs research)"

I believe there is an Admin setting which turns this on by default for all users, but it is a user setting to opt of Search engine indexing

Django

Django
Yes
January 15th, 2020 18:10

Django

Django January 15th, 2020 18:12

Just to be clear Opting out of Search engine indexing is insufficient to prevent scraping.

Creature Of The Hill

Creature Of The Hill
Yes
January 15th, 2020 18:16

mike_hales

mike_hales
No
January 15th, 2020 18:31

I don't fully understand the statement (eg 'authorized fetches') and there's no <abstain> option, so must vote a NO.

Nathan Schneider

Nathan Schneider January 15th, 2020 19:10

Agreed. That's unusually egregious. But, again, generally researchers make non-paywalled preprint versions of their research available as well. There have been significant gains in the open-access movement. That, again, would be the value of a PPL license: Researchers using our data would have to publish using a non-profit or cooperative outlet, and those are typically open access ones.

Nathan Schneider

Nathan Schneider
Yes
January 15th, 2020 19:18

Given the strong concerns raised here, I would be okay with this.

Nathan Schneider

Nathan Schneider started a poll January 15th, 2020 19:21

Put a Peer Production License on Social.coop tweets Closed 12:02pm - Saturday 18 Jan 2020

Alongside any technical provisions we add about mass scraping of our data, I propose that we should place a peer production license on our content, restricting reuse to nonprofit and cooperative entities. (Of course, we can offer separate licensing to other entities on an ad hoc basis.)

Using the PPL would also be a way of extending solidarity to the broader co-op movement.

https://wiki.p2pfoundation.net/PeerProductionLicense

9 - No
1 - Yes
mike_hales

mike_hales January 15th, 2020 19:35

Yes

Leo Sammallahti

Leo Sammallahti January 15th, 2020 19:39

Yes

Noah Hall

Noah Hall
Yes
January 15th, 2020 19:42

Noah Hall

Noah Hall January 15th, 2020 19:42

Yes

Without getting into the broader questions about licensing that Aaron has raised, I think a reasonable amendment here might be something along the lines of, "All toots covered by PPL unless specified otherwise by the user - check their profile"

Django

Django January 15th, 2020 20:04

Yes

Jonathan Bean

Jonathan Bean January 15th, 2020 21:44

Yes

Creature Of The Hill

Creature Of The Hill January 16th, 2020 08:48

Yes

Nick Sellen

Nick Sellen January 16th, 2020 10:44

Yes

sounds a good experiment in this license, the link above is broken, and hopefully this one will work - Peer Production License - I tried reading that page, but it's a bit long and full of dense walls of text :/

Nick Sellen

Nick Sellen
No
January 16th, 2020 10:54

I think we need a more informed discussion about what it is first.

Nick Sellen

Nick Sellen January 16th, 2020 10:59

I wanted to explore more what the authorized fetches option is about, the Mastodon 3.0 in-depth blog post gives this explanation (for Secure mode, which I presume is the setting that the toot I read before was referring to):

Secure mode

Normally, all public resources are available without authentication or authorization. Because of this, it is hard to know who (in particular, which server, or which person) has accessed a particular resource, and impossible to deny that access to the ones you want to avoid. Secure mode requires authentication (via HTTP signatures) on all public resources, as well as disabling public REST API access (i.e. no access without access token, and no access with app-only access tokens, there has to be a user assigned to that access token). This means you always know who is accessing any resource on your server, and can deny that access using domain blocks.

Unfortunately, secure mode is not fully backwards-compatible with previous Mastodon versions. For this reason, it cannot be enabled by default. If you want to enable it, knowing that it may negatively impact communications with other servers, set the AUTHORIZED_FETCH=true environment variable.

Given we are not on v3.0 yet, maybe we can just wait until then to decide. It might be possible to assess which servers we will not be able to communicate with were the setting on...

Nick Sellen

Nick Sellen January 16th, 2020 11:15

robots.txt is a static file included in the repo, see https://github.com/tootsuite/mastodon/blob/master/public/robots.txt (or for our current version), so not configurable within the instance, or per user, but we could choose to have our own one to override the default. I didn't manage to find an instance that has customized it though, so would need some research, maybe a question to #mastoadmins would come up with something.

mike_hales

mike_hales January 16th, 2020 13:16

@Nick Sellen great to have that greater depth, thanku. On that basis I'm happy to switch to a YES vote. Roll on v3.0!

mike_hales

mike_hales
Yes
January 16th, 2020 13:18

I don't fully understand the statement (eg 'authorized fetches') and there's no <abstain> option, so must vote a NO. Later: Nick provided some detail and I'm happy to vote YES now. Thanku.

mike_hales

mike_hales
Yes
January 16th, 2020 13:18

I don't fully understand the statement (eg 'authorized fetches') and there's no <abstain> option, so must vote a NO. Later: Nick provided some detail and I'm happy to vote YES now. Thanku.

Nathan Schneider

Nathan Schneider January 16th, 2020 16:57

@Nick Sellen sorry about the bad link. Here's a nice article on the PPL.

David Mynors

David Mynors January 17th, 2020 09:36

Yes

David Mynors

David Mynors
Yes
January 17th, 2020 09:38

Aaron Wolf

Aaron Wolf January 17th, 2020 18:36

No

mixed feelings and am open to changing my mind, but I'm skeptical of the PPL. I support co-op solidarity and the intention of the PPL 100%. But I'm critical of discriminatory licenses. I prefer PPL over CC-NC because blanket anti-commerce is even worse. But plain copyleft, CC-BY-SA would accomplish what I see everyone talking about here: getting anyone doing research to publish the research under free terms we could all access.

FWIW, I would like to mark my posts CC-BY-SA

Michael

Michael January 17th, 2020 22:07

Yes

Michael

Michael
Yes
January 17th, 2020 22:08

mike_hales

mike_hales January 19th, 2020 09:55

Clear positive vote on Put a Peer Production License on Social.coop tweets. But only 5% turnout. This needs tooting? A second vote? Its own thread? Authorized fetches is heading the same way. These need much more participation?

Nick Sellen

Nick Sellen January 19th, 2020 22:43

There is an Open Letter from the Mastodon Community, via https://sunbeam.city/@GwenfarsGarden/103507032332626576 which says they are asking if people want to co-sign.

Interestingly it points out they did not abide by the terms of service, and did not sufficiently anonymize the data. The dataset has been pulled from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/R1HKVS due to legal issues.

Django

Django January 20th, 2020 15:31

Thanks for expanding on this, I had made an assumption about this based on some toots. And as @mike_hales pointed out, more info was needed for the informed decision.

Django

Django January 20th, 2020 15:32

Agreed!

Maybes the official @SocialCoop@social.coop account could announce the polls/discussions to the instance users.

Should we re-roll the 2 polls into 1?

mike_hales

mike_hales January 20th, 2020 23:09

@Matthew Cropp or @Matt Noyes or @emi do Would you announce? But @Nathan Schneider @Django need to float the polls again - new thread to lessen confusion?

mike_hales

mike_hales January 20th, 2020 23:11

Interesting thorough letter, worth filing. Closed now for signatures.

Matt Noyes

Matt Noyes January 21st, 2020 02:29

How about this? @Nathan Schneider and @Django combine the polls in one then announce it together, with an toot from the social.coop account as back up. I am happy to encourage people to participate.

mike_hales

mike_hales January 21st, 2020 10:11

Just a thought on ‘good faith’. From the analysis in the letter of protest that has been written in the fediverse, it seems clear that the researchers were not acting in good faith at all. Rather, they seemingly acted in a pretty crass, ignorant way, didn’t do what they said they did, and weren’t aware of half the things they should have been, if they were fully literate users. So expectations of good faith were no protection in this case.

In something that’s quite technically complex like this, I might expect dumb ignorance to be a pretty widespread possibility (including in fields of casualised, precarious employment in academia), and expectations of good faith to be no defence against harm. Legal action and compensation after the harm is done isn’t a substitute for defence?

Scholar.social seems to be the act to follow on this?

Nick Sellen

Nick Sellen January 21st, 2020 10:41

So expectations of good faith were no protection in this case.

Indeed, I was too optimistic about that I think, but I still feel it was perhaps just badly implemented good faith ;)

... but the legal side seemed more successful, in that the dataset got removed from where it was hosted due to the legal basis.

Wooster

Wooster January 22nd, 2020 08:00

Any solution short of a technical measure preventing the actual scraping of posts (such as only permitting friended authenticated users to read your toots) will not stop your toots from being scraped and harvested, along with any identifiable information that is available.

Put succinctly, if you make information on the internet available to people without authentication, it can and will be scraped. Regardless of any laws, letters, privacy statements, terms of service, strongly-worded posts or anything else. The researchers and scholars who make their intentions public may make some effort to abide by these guidelines and attempt to redact personal information, but the actors who may be really doing things you'd rather them not with your data will not be so obligated.

Don't post stuff on the internet if you don't want it to be public information. There's no social mechanism that has enough force to prevent others from accessing it in an automated fashion.


If you want to post things on Mastodon that others can read, Secure mode will likely break that capability. Secure mode does not prevent scraping, it merely allows you to see who is doing the scraping, which can be an anonymous user. Either your toots are public or they aren't, authorized fetch doesn't do anything to prevent scraping. Anyone can set up a new Mastodon instance and create a HTTP signature to make authorized fetches. The Fediverse, like Twitter, is not a place for posting anything you wish to be private in some fashion. Either people and machines can read your content, or they can't. No open letters or policies will change that.

Normally, all public resources are available without authentication or authorization. Because of this, it is hard to know who (in particular, which server, or which person) has accessed a particular resource, and impossible to deny that access to the ones you want to avoid. Secure mode requires authentication (via HTTP signatures) on all public resources, as well as disabling public REST API access (i.e. no access without access token, and no access with app-only access tokens, there has to be a user assigned to that access token). This means you always know who is accessing any resource on your server, and can deny that access using domain blocks.

Unfortunately, secure mode is not fully backwards-compatible with previous Mastodon versions. For this reason, it cannot be enabled by default. If you want to enable it, knowing that it may negatively impact communications with other servers, set the AUTHORIZED_FETCH=true environment variable.

Wooster

Wooster January 22nd, 2020 08:01

How would any of this actually prevent scraping?

mike_hales

mike_hales January 22nd, 2020 10:13

It's good to have it stated as plainly as this @Wooster thanks.

if you make information on the internet available to people without authentication, it can and will be scraped. Regardless of any laws, letters, privacy statements, terms of service, strongly-worded posts or anything else

As I said earlier my concern in this is not privacy: I personally operate in the fediverse with an awareness that careless talk is as unwise here as it is anywhere. My concern is with the uses that are made of material in the commons. I'm concerned that there shoud be concerted efforts to build real commons of digital media, and that any analysis that is made on materials in commons should be returned to those who created the materials, and notified to them - as I wrote here. This is a big ask and how to do it is unclear. But potentially this gives us the means of a much more embracing awarenes of who 'we' are and what 'we' do . . Silicon Valley oligarchs and State security agencies are not the only people with an interest in knowing the shape and dynamics of our behaviour 'in the large'. This is a kind of literacy that's become possible in the past generation, and it's time it was seriously attended to.

One of the things that's most difficult in getting started on this, is that the ethos of commoning is different from, and tangential to, the basic ethos of the web and free software. These latter are built within an anarcho-libertarian culture of autonomism and complete privacy of and control over individual property. This orientation has brought some very powerful tools and technologies, and there are more in the pipeline - open data, mesh networks, open app ecosystems, whatever. But commons are post-propertarian. They're built within a culture of stewarding, curating and enjoying in which all participants have the same access to the same means, under the governance and policing and common aesthetic of them all. Commoning is associationist rather than libertarian and individualist, and the peer-to-peer culture of free software production - a world of protocol-commons - is a space where the two cultures have an awkward coexistence, which is far from resolved.

From the standpoint of building commons, the persistent concern with privacy is a sideshow and maybe a distraction, and the main game is finding ways of policing and ending extraction from commons, and facilitating and mandating return of value to the commons. It's no less urgent (though less of a life-and-death matter) to start focusing this in digital commons, than it is in the wild commons of air, water, energetics and biosphere. Digital data is one of the 'new wildernesses'; cowboys, frontiersmen, gunslingers and homesteaders are out there (where are the posses and deputies? who shot the sheriff?); and so are industrial-scale, robber-baron, clear-felling, cash-cropping, land-grabbing, financial-capital giants. It's the kind of steampunk world Neal Stephenson might write, but it just so happens that we're in it?

PS: I think the open letter, and the stance of scholar.social - is still interesting. The slack, extractive, ethos of academia certainly needs attending to. They (we - I used to be one) need to learn new ways of being in, and serving, communities that are not basically running on academic-elite, publish-or-perish, knowledge-commodity rules.