Loomio
Mon 13 Jan

Mastodon scraping incident

NS
Nick Sellen Public Seen by 194

You might have seen this stuff about the fediverse being scraped for some research (see https://sunbeam.city/@puffinus_puffinus/103473239171088670), which included social.coop.

I've seen quite a range of opinions on the topic, some consider the content public, others feel their privacy has been violated.

There are a few measures that could be put into place to prevent/limit this kind of thing in the future (see https://todon.nl/@jeroenpraat/103476105410045435 and the thread around it), specifically:

  • change robots.txt (in some way, not sure precisely, needs research)

  • explicitly prevent it in the terms of service (e.g. https://scholar.social/terms)

  • disable "Allow unauthenticated access to public timeline" in settings (once we are on v3 or higher)

Thoughts?

COT

Creature Of The Hill Tue 14 Jan

I post public, and therefore expect anyone to be able to read it. If not there are other tools, way of tooting.

However, this is in the spirit of engaging with individuals. Publicly posting means that others can find and engage with me in the same way I have done with them. It feels safe enough, because of the tools afforded to deal with the rare negative interaction. Positives far outweigh the negatives in my n=1 case.

However, scraping feels very different, and quite negative. It actually has me thinking about what I post at the moment. I wrote a tool (pre-backups) to grab all of my toots so I didn't lose media etc... It would feel extremely intrusive/abusive if I used such a tool against another account to grab all their public toots. I know I could scroll and read them, but automation brings a level of potential abuse that makes it feel more uncomfortable.

So I guess I am in favour of dealing with it somehow.

Terms of service seems right, because those signing up should know where the instance stands. But that I personally think should be backed up by disabling public timeline access (v3 dependent). This would mean that although stopping someone determined will never be possible, we can make it eminently provable they acted deliberately and remove the defence of ignorance.

My two-penneth. Interested to see what others think.

M

mike_hales Tue 14 Jan

Thanks for flagging this. I'm opposed to any actor scraping the entire sphere, for purposes of an analysis that will not be fully returned, mirror-fashion, to the communities whose behaviour traces have been systematically syphoned off . . by an industrial-strength (military-strength?) machine which is not in any way equivalent to the ordinary 'public' access of actual persons to actions of other persons-in-public. In a world with bots (and other assymetrical real world surveillance by un-public agents) some defence against this kind of violation of social norms is needed.

I may be missing something here but how do terms of service actually inhibit this kind of practice? Who's gonna sue? Is the fediverse or social.coop really going to take a violator to court? This is something I don't understand in general - so for example, I don't understand how Copyfair is in fact supposed to make any real difference to private abuses of the commons. Can someone explain the practical rationale for a terms-of-service/licensing approach? Who's it gonna stop?

So seems to me, some act-of-fiat is required, a machine-level fix: disable "Allow unauthenticated access" or modify <robots.txt> or whatever. I guess it's robot wars? Outside the law?

NS

Nathan Schneider Tue 14 Jan

I don't see any problem with scraping of public posts. That's one of the wonderful things about public microblogging; it's a resource for public research that helps us better understand ourselves. If you don't want to be scraped yourself, you can set your posts to private. I think it's within the values of Social.coop to welcome our public data to be available for study.

A colleague of mine has worked on this issue of user perceptions of research quite a bit. Some resources:

https://journals.sagepub.com/doi/10.1177/2056305118763366

https://howwegettonext.com/scientists-like-me-are-studying-your-tweets-are-you-ok-with-that-c2cfdfebf135?gi=269de090d941

M

mike_hales Tue 14 Jan

I disagree @Nathan Schneider I don’t regard this as a question of personal privacy, but rather a question of all results of analysis being returned to the community that was scraped. It would be great if research “helps us better understand ourselves”. But research findings go into the black hole of professional literatures, and research on ‘us’ (members of ’the public’) is only very rarely available to us, in any remotely direct way. Just as with environmental commons, so with cultural commons: Nothing extracted, that’s not returned to source.

The problem with academic research results is that although they're nominally ‘public’ they are in fact behind gates (cost of journals, access to journals, kinds of literacy required to read, casual elitism in presentation to specialist peers). Even when governed by ethics committees and formal legal issues of copyright are observed, academic research is basically extractive - that's the politics of the professional-managerial class ("Trust us to figure things out on your behalf. Don't bother yourself with the details, it's too technical for you anyway"). Today, with digital and bots, this has also become the problem of Big Data. 150 years on from the invention of stats and public administration, we still don't have remotely adequate ways of dealing with such issues.

If research were truly a commons, in which the (tacit, passive) contributors of scraped data were also directly participating in both the governance of the results-pool and the mundane enjoying of the results commons, that would be wonderful. But this isn't how it works in 2020. I feel that a university research team scraping my toots for a ‘public’ dataset is about half as bad as Facebook scraping my traffic for commercial exploitation or worse - still not good. Still not properly ‘public’ behaviour, in a society of elites.

This is the stance of a retired insider: a university research professional producing nominally ‘public’, public-funded findings. Bring on the commons! No, don't trust even well-meaning professionals. All power to the general assembly (hmmnn, not that, either! This is a tough one).

NS

Nick Sellen Tue 14 Jan

In this case the researchers have made efforts to comply with terms of service, from the paper:

In the terms of service and privacy policy the gathering and the usage of public available data is never explicitly mentioned, consequently our data collection seems to be complaint with the policy of the instance.

they also said they complied with robots.txt:

we have also respected the limitations imposed by the robots.txt files of the different instances

This type of case seems preventable, if that is desired.

If there was a truly hostile person doing the scraping I would imagine having those things in place would be a better starting position from a legal perspective, not that I know much about that.

I agree with the distinction between ordinary public access by actual people and machine enabled public access, especially when you include the ability to analyse the data with current and future algorithms, which is an explicit aim of theirs:

The usage of this dataset empowers researchers to develop new applications as well as to evaluate different machine learning algorithms and methods on different tasks

NS

Nathan Schneider Tue 14 Jan

I agree that research findings based on public data (or really any research from public institutions) should be publicly available. I don't know about the practices of these particular researchers, but most scholars I know at least make available open-access preprints of their research if the journals themselves are not open access. In general, academic research is more available and accessible than it ever has been, even though I believe the open access movement has a long way to go (and I've been trying to advance it through ethicaledtech.info).

But part of the point here is that public data is not just available to researchers at universities. It's available to anyone, in principle. It could be used for a variety of outcomes. One of the values of a truly open commons is that the resource is available to all, and that's the case with our data.

Of course, I believe there are times when we need to protect our data from certain forms of abuse. The recent rise in source-available licenses to prevent cloud software from being abused by Amazon is an example. In the co-op/commons community, we have experiments with the Peer Production License, which limits use to non-profit and cooperative entities. I would be very comfortable with applying the PPL to Social.coop content.

M

mike_hales Wed 15 Jan

I would be very comfortable with applying the PPL to Social.coop content

Sounds good to me as a principle 🙂 At the same time, I still have the same real-world query as before . .

Is the fediverse or social.coop really going to take a violator to court? . . I don't understand how [PPL] is in fact supposed to make any real difference to private abuses of commons. Can someone explain the practical rationale for a terms-of-service/licensing approach? Who's it gonna stop?

The Rule of Law isn’t in fact how extractive privatising culture and economy work. They operate on “OK I did it. OK so sue me” - the Rule of Fiat. As if financial compensation could remove harm anyway. Genuinely, I don’t get what Copyfair etc are actually expected to achieve. There seems to be a mistaken equation between libertarianism and lawfulness. It seems to me that law for libertarians is a gun or a bulldozer.

M

mike_hales Wed 15 Jan

the open access movement has a long way to go

All speed 👍

public data is . . available to anyone, in principle

Principle again. In practice the modus of the professional-managerial class is to pump data and knowledge out of . . let’s say ‘the Public’ . . into a stratum of culture in which it’s routinely mobilised by elites of professional wage labour, intellectual entrepreneurs, States and corporations, in the course of doing things to the public, not with those people who’ve been scraped, or enabling actions by those people. Typically, producing infrastructures for life and work that are not readily open to self-design and redesign by those whose lives and work they shape. This is 150 years of Fordist and post-Fordist capitalist practice. These are the stakes that the fediverse is playing for, and P2P-commons politics more widely.

The locations of ‘public’ datasets should be directly published to those communities from which they were scraped - with data from digital sources like the fediverse that’s not too hard at all, it’s mostly the intention that’s lacking. They should be glossed - in those same locations - so that they are meaningful to and useable by those people, rather than to the professionals the datasets are formulated for. Tools for using on the data should be published with the data - all of this under PPL/Copyfair. When analyses of the data are published they should be posted to the same locations and notified. When used in designing infrastructures, that designing should be conducted as codesign, with the communities who needs-must inhabit the infrastructures. This all sounds far fetched, and quite hard to interpret in practical terms (even though practices of codesign - and designs of codesign practices - have made great headway in the past 50 years) which is a good measure of how far we are from having an actual Public, as distinct from various kinds of privatised territory. Let’s rather call it the commons, and let’s take that politicised description seriously, in practices of active and explicit commoning, rather than falling back on myths of publicness and professionalism? It’s quite another kind of practice, and yes, a long way to go.

In my activist peer groups, this past 40 years, there have been two key principles (principles!): in-and-against the State, and in-and-against the professional-managerial class (principles that can be extended in other directions too, with regard to other kinds of oppression and supremacy). I think this is the territory we’re in here. A long way to go.

BH

Bob Haugen Wed 15 Jan

@Nathan Schneider

most scholars I know at least make available open-access preprints of their research if the journals themselves are not open access.

In one case I know about, some researchers had to pay $20,000 to a predatory publisher (a big name) to offer an open-access version of their paper. Academic publishing is a racket.

NS

Nick Sellen Wed 15 Jan

@mike_hales my comment above this one, to me, partly answers your real-world query - in this real-world case having these things in place would have been able to prevent it (for hostile cases it would increase the effort required to scrape the content).

NS

Nick Sellen Wed 15 Jan

Some interesting bits from those two papers/links are:

Within our survey sample, few users were previously aware that their public tweets could be used by researchers, and the majority felt that researchers should not be able to use tweets without consent.

and

The problem is that, for some researchers, whether the data is public is the only thing that matters. I suggest (sometimes loudly, to people who don’t want to hear it) that it shouldn’t be.

and

it’s critical that we move beyond simplistic rules and consider each situation individually and holistically. Researchers can’t place the whole burden on users; we can’t expect them to know that we’re out there watching, nor demand that they anticipate any and all potential harms that may come to them now or in the future.

I think people generally have very little awareness of how their data might flow around and be used, and often are not comfortable when they find out. Some people have been very upset by this scraping. I would love people to have more data awareness so they can make informed choices (I think most people don't know that server admins with root access can read all their private messages too... of course they shouldn't but how does anyone know that?).

@Nathan Schneider said:

If you don't want to be scraped yourself, you can set your posts to private

I, and I think other users, would like more nuance than that, is it really reasonable to combine two cases as one (posting to the local instance timeline, so individual humans can discover your content with making your toots available to anyone on the internet to scrape and analyse), it's not a technical limitation (excluding the malicious/hostile case) but a policy choice.

it's a resource for public research that helps us better understand ourselves

In this particular case, I'm kinda doubtful about their approach - content warnings are used for quite a range of purposes, and one of the things I love about the fediverse is the element of human and community curation and moderation, I think Facebook has a hard time moderating it's 982497294728472847287424 users, it's an unpleasant full time job for a lot of people, whereas in the fediverse, the moderation can be spread across each community as they wish, and it makes it perhaps more manageable again.

The model of magic AI to help moderation feels like it comes from the Facebook-type case, automate away this drudgery, but seems far less appropriate for the fediverse, where data and tools that can empower the human moderators seems more useful to me (and seems quite distinct from just automated spam/bot detection).

Perhaps this research can support people that want to go in that direction, but it doesn't seem a very good start, to act so disconnected from the communities under study. I don't really understand their motives.

M

mike_hales Wed 15 Jan

I don't get it Nick. Aren't these just documents, protocols. Protocol observers will . . observe them. What effort does it take to not-observe them? And if a document has quote-unquote legal force . . legal force costs a lot of money to mobilise. Freedom under law is very skewed. I truly don't see how such things can be seen as practical defences, for distributed or digital commons, against determined abusers.

NS

Nick Sellen Wed 15 Jan

I agree for determined users, but for these particular ones, they were doing it in good faith that it was permitted and acceptable, and presumably would not have done otherwise.

M

mike_hales Wed 15 Jan

one of the things I love about the fediverse is the element of human and community curation

This is close to the heart I think. Curating is one of three dynamics at the heart of (digital or other) commons, and curating is a practice of valuing. Actual persons in actual communities of commoners, actually practising the valuing, within collectives, of what’s contributed in commons, in actual cases. This is a big evolutionary step we’re contemplating, world scale, digitally facilitated.

Machines can be told to do helpful things - filtering or flagging based on pattern recognition, for example. But to make a closed-loop valuing process (valuing-and-enforcing process?), enacted by machines, is surely something that should be contemplated only rarely? As distinct, for example, from closed-loop processes in real-time engineering systems put, in place to prevent physical hazard?

AIs - well, just jumped-up machines - could do really helpful pattern recognition on ‘public’ Big data. We really could do with mirrors of our own tacit large-scale collective actions - in environmental commons, energy commons, media commons, material commons such as food supply chains or housing stocks, etc etc. Is the fediverse at work on this? Do we have to wait until the Big Data oligarchies are taken into coop ownership? Fat chance! Is anybody today going to trust ‘public’ ownership (the State) to do this? Or the professional communities of Big Data science, like the genome? I don’t think so. Policing the ownership of individuals’ data seems to be about as far as the Free Software and Free Web vision takes us? Have I got that wrong? Not the same thing at all as commoning.

BH

Bob Haugen Wed 15 Jan

@mike_hales

I don't understand how Copyfair is in fact supposed to make any real difference to private abuses of the commons. Can someone explain the practical rationale for a terms-of-service/licensing approach? Who's it gonna stop?

This is not directly about scraping data, it's about open source software licenses. Large companies with legal departments do not knowingly violate licenses. Which is why lots of companies will not use GPL code. I would expect universities might also want to avoid legal liabilities even if nobody is going to sue them.

Won't deter malicious actors, though...but FB and Goog getting sued by European government agencies for lots of money might put a crimp in their plans...

M

mike_hales Wed 15 Jan

 I don't really understand their motives

I don’t mean to be snarky here - I’ve earned my living in non-tenured academic contract research too - but . . Career. Publish-or-perish. ‘Interesting problems in the academic field’.

Connecting with communities (being part of non-academic communities, contributing analytical work) is hard work, in relatively unexplored modes. Researchers have only 24 hours in their days too, and mortgages to pay, and if they’re not going to be rewarded for that additional hard work, not much of it is going to get done (and THAT will be in personal spare time?).

COT

Creature Of The Hill Wed 15 Jan

If you don't want to be scraped yourself, you can set your posts to private

I, and I think other users, would like more nuance than that, is it really reasonable to combine two cases as one (posting to the local instance timeline, so individual humans can discover your content with making your toots available to anyone on the internet to scrape and analyse), it's not a technical limitation (excluding the malicious/hostile case) but a policy choice.

Any individual is welcome to browse through my profile and public toots. Please, go try it. You will see the effort it takes to build a picture and context. If an individual is willing to do that, they are investing time in understanding the context and motivation for those toot and their connections. You will also notice it's not so easy to see everything in one place, as a single column. So it's public, but in a form designed to be interpreted by actual people.

An entity or organisation or tool is a different matter. To infer from the public setup of toots in the way a profile is presented or how they appear in a public timeline, that a user then consents to them being used en-mass by an entity to my eye is wrong.

I would think, that most users would see it like this.

That it is not to say that I or others might not give consent if informed.

But implying it just feels shady and convenient for those that want the data without having to go to too much effort. Just because I could go looking around the internet for open resources, doesn't mean it is right in all cases.

D

Django started a check Wed 15 Jan

Enable authorized fetches, disable public access via API (Once we are on v3+) Closed Sun 19 Jan

This discussion is great and shows we as a Coop don't quite have consensus on the issue.

Let's see if we can agree on a few things.

1 - No
7 - Yes
D

Django Wed 15 Jan

I would like to create a second Poll regarding a Change to the Terms of Service, but fear it might be confusing having multiple polls at once

Here are the choices I have in mind, not sure if It should be a ranked choice or only choose one out of the 4 options

  • Explicitly prevent scraping 1

  • Researchers must explicitly ask for access 2, 3

  • Researchers are not obliged to ask for access 3

  • Status Quo

  1. This would also require some software to detect and ban IPs attempting to scrape

  2. This would also require software, and an temporary exception would be made

  3. Users who have Checked off 'Opt-out of search engine indexing' would be automatically excluded from research.

Thoughts?

N(@

Noah ( @redoak ) Wed 15 Jan

Privacy of toots is a complicated question. Obviously "public" means "not private" but it does not adequately distinguish between "public as in what I say in my yard, or at a restaurant" and "public as in what an elected official says at a meeting." I do believe it should be more difficult to scrape the public timeline; it's not something required for regular, individual-level interaction and almost never done with intentions of directly benefiting the people whose data is being scraped. And although it's minimal let's not forget we're paying for the server resources consumed by the scraping!

I'm in favor of all three of the options given by @Nick Sellen , and honestly interested in going further. For example, I hope for us to someday have a discussion on the possibility of migrating from vanilla Mastodon to a compatible fork offering a local-only post privacy option (I know of Hometown and glitch-soc, there may be others).

D

Django Wed 15 Jan

Oh yes regarding "change robots.txt (in some way, not sure precisely, needs research)"

I believe there is an Admin setting which turns this on by default for all users, but it is a user setting to opt of Search engine indexing

D

Django
Yes
Wed 15 Jan

D

Django Wed 15 Jan

Just to be clear Opting out of Search engine indexing is insufficient to prevent scraping.

COT

Creature Of The Hill
Yes
Wed 15 Jan

M

mike_hales
No
Wed 15 Jan

I don't fully understand the statement (eg 'authorized fetches') and there's no <abstain> option, so must vote a NO.

NS

Nathan Schneider Wed 15 Jan

Agreed. That's unusually egregious. But, again, generally researchers make non-paywalled preprint versions of their research available as well. There have been significant gains in the open-access movement. That, again, would be the value of a PPL license: Researchers using our data would have to publish using a non-profit or cooperative outlet, and those are typically open access ones.

NS

Nathan Schneider
Yes
Wed 15 Jan

Given the strong concerns raised here, I would be okay with this.

NS

Nathan Schneider started a poll Wed 15 Jan

Put a Peer Production License on Social.coop tweets Closed Sat 18 Jan

Alongside any technical provisions we add about mass scraping of our data, I propose that we should place a peer production license on our content, restricting reuse to nonprofit and cooperative entities. (Of course, we can offer separate licensing to other entities on an ad hoc basis.)

Using the PPL would also be a way of extending solidarity to the broader co-op movement.

https://wiki.p2pfoundation.net/PeerProductionLicense

9 - Yes
1 - No
M

mike_hales Wed 15 Jan

Yes

LS

Leo Sammallahti Wed 15 Jan

Yes

N(@

Noah ( @redoak )
Yes
Wed 15 Jan

N(@

Noah ( @redoak ) Wed 15 Jan

Yes

Without getting into the broader questions about licensing that Aaron has raised, I think a reasonable amendment here might be something along the lines of, "All toots covered by PPL unless specified otherwise by the user - check their profile"

D

Django Wed 15 Jan

Yes

JB

Jonathan Bean Wed 15 Jan

Yes

COT

Creature Of The Hill Thu 16 Jan

Yes

NS

Nick Sellen Thu 16 Jan

Yes

sounds a good experiment in this license, the link above is broken, and hopefully this one will work - Peer Production License - I tried reading that page, but it's a bit long and full of dense walls of text :/

NS

Nick Sellen
No
Thu 16 Jan

I think we need a more informed discussion about what it is first.

NS

Nick Sellen Thu 16 Jan

I wanted to explore more what the authorized fetches option is about, the Mastodon 3.0 in-depth blog post gives this explanation (for Secure mode, which I presume is the setting that the toot I read before was referring to):

Secure mode

Normally, all public resources are available without authentication or authorization. Because of this, it is hard to know who (in particular, which server, or which person) has accessed a particular resource, and impossible to deny that access to the ones you want to avoid. Secure mode requires authentication (via HTTP signatures) on all public resources, as well as disabling public REST API access (i.e. no access without access token, and no access with app-only access tokens, there has to be a user assigned to that access token). This means you always know who is accessing any resource on your server, and can deny that access using domain blocks.

Unfortunately, secure mode is not fully backwards-compatible with previous Mastodon versions. For this reason, it cannot be enabled by default. If you want to enable it, knowing that it may negatively impact communications with other servers, set the AUTHORIZED_FETCH=true environment variable.

Given we are not on v3.0 yet, maybe we can just wait until then to decide. It might be possible to assess which servers we will not be able to communicate with were the setting on...

NS

Nick Sellen Thu 16 Jan

robots.txt is a static file included in the repo, see https://github.com/tootsuite/mastodon/blob/master/public/robots.txt (or for our current version), so not configurable within the instance, or per user, but we could choose to have our own one to override the default. I didn't manage to find an instance that has customized it though, so would need some research, maybe a question to #mastoadmins would come up with something.

M

mike_hales Thu 16 Jan

@Nick Sellen great to have that greater depth, thanku. On that basis I'm happy to switch to a YES vote. Roll on v3.0!

M

mike_hales
Yes
Thu 16 Jan

I don't fully understand the statement (eg 'authorized fetches') and there's no <abstain> option, so must vote a NO. Later: Nick provided some detail and I'm happy to vote YES now. Thanku.

M

mike_hales
Yes
Thu 16 Jan

I don't fully understand the statement (eg 'authorized fetches') and there's no <abstain> option, so must vote a NO. Later: Nick provided some detail and I'm happy to vote YES now. Thanku.

NS

Nathan Schneider Thu 16 Jan

@Nick Sellen sorry about the bad link. Here's a nice article on the PPL.

DM

David Mynors Fri 17 Jan

Yes

DM

David Mynors
Yes
Fri 17 Jan

AW

Aaron Wolf Fri 17 Jan

No

mixed feelings and am open to changing my mind, but I'm skeptical of the PPL. I support co-op solidarity and the intention of the PPL 100%. But I'm critical of discriminatory licenses. I prefer PPL over CC-NC because blanket anti-commerce is even worse. But plain copyleft, CC-BY-SA would accomplish what I see everyone talking about here: getting anyone doing research to publish the research under free terms we could all access.

FWIW, I would like to mark my posts CC-BY-SA

M

Michael Fri 17 Jan

Yes

M

Michael
Yes
Fri 17 Jan

M

mike_hales Sun 19 Jan

Clear positive vote on Put a Peer Production License on Social.coop tweets. But only 5% turnout. This needs tooting? A second vote? Its own thread? Authorized fetches is heading the same way. These need much more participation?

NS

Nick Sellen Sun 19 Jan

There is an Open Letter from the Mastodon Community, via https://sunbeam.city/@GwenfarsGarden/103507032332626576 which says they are asking if people want to co-sign.

Interestingly it points out they did not abide by the terms of service, and did not sufficiently anonymize the data. The dataset has been pulled from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/R1HKVS due to legal issues.

D

Django Mon 20 Jan

Thanks for expanding on this, I had made an assumption about this based on some toots. And as @mike_hales pointed out, more info was needed for the informed decision.

D

Django Mon 20 Jan

Agreed!

Maybes the official @SocialCoop@social.coop account could announce the polls/discussions to the instance users.

Should we re-roll the 2 polls into 1?

M

mike_hales Mon 20 Jan

@Matthew Cropp or @Matt Noyes or @emi do Would you announce? But @Nathan Schneider @Django need to float the polls again - new thread to lessen confusion?

M

mike_hales Mon 20 Jan

Interesting thorough letter, worth filing. Closed now for signatures.

MN

Matt Noyes Tue 21 Jan

How about this? @Nathan Schneider and @Django combine the polls in one then announce it together, with an toot from the social.coop account as back up. I am happy to encourage people to participate.

M

mike_hales Tue 21 Jan

Just a thought on ‘good faith’. From the analysis in the letter of protest that has been written in the fediverse, it seems clear that the researchers were not acting in good faith at all. Rather, they seemingly acted in a pretty crass, ignorant way, didn’t do what they said they did, and weren’t aware of half the things they should have been, if they were fully literate users. So expectations of good faith were no protection in this case.

In something that’s quite technically complex like this, I might expect dumb ignorance to be a pretty widespread possibility (including in fields of casualised, precarious employment in academia), and expectations of good faith to be no defence against harm. Legal action and compensation after the harm is done isn’t a substitute for defence?

Scholar.social seems to be the act to follow on this?

NS

Nick Sellen Tue 21 Jan

So expectations of good faith were no protection in this case.

Indeed, I was too optimistic about that I think, but I still feel it was perhaps just badly implemented good faith ;)

... but the legal side seemed more successful, in that the dataset got removed from where it was hosted due to the legal basis.

W

Wooster Wed 22 Jan

Any solution short of a technical measure preventing the actual scraping of posts (such as only permitting friended authenticated users to read your toots) will not stop your toots from being scraped and harvested, along with any identifiable information that is available.

Put succinctly, if you make information on the internet available to people without authentication, it can and will be scraped. Regardless of any laws, letters, privacy statements, terms of service, strongly-worded posts or anything else. The researchers and scholars who make their intentions public may make some effort to abide by these guidelines and attempt to redact personal information, but the actors who may be really doing things you'd rather them not with your data will not be so obligated.

Don't post stuff on the internet if you don't want it to be public information. There's no social mechanism that has enough force to prevent others from accessing it in an automated fashion.


If you want to post things on Mastodon that others can read, Secure mode will likely break that capability. Secure mode does not prevent scraping, it merely allows you to see who is doing the scraping, which can be an anonymous user. Either your toots are public or they aren't, authorized fetch doesn't do anything to prevent scraping. Anyone can set up a new Mastodon instance and create a HTTP signature to make authorized fetches. The Fediverse, like Twitter, is not a place for posting anything you wish to be private in some fashion. Either people and machines can read your content, or they can't. No open letters or policies will change that.

Normally, all public resources are available without authentication or authorization. Because of this, it is hard to know who (in particular, which server, or which person) has accessed a particular resource, and impossible to deny that access to the ones you want to avoid. Secure mode requires authentication (via HTTP signatures) on all public resources, as well as disabling public REST API access (i.e. no access without access token, and no access with app-only access tokens, there has to be a user assigned to that access token). This means you always know who is accessing any resource on your server, and can deny that access using domain blocks.

Unfortunately, secure mode is not fully backwards-compatible with previous Mastodon versions. For this reason, it cannot be enabled by default. If you want to enable it, knowing that it may negatively impact communications with other servers, set the AUTHORIZED_FETCH=true environment variable.

W

Wooster Wed 22 Jan

How would any of this actually prevent scraping?

M

mike_hales Wed 22 Jan

It's good to have it stated as plainly as this @Wooster thanks.

if you make information on the internet available to people without authentication, it can and will be scraped. Regardless of any laws, letters, privacy statements, terms of service, strongly-worded posts or anything else

As I said earlier my concern in this is not privacy: I personally operate in the fediverse with an awareness that careless talk is as unwise here as it is anywhere. My concern is with the uses that are made of material in the commons. I'm concerned that there shoud be concerted efforts to build real commons of digital media, and that any analysis that is made on materials in commons should be returned to those who created the materials, and notified to them - as I wrote here. This is a big ask and how to do it is unclear. But potentially this gives us the means of a much more embracing awarenes of who 'we' are and what 'we' do . . Silicon Valley oligarchs and State security agencies are not the only people with an interest in knowing the shape and dynamics of our behaviour 'in the large'. This is a kind of literacy that's become possible in the past generation, and it's time it was seriously attended to.

One of the things that's most difficult in getting started on this, is that the ethos of commoning is different from, and tangential to, the basic ethos of the web and free software. These latter are built within an anarcho-libertarian culture of autonomism and complete privacy of and control over individual property. This orientation has brought some very powerful tools and technologies, and there are more in the pipeline - open data, mesh networks, open app ecosystems, whatever. But commons are post-propertarian. They're built within a culture of stewarding, curating and enjoying in which all participants have the same access to the same means, under the governance and policing and common aesthetic of them all. Commoning is associationist rather than libertarian and individualist, and the peer-to-peer culture of free software production - a world of protocol-commons - is a space where the two cultures have an awkward coexistence, which is far from resolved.

From the standpoint of building commons, the persistent concern with privacy is a sideshow and maybe a distraction, and the main game is finding ways of policing and ending extraction from commons, and facilitating and mandating return of value to the commons. It's no less urgent (though less of a life-and-death matter) to start focusing this in digital commons, than it is in the wild commons of air, water, energetics and biosphere. Digital data is one of the 'new wildernesses'; cowboys, frontiersmen, gunslingers and homesteaders are out there (where are the posses and deputies? who shot the sheriff?); and so are industrial-scale, robber-baron, clear-felling, cash-cropping, land-grabbing, financial-capital giants. It's the kind of steampunk world Neal Stephenson might write, but it just so happens that we're in it?

PS: I think the open letter, and the stance of scholar.social - is still interesting. The slack, extractive, ethos of academia certainly needs attending to. They (we - I used to be one) need to learn new ways of being in, and serving, communities that are not basically running on academic-elite, publish-or-perish, knowledge-commodity rules.

NS

Nick Sellen Mon 3 Feb

The researchers and scholars who make their intentions public may make some effort to abide by these guidelines and attempt to redact personal information, but the actors who may be really doing things you'd rather them not with your data will not be so obligated.

Yup, that's my feeling too, I think we are mostly limited to that first category, but that that is still very useful to me (e.g. enough to have the dataset pulled from this public harvard database https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/R1HKVS), there is a clear difference to me between having all this data/content in official and public datasets, compared to secret and illegal datasets.

For the second category, the bad actors, we can just make it marginally more difficult (that they actually have to have an account or instance, the details are unclear to me), seems worth considering the pros and cons and letting the membership decide. But because of the compatibility issues with authorized fetch mode, I voted no on the check, as I'd like that more informed discussion first, plus assessing how many of the servers we federated with would be impacted, maybe over time enough instances are upgraded sufficiently that it's not an issue.

Don't post stuff on the internet if you don't want it to be public information. There's no social mechanism that has enough force to prevent others from accessing it in an automated fashion.

I think it would be also worth making this clearer to people, mastodon is not a private/secure messaging platform.

DS

Danyl Strype Sat 4 Apr

Last year, the journal 'Information, Communication & Society' published a special issue called 'Locked Out', critically examining the ways the walled garden nature of corporate social media platforms has accelerated the problems of online mis/disinformation:

https://www.tandfonline.com/toc/rics20/22/11

One of the issues they raise is the way those platforms are protecting themselves from accountability by preventing researchers from accessing data, using "privacy" as an excuse. It's seems to me that the fediverse community is getting sucked into a manufactured moral panic around this, mistaking privatization for privacy.

Scraping of every piece of public-facing work on the web is totally normal. It's how all search engines work. It's how the Wayback Machine works. What's the difference between scraping the public discussions on public-facing Discourse forums for a search engine index, and scraping the public-facing discussions on the fediverse (or any social media platform) for discursive research? I'm reminded of the debates in the EU about 'Freedom of panorama'.

If you don't want your statements to be recorded for posterity, say them privately. AFAICT it's as simple as that. Lots of people seem to think every conversation is a nail because they like their Mastodon hammer so much. But there are plenty of free code tools that have much better tools for private discussions, even within the fediverse; Diaspora, the Zot apps (Hubzilla and Zap) etc.

DS

Danyl Strype Sat 4 Apr

@mike_hales

the basic ethos of the web and free software ... are built within an anarcho-libertarian culture of autonomism and complete privacy of and control over individual property.

I'm sorry Mike but this is a myth, one that sows confusion and division within the digital commons movements (and IMHO was crafted to do so). People like Adam Curtis and Fred Turner who propagate this just-so story about the origins of personal computing and the net are either confused about the history, or being knowingly deceptive.

If you read the founding documents of the GNU Project and the FSF, it's very clear that the motive is to create a software commons, to protect people's ability to share and cooperate in their use of computers. TBL's earliest descriptions of the web were about the benefits of bringing documents out of the individual silos on people's computers, and similarly sharing them as a commons where everyone can build on each other's work. Same with other early web media projects like Indymedia and Wikipedia. Even EFF founder JPL's 'Declaration of the Independence of Cyberspace', often referenced as the canonical example of this perceived Silicon Valley Randianism, says (emphasis mine):

"It is an act of nature and it grows itself through our collective actions."

"Your legal concepts of property, expression, identity, movement, and context do not apply to us."

The rugged individualist "libertarian" discourse came later, after the invention of HTTPS allowed for "e-commerce", making the net interesting to capitalists. This accompanied (perhaps even led to) the privatization of much of the internet's infrastructure, such as the commercialization of the DNS system, and the rise of silos like Farcebook that use web browsers as a universal UI, but don't respect web standards as an open, shared platform.

Fudging together the hacker ethos represented by Stallman, Berners-Lee, and Barlow, with the corporate apologism of Silicon Valley, is not only wrongheaded, but quite frankly it's deeply insulting to those of us who carry the torch of the former, and utterly repudiate the latter.

AW

Aaron Wolf Sun 5 Apr

Spot on Danyl. I was at LibrePlanet 2015 where an audience member was questioning Richard Stallman about how we should trust the "government take over of the internet" with the net-neutrality FCC stuff. Stallman answered like this (my recollection, video is available if I want it perfect, but it's not important):

> Entities that I trust have told me this approach to net-neutrality through Title II is overall positive. And pointing out the problems with one regulation to oppose all regulation is nonsense. It's like saying "there was a BAD law, so therefore we should not have laws." Oh, and maybe there's some misunderstanding, that people think I'm an anarchist. But I'm not, I think we need governments for many important things. In fact, I have a PRO-STATE gland!

A primary motivation that RMS had in founding the free software movement was to have the sort of community collective that he experienced at the MIT AI lab. He saw proprietary software as undermining a sharing, collaborative society. And his politics are basically Green Party views, quite different from libertarians. And many others in the movement share those views, though not exclusively.

Here's what RMS says about so-called-libertarians: https://stallman.org/glossary.html#anti

M

mike_hales Sun 5 Apr

Thanks @Aaron Wolf @Danyl Strype Its good to get this affirmation. Regarding the myth . . it’s not that I’ve read this in malicious narratives, which Strypey identifies. It’s something I’ve observed. As a relative outsider to hacker culture, and latecomer, what I do see is a whole lot of libertarian ethos. But yes, the framing as cultural commons is utterly - well, sufficiently -different, and entirely welcome. The commitment of commoning is a deeply transformative one, beyond state and market, beyond consumerist individualism and supremacy of any kind. So it s good to see these affirmed as also being deep threads running in the FOSS (or should I say opensource?) world.

I say ‘also’ bcos both are de facto presences in the now massive forces of code production and use, and its a struggle. Origin myths - “back in the day, in the unwalled garden, it was like this” - are comforting, and it’s important to have them as counter-stories. But they don’t change the present reality, which is that it truly is a struggle to claim the ground for the commons (and not reclaim, since this developed ground of internet and platforms and data oligarchy that exists today never has yet been in the commons?)

DS

Danyl Strype Sun 5 Apr

@mikeh8

FOSS (or should I say opensource?)

Up to you, but FWIW Stallman prefers "free software" or "software freedom". I like to use "open source" to describe the development methodology, and say "free code" to describe the outputs.

The commitment of commoning is a deeply transformative one, beyond state and market, beyond consumerist individualism and supremacy of any kind.

I agree. Free code, and open source practice, were designed from first principles to be a commons approach to software development. Proprietary software is the market approach. I'm not aware of a state-driven approach.

“back in the day, in the unwalled garden, it was like this”

My point is that the digital commons never went away. It's grown continuously since the GNU Project was founded. If it hadn't, neither Loomio nor the fediverse would exist. The Silicon Valley anti-socialist ideology is parasitic on that commons. It's an artifact of the VC parasitism on the goodwill associated with "open source", as is the promotion of "source available" proprietary licenses, see: https://mjg59.dreamwidth.org/52907.html

AW

Aaron Wolf Sun 5 Apr

I use the term FLO as in Free/Libre/Open because it's all those aspects (and more). See https://wiki.snowdrift.coop/about/free-libre-open

The fact is that people have gotten enough real-world experience to see that the vision is possible. Wikipedia is probably the best example in being uncompromising, completely FLO, community-run, public-facing. It has its problems, nothing is perfect. But it is a proof-of-concept.

The antisocialists (to use Stallman's term, which I like) certainly exist and many are indeed drawn to FLO tech because it doesn't directly conflict with their ideology. There are then many heated debates within FLO between pro-social parts of the movement and the "libertarian" and pro-corporate parts. It's not just FUD, these issues are real, and I've encountered them too.

My platform co-op (still working toward launch) exists specifically to address these things. The most supported FLO is that which serves corporate ends. We need to solve coordination problems and cooperate in order to fund public-focused, downstream public goods. That's the mission of Snowdrift.coop and I would greatly welcome and appreciate your participation, feedback, questions etc. We have a thorough wiki and our own forum etc. And we have done the research on the whole space.

FLO really did start from pro-social foundations, but once Open Source process showed enough dramatic success, it got co-opted. I think this is the best overview: https://mako.cc/copyrighteous/libreplanet-2018-keynote

These are deeply serious political challenges. Snowdrift.coop aims to address coordination around funding but many other elements are needed for the movement to succeed. It is dire right now, not a success yet.

AU

Ana Ulin Sun 5 Apr

The comparison with search engines is a good one: One expects a search engine to respect robots.txt directives. Crawlers that systematically disregard robots.txt directives to disallow and noindex typically get their IPs and User-Agents blacklisted. It is not a matter of what is technically possible (one can't technically enforce robots.txt directives, if the pages are still accessible on the web), but what is accepted as good etiquette and good-will behavior.

Similarly, the fediverse has developed an etiquette around respecting #nobots, and thus it is a reasonable expectation on a Mastodon instance to have that be respected.

Yes, everyone posting publicly should be aware that anyone can see their toots and those could get scraped, screenshotted or whathaveyou. But that does not mean that posting publicly gives anyone the license to aggregate and re-use my content without my consent, anymore than the fact that postcards are open gives the postman permission to make copies of all the ones I receive and post them in the local paper.

DS

Danyl Strype Mon 6 Apr

The comparison with search engines is a good one: One expects a search engine to respect robots.txt directives.

The researchers concerned did that.

But that does not mean that posting publicly gives anyone the license to aggregate and re-use my content without my consent, anymore than the fact that postcards are open gives the postman permission to make copies of all the ones I receive and post them in the local paper.

A DM is equivalent to a postcard. A public post is equivalent to a
poster on a public wall. You can send a letter to the editor and get
grumpy with people for archiving copies of the newspaper it's published,
or using them for research purposes. If you don't want your posts
treated as published works, you can post them as DMs, even in group
conversations where everyone else is posting publicly.

There are a plethora of tools for private social messaging, even within
the fediverse. Diaspora was one of the earlier examples, allowing users
to give access permission to only one person (like a DM), some people
(like group DMs), a group of people defined by the posting user
("aspects"), or everyone (public). Friendica does private messages with
DFRN and Dispora protocols, and maybe now ActivityPub? Hubzilla and now
Zap have been doing federation of private content with Zot, later AP.

You can argue with technical reality all you want, but there's no
changing the fact that if you walk around with no clothes on, everyone
can see your junk. The solution is to get dressed, not waste your time
and other people's trying to control how other people's eyes work.