Loomio
Mon 13 Jan 2020 10:38PM

Mastodon scraping incident

NS Nick Sellen Public Seen by 60

You might have seen this stuff about the fediverse being scraped for some research (see https://sunbeam.city/@puffinus_puffinus/103473239171088670), which included social.coop.

I've seen quite a range of opinions on the topic, some consider the content public, others feel their privacy has been violated.

There are a few measures that could be put into place to prevent/limit this kind of thing in the future (see https://todon.nl/@jeroenpraat/103476105410045435 and the thread around it), specifically:

  • change robots.txt (in some way, not sure precisely, needs research)

  • explicitly prevent it in the terms of service (e.g. https://scholar.social/terms)

  • disable "Allow unauthenticated access to public timeline" in settings (once we are on v3 or higher)

Thoughts?

M

mike_hales Tue 14 Jan 2020 5:22PM

I disagree @Nathan Schneider I don’t regard this as a question of personal privacy, but rather a question of all results of analysis being returned to the community that was scraped. It would be great if research “helps us better understand ourselves”. But research findings go into the black hole of professional literatures, and research on ‘us’ (members of ’the public’) is only very rarely available to us, in any remotely direct way. Just as with environmental commons, so with cultural commons: Nothing extracted, that’s not returned to source.

The problem with academic research results is that although they're nominally ‘public’ they are in fact behind gates (cost of journals, access to journals, kinds of literacy required to read, casual elitism in presentation to specialist peers). Even when governed by ethics committees and formal legal issues of copyright are observed, academic research is basically extractive - that's the politics of the professional-managerial class ("Trust us to figure things out on your behalf. Don't bother yourself with the details, it's too technical for you anyway"). Today, with digital and bots, this has also become the problem of Big Data. 150 years on from the invention of stats and public administration, we still don't have remotely adequate ways of dealing with such issues.

If research were truly a commons, in which the (tacit, passive) contributors of scraped data were also directly participating in both the governance of the results-pool and the mundane enjoying of the results commons, that would be wonderful. But this isn't how it works in 2020. I feel that a university research team scraping my toots for a ‘public’ dataset is about half as bad as Facebook scraping my traffic for commercial exploitation or worse - still not good. Still not properly ‘public’ behaviour, in a society of elites.

This is the stance of a retired insider: a university research professional producing nominally ‘public’, public-funded findings. Bring on the commons! No, don't trust even well-meaning professionals. All power to the general assembly (hmmnn, not that, either! This is a tough one).

NS

Nathan Schneider Tue 14 Jan 2020 11:45PM

I agree that research findings based on public data (or really any research from public institutions) should be publicly available. I don't know about the practices of these particular researchers, but most scholars I know at least make available open-access preprints of their research if the journals themselves are not open access. In general, academic research is more available and accessible than it ever has been, even though I believe the open access movement has a long way to go (and I've been trying to advance it through ethicaledtech.info).

But part of the point here is that public data is not just available to researchers at universities. It's available to anyone, in principle. It could be used for a variety of outcomes. One of the values of a truly open commons is that the resource is available to all, and that's the case with our data.

Of course, I believe there are times when we need to protect our data from certain forms of abuse. The recent rise in source-available licenses to prevent cloud software from being abused by Amazon is an example. In the co-op/commons community, we have experiments with the Peer Production License, which limits use to non-profit and cooperative entities. I would be very comfortable with applying the PPL to Social.coop content.

M

mike_hales Wed 15 Jan 2020 10:06AM

I would be very comfortable with applying the PPL to Social.coop content

Sounds good to me as a principle 🙂 At the same time, I still have the same real-world query as before . .

Is the fediverse or social.coop really going to take a violator to court? . . I don't understand how [PPL] is in fact supposed to make any real difference to private abuses of commons. Can someone explain the practical rationale for a terms-of-service/licensing approach? Who's it gonna stop?

The Rule of Law isn’t in fact how extractive privatising culture and economy work. They operate on “OK I did it. OK so sue me” - the Rule of Fiat. As if financial compensation could remove harm anyway. Genuinely, I don’t get what Copyfair etc are actually expected to achieve. There seems to be a mistaken equation between libertarianism and lawfulness. It seems to me that law for libertarians is a gun or a bulldozer.

M

mike_hales Wed 15 Jan 2020 10:43AM

the open access movement has a long way to go

All speed 👍

public data is . . available to anyone, in principle

Principle again. In practice the modus of the professional-managerial class is to pump data and knowledge out of . . let’s say ‘the Public’ . . into a stratum of culture in which it’s routinely mobilised by elites of professional wage labour, intellectual entrepreneurs, States and corporations, in the course of doing things to the public, not with those people who’ve been scraped, or enabling actions by those people. Typically, producing infrastructures for life and work that are not readily open to self-design and redesign by those whose lives and work they shape. This is 150 years of Fordist and post-Fordist capitalist practice. These are the stakes that the fediverse is playing for, and P2P-commons politics more widely.

The locations of ‘public’ datasets should be directly published to those communities from which they were scraped - with data from digital sources like the fediverse that’s not too hard at all, it’s mostly the intention that’s lacking. They should be glossed - in those same locations - so that they are meaningful to and useable by those people, rather than to the professionals the datasets are formulated for. Tools for using on the data should be published with the data - all of this under PPL/Copyfair. When analyses of the data are published they should be posted to the same locations and notified. When used in designing infrastructures, that designing should be conducted as codesign, with the communities who needs-must inhabit the infrastructures. This all sounds far fetched, and quite hard to interpret in practical terms (even though practices of codesign - and designs of codesign practices - have made great headway in the past 50 years) which is a good measure of how far we are from having an actual Public, as distinct from various kinds of privatised territory. Let’s rather call it the commons, and let’s take that politicised description seriously, in practices of active and explicit commoning, rather than falling back on myths of publicness and professionalism? It’s quite another kind of practice, and yes, a long way to go.

In my activist peer groups, this past 40 years, there have been two key principles (principles!): in-and-against the State, and in-and-against the professional-managerial class (principles that can be extended in other directions too, with regard to other kinds of oppression and supremacy). I think this is the territory we’re in here. A long way to go.

BH

Bob Haugen Wed 15 Jan 2020 1:25PM

@Nathan Schneider

most scholars I know at least make available open-access preprints of their research if the journals themselves are not open access.

In one case I know about, some researchers had to pay $20,000 to a predatory publisher (a big name) to offer an open-access version of their paper. Academic publishing is a racket.

NS

Nick Sellen Wed 15 Jan 2020 2:04PM

Some interesting bits from those two papers/links are:

Within our survey sample, few users were previously aware that their public tweets could be used by researchers, and the majority felt that researchers should not be able to use tweets without consent.

and

The problem is that, for some researchers, whether the data is public is the only thing that matters. I suggest (sometimes loudly, to people who don’t want to hear it) that it shouldn’t be.

and

it’s critical that we move beyond simplistic rules and consider each situation individually and holistically. Researchers can’t place the whole burden on users; we can’t expect them to know that we’re out there watching, nor demand that they anticipate any and all potential harms that may come to them now or in the future.

I think people generally have very little awareness of how their data might flow around and be used, and often are not comfortable when they find out. Some people have been very upset by this scraping. I would love people to have more data awareness so they can make informed choices (I think most people don't know that server admins with root access can read all their private messages too... of course they shouldn't but how does anyone know that?).

@Nathan Schneider said:

If you don't want to be scraped yourself, you can set your posts to private

I, and I think other users, would like more nuance than that, is it really reasonable to combine two cases as one (posting to the local instance timeline, so individual humans can discover your content with making your toots available to anyone on the internet to scrape and analyse), it's not a technical limitation (excluding the malicious/hostile case) but a policy choice.

it's a resource for public research that helps us better understand ourselves

In this particular case, I'm kinda doubtful about their approach - content warnings are used for quite a range of purposes, and one of the things I love about the fediverse is the element of human and community curation and moderation, I think Facebook has a hard time moderating it's 982497294728472847287424 users, it's an unpleasant full time job for a lot of people, whereas in the fediverse, the moderation can be spread across each community as they wish, and it makes it perhaps more manageable again.

The model of magic AI to help moderation feels like it comes from the Facebook-type case, automate away this drudgery, but seems far less appropriate for the fediverse, where data and tools that can empower the human moderators seems more useful to me (and seems quite distinct from just automated spam/bot detection).

Perhaps this research can support people that want to go in that direction, but it doesn't seem a very good start, to act so disconnected from the communities under study. I don't really understand their motives.

M

mike_hales Wed 15 Jan 2020 2:57PM

one of the things I love about the fediverse is the element of human and community curation

This is close to the heart I think. Curating is one of three dynamics at the heart of (digital or other) commons, and curating is a practice of valuing. Actual persons in actual communities of commoners, actually practising the valuing, within collectives, of what’s contributed in commons, in actual cases. This is a big evolutionary step we’re contemplating, world scale, digitally facilitated.

Machines can be told to do helpful things - filtering or flagging based on pattern recognition, for example. But to make a closed-loop valuing process (valuing-and-enforcing process?), enacted by machines, is surely something that should be contemplated only rarely? As distinct, for example, from closed-loop processes in real-time engineering systems put, in place to prevent physical hazard?

AIs - well, just jumped-up machines - could do really helpful pattern recognition on ‘public’ Big data. We really could do with mirrors of our own tacit large-scale collective actions - in environmental commons, energy commons, media commons, material commons such as food supply chains or housing stocks, etc etc. Is the fediverse at work on this? Do we have to wait until the Big Data oligarchies are taken into coop ownership? Fat chance! Is anybody today going to trust ‘public’ ownership (the State) to do this? Or the professional communities of Big Data science, like the genome? I don’t think so. Policing the ownership of individuals’ data seems to be about as far as the Free Software and Free Web vision takes us? Have I got that wrong? Not the same thing at all as commoning.

M

mike_hales Wed 15 Jan 2020 3:03PM

 I don't really understand their motives

I don’t mean to be snarky here - I’ve earned my living in non-tenured academic contract research too - but . . Career. Publish-or-perish. ‘Interesting problems in the academic field’.

Connecting with communities (being part of non-academic communities, contributing analytical work) is hard work, in relatively unexplored modes. Researchers have only 24 hours in their days too, and mortgages to pay, and if they’re not going to be rewarded for that additional hard work, not much of it is going to get done (and THAT will be in personal spare time?).

COT

Creature Of The Hill Wed 15 Jan 2020 3:12PM

If you don't want to be scraped yourself, you can set your posts to private

I, and I think other users, would like more nuance than that, is it really reasonable to combine two cases as one (posting to the local instance timeline, so individual humans can discover your content with making your toots available to anyone on the internet to scrape and analyse), it's not a technical limitation (excluding the malicious/hostile case) but a policy choice.

Any individual is welcome to browse through my profile and public toots. Please, go try it. You will see the effort it takes to build a picture and context. If an individual is willing to do that, they are investing time in understanding the context and motivation for those toot and their connections. You will also notice it's not so easy to see everything in one place, as a single column. So it's public, but in a form designed to be interpreted by actual people.

An entity or organisation or tool is a different matter. To infer from the public setup of toots in the way a profile is presented or how they appear in a public timeline, that a user then consents to them being used en-mass by an entity to my eye is wrong.

I would think, that most users would see it like this.

That it is not to say that I or others might not give consent if informed.

But implying it just feels shady and convenient for those that want the data without having to go to too much effort. Just because I could go looking around the internet for open resources, doesn't mean it is right in all cases.

NS

Nathan Schneider Wed 15 Jan 2020 7:10PM

Agreed. That's unusually egregious. But, again, generally researchers make non-paywalled preprint versions of their research available as well. There have been significant gains in the open-access movement. That, again, would be the value of a PPL license: Researchers using our data would have to publish using a non-profit or cooperative outlet, and those are typically open access ones.

Load More