Loomio
Sun 16 Feb 2014

Statistics and privacy

MB
Manuel Bichler Public Seen by 74

The opt-in statistics.json feature provides real-time sums of the number of users and posts on the D* pod. This might be a privacy issue, especially on small pods.

Is this considered a problem? Should we change the statistics implementation?

JR

Jason Robinson Sun 16 Feb 2014

Also discussion in this post.

MB

Manuel Bichler Sun 16 Feb 2014

I'm really concerned about this feature regarding privacy, which is why I turned it off again to protect my pod's users.

One can easily track when exactly a new user or a new post appears, even though this data should be hidden imho.

For example, say if Alice promised me to register on pod X, but the user number of that pod didn't increase within the last week, I know that Alice didn't register. Or if the user number of pod X increased by only 1, I know exactly at what time Alice registered. The same is true for posts.

This is a common problem within statistical databases (referring to Stallings).

I share the opinion that there should be a way to make good estimates about the number of people/accounts using Diaspora and obviosly the statistics.json method is a good way to provide those figures. But especially for very small pods (as mine) I consider it a privacy concern if they provide realtime sums for posts. Although small pods are clearly less important for generating network-wide estimates as big pods like geraspora.de or diasp.org, since Diaspora is a decentralized network, there will always be small pods, and, who knows, maybe in the future small pods (<100 accounts) will make up the majority of accounts (although the Pareto principle is more realistic imho - https://en.wikipedia.org/wiki/Pareto_principle).

I'm really keen on providing stats on my pod, but - sorry - not in real time (also not when my pod grows). I cannot provide a service to my users in clear conscience as long as it discloses those real-time sums.

MB

Manuel Bichler started a proposal Sun 16 Feb 2014

Provide weekly snapshots instead of real-time data Closed Tue 18 Feb 2014

Outcome
by Manuel Bichler Tue 25 Apr 2017

Only 1 pro in 12. Proposal declined.

The statistics.json info (the fields total_users, active_users_halfyear, active_users_monthly and local_posts) should not contain the current sums, but those of the last Tuesday midnight GMT instead. This makes sure that the numbers are not real-time but weekly snapshots.

Agree - 1
Abstain - 1
Disagree - 1
Block - 1
12 people have voted (4%)
MB

Manuel Bichler
Agree
Sun 16 Feb 2014

S

StarBlessed
Abstain
Sun 16 Feb 2014

N

NicoAlto
Abstain
Sun 16 Feb 2014

DU

[deactivated account]
Abstain
Sun 16 Feb 2014

d* statistics are a good idea, and privacy isn't compromised currently IMHO (maybe just barely) since the stats are cumulative. But if it encourages more podmins to opt in we can make it weekly, it won't hurt functionality much. podmins should vote

L

lnxwalt
Abstain
Sun 16 Feb 2014

This asks Jason to do a lot of extra work for a very minimal privacy benefit.

F

Flaburgan
Disagree
Sun 16 Feb 2014

I don't see which real problem could appear with the data pulled every day. Those stats are completely anonymous, knowing the number of person registering on a day is not a privacy leak.
Even if it's 1, it gives no real information on who registered

S

SuperTux88
Disagree
Sun 16 Feb 2014

DB

Dee Baumdeesaster
Disagree
Sun 16 Feb 2014

MP

Mike Powell
Block
Sun 16 Feb 2014

JH

Jonne Haß
Disagree
Sun 16 Feb 2014

This is an abstain but the proposal is overly specific for this case. Possible privacy leaks depend not on interval, but on interval in relation to pod size. If we do cached statistics, the interval should be configurable.

PG

Paul Greindl
Disagree
Mon 17 Feb 2014

JR

Jason Robinson
Disagree
Tue 18 Feb 2014

Just in time to vote! :D

S

StarBlessed Sun 16 Feb 2014

This seems arbitrarily paranoid. While I understand both sides of the story, I will not vote either way on the subject. I don't agree with either side. Honestly, I fought against any statistics in the first place. This, however, is just trying to stuff the genie back into the bottle.

MB

Manuel Bichler Sun 16 Feb 2014

@starblessed Based on what I read in your other posts, you consider a pod's providing statistics about the number of accounts and posts a step into centralizing the network and a step against privacy as such.

I disagree on the decentralization point but I understand your privacy point. You can't have full statistics and full privacy at the same time, those two things are mutually exclusive, just like Heisenberg's uncertanity principle. We have to balance between statistics and privacy - and for proposing dropping all statistics because of privacy issues, one may call you paranoid. ;)

S

StarBlessed Sun 16 Feb 2014

And you would be right. I am paranoid. I wont connect my pod to FB or Twitter for that very reason. I'm getting close to pulling it away from Tumblr.
If I had my way, there would be no public data about any kind of D* statistics. But that's just me.

MB

Manuel Bichler Sun 16 Feb 2014

@starblessed Well, the nice thing about a decentralized network is that every pod has their own philosophy - some will provide statistics and some won't, and it's your very personal decision which one you prefer to open an account on.

I think we should make it easily possible for podmins without any programming experience to choose their own statistics philosophy, maybe even provide more options than to opt-in or not to opt-in.

Btw. if you really want to be on a pod that does not provide any statistics whatsoever, the podmin must assure that he/she does not even say "well, about a thousand" when being called by media and asked how many accounts he/she serves.

S

StarBlessed Sun 16 Feb 2014

I opted into the stats. Just for now. I want to see how it could possibly affect the value of the data.

L

lnxwalt Sun 16 Feb 2014

In the linked discussion, we learn that there are two separate issues: each pod's statistics.json file and the central stats collector's polling. If Diaspora has no concept of regularly scheduled tasks, this change could require a fairly extensive rewrite.

I'm going to abstain because I do not think there is enough of a privacy benefit to justify the extra work this asks Jason to do (rewriting the stats collection process).

MB

Manuel Bichler Sun 16 Feb 2014

@flaburgan you are referring to Jason's statistics hub that pulls every day, but the data itself is pullable in real-time. Just like Jason did, I could write a bot that pulls the pods' data every second instead of every day. This topic is not about any pulling bot, it's about Diaspora's statistics.json feature.

Stats about non-anonymous data are never "completely anonymous", ask @starblessed about that. ;)

@lnxwalt If the community decides that something has to be done, I could do the programming stuff. No need to burden Jason.

F

Flaburgan Sun 16 Feb 2014

Well, in that case, I think that the statistics.json can be updated every day, it looks precise enough for statistics, and long enough to not know when exactly someone registered (but seriously, knowing the massive amount of data online, what's the problem by knowing when "someone" is registering? Believe me, I'm really engage for online privacy, but there I don't get it...)

MB

Manuel Bichler Sun 16 Feb 2014

I know, this seems somewhat paranoid and the D* project probably has many other issues that are much more important than this one.

I just want to raise awareness that real-time sums may lead to data leaking situations on relatively small pods that can only be avoided when opting out of statistics at all. Even on relatively bigger pods, a real-time trend on the number of posts feels somewhat spooky.

We might also only publish the numbers in 100s instead of the exact numbers.

A

Adrenalin Sun 16 Feb 2014

My first thought -- dick swinging

I do not understand at all why and who needs to have these statistics? Are they used to solve concrete problems?

For privacy reasons I would prefer not to analyze anything. Everything else is an invitation for data mining.

MB

Manuel Bichler Sun 16 Feb 2014

@adrenalin the statistics feature is already implemented, see https://www.loomio.org/d/FBjn89X2/central-hub?proposal=1y7tgbVP but it is an opt-in feature, so a fresh installation of D* does not publish any statistics whatsoever.

F

Flaburgan Sun 16 Feb 2014

@adrenalin the statistics are really needed, because we need credibility. To have more people coming in diaspora, more journalist and projects talking about us, more developers helping us to build a nice software. Most of the people who doesn't follow the project just think that "diaspora is dead". We have to fight this idea and we need to show numbers for that.

A

Adrenalin Sun 16 Feb 2014

@manuelbichler
thanks Manuel, I realized that now but missed the discussion :(

I'd sign @rekado 's comment over there

I don't think statistics are at all important. Pod-local stats are important for the pod admin; since there is no "network admin" in a decentralised network I don't see the need for stats at that level. Email didn't need usage stats either.

If we follow the idea of decentralization iow every user setting up her/his own pod they'll know their activity … and I agree with @flaburgan 's comment too

…having a better list of the pods than poduti.me

would be more interesting.

MB

Manuel Bichler Sun 16 Feb 2014

@adrenalin Well, apart from the fact that there is no statistics interface specialized for podmins, don't you think that @flaburgan 's arguments up there are pretty much showing why a good estimate of the network-total number of users would be good for pushing the project? I mean, you can't compare Email 1993 with D* 2014. Email went popular because it empowered people to do things they could never do without it, whereas D*, from a functional point of view, is for most users just a feature-poorer version of Facebook.

MB

Manuel Bichler Sun 16 Feb 2014

Oh, and @adrenalin "having a better list of the pods than poduti.me" already happened, and Jason had the motivation to do it because of the new statistics feature ;)

SVB

Steffen van Bergerem Mon 17 Feb 2014

I agree with @jonnehass Closing the vote and reopeing it saying that the update interval should be configurable might be a good idea.

MB

Manuel Bichler Mon 17 Feb 2014

I obviously overreacted by starting the vote (didn't know how to start a discussion on Loomio the best way), so if nobody has objections, I will close the vote. Sorry for taking your time.

Do you think it's still worth discussing the update interval or granularity or should we just totally close this topic? As I hear it, the grand majority is not at all interested in changing the stats behaviour (although I'm still strongly in favour of doing so).

G

goob Mon 17 Feb 2014

We can't close a topic once it's been opened in Loomio. If anyone wants to comment further they can.

I don't think anyone's saying that the topic should not be discussed; it's more that perhaps your proposal contained too many specifics (Tuesday 00:00GMT, etc) when there hasn't been a discussion about even whether there's a need to change the behaviour of the stats feature. It might well be that there's a benefit in providing the option to configure the interval the stats are generated by each pod.

Let's see where the discussion goes, and then when there's a consensus being generated, take a vote on what seems to be a workable solution with some support.

JR

Jason Robinson Tue 18 Feb 2014

My main reason for not wanting to change things is that diaspora* does not have any cron like scheduling component. Sure we might need one, but just to change this I don't think it's a good idea to add any either, unless needed for some other reason.

I find the privacy issues from on-the-fly statistics very very minimal, if at all. But that is why it is opt-in - not opt-out.

But of course, discussion should continue. If we can make the statistics feature more acceptable without losing statistics quality, then by all means. A weekly snapshot is one way but does not really change the privacy fact. For a small pod with little activity, stats once a week can be considered real time too.

JR

Jason Robinson Tue 18 Feb 2014

The question is in the end - how much complexity do we want to build into something that really has very very little gain, in terms of privacy?

G

goob Tue 18 Feb 2014

In terms of stats collection, it's the bigger pod which are more important (because the greatest numbers of users are on them), so if small pods don't join the stats collection, it will have less effect on the accuracy of the numbers than if some big pods don't join.

Rather than producing stats every week, how about building in an option to allow a pod to produce a number range rather than an exact figure for users? We could have ranges of say 1–5, 6–20, 21–50 users, or perhaps more precise than that. But by producing a range rather than an exact figure, even when the stats are produced in real time it wouldn't provide the accuracy of data some podmins are concerned about.

A

Adrenalin Tue 18 Feb 2014

@flaburgan

the statistics are really needed, because we need credibility. To have more people coming in diaspora, more journalist and projects talking about us, more developers helping us to build a nice software.

exept for more developers we need I think you are wrong.

Following your logic Facebook would have to be the company with the highest credibility worldwide out of all SN's. The opposite is true, it is a total fail in many ways.
Now:
What attracts users?

To know there are 1 or 2 or 3 million users? Guess not. They are registered where they know their friends are and where they find the functionality they think they need.

Usability is important if you want to attract a crowd that isn't geeks. Diaspora seems to be more complicated for newbies as they need to make choices (which pod should I register) without knowing what is behind. Wiki's aren't that sexy for non-geeks to find info on how to use your new SN as an easy to understand tutorial with pictures or animated. My guess quite a number of users or potential users give up as they think things are too complicated and unknown and for techies only.

Privacy matters?!
For a few yes, for the majority not at all.

How many more users d* has acquired since Snowden and NSA leaks? If that scandal doesn't provoke privacy concerns for Facebook Users what else?

Journalism is working in a different way. If you want articles written you'll have a good story to tell. And you need to establish long lasting contacts, maybe know some in person etc.

Noone is interested in those boring statistics it is not a story that sells. And journalism must sell or raise click rates.

@manuel Bichler

see my reply to flaburgan

JR

Jason Robinson Tue 18 Feb 2014

@adrenalin while I do think you are wrong, I don't think we really need to debate this. You might realize the statistics feature is already done. So you don't have to worry about usability vs statistics.

If your point was that statistics is not needed and usability is.. :)

F

Flaburgan Wed 19 Feb 2014

Snowden and NSA leaks

Tons. Really. But Jason is right, this is not the point here.