tl;dr This guide presents an approach to privacy-aware data processing, based on geotagged photos of the YFCC100M dataset and by using SQL commands on two PostgreSQL databases through pgAdmin, visualized with Python in a Jupyter Notebook.
Imagine someone attempting to analyze geotagged photos from Social Media, such as those found in the YFCC100M dataset, to monitor user frequentation.
This is a typical visual analytics task and it is done in many different application areas. Just one of many examples is the InVEST Project, where Flickr images are used to quantify aesthetic values and nature based tourism.
What are the ethical implications of using this data?
More specifically, the question could also be formulated: Is it ethically correct to use this data, at all?
There is an ongoing and active debate around using User Generated Content (UGC) for applications not directly considered by the people volunteering (for a summary, see See at al. 2016). The problem centers around the very definition of volunteering itself. From a narrow view, data can only be considered "volunteered" when it is used for the specific application for which it was originally collected and contributed by the people volunteering. But most data can be analyzed and used in multiple ways. In other words, data collection and data application are often two quite different things.
Consider, for example, the people volunteering cartographic data on OpenStreetMap. This is a prominent example of Volunteered Geographic Information (VGI). However, OSM data can be used for applications beyond mere cartography. Just one example is the research from Jokar et al. (2015), who analyzed inequalities of contributions on OpenSteetMap, as a means to study social and political dimensions of OpenStreetMap. Clearly, the OSM data was not volunteered for such a specific use. Does this mean that VGI should only be used for the specific context it was volunteered?
Currently speaking, most authors argue that it is not productive to limit evaluations to the original mode of data contribution (Lane et al. 2015; Metcalf & Crawford 2016). Instead, it is suggested to take into account a wider array of measures and, specifically, turn more attention towards the actual applications of data, and whether those would be considered beneficial by the people volunteering.
In the example presented here, broadly speaking, people volunteered Flickr photographs to a general and open cause, by labeling those with a Creative Commons License. We're using this volunteered data here, in a simplified example, to quantify nature based tourism. Such an application was perhaps not anticipated by the people volunteering. However, it appears possible to say that this application is beneficial to individuals or society as a whole because results may be used to support a sustainable management of tourism and nature's resources. Such implicit uses of data appear to dominate (see Ghermandi & Sinclair 2019).
Are such implicit uses of VGI considered legitimate? Since this documentation focuses on technical aspects of data processing, it is not possible to fully portray these conceptual difficulties of using crowdsourced and volunteered data. However, making sure that no personal data is available, at any step of the process, becomes of fundamental importance. The rationale is that people's privacy can be compromised, irregardless of the mode of contribution. Therefore, the process demonstrated in this tutorial applies to both, crowdsourced and volunteered data.
In this specific context of application, making sure that no personal data is available, at any step of the process, becomes of fundamental importance.
There are different techniques available. The primary key to protecting user privacy is to use as little information as is possible. Particularly, not storing original user ids and post ids may efficiently prevent several attack-vectors targeted towards those who perform the data collection.
As a key measure in this example, we make use of a data abstraction called HyperLogLog (Flajolet et al. 2007), to prevent storing original distinct and highly identifiable IDs in the first place. This abstraction results in a privacy–utility tradeoff, by reducing accuracy of the final visualization by 3-5%.
However, HyperLogLog cannot protect user-privacy per se (Desfontaines et al. 2018, Reviriego & Ting 2020). But, the algorithm can be combined with other approaches that, together, largely reduce privacy conflicts.
Not all implications are discussed here. For example, in the case of data published publicly, original data is available anyway. One could argue that such a situation makes any attack vector on collected data redundant. This is not the case. Firstly, it is a responsibility of any data collection authority to take individual precautions. Secondly, data published publicly may be removed at any time, requiring others to also remove data from any collected datasets. The YFCC100M dataset demonstrates that such requirement is typically impractical once original data is published, as it still contains references to many photos that have been removed in the meantime from Flickr.
In this guide, we demonstrate how to make use of a combination of techniques, to reduce the risk of privacy attacks on intermediate data collected and the final visualization published.
- Query limited data (data reduction)
- Reduce spatial granularity on initial query (data accuracy)
- Cardinality estimation, instead of counting original IDs (privacy–utility trade-off, reducing accuracy by 3-5%)
As a result, the visualization and intermediate data created in this example do not require storing any original data, while at the same time retaining the capability of being continuously updated.
To set up your environment for this guide, follow the instructions of the first two sections:
Defining the Query
We're using a two database setup:
- rawdb stands for the original social media data that is publicly available
- hlldb stands for the privacy-aware data collection that is used in the visualization environment
To query original data from rawdb, and convert to hlldb, we're using a Materialized View.
Materialized views are static subsets extracted from larger PostgreSQL tables. MViews provide a performant filter for large queries. In a real-world example, data would typically be directly streamed and filtered, without the need for a Materialized View.
Create Query Schema
Create a new schema called
mviews and update Postgres
search_path, to include new schema:
CREATE SCHEMA mviews; ALTER DATABASE rawdb SET search_path = "$user", social, spatial, temporal, topical, interlinkage, extensions, mviews;
Prepare query and cryptographic hashing
Create the pgcrypto extension:
CREATE EXTENSION IF NOT EXISTS pgcrypto SCHEMA extensions;
Prepare cryptographic hash function. The following function will take an
id and a
key (the seed value) to produce a new, unique hash that is returned in
/* Produce pseudonymized hash of input id with skey * - using skey as seed value * - sha256 cryptographic hash function * - encode in base64 to reduce length of hash * - remove trailing '=' from base64 string * - return as text */ CREATE OR REPLACE FUNCTION extensions.crypt_hash (id text, skey text) RETURNS text AS $$ SELECT RTRIM( ENCODE( HMAC( id::bytea, skey::bytea, 'sha256'), 'base64'), '=') $$ LANGUAGE SQL STRICT;
Cryptographic hashing alone will only produce pseudonymized data, since any id still relates to a single user (or post, etc.). Therefore, pseudonymization is considered a weak measure, which can be easily reversed, e.g. through rainbow tables or context lookup.
It is used here as an additional means to protect HLL sets from intersection attacks, as illustrated by Desfontaines et al. (2018).
What is a seed value?
The seed value (
skey) is a secret that is used, together with the encryption function (e.g. sha256) to produce a unique, collision-free output value (the hashed id). The same ID will be converted to a different hash if the seed value is changed. Therefore, in our case, the seed value must remain the same during the entire processing of data. In this case, the seed is called a key. This key can be destroyed afterwards, if no subsequent updates are necessary.
Also, create the citex extension, which will be used later to apply a thematic filter to the query.
CREATE EXTENSION IF NOT EXISTS citext schema extensions;
Apply topic query
In this step, several filters are applied to significantly reduce data, as a provisional measure to reduce privacy risks.
For a list of all attributes available, see the LBSN Structure definition for Post.
The topic selection query below the following data reduction steps:
- filter only 'nature based reactions', based on a set of keywords
- filter only photos with geoaccuracy 'place', 'latlng' or 'city', which broadly equals Flickr geoaccuracy levels 8-16
- reduce spatial granularity to about 5km accuracy, by using PostGis GeoHash function, ST_GeoHash
- apply cryptographic hashing to
user_guid, using the
crypt_hashfunction defined earlier
- reduce temporal granularity of
samplekey with secure password.
CREATE MATERIALIZED VIEW mviews.spatiallatlng_raw_nature AS SELECT extensions.crypt_hash(t1.post_guid, 'samplekey') as "post_guid", ST_Y(ST_PointFromGeoHash( ST_GeoHash(t1.post_latlng, 5), 5)) As "latitude", ST_X(ST_PointFromGeoHash( ST_GeoHash(t1.post_latlng, 5), 5)) As "longitude", extensions.crypt_hash(t1.user_guid, 'samplekey') as "user_guid", to_char(t1.post_create_date, 'yyyy-MM-dd') as "post_create_date", t1.post_geoaccuracy FROM topical.post t1 WHERE -- example for a specific query to -- select and monitor "reactions to nature" -- geoaccuracy minimum, equals Flickr geoaccuracy levels 8-16 t1.post_geoaccuracy IN ('place', 'latlng', 'city') AND ( -- tags concat: check if any is contained by -- citext format: case-insensitive text conversion ARRAY['nature', 'natur', 'la_nature',' natuur', 'natura', 'naturo','naturae']::citext && t1.hashtags::citext OR -- tags separated: check if is both contained by ARRAY['green','outside']::citext <@ t1.hashtags::citext OR -- words in title: check if any exists as a part of the text -- note that underscores ( _ ) are replaced by space -- character here lower(t1.post_title) ilike any ( array['% nature %','% natur %','% la_nature %', '% natuur % natura %','% naturo %','% naturae %']) OR -- words in post body (photo description): -- check if any exists as a part of the text -- note that underscores ( _ ) are replaced by space -- character here lower(t1.post_body) ilike any ( array['% nature %','% natur %','% la_nature %', '% natuur % natura %','% naturo %','% naturae %']) )
A GeoHash of
5 means, coordinates are reduced to 5 decimal digits maximum length of lat/lng. Compare the following table:
|Precision (number of digits)||Distance of Adjacent Cell in Meters|
Table: GeoHash length and corresponding geoaccuracy in meters (Source: Wikipedia).
Convert data from rawdb to hlldb
First, create a connection from hlldb to rawdb.
On rawdb, create an
lbsn_reader with read-only privileges for schema
samplekey with secure password.
DROP USER IF EXISTS lbsn_reader; CREATE USER lbsn_reader WITH LOGIN INHERIT PASSWORD 'samplekey'; GRANT CONNECT ON DATABASE rawdb TO lbsn_reader; GRANT USAGE ON SCHEMA mviews TO lbsn_reader; ALTER DEFAULT PRIVILEGES IN SCHEMA mviews GRANT SELECT ON TABLES TO lbsn_reader;
Connect hlldb to rawdb
by using Foreign Table, this step will establish the connection between hlldb to rawdb.
On hlldb, install postgres_fdw extension:
CREATE EXTENSION IF NOT EXISTS postgres_fdw SCHEMA extensions;
Create Foreign Server on hlldb:
-- DROP SERVER lbsnraw CASCADE; CREATE SERVER lbsnraw FOREIGN DATA WRAPPER postgres_fdw OPTIONS ( host 'rawdb', dbname 'rawdb', port '5432', keepalives '1', keepalives_idle '30', keepalives_interval '10', keepalives_count '5', fetch_size '500000'); CREATE USER MAPPING for postgres SERVER lbsnraw OPTIONS (user 'lbsn_reader', password 'samplekey');
Import foreign table definition on the hlldb.
IMPORT FOREIGN SCHEMA mviews LIMIT TO (spatiallatlng_raw_nature) FROM SERVER lbsnraw INTO extensions;
To test the connection, get a sample result.
SELECT * FROM extensions.spatiallatlng_raw_nature LIMIT 1;
Table 2: Sample output from
In case of a "permission denied" error, execute again on raw:
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA mviews TO lbsn_reader;
Prepare conversion of raw data to hll
We're going to use
spatial.latlng from the HLL Structure definition. The structure for this table is already available, by default, in hlldb.
Create named references
To create a named table, specifically referencing thematic query context, replace
spatial.latlng with any name (e.g.
latlng_nature), in the query below:
CREATE TABLE spatial.latlng_nature ( latitude float, longitude float, PRIMARY KEY (latitude, longitude), latlng_geom geometry(Point, 4326) NOT NULL) INHERITS ( social.user_hll, -- e.g. number of users/latlng (=upl) topical.post_hll, -- e.g. number of posts/latlng temporal.date_hll, -- e.g. number of dates/latlng topical.utl_hll -- e.g. number of terms/latlng );
Check for previous data
Make sure that the target table
spatial.latlng is empty. The following query should return no results:
SELECT * FROM spatial.latlng LIMIT 1;
Remove previous data
Optionally clean up previous HLL data:
TRUNCATE TABLE spatial.latlng_nature;
Optionally: optimize chunk size
Depending on your hardware, optimizing Postgres
fetch_size may increase processing speed:
ALTER SERVER lbsnraw OPTIONS (SET fetch_size '50000');
The HyperLogLog extension for Postgres from Citus that we're using here, contains several tweaks, to optimize performance, that can affect sensitivity of data.
From a privacy perspective, for example, it is recommended to disable explicit mode.
When explicit mode is active, full IDs will be stored for small sets. In our case, any coordinates frequented by few users (outliers) would store full user and post IDs.
To disable explicit mode:
SELECT hll_set_defaults(11, 5, 0, 1);
From now on, HLL sets will directly be promoted to sparse.
In Sparse Hll Mode, more data is stored than in Full Hll Mode, as a means to improve accuracy for small sets. As pointed out by Desfontaines et al. 2018, this may make re-identification easier. Optionally disable sparse mode:
SELECT hll_set_defaults(11,5, 0, 0);
Aggregation step: Convert data to Hll
This is the actual data collection and aggregation step. In the query below, different metrics are collected that are typical for LBSM visual analytics (postcount, usercount, userdays).
To test, uncomment
--LIMIT 100 in the query below.
INSERT INTO spatial.latlng_nature( latitude, longitude, user_hll, post_hll, date_hll, latlng_geom) SELECT latitude, longitude, hll_add_agg(hll_hash_text(user_guid)) as user_hll, hll_add_agg(hll_hash_text(post_guid)) as post_hll, hll_add_agg( hll_hash_text(user_guid || post_create_date) ) as date_hll, ST_SetSRID( ST_MakePoint(longitude, latitude), 4326) as latlng_geom FROM extensions.spatiallatlng_raw_nature --LIMIT 100 GROUP BY latitude, longitude;
INSERT 0 62879
Immediately, it is possible to quantify spatial data reduction. The originally 715,692 coordinates (raw), were reduced to 62,879 coordinates (about 9% of original data), by using a GeoHash accuracy of 5 km. Smaller GeoHashes will further improve spatial privacy.
Per coordinate, the following metrics can be estimated with a 3 to 5% accuracy using the hll data:
For example, to select the 100 most frequented locations for the topic "nature" and the YFCC100M dataset:
SELECT latitude, longitude, hll_cardinality(post_hll)::int as postcount, hll_cardinality(date_hll)::int as userdays, hll_cardinality(user_hll)::int as usercount FROM spatial.latlng_nature ORDER BY usercount DESC LIMIT 10;
Interactive visualization in Python
Using the JupyterLab service, we can visualize this data interactively in Python.
Once jupyter lab is running, connect to http://localhost:8888/ and create a new notebook.
In the first cell, load the following visualization dependencies.
import psycopg2 # Postgres API import geoviews as gv import holoviews as hv import pandas as pd from geoviews import opts from holoviews import dim from cartopy import crs as ccrs from IPython.display import display hv.notebook_extension('bokeh')
If you're using the JupyterLab service, all of these dependencies are pre-installed.
In the following Jupyter cell, enter the same SQL query, just without limiting results, to retrieve estimated data from hll db in Python.
db_connection = psycopg2.connect( host=db_host, port=db_port, dbname=db_name, user=db_user, password=db_pass ) db_connection.set_session(readonly=True) db_query = f""" SELECT latitude, longitude, hll_cardinality(post_hll)::int as postcount, hll_cardinality(date_hll)::int as userdays, hll_cardinality(user_hll)::int as usercount FROM spatial.latlng_nature ORDER BY usercount DESC; """ df = pd.read_sql_query(db_query, db_connection) display(df.head())
Create a Geoviews layer:
points_lonlat = gv.Points( df, kdims=['longitude', 'latitude'], vdims=['postcount', 'userdays', 'usercount'])
And output interactive map, with circle size based on the estimation of distinct userdays (e.g).
def set_active_tool(plot, element): """Enable wheel_zoom in bokeh plot by default""" plot.state.toolbar.active_scroll = plot.state.tools hv.Overlay( gv.tile_sources.EsriImagery * \ points_lonlat.opts( tools=['hover'], size=1+dim('userdays')*0.1, line_color='black', line_width=0.1, fill_alpha=0.8, fill_color='yellow') ).opts( width=800, height=480, hooks=[set_active_tool], title='Estimate User Count for distinct yfcc coordinates', projection=ccrs.GOOGLE_MERCATOR, )
Have a look at the full interactive output.
Continuous Social Network data streams require frequent updates to data. Hll offers lossless union operations. These can be used to update data.
In a similar situation, an analyst may realize that some "terms" were missing in the original topic filter for raw data.
Without storing raw data, it is not possible to simply re-process raw data to reflect changes to the topic filter. However, in a real-world-context, the topic filter could be adjusted from any time onwards.
Create a new topic filter for "wood", "forest" and "trees" on rawdb:
CREATE MATERIALIZED VIEW mviews.spatiallatlng_raw_wood AS SELECT extensions.crypt_hash(t1.post_guid, 'samplekey') as "post_guid", ST_Y(ST_PointFromGeoHash( ST_GeoHash(t1.post_latlng, 5), 5)) As "latitude", ST_X(ST_PointFromGeoHash( ST_GeoHash(t1.post_latlng, 5), 5)) As "longitude", extensions.crypt_hash(t1.user_guid, 'samplekey') as "user_guid", to_char( t1.post_create_date, 'yyyy-MM-dd') as "post_create_date", t1.post_geoaccuracy FROM topical.post t1 WHERE t1.post_geoaccuracy IN ('place', 'latlng', 'city') AND ( -- tags concat: check if any is contained by -- citext format: case-insensitive text conversion ARRAY['wood', 'forest', 'trees']::citext && t1.hashtags::citext )
Import to hlldb:
IMPORT FOREIGN SCHEMA mviews LIMIT TO (spatiallatlng_raw_wood) FROM SERVER lbsnraw INTO extensions;
The next step is to update
spatial.latlng_nature (hll) with data from
mviews.spatiallatlng_raw_wood (raw). Metrics for coordinates that already exists will be unioned (
The specific term for ON CONFLICT ... DO UPDATE SET is UPSERT
INSERT INTO spatial.latlng_nature ( latitude, longitude, user_hll, post_hll, date_hll, latlng_geom) SELECT s.latitude, s.longitude, hll_add_agg( hll_hash_text(s.user_guid)) as user_hll, hll_add_agg( hll_hash_text(s.post_guid)) as post_hll, hll_add_agg( hll_hash_text( s.user_guid || post_create_date)) as date_hll, ST_SetSRID(ST_MakePoint( s.longitude, s.latitude), 4326) as latlng_geom FROM extensions.spatiallatlng_raw_wood s GROUP BY s.latitude, s.longitude ON CONFLICT (latitude, longitude) DO UPDATE SET -- if lat/lng already exists, union hll sets user_hll = COALESCE(hll_union(EXCLUDED.user_hll, spatial.latlng_nature.user_hll), hll_empty()), post_hll = COALESCE(hll_union(EXCLUDED.post_hll, spatial.latlng_nature.post_hll), hll_empty()), date_hll = COALESCE(hll_union(EXCLUDED.date_hll, spatial.latlng_nature.date_hll), hll_empty());
INSERT 0 59177
Duplicate user IDs, post IDs or userdays will only be counted once, because this (cardinality estimation) is the fundamental function of HLL.
Retrieve updated metrics:
SELECT latitude, longitude, hll_cardinality(post_hll)::int as postcount, hll_cardinality(date_hll)::int as userdays, hll_cardinality(user_hll)::int as usercount, latlng_geom FROM spatial.latlng_nature ORDER BY usercount DESC LIMIT 10;
The order of coordinates remained roughly the same, but metrics now reflect the increased scope of the update thematic filter.
Have a look at the full interactive output.
Evaluation and discussion
This guide tries not to be exhaustive in what is possible, from a privacy perspective. Further, the example discussed here is very limited and highly speculative. In a real-world visual analytics workflow, a wider range of requirements would affect decisions made.
The primary motivation is to demonstrate how certain measurements can significantly reduce the risk of user re-identification in datasets collected for visual analytic. To generate the same visualization for worldwide userdays based on raw data, despite using a sparse spatial granularity, the only applicable approach would be to store each individual user ID and corresponding date.
As with all tools and approaches, there are caveats that reduce possible applications. A 3 to 5% loss of accuracy, as presented here, may be acceptable only in some contexts. Conversely, the measures taken here to protect user re-identification may be found unsuitable in other contexts. Therefore, certain risks remain.
The intersection capability of Hll sets offers certain attack-vectors, which make it possible for an attacker to 'guess', with some degree of certainty, whether a user is in a set or not Flajolet et al. 2007. Particularly small sets are vulnerable to such intersection attacks. With a spatial aggregation accuracy of 5 km, there are still many small sets with only one or few users. Increasing spatial aggregation distances, or separating small Hll sets may be used to further improve privacy of users.
Desfontaines, D., Lochbihler, A., & Basin, D. (2018). Cardinality Estimators do not Preserve Privacy. 1–21.
Flajolet, P., Fusy, E., Gandouet, O., & Meunier, F. (2007). HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. Conference on Analysis of Algorithms, AofA 07. Nancy, France.
Ghermandi, A., & Sinclair, M. (2019). Passive crowdsourcing of social media in environmental research: A systematic map. Global Environmental Change, 55, 36–47.
Jokar Arsanjani, J., Zipf, A., Mooney, P., & Helbich, M. (2015). OpenStreetMap in GIScience. OpenStreetMap in GIScience: Experiences, Research, Applications, 1–20. DOI
Lane, J., Stodden, V., Bender, S., & Nissenbaum, H. (2015). Privacy, Big Data, and the Public Good: Frameworks for Engagement. Cambridge: Cambridge University Press. DOI
Metcalf, J., & Crawford, K. (2016). Where are human subjects in Big Data research? The emerging ethics divide. Big Data & Society, 3(1), 205395171665021. DOI
Reviriego, P., & Ting, D. (2020). Security of HyperLogLog (HLL) Cardinality Estimation: Vulnerabilities and Protection. IEEE Communications Letters, 1–1. DOI
See, L., Mooney, P., Foody, G., Bastin, L., Comber, A., Estima, J., … Rutzinger, M. (2016). Crowdsourcing, citizen science or volunteered geographic information? The current state of crowdsourced geographic information. ISPRS International Journal of Geo-Information, 5(5). DOI