To follow the examples used in this guide, follow the steps below.
Download of the YFCC100m data
Please follow the instructions provided on the official site.
Getting the YFCC100M: The dataset can be requested at Yahoo Webscope. You will need to create a Yahoo account if you do not have one already, and once logged in you will find it straightforward to submit the request for the YFCC100M. Webscope will ask you to tell them what your plans are with the dataset, which helps them justify the existence of their academic outreach program and allows them to keep offering datasets in the future. Unlike other datasets available at Webscope, the YFCC100M does not require you to be a student or faculty at an accredited university, so you will be automatically approved.
Conversion of YFCC100m to LBSN Structure
Any conversion from one structure to another requires the definition of mapping rules.
To demonstrate mapping of arbitrary LBSN data to the common LBSN structure scheme, we have built lbsntransform, a python package that includes several pre-defined mapping sets.
You'll also need a Postgres Database with the SQL Implementation of the LBSN Structure.
The easiest way is to use full-stack-lbsn, a shell script that starts the following docker services:
- rawdb: A ready to use Docker Container with the SQL implementation of LBSN Structure
- hlldb: A ready to use Docker Container with a privacy-aware version of LBSN Structure, e.g. for visual analytics
- pgadmin: A web-based PostgreSQL database interface.
- jupyterlab: A modern web-based user interface for python visual analytics.
If you're working with Windows, full-stack-lbsn will only work in Windows Subsystem for Linux (WSL). Even if it is possible to run Docker containers natively in Windows, we strongly recommend using WSL or WSL2.
lbsntransform --origin 21 \ --file_input \ --input_path_url "/data/flickr_yfcc100m/" \ --dbpassword_output "sample-password" \ --dbuser_output "postgres" \ --dbserveraddress_output "127.0.0.1:15432" \ --dbname_output "rawdb" \ --csv_delimiter $'\t' \ --file_type "csv" \ --zip_records
- input_path_url: The path to the folder where
- dbpassword_output: Provide the password to connect to rawdb.
- dbserveraddress_output: This is the default setup of rawdb running locally.
- rawdb: The default database name of rawdb.
- csv_delimiter: Flickr YFCC100M data is separated by tabs, which is specified in lbsntransform as
$'\t'via the command line
- file_type: Flickr YFCC100M data format is CSV (line separated).
- zip_records: Length of
yfcc100m_places.csvmatches. This tells lbsntransform to concatenate both files on stream read.
Reading the full dataset into the database will require at least 50 GB of hard drive and, depending on your hardware, up to several days of processing. You can read the dataset partially by adding
--transferlimit 10000, to only read the first 10000 entries (e.g.).