Skip to content

Importing YFCC100m

To follow the examples used in this guide, follow the steps below.

Download of the YFCC100m data

Please follow the instructions provided on the official site.

Quote

Getting the YFCC100M: The dataset can be requested at Yahoo Webscope. You will need to create a Yahoo account if you do not have one already, and once logged in you will find it straightforward to submit the request for the YFCC100M. Webscope will ask you to tell them what your plans are with the dataset, which helps them justify the existence of their academic outreach program and allows them to keep offering datasets in the future. Unlike other datasets available at Webscope, the YFCC100M does not require you to be a student or faculty at an accredited university, so you will be automatically approved.

Conversion of YFCC100m to LBSN Structure

Any conversion from one structure to another requires the definition of mapping rules.

To demonstrate mapping of arbitrary LBSN data to the common LBSN structure scheme, we have built lbsntransform, a python package that includes several pre-defined mapping sets.

Note

Have a look at the exact mapping criteria for the Flickr YFCC100M dataset. The package also contains examples for other mappings (e.g. Twitter, Facebook Places), which can be extended further.

You'll also need a Postgres Database with the SQL Implementation of the LBSN Structure.

The easiest way is to use full-stack-lbsn, a shell script that starts the following docker services:

  • rawdb: A ready to use Docker Container with the SQL implementation of LBSN Structure
  • hlldb: A ready to use Docker Container with a privacy-aware version of LBSN Structure, e.g. for visual analytics
  • pgadmin: A web-based PostgreSQL database interface.
  • jupyterlab: A modern web-based user interface for python visual analytics.

Tip

If you're familiar with git and docker, you can also clone the above repositories separately and start individual services as needed.

Windows user?

If you're working with Windows, full-stack-lbsn will only work in Windows Subsystem for Linux (WSL). Even if it is possible to run Docker containers natively in Windows, we strongly recommend using WSL or WSL2.

After you have started the rawdb docker container, import Flickr YFCC CSVs to the database using lbsntransform.

lbsntransform --origin 21 \
    --file_input \
    --input_path_url "/data/flickr_yfcc100m/" \
    --dbpassword_output "sample-password" \
    --dbuser_output "postgres" \
    --dbserveraddress_output "127.0.0.1:15432" \
    --dbname_output "rawdb" \
    --csv_delimiter $'\t' \
    --file_type "csv" \
    --zip_records
  • input_path_url: The path to the folder where yfcc100m_places.csv and yfcc100m_dataset.csv are saved.
  • dbpassword_output: Provide the password to connect to rawdb.
  • dbserveraddress_output: This is the default setup of rawdb running locally.
  • rawdb: The default database name of rawdb.
  • csv_delimiter: Flickr YFCC100M data is separated by tabs, which is specified in lbsntransform as $'\t' via the command line
  • file_type: Flickr YFCC100M data format is CSV (line separated).
  • zip_records: Length of yfcc100m_dataset.csv and yfcc100m_places.csv matches. This tells lbsntransform to concatenate both files on stream read.

Note

Reading the full dataset into the database will require at least 50 GB of hard drive and, depending on your hardware, up to several days of processing. You can read the dataset partially by adding --transferlimit 10000, to only read the first 10000 entries (e.g.).


Last update: June 19, 2020