Skip to content

Importing YFCC100m

To follow the examples used in this guide, follow the steps below.

Download of the YFCC100m data

Please follow the instructions provided on the official site.

Quote

Getting the YFCC100M: The dataset can be requested at Yahoo Webscope. You will need to create a Yahoo account if you do not have one already, and once logged in you will find it straightforward to submit the request for the YFCC100M. Webscope will ask you to tell them what your plans are with the dataset, which helps them justify the existence of their academic outreach program and allows them to keep offering datasets in the future. Unlike other datasets available at Webscope, the YFCC100M does not require you to be a student or faculty at an accredited university, so you will be automatically approved.

Conversion of YFCC100m to LBSN Structure

Any conversion from one structure to another requires the definition of mapping rules.

To demonstrate mapping of arbitrary LBSN data to the common LBSN structure scheme, we have built lbsntransform, a python package that includes several pre-defined mapping sets.

Note

Have a look at the exact mapping criteria for the Flickr YFCC100M dataset. The package also contains examples for other mappings (e.g. Twitter), which can be extended further.

You will also need a Postgres database with the SQL implementation of the LBSN Structure.

See the instructions here, in the base setup example.

After you have started the rawdb docker container, import Flickr YFCC CSVs to the database using lbsntransform.

lbsntransform --origin 21 \
    --file_input \
    --input_path_url "/data/flickr_yfcc100m/" \
    --dbpassword_output "sample-password" \
    --dbuser_output "postgres" \
    --dbserveraddress_output "127.0.0.1:15432" \
    --dbname_output "rawdb" \
    --csv_delimiter $'\t' \
    --file_type "csv" \
    --zip_records \
    --mappings_path ./resources/mappings/
lbsntransform --origin 21 ^
    --file_input ^
    --input_path_url "/data/flickr_yfcc100m/" ^
    --dbpassword_output "sample-password" ^
    --dbuser_output "postgres" ^
    --dbserveraddress_output "127.0.0.1:15432" ^
    --dbname_output "rawdb" ^
    --csv_delimiter $'\t' ^
    --file_type "csv" ^
    --zip_records ^
    --mappings_path ./resources/mappings/
Quick-installation

See the lbsntransform docs for installation steps.

To make the YFCC100M mapping available, either clone the entire repository with git clone git@gitlab.vgiscience.de:lbsn/lbsntransform.git or only the resource mappings folder with

git clone git@gitlab.vgiscience.de:lbsn/lbsntransform.git \
    && cd lbsntransform \
    && git filter-branch --subdirectory-filter resources
git clone git@gitlab.vgiscience.de:lbsn/lbsntransform.git ^
    && cd lbsntransform ^
    && git filter-branch --subdirectory-filter resources
  • input_path_url: The path to the folder where yfcc100m_places.csv and yfcc100m_dataset.csv are saved.
  • dbpassword_output: Provide the password to connect to rawdb.
  • dbserveraddress_output: This is the default setup of rawdb running locally.
  • rawdb: The default database name of rawdb.
  • csv_delimiter: Flickr YFCC100M data is separated by tabs, which is specified in lbsntransform as $'\t' via the command line
  • file_type: Flickr YFCC100M data format is CSV (line separated).
  • zip_records: Length of yfcc100m_dataset.csv and yfcc100m_places.csv matches. This tells lbsntransform to concatenate both files on stream read.

Note

Reading the full dataset into the database will require at least 50 GB of hard drive and, depending on your hardware, up to several days of processing. You can read the dataset partially by adding --transferlimit 10000, to only read the first 10000 entries (e.g.).


Last update: April 1, 2021