Importing YFCC100m
To follow the examples used in this guide, follow the steps below.
Download of the YFCC100m data
Please follow the instructions provided on the official site.
Quote
Getting the YFCC100M: The dataset can be requested at Yahoo Webscope. You will need to create a Yahoo account if you do not have one already, and once logged in you will find it straightforward to submit the request for the YFCC100M. Webscope will ask you to tell them what your plans are with the dataset, which helps them justify the existence of their academic outreach program and allows them to keep offering datasets in the future. Unlike other datasets available at Webscope, the YFCC100M does not require you to be a student or faculty at an accredited university, so you will be automatically approved.
Conversion of YFCC100m to LBSN Structure
Any conversion from one structure to another requires the definition of mapping rules.
To demonstrate mapping of arbitrary LBSN data to the common LBSN structure scheme, we have built lbsntransform, a python package that includes several pre-defined mapping sets.
Note
Have a look at the exact mapping criteria for the Flickr YFCC100M dataset. The package also contains examples for other mappings (e.g. Twitter), which can be extended further.
You will also need a Postgres database with the SQL implementation of the LBSN Structure.
See the instructions here, in the base setup example.
After you have started the rawdb docker container, import Flickr YFCC CSVs to the database using lbsntransform.
lbsntransform --origin 21 \
--file_input \
--input_path_url "/data/flickr_yfcc100m/" \
--dbpassword_output "sample-password" \
--dbuser_output "postgres" \
--dbserveraddress_output "127.0.0.1:15432" \
--dbname_output "rawdb" \
--csv_delimiter $'\t' \
--file_type "csv" \
--zip_records \
--mappings_path ./resources/mappings/
lbsntransform --origin 21 ^
--file_input ^
--input_path_url "/data/flickr_yfcc100m/" ^
--dbpassword_output "sample-password" ^
--dbuser_output "postgres" ^
--dbserveraddress_output "127.0.0.1:15432" ^
--dbname_output "rawdb" ^
--csv_delimiter $'\t' ^
--file_type "csv" ^
--zip_records ^
--mappings_path ./resources/mappings/
Quick-installation
See the lbsntransform docs for installation steps.
To make the YFCC100M mapping available, either clone the entire repository with git clone git@gitlab.vgiscience.de:lbsn/lbsntransform.git
or only the resource mappings folder with
git clone git@gitlab.vgiscience.de:lbsn/lbsntransform.git \
&& cd lbsntransform \
&& git filter-branch --subdirectory-filter resources
git clone git@gitlab.vgiscience.de:lbsn/lbsntransform.git ^
&& cd lbsntransform ^
&& git filter-branch --subdirectory-filter resources
- input_path_url: The path to the folder where
yfcc100m_places.csv
andyfcc100m_dataset.csv
are saved. - dbpassword_output: Provide the password to connect to rawdb.
- dbserveraddress_output: This is the default setup of rawdb running locally.
- rawdb: The default database name of rawdb.
- csv_delimiter: Flickr YFCC100M data is separated by tabs, which is specified in lbsntransform as
$'\t'
via the command line - file_type: Flickr YFCC100M data format is CSV (line separated).
- zip_records: Length of
yfcc100m_dataset.csv
andyfcc100m_places.csv
matches. This tells lbsntransform to concatenate both files on stream read.
Note
Reading the full dataset into the database will require at least 50 GB of hard drive and, depending on your hardware, up to several days of processing. You can read the dataset partially by adding --transferlimit 10000
, to only read the first 10000 entries (e.g.).