Skip to content

The YFCC100m dataset

The YFCC100m dataset is a typical example of user-generated content that is made publicly available for anyone to use. It was published by Flickr in 2014 (Thomee et al. 1). The core dataset is distributed as a compressed archive that contains only the metadata for about 100 Million photos and videos from Flickr published under a Creative Commons License. About 48 Million of the photos are geotagged (Fig. 1).

Geotagges Worldwide
Fig. 1: „A 1 million photo sample of the 48 million geotagged photos from the dataset plotted around the globe.“ CC BY-NC 4.0 David Shamma, Flickr.

Even if user-generated data is explicitly made public, like in this case, certain risks to privacy exist. Data may be re-purposed in contexts not originally anticipated by the users publishing the data. IBM, for example, re-purposed the YFCC100m dataset to fuel a facial-recognition project, without the consent of the people in the images.

Structure

The core dataset consists of two CSV files of about 14 GB which are hosted on Amazon AWS S3 bucket. This dataset contains a list of photos and videos and related meta data (titles, tags, timestamps etc.).

An overview of available columns in this dataset is provided by Deng et al. 2. The table below contain a summary of the CSV columns.

Column Metadata Description Example
0 row id 0
1 Photo/video identifier 6185218911
2 User NSID 4e2f7a26a1dfbf165a7e30bdabf7e72a
3 User ID 39019111@N00
4 User nickname guckxzs
5 Date taken 2012-02-16 09:56:37.0
6 Date uploaded 1331840483
7 Capture device Canon+PowerShot+ELPH+310+HS
8 Title IMG_0520
9 Description ? My vacation
10 User tags (comma-separated) canon,canon+powershot+hs+310
11 Machine tags (comma-separated) landscape, hills, water
12 Longitude -81.804885
13 Latitude 24.550558
14 Accuracy Level (see Flickr API) 12
15 Photo/video page URL https://www.flickr.com/photos/39089491@N00/6985418911/
16 Photo/video download URL https://farm8.staticflickr.com/7205/6985418911_df7747990d.jpg
17 License name Attribution-NonCommercial-NoDerivs License
18 License URL https://creativecommons.org/licenses/by-nc-nd/2.0/
19 Photo/video server identifier 7205
20 Photo/video farm identifier 8
21 Photo/video secret df7747990d
22 Photo/video secret original 692d7e0a7f
23 Extension of the original photo jpg
24 Marker (0 ¼ photo, 1 ¼ video) 0

Table 1: Summary of Metadata for each CSV column available in the core dataset (yfcc100m_dataset.csv). Examples are randomly generated.

Next to this core dataset, several expansion packs have been released that provide additional data:

  • Autotags: Auto tags added by deep learning (e.g. people, animals, objects, food, events, architecture, and scenery)
  • Places: User provided geotags and automatically associated places.
  • Exif: Additional Exif data for each photo

To follow the guides herein, only the places expansion set is is currently required.

Column Metadata Description Example
0 Photo/video identifier 6985418911
1 Place reference (null to multiple) 24703176:Admiralty:Suburb,24703128:Central+and+Western:Territory

Table 2: Summary of Metadata for each CSV column available in the places expansion dataset (yfcc100m_places.csv).

References


  1. Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., & Li, L.-J. (2016). YFCC100M: The new data in multimedia research. Commun. ACM, 59(2), 64--73. https://doi.org/10.1145/2812802 ::: 

  2. Deng, N., & Li, X. R. (2018). Feeling a destination through the "right" photos: A machine learning model for DMOs' photo selection. Tourism Management, 65, 267--278. ::: 


Last update: April 14, 2021
Back to top