The YFCC100m dataset

The YFCC100m dataset is a typical example of user-generated content that is made publicly available for anyone to use. It was published by Flickr in 2014 (Thomee et al. ¹). The core dataset is distributed as a compressed archive that contains only the metadata for about 100 Million photos and videos from Flickr published under a Creative Commons License. About 48 Million of the photos are geotagged (Fig. 1).

Geotagges Worldwide
Fig. 1: „A 1 million photo sample of the 48 million geotagged photos from the dataset plotted around the globe.“ CC BY-NC 4.0 David Shamma, Flickr.

Even if user-generated data is explicitly made public, like in this case, certain risks to privacy exist. Data may be re-purposed in contexts not originally anticipated by the users publishing the data. IBM, for example, re-purposed the YFCC100m dataset to fuel a facial-recognition project, without the consent of the people in the images.

Structure

The core dataset consists of two CSV files of about 14 GB which are hosted on Amazon AWS S3 bucket. This dataset contains a list of photos and videos and related meta data (titles, tags, timestamps etc.).

An overview of available columns in this dataset is provided by Deng et al. ². The table below contain a summary of the CSV columns.

Column	Metadata Description	Example
0	row id	0
1	Photo/video identifier	6185218911
2	User NSID	4e2f7a26a1dfbf165a7e30bdabf7e72a
3	User ID	39019111@N00
4	User nickname	guckxzs
5	Date taken	2012-02-16 09:56:37.0
6	Date uploaded	1331840483
7	Capture device	Canon+PowerShot+ELPH+310+HS
8	Title	IMG_0520
9	Description ?	My vacation
10	User tags (comma-separated)	canon,canon+powershot+hs+310
11	Machine tags (comma-separated)	landscape, hills, water
12	Longitude	-81.804885
13	Latitude	24.550558
14	Accuracy Level (see Flickr API)	12
15	Photo/video page URL	https://www.flickr.com/photos/39089491@N00/6985418911/
16	Photo/video download URL	https://farm8.staticflickr.com/7205/6985418911_df7747990d.jpg
17	License name	Attribution-NonCommercial-NoDerivs License
18	License URL	https://creativecommons.org/licenses/by-nc-nd/2.0/
19	Photo/video server identifier	7205
20	Photo/video farm identifier	8
21	Photo/video secret	df7747990d
22	Photo/video secret original	692d7e0a7f
23	Extension of the original photo	jpg
24	Marker (0 ¼ photo, 1 ¼ video)	0

Table 1: Summary of Metadata for each CSV column available in the core dataset (yfcc100m_dataset.csv). Examples are randomly generated.

Next to this core dataset, several expansion packs have been released that provide additional data:

Autotags: Auto tags added by deep learning (e.g. people, animals, objects, food, events, architecture, and scenery)
Places: User provided geotags and automatically associated places.
Exif: Additional Exif data for each photo

To follow the guides herein, only the places expansion set is is currently required.

Column	Metadata Description	Example
0	Photo/video identifier	6985418911
1	Place reference (null to multiple)	24703176:Admiralty:Suburb,24703128:Central+and+Western:Territory

Table 2: Summary of Metadata for each CSV column available in the places expansion dataset (yfcc100m_places.csv).

References

Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., & Li, L.-J. (2016). YFCC100M: The new data in multimedia research. Commun. ACM, 59(2), 64--73. https://doi.org/10.1145/2812802 ::: ↩
Deng, N., & Li, X. R. (2018). Feeling a destination through the "right" photos: A machine learning model for DMOs' photo selection. Tourism Management, 65, 267--278. ::: ↩

Last update: April 14, 2021