KTS(Korean Tourist Spot) Dataset

Introduction

The KTS dataset is constructed by collecting images, text, hashtags and some other information from Instagram for heterogeneous data analysis or mining.
The KTS dataset contains 10,000 images and text like sentence and multiple hashtags, the number of likes for each image. We have removed sensitive information of users (identifiable person’s face, personal information, advertising posts and etc.) during data pre-processing. For hash tags, we have removed all hash tags of other languages except for Korean and English.

Class Structure

Coarse Label	Fine Label
person-made	amusement Park
	palace
	park
	tower
	restaurant
nature-scene	beach
	cave
	island
	lake
	mountain

Table 1. Hierarchical Class Structure of KTS Heterogeneous Dataset

Table 1 shows the class structure of the KTS dataset. The KTS dataset is designed as a two-level hierarchical structure. It can be divided into person-made tourist spots and nature scene tourist spots as coarse components of the upper level concept. Each coarse label has 5 fine labels.

Figure 1. Description of KTS Dataset : person-made

Figure 2. Description of KTS Dataset : nature-scene

Figure 1 shows the description of person-made in the course label. You can intuitively know in the picture that this dataset contains heterogeneous data (image-text-hashtag-like). For example, if you look at the amusement park class, you can see that the image of the Ferris wheel is composed of heterogeneous data that consists of a pair of corresponding text, hashtags, and likes 45.

Likewise, you can see a description of the nature scene as a course label in Figure 2. For example, if you look at the beach class, you can see that the image of the beach landscape is composed of heterogeneous data that consists of a pair of corresponding text, hashtags, and likes 38.

Data Structure

We provide this data set divided into total version and split version. The total version contains all the data, and also the split version is provided in 7: 1: 2 ratio, divided by train, valid and test.

The following Figure 3 shows an example of heterogeneous data. For example, the first picture shows the 64th data for the island class in the train folder. The class(label) of the image is the island, and the index of “img_name” in json file refers to image file name. The json file also contains data such as text, likes, etc. which form a pair for this data. This data structure allows you to load a json file and an image file together

Figure 3. The Example of KTS Heterogeneous Data

Usage

This data can be downloaded from the our github repository. Unzip the downloaded file, you will be able to run it via python3 code, load_data.py (or load_data.ipynb for jupyter) for using the dataset.

We hope that this dataset will be used in various fields such as machine learning using Korean texts, tourist spot recommendation system, and heterogeneous data analysis and etc.

Copyrights

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files(the “Software”), to deal in the Software with out restriction, including without limitation the rights to use copy, modify, merge, publish, distribute, sublicense, and/or sellcopies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.

IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.