Towards Vandalism Detection in OpenStreetMap Through a Data Driven Approach

Vandalism is a phenomenon that has aﬀected by now the digital domain, in particular in the context of Volunteered Geographic Information projects. This paper aims at proposing a methodology to detect vandalism in the OpenStreetMap project. First, an analysis of related works sheds light on the lack of consensus when it comes to deﬁning vandalism in VGI from both conceptual and practical points of view. Second, we present experiments on the use of clustering-based outlier detection methods to identify vandalism in OSM. The outcome of this study focuses on choosing the right variables when it comes to detecting vandalism in OSM.


Introduction
Skepticism toward the use of Volunteered Geographic Information (VGI) stems from the lack of data qualification in VGI datasets despite the likelihood of poor quality contribution occurrences.In the case of the OpenStreetMap (OSM) project, allowing anyone to map also adds the risk of welcoming ill-intentioned contributors who impoverish the quality of the data through acts of vandalism.For instance, some of the Pokemon Go players who signed up as OSM contributors wrongly mapped geographic elements in order to boost the development of Pokemon nests 1 .But how to distinguish actual vandalism from unintended mistakes?And how to automatically detect real vandalism in OSM?This paper's contribution is twofold: first we highlight the various definitions of vandalism that were adopted to automatically detect vandalism.Then we investigate the ability of an unsupervised method to detect vandalism in OSM by using a clustering-based outlier detection.Understanding vandalism: related work Historically, vandalism comes from Germanic barbarians, called Vandals, who were reputed for sacking artworks and monuments during their invasions in Western Europe [7].Over times, its meaning broadened and nowadays vandalism refers generally to material defacement made by human beings.However, a degradation does not necessarily fall under vandalism because depending on the context, an act will not bear the same label.For instance, animal slaughter can be labeled as vandalism unless the killer has a license to hunt [7].Actually, labeling an action as vandalism requires to assess the damage caused, the author's motives and the context of the incident [2]: these notions are already difficult to evaluate juridically, as each case has its own elements of context and oftentimes the author's motives are not directly accessible.Actually, vandalism definition is quite clear but as it relies on elements that are hard to assess for human beings, detecting it automatically remains a challenge.
Automatic detection of vandalism has been widely studied in Wikipedia [1] and Wikidata [4], but as these papers dealt with vandalism detection using supervised machine learning, they did not focus on giving a clear definition of vandalism.Actually, the existence of a corpus of labeled data on Wikidata/Wikipedia enabled them to evacuate the question.[6] developed a rule-based vandalism detection system for OSM data.The rules mainly take into account user reputation and object history.Therefore OSM newbies' created objects are at a disadvantage as they are more prone to be detected as vandalism.Unlike Wikidata, no corpus of OSM vandalism data is available.This is why our experiment attempts to detect OSM vandalism with the use of an unsupervised method and considering other vandalism metrics that were not tackled in [6].
Through the analysis of intentional vandalism incidents on Wikimapia and OSM, [2] proposed a typology for carto-vandalism composed of six categories: play carto-vandalism, ideological carto-vandalism, fantasy carto-vandalism, artistic carto-vandalism, industrial carto-vandalism and spam.This typology is drawn from experimental observations so it is quite realistic, however it can be difficult in some cases to label vandalism straight away in one of these categories.For instance, artistic carto-vandalism can be seen as a sub-category of fantasy carto-vandalism: mapping polygon art is necessarily a fictional data.In fact, the proposed typology implies knowing the contributor's intentions, which is a research problem in itself.On this intentionality issue, [6] solves the problem by stating that " in the case of OSM, vandalism can occur intentional and unintentional, contradicting the traditional definition of the term 'vandalism' ".However, the OSM Wiki page about vandalism2 does mention the difference between vandalism and bad editing which lays in the contributor's purpose, although both of them require data repairs.Actually, OSM vandalism and bad editing may both result in the same defacement of the dataset.This is why OSM Wiki page on vandalism does not provide a definition of what vandalism is but how it manifests in OSM dataset together with bad editing.Thus another challenge for vandalism detection in OSM is to steer clear of mistaking bad edits with true vandalism (i.e.minimizing false positives).
Like in [6], we analyze some cases of OSM user blocks in the light of the context, the user's motive and the caused damage in order to better understand what belongs to true vandalism and what does not.The case depicted in Figure 1 is true vandalism on versions 6 and 7, as the contributor 3 completely defaced the nature of a Russian island by changing its name tags and turning it into a park.These edits are obviously made on purpose.We also 61:3  note that the changes made on version 8 are not vandalism as the same user brings back the values of some previously altered tags.The edit war on an area in Latvia depicted on Figure 2 shows a disagreement about the real nature of this place.The banned user 4 is the one who added the 'leisure=park' tag to the area.However, further research shows that local people do consider this place as a park 5 .Consequently, this case is not truly vandalism but rather highlights the ambiguity of the geographic object.Lastly, adding unconventional tags6 can be seen as an abnormal contribution but in this case it is not vandalism.Some of the tag values are understandable for humans and actually add valuable information to the objects (Figure 3).These examples show that vandalism -according to a data-oriented traditional definition -is less regular than contributions being non-compliant to OSM policy.Due to the scarcity of vandalism in OSM and the difficulty to enumerate all of the possible cases, this study tackles the vandalism issue following an outlier detection approach.

Methodology
Assuming that vandalized data form outliers in a dataset, our experiment aims at finding out whether using a clustering-based outlier detection enables to identify vandalized data in an OSM dataset.As vandalism does not often occur on OSM and we do not know where it  happened, we cannot choose a study area where we would be assured to find vandalism cases.Therefore, we need to purposely add vandalized data so that the outliers to be detected are known in advance.
Then, every OSM element should be described by variables that will be used as inputs for the clustering algorithm.In the first place, the experiment will be limited to the detection of vandalism on buildings.This implies retrieving OSM ways and OSM relations that contain the 'building' key tag.To best describe OSM data, several types of descriptors may be contemplated: geometric variables [3], topological variables [3], historic variables [6] and user variables have been used in the literature to qualify OSM and crowdsourcing data in general [1,6].In this study, we employ fantasy and artistic vandalism to deface our dataset so at the moment only geometric variables were input into the clustering algorithm, as artistic vandalism is characterized with oddly shaped objects (Table 1).Eventually, the clustering algorithm will group similar objects according to their input attributes while setting aside buildings having particular values.

Experiment and initial results
In this study, the dataset is composed of OSM buildings that are located in Aubervilliers, a suburban town of Paris.Vandalism committed in this dataset includes (Figure 4): 17 fictional buildings of different sizes which were mapped in a blank space (the yellow polygon in Figure 4 indicates that this space is currently an area under construction) and 10 artistically shaped buildings that were mapped in the middle of a river and over the town's graveyard.The outlier detection was run using the DENCLUE clustering algorithm (Java Smile library) because it is noise-invariant and remains efficient for high dimensional datasets [5].It takes a smoothing parameter σ that describes the influence of a data point in the data space, and a parameter m that corresponds to the noise threshold.The algorithm starts by building a clustering model based on the input variables, then predicts the class of each element according to the clustering model.At this point, buildings whose descriptors are totally inconsistent with the clustering model are classified as outliers.The others are classified into clusters.However, some clusters contain only one element, meaning the values of these buildings descriptors fit into the clustering model but no building was similar enough regarding its attributes' values to belong to the same cluster.Thus, in a certain way, these one-size clusters can be considered as outliers too but to a lesser degree.Table 2 summarizes the number of outliers and one-size clusters that were detected for each kind of data (vandalism or not).
The first 'e' letter-shaped building was the only outlier-labeled vandalism while the remainder of artistically vandalized buildings -including the other two 'e' letter-shaped ones -was classified into one-size clusters.We note that 25 cases of vandalism out of 27that represents 92% of known vandalism -could be retrieved either in the outlier class or a one-size cluster, which is quite outstanding.Nevertheless, 60% of normal buildings have been also classified into outliers or one-element clusters.By taking a look at the variables of the vandalized buildings, we notice that the geometric descriptors do not bring out the geometric peculiarities of the artistic vandalism that was committed into our dataset.Maybe considering a polygon density variable which accounts for a polygon's number of vertices G I S c i e n c e 2 0 1 8

61:6
Towards Vandalism Detection in OpenStreetMap Through a Data Driven Approach would have brought out all of the committed artistic vandalism.OSM French buildings have been mostly imported through mass imports from the French cadaster, so a lot of OSM building elements actually map small and weirdly shaped pieces of building.This is why the tiniest fictional building was not seen as an outlier given the strong presence of small sized elements in the dataset.Therefore we should reconsider geometric attributes that would not bring out the geometric specificity of geographic objects.Eventually, our input variables did not take into account the building's spatial relations with other elements.Here, some vandalized buildings are contained inside a river, a construction area and intersect a cemetery.Considering additional topological variables that express these peculiar situations might improve the detection of uncommonly located vandalized building.Actually, we did not expect to successfully detect all our vandalism cases -without any false positive -by simply using a clustering method on geometric features, so this first result is fairly encouraging.

Conclusion and future work
Our work focused on the definition of vandalism and the aspects that challenge its automated detection, such as the contributor's purpose, the context and the harm done.Initial experimental results showed that detecting OSM vandalism using an unsupervised method requires a wiser choice of the attributes to be input in the clustering algorithm.These attributes cannot be simple data quality assessment features but they have to be specifically designed for vandalism detection.Future work includes exploring the influence of the σ and m parameters of DENCLUE clustering algorithm on the outlier detection predictions.Other clustering algorithms (e.g.DBScan, BIRCH) should also be tested to check if they perform better on detecting vandalism.Besides, we intend to carry out the same experiment on OSM German buildings because most of them have been mapped by hand, so unlike OSM French buildings, they should not be divided up into small pieces: maybe in this dataset our vandalism cases would be detected.We also intend to deal with other types of vandalism, for instance vandalism through tag edits or object delete.In this case, other relevant variables should be contemplated to enrich our dataset -as mentioned previously, the set of input clustering variables should be extended with topological, semantic and historical features, as well as contributor-oriented descriptors and reference data matching indicators.However we will then have to address the curse of dimensionality issue.Eventually, in the same way as with Wikipedia vandalism, supervised learning classification techniques may be contemplated to detect vandalism in OSM.

Figure 1
Figure 1 Tag history of an OSM fantasy vandalism case (source: OSM Deep History application).Each column gives to the state of an OSM object's version concerning its metadata and its tag values.Key tags are on the left.The changes are coded by colors: green stands for tag addition, yellow for tag-value edit and red for tag delete.

Figure 2
Figure 2 Tag history of an ambiguous area in Latvia (source: OSM Deep History application).

Figure 4 Table 2
Figure 4 Fantasy vandalism (left image) and artistic vandalism in Aubervilliers, France.

Table 1
Overview of OSM building geometric variables that are used for the experiments.N.B. : MBR stands for Minimal Bounding Rectangle.