You are on page 1of 20

Wikis as Social Networks:

Evolution and Dynamics

Ralf Klamma and Christian Haasler

RWTH Aachen University


Information Systems
Ahornstr. 55, 52056 Aachen, Germany
(klamma|haasler)@dbis.rwth-aachen.de

Abstract. Despite of the enormous success of wikis in public and cor-


porate knowledge sharing projects we do not know much about the evo-
lution and dynamics of wikis. Our approach is to analyze wikis as social
networks and apply dynamic network analysis on them. In our proto-
typical environment we handle the complex data management problems
arising when dealing with different wiki engines and different sizes of wiki
dumps. The analysis and visualization of evolving wiki networks allow
wiki stakeholder to research the social dynamics of their wikis.

1 Introduction

Since wikis have become very successful in huge collaborative projects like the
Wikipedia, an encyclopedia with millions of entries in hundreds dozens of lan-
guages edited by a countless crowd of editors. But also in organizational settings
wikis are already introduced as Web 2.0 tools for knowledge sharing and project
management [1–3]. Therefore, questions like what is making a wiki project par-
ticularly successful have become very interesting for research and practice. A lot
of added-value services and businesses have been built around the wiki concept
which lead to a big variety in wiki software and wiki hosting services. Even if
one can start a wiki with a simple ’wiki-on-a-stick’ solution like TiddlyWiki,
the maintenance of large collaborative wiki demands more elaborated platforms.
Two well-known providers of wiki platforms based on the MediaWiki engine
are the Wikimedia Foundation and the hosting service Wikia. A variety of wiki
projects are hosted on the MediaWiki engine, e.g. the Wikipedia. We concen-
trate on this engine here. Due the chosen architecture, also other engines can be
supported. We demonstrate this with the TikiWiki engine.
The enormous number of public and organizational wikis has created a long
tail [4] of wikis. Besides the very successful and very visible wiki-based knowledge
creation and sharing projects, there are many others with much lesser numbers
of editors and edits.
If we want to give wiki stakeholders tools for analyzing the dynamics and the
evolution of their wikis we have to deal with different wiki hosting software, with
wikis ranging from a few hundred nodes to wikis with millions of nodes, with
wiki dumps from unreliable public and corporate sources, with data management
problems and with complex algorithmic problems. Especially, the ability to host
the analysis and visualization of different wiki engines for the whole long tail of
wikis is still a true challenge.

Social networks in computer mediated communication have drawn also a lot


of scientific attention, e.g. [5–7, 1]. In this paper we concentrate on the dynamic
analysis of wikis, especially dynamic network analysis (DNA). DNA [8] is an
emerging area of science advancing traditional social network analysis (SNA)
by the idea that networks evolve over time in terms of changes of nodes in
the networks and changes of links between nodes. We argue in our paper that
DNA is applicable for wikis. For wiki users, wiki managers, and wiki hosting
services it is extremely important to know if wikis are still going to grow in
numbers of authors, edits and wiki articles or if the wiki is going into a phase
of stagnation. It is important to know if and when non-existing articles will be
created and edited by users. When a node (wiki page, an editor, a URL) is
‘important’ in the moment, will it stay important over the lifetime of the wiki
or will its importance change over time? If a network is heterogeneous will it
become homogeneous after a while or will it be that way for ever?

In the Web 2.0 not only wikis but also other new media have become tremen-
dously successful [9–11]. By developing standard operations for handling Web
2.0 data analysis and visualization we hope to encourage communities to ap-
ply dynamic network analysis thus increasing their agency in a world where we
leave billions of virtual footprints day by day. To serve the needs of different
stakeholders and communities in DNA we have developed a framework called
the MediaBase [12]. A MediaBase consists of three elements: (a) a collection
of crawlers specialized for distinguished Web 2.0 media like blogs, wikis, pods,
feeds, and so on; (b) the crawlers feed multimedia databases with a common
metamodel for all the different media, artifacts, actors, and communities leading
to a community-oriented cross-media repository; (c) a collection of web-based
analysis and visualization tools for DNA. Examples for MediaBases are available
for technology enhanced learning communities (www.prolearn-academy.org),
for German cultural science communities (www.graeclus.de), and for the cul-
tural heritage management of the UNESCO world heritage Bamyian Valley in
Afghanistan (www.bamyian-development.org). The WikiWatcher introduced
in this paper is part of the MediaBase.

The rest of the paper is organized as follows. In Section 2 we analyze prior


approaches and open issues. In Section 3 we characterize wikis as social networks
where DNA is applicable. Section 4 describes design and architecture of our
software prototype WikiWatcher. In Section 5 we are presenting the main results
of our analysis of different wikis. We conclude our paper with a discussion and
an outlook on further research.
2 Static Analysis of Wikis
There is an already existing literature on the analysis of wikis. Most of the
studies concentrate on static aspects of wikis. A lot of studies have already been
performed to analyze wikis. In general, those studies can be classified in studies
which make use of the publicly available wiki data (dumps) themselves and in
studies making use of additional data like access log files [13]. In this paper, we
concentrate on the analysis of publicly available wiki dumps. In this regard, we
can further classify studies concentrating on the static analysis of wiki dumps
[14, 15] and those concentrating on the dynamic aspects. But first, let us start
with a well-known example: the Wikipedia.
The Wikipedia is the most researched wiki. We want to mention only a
few studies to characterize the scientific progress in the DNA of wikis. Among
the first comprehensive studies of Wikipedia was the 2005 study by Voß [14].
Wikipedia was measured according to its network characteristics. In particular,
the article referred to the changes in size of the wiki database, the number of
articles, words, users and links. Among other things Voß figured out that the
distribution of links behaves scale-free with respect to growth and preferential
attachment [16]. Wilkinson and Huberman evaluated qualitatively the collabo-
ration of the Wikipedia community. They showed that the accretion of edits to
an article is described by a simple stochastic mechanism, resulting in a heavy
tail of highly visible articles with a large number of edits [17]. They figured out
that the quality of an article depends on the number of its modifications. Kittur
et al. [7] examined the success of Wikipedia. In particular, they analyzed if it is
a great number of contributors where each deals with only a few articles (‘wis-
dom of the crowd’) or if it is only a small elite group of contributors that has
the lion’s share (‘power of the few’). The later is true. On this qualitative view
on Wikipedia, the work of Priedhorsky et al. [13] is built up. They dealt with
vandalism in Wikipedia articles. For this purpose, two types of information were
used: Wikipedia articles themselves and their log files. By this means it could be
measured which article revisions the visitors had viewed and if it was an intact
or damaged version. The researchers aimed to quantify the influence of article
edits and revisions respectively to the visitors. The number of vandalized pages
viewed by real readers is extremely low. Further research classified users with
respect to their position in online communities like wikis. Although Wikipedia’s
equal treatment of editors some members seem to get a leading role [18].
Most research in this area aim for Wikipedia, but is not ready for arbitrary
wikis. A general concept for handling and analyzing any wiki is missing. Wikis
are applied in a variety of social and organizational environments. It would be
useful to obtain methods and tools for interpreting those incidental social struc-
tures. Hence, the motivations of this work are to afford a view on wikis as social
networks, to build up formal network models, to apply measurements of SNA,
to visualize wiki networks, and to consider the dynamic aspect of wiki networks.
The data and information basis of the most projects and researches is built up on
wiki log files or direct database access. With respect to the ‘open’ wiki concept
this work uses only public wiki data. Most wikis offer automatically generated
dumps which can also be inquired e.g. via the MediaWiki page special:export. Ar-
ticles, links and references as well as authors are treated as network components
(actors) – not only as a growing number. We apply Actor-Network Theory [19]
for the data management, i.e. we do not differentiate between human and non-
human actors and we can aggregate groups of actors as a new actor. Actor rela-
tionships and dependencies evolve during any given period. The static network
structure of classic SNA is extended consequently. Qualitative characteristics
such as the community leading role are now measurable by network analysis
methods. Characteristics like scale-free networks, hierarchical structures, short-
est paths and centrality (‘importance’) of network components can be measured
and analyzed in a time series context. Thus, it is possible to illustrate social
change and evolution in wiki networks.
Collaborative work, playing a fundamental role in wikis, is now able to be
visualized by means of dynamic network visualization. Not only ‘clinical’ num-
bers and measured values but also graph visualization help to identify strategic
actors and their activities. These possibilities give an aid to handle with wiki
actors, e.g. in a social, economic or security-relevant way. Our concept and our
implementation are introduced in the following.

3 Wikis as Dynamic Networks

Social science deals with the analysis of relations between different kind of ac-
tors, such as single persons, interacting groups or organisations. Social network
analysis is concerned with patterns of relationships between social actors [20].
Social networks can be seen as constructs of relations and entities like actors
and artifacts. Wikis conform to these qualitative aspects of social networks. The
main idea of ‘writing articles in common’ by Ward Cunningham [21] can be
realized only if wiki users collaborate in creating, modifying and maintaining
articles. These writing processes imply different kinds of social networks. Wiki
users as well as wiki pages can be seen as objects in a social network that help to
achieve the aim of establishing the wiki. Writing articles in common creates re-
lations between the participated authors and hence edges between author nodes
in the network. Wiki pages (articles) and their linked structure can only be war-
ranted by wiki users. Like the most networks in social science our different kind
of wiki networks evolve during the editing process. At this point one can see the
restriction of SNA. A great lack of dynamic components becomes noticeable.
Social networks hold characteristics like growth and adjustment. To solve this
challenge we apply dynamic network analysis. The static view is enhanced by a
dynamic one while the evolution process considers the agency and behavior of
the network actors. This is realized by adding one or more time parameters to
the networks. The corresponding models are introduced in the next sections.
A well-known classification of network topology will be introduced briefly.
Existing empirical and theoretical results indicate that complex networks can
be divided into the two major classes of homogeneous and heterogeneous net-
works [22]. This classification is based on the connectivity distribution P (k)
which gives the probability that an arbitrary node is connected to k other
nodes [22]. Homogeneous networks are characterized by almost the same number
of links at each node.
In contrast to that heterogeneous networks are often characterized by the
existence of clusters, i.e. the aggregation of nodes. Furthermore they have a
degree distribution that is characterized in such a way that not all nodes in a
network have the same number of edges [23]. The regarding distribution function
acts in accordance to the power law P (k) ∼ k −γ [22, 24].
Some seminal indices for determining the ‘important components’ in social
networks are centrality indices [25]. They quantify central nodes which can of-
ten spotted intuitively by considering the visualized networks. Degree centrality
which refers to the distribution of links as described above is one of the easiest
measurements for determining the influence of a node on its neighbors. For undi-
rected graph d(v) is the number of adjacent edges of node v, analogous d− (v)
and d+ (v) for directed graphs.
The focus of closeness centrality lies on measuring the closeness of a node to
all other nodes in the network [26]. In contrast to degree centrality there is no
local restriction any more. Closeness centrality CC (v) of a node v is defined as
follows. d(v, t) is the distance from v to a node t ∈ V .
1
CC (v) = P
t∈V d(v, t)

Betweenness centrality is based on shortest paths measurements. It indicates


which nodes have strong influence on the network. They control the information
flow through the network since many shortest paths going through them [26].
Betweenness centrality is defined as follows. σst denotes the number of short-
est paths between nodes s and t, σst (v) denotes the number of shortest paths
between s and t with v on it.
X σst (v)
CB (v) =
σst
s6=v6=t

With respect to the interdependent wiki actors two network models are es-
tablished. While the term network refers to the informal concept describing an
object composed of elements and interactions or connections between these ele-
ments, the natural means to model networks mathematically is provided by the
notion of graphs [27].
An article graph is such a graph with directed edges. It can be constituted
intuitively by considering wikis as a part of the World Wide Web (WWW).
Each wiki article (‘page’) induces a node that is labeled by the page title and
its namespace. As one can observe in the XML dumps as well as in the page
URLs almost every page is denoted by its namespace and title (separated by a
colon). Namespaces help to group wiki pages. For example, the Wikipedia page
Category:Football denotes the category page of football. Page names without
a namespace prefix refer to the main namespace of the wiki (in the following
denoted by ARTICLE). Like in the WWW articles are linked among each other.
Links can be set arbitrarily by wiki users either to other articles or to external
resources like ‘normal’ web pages. Furthermore, it is possible to set links to other
wiki articles that do not exist yet. In the standard wiki theme those links are
red colored. Due to the evolution of a wiki and its time dependent graphs four
different types of nodes have to be considered:
– ‘Normal’ article nodes, type existing: They already exist in the wiki and in
the XML dump respectively. They have got a text body and at least one
revision.
– Article nodes of type requested: They refer to requested articles on which a
link is set in another article. Requested articles will be established in the
future of the XML dump. A usual way to create new articles is to set a link
to them in some special pages called Seed or Sandbox. Requested articles will
change their type to existing articles somewhere in the future of the XML
dump.
– Article nodes of type never exists: Wiki dumps correspond to a certain time
period that begins at the creation of the wiki and ends at the moment of the
dump creation. Never existing articles can be seen as part of the requested
articles set, but in contrast to them they will not be created until the end of
the wiki dump.
– URL nodes: They refer to URL artifacts that are referred in the text body.
Naturally, the last three node types only possess incoming edges. The set
of all nodes is denoted by Varticle , the set of edges by Earticle . Since the graph
depends on a certain timestamp, it is defined as Garticle (t) = (Varticle , Earticle )
where t ∈ T S with T S a set of timestamps. The ‘oldest’ element corresponds to
the wiki creation, the ‘youngest’ one to the point of the wiki dump.
Author graphs can not be perceived in such an intuitive way as article graphs.
They are built up on collaboration of wiki users (authors). The main idea of
modifying wiki articles is the equality of wiki authors. Every web user is allowed
to participate. Of course, due to vandalism there are some exceptions and re-
strictions for some articles with a more or less sensitive content. Furthermore,
just a few users have special admin rights, but this does not effect the model.
An easy way to edit articles classifies authors in the two types anonymous au-
thors and registered authors. Anonymous authors are denoted by their IP ad-
dress, registered auhtors by their username. Consequently, the node set Vauthor
contains all authors involved in the wiki. A social relation (undirected edge)
between two authors arises when they have worked in common on a wiki arti-
cle, i.e. an intentional or unintentional collaboration by modifying the text body.
Eauthor denotes the set of the collaboration edges. Due to the high dynamics and
growth of a wiki an author graph is time-dependent, too. In contrast to article
graphs they’ve got two timestamps t0 and t1 ∈ T S as input parameters. Thus,
Gauthor (t0 , t1 ) = (Vauthor , Eauthor ) determines the graph where those nodes are
connected which authors have worked on a common article during the given
time period. According to the introduced wiki graphs and network models a
system database was established. It considers all entities and their dependencies
that are required to generate author networks and article networks respectively.
Figure 1 shows the corresponding entity relationship diagram. The entities wiki

namespace title
article_id type
name
type
wiki_id author_id
wiki article
0:n
wiki author

1:1
isA
refersTo
modified

0:0 0:n
article revision

0:n url_id address


revision_id

0:n
timestamp refersTo url
size

Fig. 1. entity relationship diagram.

author and wiki article take center stage. They correspond to the network nodes
described above. Articles can be identified by their labels consisting of name-
space and title. The attribute type reflects one of the three article node types. An
article revision is a previous version of an article, but also the article itself. Due
to the wiki concept every article revision is saved in the wiki database. They
are determined by their revision timestamp additionally. Their disc space can
be an interesting attribute, too. Every article revision may contain an arbitrary
number of links in its text body to other articles, but not to article revisions.
This is clear by considering the link format that does not contain any revision or
time information, e.g. [[Mathematics]] and [[Category:Science]]. Each ar-
ticle text body may also contain web links of the format [http://example.com]
(squared brackets are optional). They correspond to the entity URL which can be
part of an arbitrary number of revisions. Last but not least, every wiki author
can be identified by its name and type (anonymous/registered). Every author
may have modified/created 0 to n article revisions, but every revision was only
modified by exactly one user.

4 Design and Architecture


For the realization of the dynamic network analysis of wikis a two-stage system
was developed. As mentioned before it is called WikiWatcher. WikiWatcher con-
sists of a prototype that contains parsing modules for extracting the required
network data out of the dump files. Furthermore, a graphical user interface was
built offering functions for the generation, visualization and measurement of wiki
networks. Data which is used by the system derives from wikis with MediaWiki
engine and conforms to the introduced network models.
The conceptual design of the system is shown in Figure 2. Stage 1 realizes

Stage 1 Wiki Network Data

authors
Generating XML Parsing wiki data/
dump/export files database transfer article pages, Joe
URLS,
revisions Liz

article Tim
[[Article]]
RDB
[[requested]]

123.45.67.89

Stage 2
article

[http://…]

[[Article2]]
Generating Networks Measurement [[never exists]]

Metadata

Visualization Network Analysis

Fig. 2. conceptual design of the system.

the SAX-based parser tool that gets XML dumps as input data. It extracts the
necessary data that are used for generating networks later on and transfers it
into a system database. The system database serves as an interface between both
stages.
Stage 2 uses the previously stored network data of the system database. The
prototype on this stage offers functions for generating wiki networks according
to some input parameters that correspond to the network models mentioned
before. It also visualizes networks and applies them for methods and algorithms
of social network analysis. These methods support the verification of assumptions
and hypotheses about the behavior and characteristics of wiki networks and help
to accomplish the dynamic network analysis.
As a programming language Perl was chosen for stage 1 with respect to its
comfortable existing modules and its possibilities of defining regular expressions.
One of the most important questions was how to extract network information
out of a wiki. Due to the tremendous amount of information of such wikis like
Wikipedia, Wikiversity or Wikia Search the most feasible and effective way was
to use the SAX standard (Simple API for XML). It provides the potentiality
of parsing XML documents in linear time and constant memory space. These
properties are essential when treating XML dumps with a disk space in the range
of some Gigabytes.
In general, the SAX parser works event-based, i.e. if a certain XML tag or an
attribute emerges the parser invokes user-defined methods. Entity or attribute
values can be read and prepared for further processing. Memory will be cleared
and can be used later on.
The parser tool of WikiWatcher considers in particular the dynamic aspect of
wikis. This means to process dumps that were generated with the option full only.
These dumps contain wiki pages with all their revisions and timestamps. The
schema of the system database that was built up for storing the wiki information
is geared to the ER-diagram. For example one interrelational dependency speci-
fies that each revision is dealt by exactly one author. Thus, the authors table has
to be filled before the revisions table is allowed to be filled. Hence, the parser is
divided in some sub-modules that care about these restrictions. After extracting
some ‘heading information’ like the wiki name, wiki URL, namespaces etc. the
first pass through the XML dump extracts all participated authors. As a result
of the ‘natural’ appearance of redundancies in XML documents some database
constraints have to be set. In wiki dumps author names and IP addresses are
stored redundantly for instance.
Another problem occurs when considering different types of article nodes due
to the evolution of a wiki. This also effects the parsing sequence. By having a
look at the text body of a revision, there may be some links to other articles
(pages) that do not exist at this point of time. Because of the sequential parsing
process there can arise two cases: The article occurs later on or the article will
never exist in the whole dump. For getting all articles with their types (existing,
requested, never existing) and for storing them into the pages table there must
be a first pass for getting the existing pages and a second pass through the
text body of each revision that scans links to requested articles (“timestamp of
the current revision must be smaller than the timestamp of first revision of the
requested article’) and links to articles that will not exist at all.
While scanning the revision text all occurring URLs are stored into the sys-
tem database. The last step is to store some further revision data like times-
tamps, size and participated authors (their IDs) and to save the link structure
of each revision to other articles and URLs (with references to previously saved
data records).
Scanning links and URLs means to create appropriate regular expressions.
Links may have the form [[Football]], [[Football (Soccer)|Football]] or
[http://some.url]. One challenge was that MediaWiki software allows to set
links to external web pages either with or without squared brackets. On URLs
without brackets terminal symbols like question marks, exclamation marks or
commas are not considered for generating the corresponding web page link, but
there may occur ambiguousness. Another challenge is given by typing or syn-
tax errors induced by wiki users. There may emerge some wrong ‘links’ like
[[Football] or [[http://www.example.com]] that result in incorrect data
records. By editing Wikipedia articles it is obligated often to have a preview
before saving the modification, but it will not be possible to avoid this problem
completely. Due to treating XML dumps as sequential data streams the time
complexity is O(n) and the space complexity is O(c) with n as the length of
the XML document and c constant. For the representation of wiki networks the
Java graph tool kit yFiles has been integrated into WikiWatcher on stage 2. It
provides classes and methods for generation, visualization and measurement of
networks. A graphical user interface was built that provides elements to choose
parameters like namespaces, node types (article networks), author types (au-
thor networks) as well as for timestamps and measurement methods. It further
provides functions to compute mean values, standard deviation etc. as well as
functions to export visualized networks into common formats. A general problem
in research is the representation and visualization of tremendous networks with
more than 10,000 nodes and edges.
While the parser on stage 1 is able to process in linear time and constant
space, some measurement methods in stage 2 like centrality indices are costly to
compute [28]. For unweighted graphs the complexity to compute the betweenness
centrality using yFiles is O(|V |·|E|), closeness centrality is O(|V |2 +|V |·|E|). The
system database that serves as an intermediary between both stages is realized
with IBM DB2.

5 Dynamic Network Analysis of Wikis

The characteristics of wiki networks, their behavior, rates of change and distinc-
tive features during the evolution process are the basis of the DNA. The devel-
oped prototype WikiWatcher allows to conduct the DNA by offering methods
for measurement and representation of wiki networks. It permits the verification
of assumptions about network characteristics. Not only the stage 2 prototype
but also the system database allows to query and analyze wiki data. While the
measurement data at stage 2 can be classified in information that refers to net-
work structure, the gained information at the database level may refer more to
network dimension of wikis. Structural aspects correspond to characteristics like
centrality, clustering, network diameter or shortest path issues, whereas dimen-
sional aspects cover properties like number of articles and authors, number of
modifications, size of articles as well as their rates of change. We start with a
couple of well known ideas and come to some newer hypotheses later.
The rate of new authors/articles into a wiki network falls off after a pe-
riod of time. The idea is to come from a ‘foundation fever’ of a wiki. Figure 3
shows both the growth rate of the number of authors and articles. A few wikis
are treated in the diagrams. In general, the assumption can not be verified. It
couldn’t determined a fall off in the rate of growth in both cases. The growth’s
characteristics may be up to semantic aspects of a wiki, e.g. up-to-date incidences
that animate new users to write new articles. It has to be proved individually.
In the case of Wikia Search it seems to be clear. In January 2008 it went online
for public – observably in the sharp bend in both network types. The mea-
surements of Wikipedia (Simple English) show a progressive growth rate, in the
case of Wikiversity it fluctuates sometimes or it may have leaps and bounds
in other wikis. Because Wikiversity’s articles are strongly categorised, further
name spaces are included. There is a remarkable observation that wasn’t in-
tended when considering both growth rates separately. Joining new authors to a
Fig. 3. Rate of growth (author/article networks).
wiki mostly means new articles – it does not mean working on already existing
articles.
Wiki networks are heterogeneous during the whole evolution process. In ho-
mogeneous networks the number of k links per node is about the average hki
[22]. Such a uniform distribution couldn’t be verified in (social) wiki networks.
Applying and measuring the degree centrality showed an imbalance between the
network nodes in terms of their links. According to a lot of situations in social
structures a small portion of actors have above-average links and do most of
the work, i.e. editing articles and establishing new relations. This is shown in
Figure 4 where two author networks are given (left, center). To make contact
to other users, one needs to edit a lot of articles. But, this kind of users are
the minority. This distinctive heterogeneity not only occurs in author networks,
but also in article networks (see Figure 4, right). For article networks this is

Fig. 4. Heterogeneous author/article networks


proven in Figure 5 by using the degree centrality. Incoming as well as outgoing
article edges and links respectively are observed over a certain time period. The
measurements showed in all considered wikis a continuous strong standard devi-
ation of edges to nodes. Depending on semantic issues there may be a very high
standard deviation of outgoing links. This is given in Aachen Wiki which serves
Fig. 5. Article networks: degree centrality and standard deviation
as a information wiki for the city of Aachen and as an index which naturally has
many outgoing references.
Central nodes hold their important role during the evolution process. As de-
scribed, the ‘importance’ of a node can be determined by using the betweenness
centrality. This means that most shortest paths in the network go through these
nodes. This measurement is done for Wikia Search for the time period August
2004 to August 2005. The left side of Figure 6 gives for every registered author
its betweenness centrality depending on time (unnormalized for a better view).
Like the degree centrality there is only a small part of authors that have a high
betweenness centrality. In general they hold or increase their high value during
the evolution process. The survey can be found in article networks as well. The
right side of Figure 6 shows the betweenness centrality for Jabber Wiki, a wiki as
the name suggests. One of the most evident characteristics of wiki networks is the

600 600 250


500 250
CB 500 CB 200
400 200
400
300 150 150
300
200 100 100
200
100 50
100 50
0
Angela
Melancholie Jabberfaehige_Programme
Transports
0
Jasonr Vorteile_von_Jabber
Helga
Payo1
Nlw
0 Warum_Jabber
Kryptografie
Einrichtung_Psi
Helga−HTTP
0
Helga−Gruppen
Aktuelle_Ereignisse
Sgeo Alte_Meldungen
Dedalus Kryptografie_SSL/TLS
Helga−Befehle
Tim Einrichtung_Gajim
Einrichtung_von_Pidgin
Maurreen Externe_Bots
Gemeinsame_Benutzergruppen
Anthony_DiPierro Helga−Ideen
Kryptografie_OTR
Bdesham Einrichtung_JBother
Moderation_von_Gruppenchats
Weide Gruppenchat
Helga−ErsteSchritte
Interceptor
Einrichtung_Pandion
Ellmist Kryptografie_OpenPGP
Christopher_mcdermott Verbesserungen_der_Software
Einrichtung_CampusTalk
Hashar IdeensammlungFAQ
Anmelden_und_Aktivieren
Aphrael_Runestar
Fennec
AlexG
articles Einrichtung_Miranda
Einrichtung_AdiumX
Einrichtung_mcabber
Einrichtung_kopete
Helga_(HTTP)
RSS−Bot
Ingoolemo
Jimbo_Wales
Einrichtung_Centericq
Helga_(Bugs)
Hauptseite
Helga.php
12
authors Ppp
Jim 12 Dateitransfer
Helga_(Befehle)
Ideen_fuer_weitere_Jabber−Dienste
Einrichtung_Spark
Einrichtung_Meebo 10
Cimon_Avaro
Vickie
Nickshanks
10 Einrichtung_Tkabber
Einrichtung_Exodus
Gemeinsame_Benutzergruppen_in_der_Kontaktliste
Einrichtung_iChat 8
Kryptografie_SimpLite
Htaccess 8 Moeglichkeiten_fuer_Moderatoren_in_Gruppenchats
Einrichtung_Trillian
Helga_(Server_Bot)
Andrevan
Mdavis
Timur 6
AGB
TODO
Audio−_und_Videochat
RSS_Feeds_Bot
Helga_(Ideen)
6
Helga−Bugs
Node_ue
Jfs 4 CampusWebTalk
Einrichtung_Neos
Admin_Log
Einrichtung_von_Gaim
Testphase_Fahrplan
4
Par WebReg
Yann
Gskur
2 Intern
Einrichtung_von_Sim
Jabber−Administratoren
Helga_(Gruppen)
2
0 months
Bekannte_Probleme
Fingerprint
KryptografieOTR
0 months

Fig. 6. Betweenness centrality of author and article networks


heterogeneity during their entire evolution processes. In homogeneous networks
the number of links k per node is about the average hki [22]. Such homogeneous
structures do not appear in wiki author or wiki article networks. In fact, there
exist a few nodes with a lot of adjacent edges and a plenty of nodes with only a
few edges. Figure 7 gives an impression of heterogeneous networks. The author
network (circular layout) is a collaboration network of anonymous and registered
users of the BerlinWiki that is hosted on Wikia. The article network (organic
layout) gives the status of the German Wikia itself in May 2008 including all
namespaces. Requested and never existing links as well as URLs are excluded
because they’ve got only incoming edges. However, a strongly unbalanced edge
distribution can be observed. The exponential distribution of wiki networks dur-

Fig. 7. Heterogeneous author/article networks


ing the whole evolution is also shown by considering betweenness and degree
centrality indices. Furthermore, the standard deviation of the number of edges
points out these characteristics. But also semantically it can be explained by
considering the ‘intention’ of certain articles. For instance, there are a few ar-
ticles that hold an index character with many outgoing links and references.
Considering this issue, a consequential claim is to divide (registered) authors
into two classes. The distinctive criterion is how intensive the participation on
articles by authors is in the whole wiki. The system database outputs for every
author its number of revisions. Ordered by this number, it could be drawn a
line where the discrepancy of the revision numbers allocated to the authors was
high. For example, in the Wikipedia (simple english) only 377 authors did 93%
of the work (revisions), almost 5,000 authors did only a small part of it (7%).
This phenomenon could be observed in all considered wikis independent from
their size. A predominant number of revisions can be allocated to a small group
of authors. After a short period of time a small group of users gathers around
an article.
Registered authors often serves as ‘connectors’ of anonymous author network
components. This is another phenomenon could be discovered in the connections
between registered and anonymous wiki users. Handling with a number of wikis
has shown that this has to be proven for each wiki. In the given example of
Wikia Search (see left side of Figure 8), it is remarkable that anonymous authors
can be identified in a certain way. Although they can only be spotted by their

Fig. 8. ‘Connectors’ in author and article networks

IP addresses, one can divide them after a period of time into single groups or
graph components that are completely separated from each other. It would be
interesting for further research to decompose the IP addresses and to allocate
them geographically. By this way, it could be observable which addresses belong
to single users. By adding registered authors to an anonymous author network
one obtains only one strongly connected network component. The Wikia Search
example gives the state of the anonymous author network with t0 = July 15,
2004 and t1 = January 7, 2008 (shortly before the ‘official start’).
Nodes with a high betweenness rate are gateways to the rest of the web. As
shown in the right side of Figure 8 which gives the state of the article-URL
network of the AachenWiki in May 2008 there are a few article nodes that contain
a lot of outgoing edges to external resources (web pages). These articles can be
important as ‘connectors’ to the WWW. An interesting question may be if there
is a correlation between these articles with high degree centrality to external
pages and articles that have a high betweenness centrality to other wiki articles.
Articles with a high CB control the information flow within a network. They are
important by clicking through the wiki and must be protected against vandalism
and other damages. Figure 9 treats ten of the most important article nodes of the
Unofficial Google Wiki from January to December 2007. The wiki is hosted on
Wikia. First, one can observe an almost constant betweenness centrality that is
occupied by every node during the whole period. At this point a nice side effect
emerges: Vandalism was detected by using the model. On July 31, 2007 (see
month 8) the content of the main article Google Wiki was deleted completely.
This means of course vanishing of all edges to other articles. Hence no shortest
Fig. 9. Betweenness and degree centrality of article nodes

path could go through the main page. This implies a betweenness centrality of
0 – in the diagram visualized as a ‘gap’.
The right side of Figure 9 shows a stable number of outgoing edges to URLs.
Both measurements show the strong heterogeneity of wiki networks. In general
it can not be assumed a correlation between an article-article-CB and article-
URL-CD . But, in the treated wikis all articles with a CB greater than 0 hold a
number of URLs. Hence, these nodes are important in both ways: for the internal
structure of the wiki as well as a connector to the ‘real world’.
Complex networks become denser durring their evolution. Approaches of Les-
kovec et al. [29, 30] concerning the Densification Power Law showed that complex
networks may become denser during their evolution and growth. Generally, this
could not be verified for wiki author networks. Figure 10 reflects two essen-
tial characteristics. A few wikis were treated by considering their shortest path
lengths. The measurements begin at the creation of the wikis (first month) and
end at the moment of the XML dump. At each measurement point the greatest

Fig. 10. Lengths of shortest paths in author networks


strongly connected component of an author network was considered by comput-
ing the average shortest path length from one author node to another author
node. As a consequence of the easy way of (intended or unintended) collaboration
users are connected very quickly to other users. (Remember, you just need to
work on an article in common.) On average, the shortest path length to another
author is not longer than 3.
But after a time period of ‘self-discovery’ the average distances stagnate at
nearly 2 for all treated author networks. This kind of wiki self-discovery is de-
picted in Figures 11. They show the author networks (anonymous and registered
authors) of Wikia (de) in July and August 2007 respectively. In the beginning,
authors work in small groups on ‘their’ articles. More and more new authors join
the network. After accomplishing this first evolution process strongly connected
components will be merged to one single component (apart from some isolated
nodes). The figure shows the important author link between both components.
Of course, the average distance increases when a new component is connected
(see peak in Figure 10). But due to more and more interactions between authors,
the average distances will level off at 2 until the measurement ends. Hence, a
growing densification during the evolution process could not be determined.

Fig. 11. Author network evolution

6 Conclusions and Outlook

In this paper, our aim was to establish a dynamic network analysis view on wikis.
Wikis are continuously mutating and growing network structures. We introduced
different quantitative and qualitative characterizations of wiki networks to model
evolution and dynamics of wikis. In the formal network models we introduced
different node types in the network like authors, articles, revisions, and URLs.
Each of the nodes is annotated by a time component allowing us to track com-
plex changes in the structures over time. Due to the limited space, we cannot
present all the hypotheses we tested for the study, cf. [31, 32] for more details.
We highlighted here that a predominant number of revisions can be allocated to
a small group of authors. We described that an anonymous user can be spotted
by her editing behavior regardless of the IP address. We demonstrated that wiki
pages with a high betweenness centrality also contain a lot of external links thus
serving as a gateway to the external web. In the end, we had a closer look at the
assumed densification of wiki networks which could not be affirmed.
The applied DNA refers to structural aspects of wiki networks. Measurements
of centrality indices revealed a growing heterogeneity in wiki networks. Like
in other social networks we could determine a strong hierarchical structure of
important and unimportant nodes. Furthermore, we have built a bridge to the
Small World Phenomenon [33–35] that can be found in social science frequently.
It was shown a continuous growth in the number of authors and articles with
a remarkable correlation. But there could not made a general assertion about
the kind of growth. This has to be checked in any particular case. But it offers
interesting starting points for further research in cross-medial network types like
author–article networks. What effect does have a weighting of edges? What kind
of influence do have minor edits? In addition, semantic analysis of corresponding
discussion, talk or user pages in terms of growth and changing may be interesting.
What kind of benefits does have DNA for wikis? Next to getting an overview
on hidden interrelationships and pointing out remarkable actors there are some
further applications. Vandalism is widespread in the web. Wikis are concerned
of it, too. So, it is necessary to protect particular areas and articles respectively.
Wikipedia protects its articles due to semantic decisions – if an article has a sen-
sitive content. But by means of network analysis articles could be protected by
their importance to guarantee a secure information flow in the network. There
may be further advantages of considering wiki networks, e.g. economic or social
aspects that are based on network measurements. They can give recommenda-
tions to users according to the gained network data. This may be commercial
advertisement or social information.
In this paper wikis based on MediaWiki software were considered only. For
generating the networks according to the models we implemented a two-staged
system. It consists of a crawler that takes care of data extraction, transferring
them into a system database, preparing them for generating and visualizing
networks as well as applying measurement methods. Stage 1 is able to manage
XML dumps of arbitrary file size. Parsing is done in linear time and constant
space using SAX. Stage 2 uses the advantages of existing graph drawing libraries
and their network analysis algorithms. One of the biggest problems was to handle
with ‘big’ wikis as measured by their number of nodes. The English Wikipedia
contains more than 2 million articles (nodes) and its German counterpart after all
1 million articles (nodes). Until now, it is a research challenge how to generate,
analyze and visualize such tremendously huge networks. The main aspect of
wikis (‘writing articles in common’) echoes in all wiki systems. Every wiki is
able to be represented as a mutating and developing network. Due to the 2-
stage design approach modifications for other wiki software are easily done. We
already implemented a modified first stage for the content management system
TikiWiki. Stage 2 can be untouched. Amongst other things like export format,
namespaces and author types the different tagging of links had to be considered
(see table 1).

MediaWiki TikiWiki
[[article]] ((article))
[[article|description]] ((article|description))
[http://example.com] [http://example.com]
[http://example.com eg] [http://example.com|eg]
Table 1. Tagging of links

By adjusting the parser module on stage 1 according to the new requirements


it is possible to adapt the most common wikis and even TikiWiki to the system.
In this manner it is possible to apply DNA methods on arbitrary wikis based on
arbitrary wiki engines.
For dynamic network analysis incremental dumps are more appropriate than
the static dumps we used for the extraction of timed information from the evo-
lutionary wiki. Future work also includes the design of such incremental dump
options for existing wiki software.

7 Acknowledgments

This work was supported by the German National Science Foundation (DFG)
within the collaborative research center SFB/FK 427 ‘Media and Cultural Com-
munication’, within the research cluster established under the excellence ini-
tiative of the German government ‘Ultra High-Speed Mobile Information and
Communication (UMIC)’ and within the cluster project CONTICI. We thank
our colleagues for the inspiring discussions.

References
1. Aronsson, L.: Operation of a large scale, general purpose wiki website: Experience
from susning.nu’s first nine months in service. In Carvalho, J.a.A., Hübler, A.,
Baptista, A.A., eds.: Proceedings of the 6th International ICCC/IFIP Conference
on Electronic Publishing, Karlovy Vary, Czech Republic (November 2002) 27–37
2. Lamb, B.: Wiki open spaces: Wikis, ready or not. Educause Review 39(5) (Septem-
ber/October 2004) http://www.educause.edu/apps/er/erm04/erm045.asp, last ac-
cessed: November 2008.
3. Aguiar, A., David, G.: Wikiwiki: weaving heterogeneous software artifacts. In:
WikiSym ’05: Proceedings of the 2005 international symposium on Wikis, New
York, NY, USA, ACM (2005) 67–74
4. Anderson, C.: The Long Tail: Why the Future of Business Is Selling Less of More.
Hyperion (2006)
5. Vega-Redondo, F.: Complex Social Networks. Econometric Society Monographs.
Cambridge University Press, Cambridge (2007)
6. Adler, B.T., de Alfaro, L.: A content-driven reputation system for the Wikipediai.
In: WWW. (2007) 261–270
7. Kittur, A., Chi, E.H., Pendleton, B.A., Suh, B., Mytkowicz, T.: Power of the
few vs. wisdom of the crowd: Wikipedia and the rise of the bourgeoisie. In: 25th
Annual ACM Conference on Human Factors in Computing Systems (CHI 2007);
2007 April 28 - May 3; San Jose, CA. (2007)
8. Carley, K.M.: Dynamic network analysis. In Breiger, R., Carley, K.M., eds.: Sum-
mary of the NRC workshop on Social Network Modeling and Analysis, National
Research Council (2003)
9. Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: Structure and evolution of
blogspace. Communications of the ACM 47(12) (2004) 35–39
10. Vossen, G., Hagemann, S.: Unleashing Web 2.0. - From Concepts to Creativity.
Morgan Kaufman, Burlington, MA (2007)
11. Aigrain, P.: The individual and the collective in open information com-
munities. In: 16th BLED Electronic Commerce Conference. (June 2003)
http://hdl.handle.net/2038/957, last accessed: November 2008.
12. Klamma, R., Spaniol, M., Jarke, M.: Pattern-based cross media social network
analysis for technology enhanced learning in europe. In Nejdl, W., Tochtermann,
K., eds.: Proceedings of the First European Conference on Technology Enhanced
Learning, Crete, Greece, October 3-5. Volume 4227 of LNCS., Berlin Heidelberg,
Springer-Verlag (2006) 242–256
13. Priedhorsky, R., Chen, J., Lam, S.T.K., Panciera, K., Terveen, L., Riedl, J.: Cre-
ating, destroying, and restoring value in Wikipedia. In: GROUP ’07: Proceedings
of the 2007 international ACM conference on Supporting group work, New York,
NY, USA, ACM (2007) 259–268
14. Voß, J.: Measuring Wikipedia. In: Proceedings of the 10th International Conference
of the International Society for Scientometrics and Informetrics. (2005)
15. Hu, M., Lim, E.P., Sun, A., Lauw, H.W., Vuong, B.Q.: Measuring article quality
in Wikipedia: models and evaluation. In: CIKM ’07: Proceedings of the sixteenth
ACM conference on Conference on information and knowledge management, New
York, NY, USA, ACM (2007) 243–252
16. Barabási, A.L., Albert, R., Jeong, H.: Mean-field theory for scale-free random
networks. Physica A Statistical Mechanics and its Applications 272 (1999) 173–
187
17. Wilkinson, D.M., Huberman, B.A.: Assessing the value of cooperation in
Wikipedia. First Monday, volume 12, number 4 (April 2007) (Feb 2007)
18. Joseph M. Reagle, J.: Do as i do:: authorial leadership in Wikipedia. In: WikiSym
’07: Proceedings of the 2007 international symposium on Wikis, New York, NY,
USA, ACM (2007) 143–156
19. Latour, B.: On recalling ant. In Law, J., Hassard, J., eds.: Actor-Network Theory
and After. Oxford (1999) 15–25
20. Breiger, R.L.: The analysis of social networks. In Hardy, M., Bryman, A., eds.:
Handbook of Data Analysis. London, SAGE Publications (2004) 505–526
21. Cunningham, W.: Invitation to the patterns list.
http://c2.com/cgi/wiki?InvitationToThePatternsList (2005)
22. Albert, R., Jeong, H., Barabási, A.L.: Error and attack tolerance of complex
networks. Nature 406 (2000) 378–382
23. Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Reviews of
Modern Physics 74 (2002) 47
24. Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286
(1999) 509
25. Koschützki, D., Lehmann, K.A., Peeters, L., Richter, S., Tenfelde-Podehl, D., Zlo-
towski, O.: Centrality indices. In Brandes, U., Erlebach, T., eds.: Network Analysis:
Methodological Foundations. Springer (2005)
26. Brandes, U., Kenis, P., Wagner, D.: Communicating centrality in policy network
drawings. IEEEE Transactions on Visualization and Computer Graphics 9(2)
(2003) 241–253
27. Brandes, U., Erlebach, T.: Fundamentals. In Brandes, U., Erlebach, T., eds.:
Network Analysis: Methodological Foundations. Springer (2005)
28. Brandes, U.: A faster algorithm for betweenness centrality. Journal of Mathemat-
ical Sociology 25(2) (2001) 163 – 177
29. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: Densification and
shrinking diameters. ACM Trans. Knowl. Discov. Data 1(1) (2007) 1–40
30. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification laws,
shrinking diameters and possible explanations. In: KDD ’05: Proceedings of the
eleventh ACM SIGKDD international conference on Knowledge discovery in data
mining, New York, NY, USA, ACM (2005) 177–187
31. Haasler, C.: Dynamische Netzwerkanalyse von Wikis. Diplomarbeit, RWTH
Aachen, Lehrstuhl für Informatik 5 (12 2007)
32. Klamma, R., Haasler, C.: Dynamic network analysis of wikis. In: Proceedings
of I-Know’08 and I-Media’08, International Conferences on Knowledge Manage-
ment and New Media Technology, Graz, Austria, September 3-5, 2008, Journal of
Universal Computer Science (J.UCS), 2008. (2008) 161–168
33. Milgram, S.: The small-world problem. Psychology Today 1(1) (1967) 60–67
34. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature
393 (1998) 440–442
35. Adamic, L.A.: The small world web. In: ECDL ’99: Proceedings of the Third
European Conference on Research and Advanced Technology for Digital Libraries,
London, UK, Springer-Verlag (1999) 443–452

You might also like