Network generation and analysis from text corpuses
Human language involves a nested set of networks. These networks involve words
as basic units, interacting at different levels. The dynamics of such
networks, the constrains acting on them and their evolution define what we know
as human language.
syntactic or production webs have been studied
over the last years. In order to obtain automatically a language network,
we can use the precedence relation
that naturally emerges from finding two words one after each other within (at least)
one sentence. This means that a given text T will be defined as a set of
Additionally, we can also calculate with WIENER the distributions that characterize
the large-scale architecture of LPNs. One interesting property exhibited by these
networks is given by their highly heterogeneous distributions. The simplest involves
the frequency of words P(k) having k connections with other words (equivalently having
a degree k). This distribution
is scale-free , namely
that its falls with the number of links as a power law,
Together with words as the basic units (nodes) of the graph, links are defined
as follows: two words are "linked" if they appear one after the other (in a given
order) at least within one sentence. These webs capture what we can call the
production capacity of the underlying syntactic rules. Our network will be
the result of the intersections of many different sentences and their potential
combinations, and thus it provides a glimpse of the architecture of
production and the relative relevance of different words in each language. By using these
Language Production Networks (LPN) we adopt the simplest approach to building language
webs. Such an approach has been shown to be able to capture a large
part of the underlying syntactic structure.
Additionally, we can consider the frequency of pairs of words found in the
corpus, which naturally introduces a flow measure (or weight). Thus we expand
the previous all-or-none connections to a weighted graph with links given by:
Which measures how often two words appear linked.
WIENER takes corpuses as written texts and considers sentences as thos sets
of words separated by ".". Links are made only between words within each sentence.
The resulting graph is then systematically analysed and a number of global and local
measures can be performed on the largest connected subgraph. Among these measures,
we have the two characterizing the so called
small world structure of a graph, namely
the average path length and the clustering coefficient. The first is defined as
the average length of shortest paths between any pair of words whereas the second
is defined as the probability that two words having a common word to which they
are linked are also linked to each other. Most complex networks are characterized
by a small average path length (much smaller than system's size) and a high clustering
to be provided within this package allow to visualize complex networks
and test different properties at local and global scales.
A screenshot of our algorithm is shown below.
FIGURE:from writen or spoken
corpuses, it is possible to build a network representation of language based on
co-occurrence graphs. Here a piece of the core of Moby Dick is shown as obtained
from our software package WIENER. Here words are indicated as balls and links
indicate that the two linked words appeared (at least once) one after the other within a
sentence. In this screenshot we can see some of the statistics generated by the
algorithm. Here the graph has been coloured by considering the degree of the
WIENER computes this distribution as well as the cumulative one, defined as
which is much more convenient when plotting these statistical measures. WIENER
also computes statistical quantities such as
betweenness centrality, defined as
the fraction of shortest paths flowing through a given word, namely:
defined here for the kth-word. Here
is the fraction of minimal paths crossing this word starting from any two
arbitrary pairs of words, whereas
is the fraction of minimal paths between all other pairs of words.
This package provides
a visualization of the global organization of the LPN as well as
several available statistical measures, including different correlations
between statistics. These measures and distributions can be exported
to an external file and used for further calculations.
Corpora and Corpus-based linguistic information
The WIENER package manipulates written compilations of
both texts and transcribed spoken sentences. The main source for our
analysis are compiled corpuses available online. Here we provide a list
of some of the most rich websites were most languages (modern and old)
These corpuses have different levels of detail and texts need to be
filtered (removing numbers and undesired symbols with no relevance)
before the package computes the network structure (small world,
scale-free) and statistical properties of the corpus, both at the
lexicon and network levels.
ONLINE LANGUAGE CORPUSES
Case studies on language network analysis
Within our research on language networks, we are considering two special
situations in which the size and structure of these networks changes
in time. The first involves its generation and growth during language
acquisition in children. The second deals with a complementary situation,
namely the decay of network organization due to cognitive deterioration.
Moreover, we have an ongoing project on emergence of language networks
in robots, which will be developped in collaboration with the
SONY CSL Lab
Our goal here would be to explore the structure of growing lexical and grammatical
webs in Aibo robots, and to study their relation with the underlying
Analysing the ontogeny of children language
Language network decay under brain damage
Emergence of artificial
Complexity & Language related WebSites.
Since its founding in 1984, the Santa Fe Institute (SFI) has devoted
itself to fostering a multidisciplinary scientific research community
pursuing frontier science. SFI seeks to catalyze new research
activities and serve as an "institute without walls". The Institute
explores a number of research areas with strong links to ECAgents
topics, including human language, complex networks and computation in
Sony CSL Paris
This research is part of the newly emerging field of evolutionary
linguistics. We are investigating ways in which artifical agents can
self-organise languages with natural-language like properties and how
meaning can co-evolve with language. Our research is based on the
hypothesis that language is a complex adaptive system that emerges
through adaptive interactions between agents and continues to evolve in
order to remain adapted to the needs and capabilities of the agents.
Evolution and Computation, Edinburgh UK
This group is part of the Theoretical and Applied Linguistics, within
the School of Philosophy, Psychology and Language Sciences at the
University of Edinburgh. Their focus is on understanding the origins
and evolution of language and communication. They have pioneered the
application of computational and mathematical modelling techniques to
traditional issues in language acquisition, change and evolution. The
overall goal is to develop a theory of language as a complex adaptive
system operating on multiple time-scales.
MIT Linguistics USA
The research conducted by the MIT Linguistics Program strives to
develop a general theory that reveals the rules and laws that govern
the structure of particular languages, and the general laws and
principles governing all natural languages. The core of the program
includes most of the traditional subfields of linguistics: phonology,
morphology, syntax, semantics, and psycholinguistics, as well as
questions concerning the interrelations between linguistics and other
disciplines such as philosophy and logic, literary studies, the study
of formal languages, acoustics, and computer science.
Institute for Logic, Language and Computation. Amsterdam
The Institute for Logic, Language and Computation (ILLC) is a research
institute of the University of Amsterdam, in which researchers from the
Faculty of Science and the Faculty of Humanities collaborate.
ILLC's central research area is the study of fundamental principles of
encoding, transmission and comprehension of information. Emphasis is on
natural and formal languages, but other information carriers, such as
images and music, are studied as well.
Research at ILLC is interdisciplinary, and aims at bringing together
insights from various disciplines concerned with information and
information processing, such as logic, mathematics, computer science,
linguistics, cognitive science, artificial intelligence and philosophy.
Stanford Department of Linguistics
The Stanford University Department of Linguistics is a vibrant center
of research and teaching. The range of languages studied is diverse and
the scope of active research and teaching is broad, including
acquisition, computational linguistics, historical linguistics,
morphology, phonetics, phonology, pragmatics, psycholinguistics,
semantics, sociolinguistics, syntax, typology and variation.
Berkeley Linguistics Department, California
The Berkeley Linguistics Department has a rich and distinguished tradition.
The Department has strengths in many areas. Phonetics, phonology,
morphology, syntax, semantics, pragmatics, sociolinguistics, historical
linguistics, and cognitive linguistics are all well represented. The
Department emphasizes research that seeks to discover and provide
explanations for general properties of linguistic form, meaning, and
usage. We are also committed to linguistics in the service of
endangered languages, and support a number of language revitalization
programs for Native Americans.
Much of our research is potentially interdisciplinary and/or involves
the careful documentation of individual languages, language families,
and their histories. The Department has always had a strong commitment
to the study of American Indian languages, and also has special
strengths in African, Asian and European languages.
Group of Universitat de Barcelona
Grup de Biolinguistica GB at the
University of Barcelona (UB) was established with the aim of relating
the study of language to the natural sciences in general and biology in
particular. This goal is fully justified, since the language faculty is
one of the most (if not the most) important endowments of our species.
The GB aims to advance this goal by contributing to the consolidation
and diffusion of information on the subject via a number of different
means, such as lectures, seminars and conferences.
The GB regards as necessary the collaborative exchange of ideas
between linguistic theory and other disciplines such as evolutionary
biology, neurolinguistics, psycholinguistics, genetics, complex systems
theory or ethology. Incorporating evidence from each of these
fields allows us to gain a better understanding of the nature, origins
and evolution of language.