Research Project, ICREA/Complex Systems Lab 
Main Researchers: Bernat Corominas , Ricard V. Solé & Sergi Valverde

Related projects: ECAgents 





The WIENER site (for "Word Interactions: Exploring NEtwork Robustness") provides both a collection of of links to online resources where corpuses from different languages can be downloaded in order to perform network analyses of language architecture. Our definition of network is restricted to the presence of directed co-occurrences within sentences: two words withn a sentence are linked provided they appear next to each other. The order of such appeareance is preserved in the resulting graph, with an arrow pointing from the first word to the next.



WIENER works mainly with written corpuses

Additionally, the WIENER package, developed within the EU project ECAgents , will allow users to analyse and visualize large language networks. The package will help identify relevant words and modules that facilitate communication as well as to test the impact of their loss into network's performance.


NEWS


1/03/2005   First public release and publication of this web page.

17/03/2005  Syntax for free? by: Ricard V. Sole.

30/12/2005  Santa Fe Institute Working Paper: Language Networks: Their Structure, function and evolution by: Ricard V. Sole, Bernat Corominas, Sergi Valverde. 1/03/2005   

30/12/2005  Paper to appear in J. Theor. Biol: Network topology and self-consistency in language games by: Bernat Corominas Murtra and Ricard V Sole 1/03/2005   

06
/03/2006  First version of the Wiener Package (WIENER1.0) has been released.
 

Coming Soon:

·A systematic guide to build child syntactic networks.

· A new paper with an extensive analysis of co-occurrence language networks in preparation

 

 

Wiener 1.0
D
ownload application
Download manual
This software requires Windows 98, 2000 or XP

 


Network generation and analysis from text corpuses

Human language involves a nested set of networks. These networks involve words as basic units, interacting at different levels. The dynamics of such networks, the constrains acting on them and their evolution define what we know as human language. Semantic, syntactic or production webs have been studied over the last years. In order to obtain automatically a language network, we can use the precedence relation that naturally emerges from finding two words one after each other within (at least) one sentence. This means that a given text T will be defined as a set of sentences,

and thus

Together with words as the basic units (nodes) of the graph, links are defined as follows: two words are "linked" if they appear one after the other (in a given order) at least within one sentence. These webs capture what we can call the production capacity of the underlying syntactic rules. Our network will be the result of the intersections of many different sentences and their potential combinations, and thus it provides a glimpse of the architecture of production and the relative relevance of different words in each language. By using these Language Production Networks (LPN) we adopt the simplest approach to building language webs. Such an approach has been shown to be able to capture a large part of the underlying syntactic structure.

Additionally, we can consider the frequency of pairs of words found in the corpus, which naturally introduces a flow measure (or weight). Thus we expand the previous all-or-none connections to a weighted graph with links given by:

Which measures how often two words appear linked. WIENER takes corpuses as written texts and considers sentences as thos sets of words separated by ".". Links are made only between words within each sentence. The resulting graph is then systematically analysed and a number of global and local measures can be performed on the largest connected subgraph. Among these measures, we have the two characterizing the so called small world structure of a graph, namely the average path length and the clustering coefficient. The first is defined as the average length of shortest paths between any pair of words whereas the second is defined as the probability that two words having a common word to which they are linked are also linked to each other. Most complex networks are characterized by a small average path length (much smaller than system's size) and a high clustering coefficient.

The algorithms to be provided within this package allow to visualize complex networks and test different properties at local and global scales. A screenshot of our algorithm is shown below.



FIGURE:from writen or spoken corpuses, it is possible to build a network representation of language based on co-occurrence graphs. Here a piece of the core of Moby Dick is shown as obtained from our software package WIENER. Here words are indicated as balls and links indicate that the two linked words appeared (at least once) one after the other within a sentence. In this screenshot we can see some of the statistics generated by the algorithm. Here the graph has been coloured by considering the degree of the nodes.

Additionally, we can also calculate with WIENER the distributions that characterize the large-scale architecture of LPNs. One interesting property exhibited by these networks is given by their highly heterogeneous distributions. The simplest involves the frequency of words P(k) having k connections with other words (equivalently having a degree k). This distribution is scale-free , namely that its falls with the number of links as a power law,

WIENER computes this distribution as well as the cumulative one, defined as

which is much more convenient when plotting these statistical measures. WIENER also computes statistical quantities such as betweenness centrality, defined as the fraction of shortest paths flowing through a given word, namely:

defined here for the kth-word. Here

is the fraction of minimal paths crossing this word starting from any two arbitrary pairs of words, whereas

is the fraction of minimal paths between all other pairs of words. This package provides a visualization of the global organization of the LPN as well as several available statistical measures, including different correlations between statistics. These measures and distributions can be exported to an external file and used for further calculations.





Corpora and Corpus-based linguistic information

The WIENER package manipulates written compilations of both texts and transcribed spoken sentences. The main source for our analysis are compiled corpuses available online. Here we provide a list of some of the most rich websites were most languages (modern and old) are represented. These corpuses have different levels of detail and texts need to be filtered (removing numbers and undesired symbols with no relevance) before the package computes the network structure (small world, scale-free) and statistical properties of the corpus, both at the lexicon and network levels.


DATABASE ON ONLINE LANGUAGE CORPUSES





Case studies on language network analysis

Within our research on language networks, we are considering two special situations in which the size and structure of these networks changes in time. The first involves its generation and growth during language acquisition in children. The second deals with a complementary situation, namely the decay of network organization due to cognitive deterioration. Moreover, we have an ongoing project on emergence of language networks in robots, which will be developped in collaboration with the SONY CSL Lab . Our goal here would be to explore the structure of growing lexical and grammatical webs in Aibo robots, and to study their relation with the underlying software network.

Analysing the ontogeny of children language


Language network decay under brain damage


Emergence of artificial language networks


 



Complexity & Language related WebSites.

Santa Fe Institute

Since its founding in 1984, the Santa Fe Institute (SFI) has devoted itself to fostering a multidisciplinary scientific research community pursuing frontier science. SFI seeks to catalyze new research activities and serve as an "institute without walls". The Institute explores a number of research areas with strong links to ECAgents topics, including human language, complex networks and computation in biology.


Sony CSL Paris
This research is part of the newly emerging field of evolutionary linguistics. We are investigating ways in which artifical agents can self-organise languages with natural-language like properties and how meaning can co-evolve with language. Our research is based on the hypothesis that language is a complex adaptive system that emerges through adaptive interactions between agents and continues to evolve in order to remain adapted to the needs and capabilities of the agents.


Language Evolution and Computation, Edinburgh UK
This group is part of the Theoretical and Applied Linguistics, within the School of Philosophy, Psychology and Language Sciences at the University of Edinburgh. Their focus is on understanding the origins and evolution of language and communication. They have pioneered the application of computational and mathematical modelling techniques to traditional issues in language acquisition, change and evolution. The overall goal is to develop a theory of language as a complex adaptive system operating on multiple time-scales.


MIT Linguistics USA
The research conducted by the MIT Linguistics Program strives to develop a general theory that reveals the rules and laws that govern the structure of particular languages, and the general laws and principles governing all natural languages. The core of the program includes most of the traditional subfields of linguistics: phonology, morphology, syntax, semantics, and psycholinguistics, as well as questions concerning the interrelations between linguistics and other disciplines such as philosophy and logic, literary studies, the study of formal languages, acoustics, and computer science.


Institute for Logic, Language and Computation. Amsterdam
The Institute for Logic, Language and Computation (ILLC) is a research institute of the University of Amsterdam, in which researchers from the Faculty of Science and the Faculty of Humanities collaborate. ILLC's central research area is the study of fundamental principles of encoding, transmission and comprehension of information. Emphasis is on natural and formal languages, but other information carriers, such as images and music, are studied as well. Research at ILLC is interdisciplinary, and aims at bringing together insights from various disciplines concerned with information and information processing, such as logic, mathematics, computer science, linguistics, cognitive science, artificial intelligence and philosophy.


Stanford Department of Linguistics
The Stanford University Department of Linguistics is a vibrant center of research and teaching. The range of languages studied is diverse and the scope of active research and teaching is broad, including acquisition, computational linguistics, historical linguistics, morphology, phonetics, phonology, pragmatics, psycholinguistics, semantics, sociolinguistics, syntax, typology and variation.


Berkeley Linguistics Department, California
The Berkeley Linguistics Department has a rich and distinguished tradition. The Department has strengths in many areas. Phonetics, phonology, morphology, syntax, semantics, pragmatics, sociolinguistics, historical linguistics, and cognitive linguistics are all well represented. The Department emphasizes research that seeks to discover and provide explanations for general properties of linguistic form, meaning, and usage. We are also committed to linguistics in the service of endangered languages, and support a number of language revitalization programs for Native Americans. Much of our research is potentially interdisciplinary and/or involves the careful documentation of individual languages, language families, and their histories. The Department has always had a strong commitment to the study of American Indian languages, and also has special strengths in African, Asian and European languages.


Biolinguistics Group of Universitat de Barcelona

The Grup de Biolinguistica GB at the University of Barcelona (UB) was established with the aim of relating the study of language to the natural sciences in general and biology in particular. This goal is fully justified, since the language faculty is one of the most (if not the most) important endowments of our species. The GB aims to advance this goal by contributing to the consolidation and diffusion of information on the subject via a number of different means, such as lectures, seminars and conferences.
The GB regards as necessary the collaborative exchange of ideas between linguistic theory and other disciplines such as evolutionary biology, neurolinguistics, psycholinguistics, genetics, complex systems theory or ethology. Incorporating evidence from each of these fields allows us to gain a better understanding of the nature, origins and evolution of language.