Background

Intelligence gathering and documentation in global organizations are maturing from separate, largely manual tasks into organization-wide processes of multilingual content management. Timeliness, efficiency and quality are sought through redistribution of labor, automatization of the processes, and standardization of their products.

TF: TermFactory

Concept based special language terminology work continues to be expensive, largely manual expert work. For some time now, there have been commercial tools for multilingual terminology management (e.g. SDL Multiterm). However, these tools provide scant support for concept analysis. There are commercial solutions for distributing multilingual terminology on the web (e.g. MultiTerm Online), but limited examples as yet of truly collaborative distributed terminology work. The semantic web ontology languages (RDF, OWL) are designed for distributed semantic work, but current implementations do not specifically address multilingual terminology work (e.g. Collaborative Protégé, FinnOnto). TermFactory is designed to fill this gap. Term Factory leans on long expertise on language technology, terminology theory and practice, and web technology in the research partners. Specifically, it builds on a number of preparatory projects.

TEKES Fenix program project 4M (Mobile Multilingual Main-tenance Man) built a multilingual natural language dialogue system based on Web ontology language techniques (RDF,OWL) with an application in the service business. The core of the system is a hierarchy of generic and domain ontologies which provide explicit language support for natural language processing components (NL parsing and generation plus speech recognition and synthesis as well as to an array of reasoners (including information retrieval, case and model based reasoning). The project also provided offline tools to populate the system components with domain data (including semantic document indexing, information extraction). Research partners: University of Helsinki, Helsinki University of Technology, VTT; technology partners: Lingsoft Inc, business partners: Fujitsu Services, Nokia Business Infrastructures, Wärtsilä.

TEKES Fenix program project FinnOnto (Eero Hyvönen) and its sequel, converts national documentation thesauri into open source ontologies for the use of memory organisations (libraries, museums, etc.). health officials and companies. The largest of these ontologies, YSO (Yleinen suomalainen ontologia, General Finnish Ontology) is a conversion of the national library thesaurus which consists of over 20.000 keywords. These ontologies will be accessible for download from an ontology server. The FinnOnto project has made headway in the problems of distributed web ontology development, versioning, and the presentation of non-discrete data in ontologies (time and space ontologies). Work is in progress toward translating search terms into English and Swedish (http://www.seco.tkk.fi/projects/finnonto).

Units of the University of Helsinki located in Kouvola – Department of Translation Studies and Palmenia Centre for Continuing Education – have long standing experience in the theory and practice of multilingual terminology work. They have produced a number of special language vocabularies, most recently, Finnish-Russian Forestry Dictionary and EU-Russia Project Co-operation Glossary (www.projectglossary.eu).

I2I: Information to Intelligence

Department of Computer science of the University of Helsinki (HY/CS) has been developing methods for document and text analysis, text mining, Information Extraction (IE), and automatic acquisition of linguistic knowledge for language understanding tasks. HY/CS is building an IE system called PULS – Pattern-based Understanding and Learning System – which will serve as the core IE component in I2I. At the center of the system is the IE engine, which receives plain text, analyzes it, finds facts of a pre-specified kind, and outputs them in structured form into a database table (“extracted facts”).

For example, a system that extracts facts about medical epidemics, finds texts segments mentioning an epidemic incident (e.g., “A family of 4 contracted bird flu in Thailand last week”) and records the attributes of the incident: the name of the disease, the date and location of the incident, the number of victims, their type (humans, animals), and whether they survived or died. These facts belong to the domain of epidemics. The kind of facts that PULS extracts depends on the knowledge bases, which are customized for each new domain.

A critical bottleneck in building ontologies is the problem of knowledge acquisition for new domains. The process of populating a new ontology, if performed manually, is so labor-intensive as to become the most costly part of the project, over the long term. Therefore, for an ontology-building project to be truly scalable it should include a component that helps to populate the ontology for a new domain automatically, reducing manual intervention as far as possible.

We propose to explore bootstrapping methods for ontology population, as they offer the lowest cost/benefit ratio. By bootstrapping we mean minimally-supervised machine learning techniques. 

We stratify the process of building an ontology into two sub-problems:

  • a. defining the set of relations and the hierarchy of internal nodes; and
  • b. populating the large classes in the ontology, typically found at the leaf nodes.

Normally one relies for this on the work of ontology builders, lexicographers, etc., who would compile the collections of terms manually, either from pre-existing resources (specialized ontologies) or directly by analyzing text. Sub-problem a. is difficult to automate, since it requires deep, real-world knowledge and understanding of the domain. However, sub-problem b. may be viewed as a classification problem, and hence may be amenable to a computational approach.

For example, in the medical domain we need to identify as many disease names as possible; also essential are disease agents (bacteria, viruses, fungi, algae, parasites, etc.), disease vectors (i.e., organisms that transmit disease: rats, mosquitoes, etc.), drugs used in treatment, etc.  Such classes of terms are more general than traditional proper names and are more difficult to identify (e.g., they are not necessarily capitalized).

We have obtained promising initial results in this area and presented the Nomen algorithm, which learns multiple classes of terms simultaneously. The salient features of Nomen are:

  • it learns terms (or generalized names), with no reliance on capitalization cues (which allows it to go beyond proper names),
  • it learns from an un-annotated corpus, by bootstrapping from a few seeds,
  • it learns several categories simultaneously, and uses additional categories for negative evidence to reduce overgeneration and degradation of precision.

The first point is relevant to our target applications, since most terms are not proper names. Starting from as few as ten sample terms for each category, Nomen is able to learn thousands of terms, with high accuracy.

In a variant experiment, Nomen was shown to work well with a very large seed as well, finding more unknown names in a raw corpus.  The basic Nomen algorithm achieved close to 90% recall, with good precision (about 70%). In this project, we will investigate improvements to the algorithm, to make it more useful for term discovery.

MuTOL – Multilingual Terminology and Ontology Learning

The Computational Cognitive Systems group (COG) at the Department of
Information and Computer Science at the Helsinki University of
Technology (ICS/TKK) conducts research on artificial systems that
combine perception, action, reasoning, learning and communication. A
specific focus is in adaptive language technology. Central research
themes of the group include adaptive machine translation, modeling of
conceptual spaces, automatic and language independent extraction of
terminologies, and learning ontologies from examples. These techniques
can quickly discover and analyze complex patterns and learn from new
data, and they are able to handle textual data independently of the
used language and domain.

In the MuTOL workpackage of the ContentFactory project, candidate
terms are extracted automatically from text in different languages and
hierarchical structures are built based on the candidate terms. The
research group has recently been a partner in an EU-funded project
called MedIEQ. The MedIEQ project has developed quality labeling
methods for multilingual web-based medical text resources using
semantic technologies.

SIS: Service Interoperability Support

Business managers come across the daily challenge of how to search and discover relevant market data on products, services, potential customers, partners, competitors and new entrants. The solution lies in the freely available collective intelligence on the web. An online service-interoperability support has been developed to help business managers with doing their work more effective and efficiently. In the framework of ContentFactory (CF), SIS provides support for the case study that involves the industry partners and integrates the ontology solutions from TF and I2I.

SIS evolves from two earlier projects, namely CrossWork and SOAMeS. The first project is a completed FP-6 EU research project in which the objective was pursued to develop automated mechanisms for allowing dynamic work ow formation and enactment,enabling tight coupling and strong synergies between different organizations. The CrossWork architecture involves the development of ontologies for goal decomposition and team formation, followed by an inter-organizational business-process setup-, veri fication-, and enactment-environment that integrates legacy systems. Furthermore, this architecture is complimented by visualization tools for the setup and enactment phase of an eSourcing con figuration.

The SOAMes project (Service Oriented Architecture in Multichannel e-Services) was motivated by creating a roadmap for enterprises to adopt the SOA architecture and SOA-oriented tools into enterprise computing and inter-enterprise computing. Besides studying the adoption process for present needs, there is a further research agenda for enhancing the SOA and related business-network-management facilities.

The SOAMeS project continued the project series on the B2B interoperability middleware of the group for collaborative and interoperable computing (CINCO). The middleware provides generic platform services and infrastructure for managing dynamic collaborations in an open networked business environment. The research goals in the SOAMeS project included interoperability management at process and pragmatic level, and non-functional aspects addressing the business strategy needs. Part of the work was performed in industry-partner case projects, studying applicable methodology for SOA-based management of business processes and platforms and applying earlier results.

Latest news

25.5.09 - The ContentFactory website is up and running. It consists of a public WordPress site for general information on the project and public news, maintained by University of Helsinki Palmenia unit at Kouvola, plus a project internal TWiki site for day-to-day project business and internal reporting, maintained by the University of Helsinki Department of Computer Science in Kumpula.

Archives