Research system of semantic information in medical videoconference based on conceptual graphs and domain ontologies

The multiplication of the number of AudioVisual Documents (AVD) engendered a problem while searching for information within gigantic databases of which we are incapable to index their contents completely manually. Indeed, several complex difficulties are put by these documents because of the vertiginous increase of the quantity of the multimedia data to be treated and the specification met in the representation and the extraction of their contents in particular semantics of the fact that these documents contain three types of media (text, sound, image). AVDs can be classified in professional broadcasted videos (movies, emissions), sporting videos, video controlling, videoconference etc. In this paper, we propose a model of representation of the semantic contents of videoconferences documents in medicine based on the conceptual graphs taking into account the different modalities. This model is based on the concepts extraction and the semantic relations between them and appeals ontology domain.


INTRODUCTION
Nowadays, we assist a continuous development of information technology. These new technologies have enabled the rapid development of material production technology and information management. The progress of production tools of information such as video conferencing has enabled the production of a huge amount of information. This rapid increase in the volume of information has created the problem of "how to find information that interests us in this great mass of information?" To treat this problem, the IRS has been developed in order to select from a volume of information, the pertinent information vis-à-vis an information need. These SRI aim to connect two representations; one is of user needs, and the other is of the documents' content using a correspondence function.
The system we are building consists of three modules, ontologies, a documents' database and an index database. The system's modules are: the analysis module, the indexing and the search module shown in Figure 1.

Figure 1: General architecture
The analysis module (also called modeling module) is responsible for navigating and analyzing the documents that are in the documents' database. It is to define a model for the description of media content that are appropriate for the representation of the videoconferences' content. This model must take into account the information's items in different levels of descriptions and the various present media which are adapted to be integrated into an indexing and retrieval system. The analysis's result is an XML file.
The indexing module is in charge of navigating and indexing the documents which are the results of analysis module using the concepts of the ontology. The result of the indexing is stored in a database index. Indexing a document is done by the system administrator who submits the address of the document to be indexed to the indexing module.
In case of a user wanting to search a videoconference, the request will be sent to the search module.
In what follows, we will describe only the analysis and indexing modules of our system.
Ontologies have considerably improved the pertinence of results in the search for audiovisual documents. That is why we opted for the method of indexing using ontology. This improvement is due to the fact that the indexing process takes on consideration the different concepts and relations between these concepts (given by the ontology). Therefore, unlike the methods based on simple and static keywords that interest in whether a word exist or not in a document, this method takes into account the semantics of the terms to search for. The next part of our paper is devoted to present, firstly, the analysis module (modeling) of videoconferences documents. And, it describes, secondly, the indexing module of videoconferences document completed by the use of ontologies domain.

ANALYSIS MODULE
The information of videoconference content can be represented in several levels: physical information consisting of binary data of the content that is not usable by the computer, and description information that can transform physical information into exploitable knowledge by the user, this strengths the interface between man and machine, and then easily exploit the video content. To provide multiple levels of abstraction in the exploitation of the videoconference content, we propose a modeling schema in two levels: structure and semantics. Modeling the structure of the content describes the organization that may represent the information content. This model is often based on the classical structure of a video document (the sequence, the scene, the plan). The descriptions in this level are automatically calculated using visual descriptors. This level of modeling is free from all semantic description.
Videoconferences are hierarchically structured in scenes, maps and images. This structure reflects the process of creating the videoconference. We are only interested in the visual descriptors to clear this hierarchical organization. It is defined by means of a top-down approach via successive niceties. For a user, browsing a hierarchical structure is certainly easier than navigating a flat structure.
The main interest of this organization is that it can be automatically extracted by releasing the semantic content.
As for the semantic modeling, it is an abstraction that allows you to link low level descriptions of the real world. A modeling schema for describing the meaning of the descriptions located at the structure. We exploit the notion of concepts and conceptual relationships to present occurrences (information items) described in the structure part.
We will use the conceptual graph formalism to present different ideas in a videoconference using concepts and relationships. It is recommended to describe the contents of each of the structural elements to extract excerpts from the videoconference that answer specific requests by navigating conceptual graphs. The conceptual graph model is a modeling approach having the specificities to be formal, to represent knowledge and to be concrete in the sense that there are efficient tools for manipulating modeled knowledge. It allows modeling the knowledge of a domain using graphs, based on a support. This modeling approach is intentional, and is provided in semantic first-order logic, and assumes a closed world for its reasoning [4].
In our modeling approach, we develop a modeling schema of semantic knowledge combining the two types of modeling (hierarchical and semantic) and regardless to the videoconference content, that's to mean that we can apply it on all videoconferences.

Modeling of visual content
In every facet defining correspondence with an informative content type on all images objects, is associated a model describing the images objects, the relations between them and the operations defined on these descriptions. The specified facets are supplementary in order to go with the interpretations that model and each instance of the general model is a combination of facets to translate the wealth of pertinent characteristics of the images.
Two main different categories of facets: the physical facet which represents the entity perceived by the human eye in its plane and two-dimensional representation, and logical facet collecting the interpretations of the image and all of its most semantic descriptors. The logic facet is subdivided into four facets whose combination provides the symbolic characterization of the image: structural, spatial facets, symbolic and signal.

a) The structural facet
The structural facet represents the decomposition of an image into image objects. Each image object can be decomposed into sub-objects images. The composition relation associated with this facet is the relation "contains" that involves spatial inclusion, the regions corresponding to the component objects are included in the geometric boundaries of the region described by the decomposed object.
The structural facet is represented by a conceptual graph whose nodes are the image objects and the arcs are the instances of the composition relation.

b) The spatial facet
The spatial facet describes the geometrical information on relative to spatial objects associated with the images objects as well as the spatial relations between them. This facet allows the characterization of an image's objects by their shapes and their relative positions. A spatial object is defined by giving of a geometric shape (point, segment, polygon) corresponding to its contour. The spatial facet is represented in the classical spaces in order to give the model a greater generality. These spaces are first the Euclidean space combining the notions of scalar product, orthogonality, angles and standards. This space allows operations such as the calculation of the centroid of the area, length, width, height and the polygon encompassing [8].
The spatial sub-facet to specify spatial relations (relative position, direction) of the image objects. This sub-facet is shown by the following graph: Considering Io1 and Io2 two images objects representing respectively concepts "Expert" and "chirurgical operation" in the following example: "The videoconference segments show an expert explaining a surgery." The representation with conceptual graph formalism exploiting a spatial description is described by the graph below:

c) The symbolic facet
The symbolic facet is the representation of the semantic content of an image and is defined as a given symbolic objects associated with the images objects as well as relations corresponding to the description of scenes or actions involving those objects. The symbolic facet is trying to take into account the multiple interpretations regarding the semantics conveyed by the image. It is strongly constrained by the application to the extent that the term "sense" is related to the comprehension of the domain of the application as well as an indexing language chosen to express the relations between the elements of knowledge brought to light [8].

d) Facet signal
Facet signal contains information on the visual content of "low level" such as color, texture, spatial positions, etc. This information is represented in the form of numerical descriptors. In many cases, it is possible to use this information to infer a semantic description by the aggregation of a number of these low-level criteria.
Modeling the signal facet is inspired by the work of Mr. Belkhatir [3]. The signal facet describes the visual content of the document videoconference in terms of visual perception of videoconference information. It allows you to specify the lowlevel visual features of videoconference. Formally, we denote the elements of this facet by image descriptors (Ids). These descriptors are not necessarily symbolic, but they can be used to infer semantic descriptions.
The signal facet is divided into four sub-facets:


The sub-facet color to present the colors characteristics in the visual content of the document. This sub-facet is presented in the following graph: [Io] → (has color) → [(<col> AND] [Io] → (has color) → [(<col>OR] With <col> AND and <col> OR representing respectively a combination of 11 Boolean values representing the concepts "colors" as already presented in [8].


The sub-facet texture is used to describe the texture properties in the visual content. This sub-facet is presented in the following graph: [Io] → (has texture) → [<tex>AND] [Io] → (has texture) → [<tex>OR] With <tex> AND et <tex>OR representing respectively a combination of 11 Boolean values representing the texture concepts as already presented in [8].


The sub-facet motion to specify the movements of the camera of the images objects and their trajectories. This sub-facet is presented in the following graph: [Io] → (has motion) → [<mvt>AND] [Io] → (has motion) → [<mvt>OR] With <mvt> AND and <mvt> OR representing respectively a combination of 8 Boolean values representing the motion concepts as already presented in [8].

Modeling of audio content: Extracting terms
The extracting process consists in detecting terms in a documentary context. A documentary context is defined as a textual unit inside of XML document; it may represent a sentence, a paragraph, or a logic element of the logical structure (the Text nodes in the XML documents). First of all, for each term (or one of synonyms of this term), we seek his presence in the Treated document.
Then, we calculate the number of occurrences of each term in the document which is the cumulative of all terms found and their synonyms. The frequency of occurrence of each word in the document is equal to the number of occurrences of each term in the document divided by the total number of terms in the document.
The purpose of this step is to extract all the terms of the document likely to represent concepts in the ontology. These terms correspond to different inputs (or nodes) in the ontology. For this purpose, we use a technique that consists of projecting the ontology on the document. This is done by the ontology browsing using a parser developed for this reason to identify ontology concepts that occur as terms in the document (detailed in the indexing module) We set three goals for this: extract simple term, compound terms and specific terms. N o v e m b e r , 2 0 1 3

Extraction of simple terms a. Extraction of empty words candidate
An empty word candidate is a word likely to be an empty word. In this step, we assign to each word category: empty or full. Stop words (or stop words in English) are words that are common to all the texts in a same language. They have a functional utility. In English, the obvious empty words could be "the", "the", "of", "this", "in", "to", etc…In a monolingual context where all the documents in the corpus are in the same language, empty words are mainly characteristic words of the language such as prepositions, articles, etc.. In this context, empty words are called also grammatical words. So there is no need to index or use in the process of information retrieval. In a text, an empty word is a non-significant unlike a full word.
In contrast to recent work, we keep the pronouns and we consider them as full words as they can refer to simple or compounds words which may be semantically rich.
A term is considered semantically rich if it is either common or specific. In fact, for every non-empty and unweighted term, we check if it is a specific word. A specific word is a low weighted one but rich from its semantic side. It may even have a single occurrence in the videoconference to be indexed for example, the word "mini camera" which appears only once time in the videoconference "laparoscopic surgery in Strasbourg". This term is considered as a specific term for laparoscopic surgery because it can't perform this type of surgical operation without introducing this mini camera in the patient digestive system. In this case, despite its low weight, the term a low weighted is considered specific. So it will be added to the list of simple terms.

b. Extraction simply by removing stopwords
To extract the simple terms, we proceed by elimination of empty words. All the words in the corpus consist of two subsets: a subset of empty words and a subset of full words as simple terms. Thus, simple terms are identified by the elimination of empty words from all words that composes the vocabulary of the corpus.

c. Weighting simple terms
In this step, we assign a weight to each term that represents its discriminating power and its representative power in the document where it appears. Indeed, a term doesn't represent adequately the document only if its importance degree in this document is significant. In the literature, we distinguish two types of weighting: local and global.

 The local weighting
The local weighting consists of measuring the representative power of a term in a document from the corpus [2]. It uses local information of the term in a given document. This weighting is calculated as follows: Where:  is the number of occurrences of the word i in document j;  is the number of occurrences of the term k in the document j. The denominator is the number of occurrences of words in the document in question;  is the set of terms in the corpus.

 The overall weighting
This weighting can assign to a term a measure reflecting its importance in the corpus of documents. A term that appears in the majority of documents is less valuable for distinguishing documents from each other. In our approach, we are interested in using the local weighting because indexing is done document by document. Referring to the videoconference "videosurgery in Starsbourg", we find that the term "act" appeared five times in the videoconference. And we find that the total number of simple terms is 189 words, where the weighting of the term "act" is 5/189 = 0.026....

d. Extraction algorithm of simple terms
The algorithm of our approach of extracting simple terms from the documents is as follows:

Extraction of compound terms a. Extraction of compound terms based on mutual information
To designate a new concept in a field, the principle is to avoid creating a new term which would result in a rapid explosion of the lexicon [5]. This new term, a compound term, is created from existing lexical data. These compound terms are combinations of two or more words [7]. With a new concept, there are no new terms, but there are new combinations of words to describe it. These combinations are sequences of words that will be considered as new terms.
In our method, we adapt the approach of F. Harrathi [5]. To extract compound terms, we use an iterative and incremental process. It allows discovering new words from existing ones. The process proceeds to extract new terms from an initial list of known terms by using a statistical measure: Adapted Mutual Information (AMI). We start from the list of simple terms.
We calculate subsequently the value of the IMA of each pair of words. We do not propose to take into account the frequency of an empty word for the calculation of the IMA. For example, the term « hospital of Strasbourg », the frequency of the empty word « of » will not be taken into account and will be replaced by the value of the frequency of the simple term « hospital ». During the extracting process of compound terms, the term « hospital of » is marked as a « term of construction». This term is deleted at the next iteration. The couple of terms that the value of the IMA is less than a threshold value are accepted as compound terms. The process stops when at iteration no new term is extracted. For a couple of words ( , ), the adapted mutual information is calculated as follows : The process of extraction of compound terms we use is composed of three steps: 1. Initialization: in this step, we initialize the list of composed words by the contents of the list of simple words; 2. Discovery of new terms: in this step, we calculate the mutual information between an item from the list of compound terms and a word from corpus; 3. Adding of new terms: at this stage, if we find a value of mutual information superior to a given threshold, we add the couple of words to the list of compound terms. If the number of occurrences of the word "surgical" is equal to the word "act" then these two words are deleted Lts.
Otherwise, their occurrences as simple terms reduce by the number of occurrences of the compound word "surgical act ".

b. Algorithm of extraction of compound terms
The algorithm of our approach of extracting compound terms from the documents is as follows:

c. Weighting compound terms
At this stage, we calculate the weigh to reflect the importance of the term in the document. This weight depends on three factors: the frequency of the compound word in this document, the weights of simple terms that compose it and length of the compound term.
Measurement weighting of compound terms that we have proposed in this manuscript called Compound Term Frequency (CTF). It will be calculated by the function 3. To test this function, we apply it to a simple term Ti. We conclude in this case that CTij will be equal to FTi.

INDEXING MODULE
Indexing is a step that consists of analyzing the document while organizing of the documentary fund to produce a set of keywords, also called "descriptors" background, which the system can easily manage and use in the process of further research.
This module allows extracting the concepts of basic documents (see Figure 3). In SRI a document is considered a medium that conveys information. The result of this indexing module is represented as a list of descriptors. These descriptors ; concepts and semantic relations are extracted by exploring ontology of the domain in order to improve its semantic description. They are plotted, then, using the conceptual graph formalism. This representation is called an index document.
The primary criterion in the extraction of descriptors must always be the potential value of a concept as an element in the expression of document content and in its information search.

Extracting Concepts
During this step, we extract concepts from medical videoconferences. These concepts are denoted in XML documents with simple or compound words. These terms have been taken during the previous steps. To complete the correspondence between the terms and concepts associated with these terms , we use one or more medical ontologies, given an ontology consists of a set of concepts C and a set of relations between these concepts R. In ontology, each concept is identified by a unique identifier. For each concept, one or more terms have been associated. These terms are called "labels". These are divided into "preferred" label and "alternative" label. Alternative labels are considered synonymous of favorite labels. For example, the concept of "C0001365" in the UMLS has a preferred label and two alternative labels (see Figure 4).

Figure 4: Example of a concept described by DOE
The concept of the figure above is described as follows: In the ontology, a set of terms is used to label the concepts and relations between concepts. This set forms the ontology vocabulary and will be noted: Where OC V : All terms used to denote the ontology concepts; OR V : All terms used to denote the relations of ontology.
Using the set OC V , we define the operator reference term . This operator is used to determine the / these concept (s) denoted by a term.
An operator reference concept is defined. These two operators are determined by the following equations: Lc ("C0001365") = {"Cerebrovascular disease", "Acute, but ill-defined cerebrovascular disease", "Cva"} The method which we propose to extract concepts from an XML document of a medical videoconference is to assign to each term of a document the concepts that are related to. To identify concepts related to each term, we use the relation Sc defined beyond.
For semantic indexing based on the concepts, the descriptors that describe the concepts are represented by the terms found in the modeling module. These terms are projected on an external semantic resource to identify the concepts that are associated to them as well as the relationships between them. In our work, we choose three ontologies (ONTOMÉNÉLAS, UMLS, SNOMED-CT). These are judged the most useful among semantic resources for the indexing of audiovisual documents covering the medical field.
In our system, the user chooses the ontology to be used: ONTOMÉNÉLAS for cardiac surgery, SNOMED-CT for clinical terminologies and UMLS in the case of general medicine.
In the case where the ontology does not satisfy the user, this latter can choose another ontology for extracting the associated concepts.

Ambiguity of terms
The problem of ambiguity arises when the terms of the association of words to concepts. There are two types of ambiguity: a linguistic ambiguity and semantic ambiguity.

a. Ambiguous language
This problem is encountered in the case of multilingual documents. It is to find two words belonging to different languages but with the same form in a text. Such ambiguity is not treated in our work because we use monolingual documents.

b. Semantic ambiguity
This case is when we find many concepts denoted by the same term (a term can be the label of several concepts in the ontology). To solve this problem, we seek another concept C 'in relation to the concept C denoted by the ambiguous term t in the ontology. If we find the concept C ', we consider the concept C as a concept denoted by the term t. If not, we use another ontology. In case where we don't not find any concept in relation to the concept C (using the three ontologies), we take all the concepts denoted by the term in question.

Weighting concepts
During this stage, we weight the list of extracted concepts. A concept is represented in a videoconference by one or several words whose frequency of each is already calculated in the extraction of simple terms and compound terms. The Frequency of Concept (CF) is equal to the average of frequencies of the terms that represent it in the document. The weighting of concepts tries to sort them proportionally to their importance in the videoconference document. The most important concept is the one having the highest frequency. We will, then, order the according to their frequencies from the more to less frequent. N o v e m b e r , 2 0 1 3 The algorithm of the method of concepts extraction is as follows:

Algorithme of concepts extraction Inputs
Lts : list of simple terms

Extraction of semantic relations between concepts
The importance of taking into account semantic relations lies in the fact that they can considerably improve the efficiency of search for videoconference. Also, indexing using the concepts and the relations between them is much more efficient than using only the concepts [Harrathi , 09].
In order to extract semantic relations between concepts extracted in the previous section, we rely on the used semantic resources, in our case the ontologies. These relations are defined in ontologies by relations types. We admit the hypothesis mentioned in [6]: "a relations between two concepts of a document if these two concepts appear in the same sentence, and if the semantic resource defines the semantic relations".
If we take the extract 3 of XML of medical videoconference and using the thesaurus UMLS, we detect the concepts C0334046 and C1302773 denoted by the terms "mild dysplasia" and "squanous low grade intraepithelial lesion ".
Applying the hypothesis of Maisonnasse, we find that concepts C0334046 andC1302773 belong to the same sentence, and that these two concepts are connected by the relation "is_finding_of_disease". Using UMLS, we find that this relation is defined as semantics. It is the relations R54390434 of UMLS.

Updating the index base
With the arrival of a new document, updating the index is performed according to the algorithm 5. The index is updated by linking the document added to the concepts of the ontology and saving the frequency of occurrence of each concept. This frequency will be helpful for sorting documents found during the search module.

Algorithm 5: Taking into account the adding of new document.
For a deleted document, its status is changed « deleted » before updating the different relative information of the document in the database. Therefore, this document will not be considered by the queries. The corresponding concepts to

Algorithm 6: Editing a document: adding information
Deleting information in the document may generate the following events:  removal of the existing relations between concepts presented in the deleted information;  Deletion of concepts.
The index is updated by the algorithm 8. In any case (adding, deleting or updating document corpus), updating the index is carried out only on the concerned document. This reduces indexing time while maintaining the coherence between corpus and the index to find the most relevant documents.

EXAMPLE
Our indexing model has been tested on a corpus containing several medical conferences. Among these, we mention videosurgery in Strasbourg.
After analyzing the videoconference, we get the following results:  Audio : Using FoxTab Video to MP3 software for convert video conferencing "videosurgery in Strasbourg" in an audio file and using the Dragon Naturally Speaking software for transcription of auditory content into text, we get the following: « The six heart surgery that occured last week by means of a robot which highlighted the fantastic technical progress made in recent years ...»  Texte : We use the approach of Sophie Schupp to analyze the texts contained in the videoconference, we detect the following:

 Image :
We use the software Advanced XVideo Converter for extracting images, we select key frames and we describe them manually.  To respectively extract all simple terms, compound terms, we apply the extraction algorithms mentioned in the section « indexing medical content of the videoconferences » in this article. N o v e m b e r , 2 0 1 3 Frequent simple and compound terms and specific terms of videoconference are implicitly extracted and their union forms the list of terms. This list is used in the extraction of semantic descriptors module. Extraction of concepts and semantic relations will be carried out by the extraction of concepts and semantic relations algorithms mentioned in the same section «indexing medical content of the videoconferences" in this article. The algorithms of extraction of simple and compound terms lead to a correct result (90%), that is to say almost all simple and compound words, whatever the number of words that make up each compound term, are correctly extracted. In the treated corpus, the maximum number of words in a compound term is four for example; we can mention « Institute of Tele surgery of Strasbourg » and « European Institute of Tele surgery ».
We give, in the following, some results and their percentages of test for a selected videoconference from our corpus.
For the extraction of simple terms, they are all extracted and their occurrences are calculated correctly while taking into account the pronouns relating to these terms.
We obtain, by applying the algorithm for extracting compound terms, 40 compounds terms in which there are three terms who are not compound ones. The percentage of successful extraction of compound terms is (40-3) / 40 = 92%. If the pronouns relating thereto are not taken into account, the calculation of frequencies of these terms is 100% correct. The following figure shows the relative XML document of the found compound terms.  For the research module, we have not yet tested it on our corpus.