An Analysis of Comment-Revision Thresholds in Bilingual Electronic Meetings

Prior studies have shown that providing participants in bilingual or multilingual, electronic meetings with the capability of revising comments can increase the accuracy of translations to other languages. This is often done via a round-trip translation (RTT) in which the source text is translated to another language, translated back again


INTRODUCTION
The United States Bureau of Labor Statistics suggests demand for human interpreters in the United States will grow by 29% between 2014 and 2024 due to increased globalization and immigration. The result is more people are increasingly using free, online, translation services for communication because human interpreters are expensive or unavailable for immediate service. However, automatic machine translation continues to suffer from poor accuracy in many cases and must be used with caution. For example, one study [16] reports on the use of Google Translate to provide medical information to patients because no human interpreter was available for the language required, and a later analysis revealed that the communication was only about 57% accurate. This could be critical when life-or-death decisions are being made.
On the other hand, translation accuracy in an informal bilingual or multilingual electronic meeting might not be as important. Further, as more languages are added to a meeting, it becomes more difficult to obtain human interpreters conversant in the foreign tongues. Finally, as all group members are typing simultaneously, it becomes nearly impossible for a human, or group of humans, to provide concurrent translation [7]. Thus, automated translation becomes necessary.
Although accuracy can be poor with Web-based translation services, improvements can sometimes be made by the user. For example, if something is not understood in a conversation, people often try to express their thought using different words. In an electronic meeting, typing errors, acronyms, slang, poor grammar, and idioms can adversely affect the accuracy [4,14,15]. If the originator of a comment could have some indication of the translation accuracy before it is submitted to the group, he or she could make revisions. Round-trip-translation (RTT) is one technique that could provide this evaluation capability.
Using RTT, text is translated to another language in a forward translation (FT), and then the results of this FT are translated back into the original language in a backward translation (BT). Combined, they form an RTT. If there are differences between the backward translation and the original comment, the content of the message might be misunderstood, as several studies have found significant, positive correlations between FT and BT accuracies (e.g., [2,5,6,10,19,20]).
If group members could be alerted to the fact that their comments might not be translated accurately, they might revise the original text to make it more understandable, e.g., if the similarity between the original and back translation falls below a certain threshold. However, it is not clear at what limit revision benefits the conversation. This paper describes an experiment with 14 groups in bilingual meetings with automatic translation using thresholds ranging from 0 to 100.

LITERATURE REVIEW
Early multilingual, electronic meeting systems translated among different languages, but it was not easy to determine if all group members understood the text that was exchanged [3]. For example, one study [23] found that 1 out of 20 translated comments in an electronic meeting included misconceptions, while the comment originators did not know they were misunderstood. In an attempt to correct these translation errors, Amikai's AmiChat introduced the capability for group members to indicate whether or not a comment was understood [12]. By clicking an onscreen button, users could send a message to the originator of the text that a particular passage was not clear. However, some users might not want to make the effort or might be too shy to admit miscomprehension.
Instead of relying on the reader to alert the originator of possible muddled text, another electronic meeting system called AnnoChat was developed using RTT for error detection [24]. The back translation was presented to the originator for perusal, and the author could determine if changes might need to be made before final submission. Using this system, Japanese users were able to improve translations to English with RTT, but they were not able to improve translations from Japanese to Chinese and Korean [13]. When more than two languages are used in a meeting, changing a comment could increase the accuracy of a translation into one language, but it might decrease the translation accuracy to others.
In another study [9], 22 students used automatic translation in a bilingual (English-German) meeting that provided an RTT on every comment, while another group of 18 students used the system without automatic comment evaluation. With the first group, an arbitrary threshold of 80% was selected for notification. That is, if the reverse translation words were less than 80% similar, the originator of the comment was alerted and given a chance to revise the text before final submission.
In this way, group members would not need to review a back translation for every comment written, thus saving them time and perhaps increasing satisfaction. Results showed that evaluation and a chance for comment revision improved comprehension of the translated text from English to German from 83.12% to 87.69%, but the improvement was not statistically significant. However, a further analysis revealed that if the threshold had been changed to 50%, there would have been significant improvement. That is, modifying only the most garbled translations yielded meaningful differences in comprehension.

EXPERIMENTAL STUDY Purpose
The purpose of this study was to further explore RTT thresholds to determine at what point revisions improve translation accuracy.

Subjects and Task Description
A total of 126 business students from a large public university in the southern United States participated in 14 meetings with sizes ranging from 4 to 11 to discuss (in English) for about 10 minutes the parking problem on campus, a topic that has been used frequently in electronic meeting research (e.g., [21]) and a time determined to be sufficient for a full discussion [22]. The students had knowledge of the topic and were motivated to find a solution, but had no ultimate control over the outcome.
Seven similarity thresholds were used (0, 20, 40, 50, 60, 80, and 100) with 0 indicating no back translation was shown and no revision was possible because no percentage would be less than 0%. On the other hand, groups using a threshold of 100 saw every comment's back translation, as the similarity score never rose above this. Similarly, groups using the limit of 50 saw back translations only if the similarity score was 50% or below.
All of the students spoke English fluently and few knew any other language. During the meetings, the group facilitator added pre-written German comments that were automatically translated to English, thus, simulating a bilingual session. A German speaker evaluated the students' comment translations to German after the experiment. Group members answered a short survey after each meeting assessing how useful and easy to use the system was and whether or not they could detect comments translated from German.

Meeting Software
In some earlier studies of bilingual electronic meetings, reverse translations were shown for all comments, possibly annoying group members and slowing down the meeting. In the latest study, one arbitrary threshold was selected to discriminate among the translations so that only the worst translations were captured, alleviating some of the annoyance. However, results showed that this threshold was probably too high. In addition, the software provided details on the RTT on a separate screen, requiring the meeting participant to switch between pages, further hindering the process.
We used a multilingual electronic meeting system linked with Google Translate, an online service that currently supports 103 languages [18]. Instead of showing the RTT on a separate screen, this system shows the back translation at the bottom on the current screen, only if it falls below a certain threshold set by the meeting facilitator, as shown in Figure 1. The meeting participant simply types a comment in the textbox at the top of the screen using the selected language (English), and then clicks on a button to submit it. In this case, the system translates the comment into a target foreign language (German) and then translates this back to English, showing it along with the similarity percentage if it falls below the predetermined limit. If the author decides that the comment is acceptable, despite the low score, he or she can click "Submit as is" and the text is available to the group (in English and German). Otherwise, the user clicks "Revise comment," and the focus is placed back in the textbox In Figure 1, the text was translated to German as "Jeder sollte seinen eigenen Parkplatz" or "Everyone should his own parking spot." In German, as in English, verbs are sometimes omitted during informal conversations. In this case, "haben" or "have" was omitted. As a result, the back translation also omitted the "have." The poor back translation is grammatically correct but confusing. However, the forward translation can still be understood. Thus, a BT might be perfect, while the FT is misunderstood, and another BT might be inaccurate, while the FT is still comprehensible.  Table 1 shows a summary of the experimental results. Students believed the meeting system was useful and easy to use, and there was no significant difference among the group thresholds in terms of these two variables (F = 0.23, p = 0.96 and F = 0.75, p = 0.61, respectively). However, there was a significant difference in the number of comments written per person in the same amount of time (F = 3.49, p < 0.01). Those in the 0% threshold groups wrote only 1.5 comments per person in comparison with the overall average of 3.95, even though they were not making revisions.
Students generally were not able to recognize those comments translated from German, indicating that perhaps the translations were grammatically correct and comprehensible. Self-assessed comprehension scores of the discussions ranged from 96.1% to 99.5%, significantly above the 72.45% threshold required by many American graduate schools for admission [8] (t =55.7, p < 0.01). There was also no significant difference among the group thresholds in terms of German recognition (F = 0.86, p = 0.53) and comprehension (F = 0.87, p = 0.52).
There was a significant correlation between comprehension and perceptions of ease of use (R= 0.265, p=0.01) and usefulness (R = 0.30, p < 0.01), and those who found it easy to use also found it useful (R=0.44, p < 0.01). None of the other variables showed significant correlations.   Table 2 shows the numbers of comments written by members in each threshold group, the similarity percentage between the original text and the back translation, the numbers of alerts (messages with BTs falling below the threshold), the numbers of revisions, and the percentage of the translations to German understood by the reviewer. The RTT similarity scores were fairly consistent among the groups. As expected, the alert numbers rose steadily with the threshold, but the numbers of revisions did not consistently rise. Those in the 80% threshold groups revised 12.6% of the alerted comments, but those in the 60% threshold groups revised 39.0% of the alerted comments. The reviewer understood 84.8% of the text, significantly above the 72.45% minimum required (t = 1.8, p = 0.04), but significantly below the 98.6% comprehension of the English-speaking group members (t = -2.2, p = 0.02).

Text analysis
Several examples from the transcripts illustrate problems that occurred when English was translated to German:  "When there are only few slot left, its only up to your luck whether you will get the slot or not."  "Wenn es gibt nur wenige Slot links, es ist nur bis zu Ihrem Glück, ob Sie den Schlitz erhalten wird oder nicht." The word "left" as in "remaining" was translated to "links" as in direction. However, the grammatical error "its" did not adversely affect the translation.
 "Knock down the football stadium and use that space to build additional parking. Roll Tide."  "Klopfen Sie das Fußballstadion nach unten und nutzen diesen Raum zusätzliche Parkplätze zu bauen. Roll Tide." "Roll Tide" is a rallying cry of a football team in the United States. Although most students in the southern part of the USA probably understood what was meant, a German speaker from another country would probably not recognize it.
 "I've got 99 problems but a parking spot aint one."  "Ich habe 99 Probleme bekam aber einen Parkplatz aint ein." The American vernacular word "ain't" was misspelled, and Google Translate simply repeated the word. If the contraction had been spelled correctly, however, the translation would have been perfect.
 "Freshman can't bring their cars on campus."  "Freshman können ihre Autos auf dem Campus zu bringen." The word "Freshman" was translated correctly in some comments, but not here. Also, the sentence translation gives the direct opposite meaning of what was intended.
 "invest in a pair of heeleys"  "investieren in ein Paar heeleys" This was a good translation, but the German reviewer did not realize that "heeleys" are shoes.
The following examples of round-trip translations provide further details of what occurred in the meetings:  "We need bigger spots."  "Wir brauchen größere Flecken."  "We need more spots." 75% similarity In this case, "more" does not provide the same meaning, and "Flecken" means "dirty spots." The student decided it was good enough, however, and submitted it as is.  "having a 10 am is practically having a 8am because of parking."  "a 10.00 mit praktisch, weil der Park ein 8.00 mit."  "a 10:00 with convenient because of a Park with 8:00." (37.5%) The German FT is as incomprehensible as the BT, but the student decided to submit it as is.
 "everyone get a bike"  "jeder bekommen ein Fahrrad"  "each get a bicycle" (75%) The student revised the comment to "get a bike" with a 100% similarity score less than a minute later. The German comprehension was also improved.
 "Build another parking garage"  "Bauen Sie ein anderes Parkhaus"  "Build another park" (50%) The student submitted it as is, although clearly not similar to the original. However, the German translation was comprehensible.
Groups using a threshold RTT similarity score of 50% generated comments with the lowest German comprehension (72.1%), but the average comprehension for the groups below increased to an average of 89.1% while the average for the groups above the threshold also rose to 84.7%, on average. Thus, it is not clear where the threshold should be set based solely upon comprehension. However, groups at the 50% threshold and below revised only 1.7% of all comments, on average, while those above revised 10.9%. Just transitioning from the 50% to the 60% threshold increased the burden by over three times (3.7% of all comments to 12.3%). Therefore, a setting of 50% might be optimal based upon comprehension and participant effort, in agreement with what an earlier study [9] predicted.

CONCLUSION Summary
In this study, group members exchanged comments in English with translations to German via a bilingual electronic meeting system. Seven RTT similarity thresholds were used in attempt to find the optimal setting for maximum accuracy, but results were not clear. German comprehension above and below the 50% threshold were both high, but the extra burden of comment revision increased with the threshold. Thus, we conclude 50% might be optimal.

Limitations
The first limitation is that the bilingual meeting was only simulated with German comments entered by the facilitator and translations reviewed afterward. Knowing this, perhaps the participants were not sufficiently motivated to submit understandable comments, especially in the case of the 50% threshold groups. However, it is difficult to recruit sufficient foreign-language-speaking subjects for experiments.
A second limitation is that only English and German were used. Other language combinations, for example, Georgian to Swahili, are likely to produce far less translation accuracy [1].
Third, only one reviewer was used to evaluate the translations from English to German. Different reviewers might have comprehended more or less of the comments.

Future Research
Because of the inconsistent results with comprehension below and above the 50% threshold, further study on an optimal setting is necessary. In addition, an evaluation of the experimental subjects' time and effort during comment revision could yield further insight into providing the most productive multilingual meeting.
In addition, prior studies have focused on RTT with just two languages, but it is much more difficult when many languages are involved in a meeting [17]. For example, as noted in the [13] study, increasing the accuracy on one translation might decrease the accuracy of another. Nevertheless, studies should investigate how translation accuracy can be improved in these multilingual meetings as well.
Finally, the text analysis showed that some users did not revise their comments even when the RTT was poor. This suggests possible poor understanding of the task at hand or a lack of user engagement. Additional work should be done to determine if improved training or increased user engagement improves FT accuracies.