Next Article in Journal
Examining the Determinants of Facebook Continuance Intention and Addiction: The Moderating Role of Satisfaction and Trust
Previous Article in Journal
A Comparative Study of Interaction Time and Usability of Using Controllers and Hand Tracking in Virtual Reality Training
 
 
Article
Peer-Review Record

The Role of Machine Translation Quality Estimation in the Post-Editing Workflow

by Hannah Béchara 1,*, Constantin Orăsan 2, Carla Parra Escartín 3, Marcos Zampieri 4 and William Lowe 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Submission received: 15 July 2021 / Revised: 31 August 2021 / Accepted: 6 September 2021 / Published: 14 September 2021

Round 1

Reviewer 1 Report

The authors present a study on the use of machine translation post-editing, specifically whether the use of quality estimation can be useful.

The paper is well presented and readable. I also find no obvious language issues. Also, the motivation, background information, and methodology are sufficiently explained. 

I have only some minor concerns. 

At some points, words are used for length normalization, and at other points, tokens. Please clarify or be consistent.

Since FMS is used, a bit more explanation of how it is computed should be provided.

Tables 8 and 9 are alone on a page, while there is room for text. Also, captions for tables 5, 6, 8, and 9 are below the table instead of above. Also, table 8 is referenced in the text before table 7. Please improve the formatting with regard to the tables.

The authors may add FMS to the list of Abbreviations.

 

 

 

Author Response

Thank you for your kind comments and feedback. We have applied the feedback directly in the manuscript and uploaded it, with the tracked changes.

Author Response File: Author Response.pdf

Reviewer 2 Report

Please, see comments in the attached file. 

Comments for author File: Comments.pdf

Author Response

Please see attachment.

Author Response File: Author Response.pdf

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

The main asset of this paper is, in my opinion, its attempt to study the impact
of inaccurate QE on post-editors' workload. This is particularly interesting
since previous studies about the impact of using QE in a MT/PE workflow,
do not consider uncertainty in producing the quality estimates. This paper
does not address the latter either, but yet reports on the impact of
inaccurate QE.

The paper is well-written, easy to read and to understand. I have noticed some
typos that need to be fixed, as well as have a few remarks regarding the
figures. I have listed them all below.

The reading starts with a short introduction followed by a background Section,
which efficiently positions the work, with respect to previous studies. A
quick remark, however, regarding a statement made by the authors (l.102-106):
the authors claim that the results from Moorkens and O'Brien (i.e., 81% of
the translators expressed the need for confidence scores to be displayed in
the interface) validates their findings. They do report that the post-editors
liked getting a first impression via the traffic light system, yet the
feedback in Table 6 indicates a strong feeling against QE.
Could the authors explain/discuss this?

In Section 3, have the authors studied the impact of the three STS features,
or considered training a QE model on a smaller feature set (e.g. the 17
baseline feats + 3 STS feats)? The improvements shown in Table 1 look
marginal.

Also, I assume that the "collection" from which the similar sentences are
extracted, is the AutoDesk data introduced in sub-Section 3.2?
The authors also report the performance of their Quest++ model using MAE
only. The performance of QE models is usually reported using the Pearson
correlation coefficient (Pearson r), as well as RMSE, along with MAE. In
addition, Specia et al. (2018)[1] recommend to report on all three metrics,
instead of one of them individually, to prevent bias in their interpretation.
I therefore invite the authors to update Table 1 accordingly.

In Section 4, my biggest concern relates to the selection of the inaccurate
QE predictions. The authors consider a QE prediction as "good" if it is within
10% of the observed FM score, as "bad"/inaccurate otherwise. How did they
determine this value? Also, what is the distribution of the inaccurate QE
predictions? This is quite important, yet not mentioned in the paper, as a
prediction 12% lower than the observed FM score is not as dramatic (in terms
of effort for the post-editor to fix) as a prediction that would be 40% away
from the expected score. This is also relevant for the discussion in
sub-Section 5.2, where the authors present the results of the effect of good
QE vs. bad QE, as figures 4 and 5 show that there is almost no difference
between the two.

Continuing with Section 5, I was surprised to reach the end of the paper
without having read any mention of PE guidelines. Post-editors are usually
provided with guidelines which indicate either a light or full post-editing.
Regarding the authors' statement that "the inclusion of MTQE cuts post-editing
effort by 0.4 keystrokes per word", I wondered whether this is also biased by
some guidelines. In a scenario where post-editors are provided with
guidelines for light PE, or no guideline at all, can one assume that this
would result in a minimal effort from the post-editor to fix the
translation (considering that a translation with light green background tells
the post-editor that the quality is predicted as "good enough for PE", which
corresponds to a FM score > 75%)?

The analysis in sub-Section 5.3 is somehow confusing. It was not clear at the
beginning whether Table 5 presents another manually-selected subset of
sentences from the AutoDesk dataset, or if it corresponds to the post-edited
data. Further down, the statement "attempting to post-edit still cuts down the
time and effort, despite the low quality" is a bit vague: what does 'low
quality' mean? I guess this is also related to my comment in Section 4 (and
5.2) above.

In sub-Section 5.5, I kind of disagree with the authors when they write that
the time spent getting familiar with PET is responsible for the observed
additional time. PET looks like a pretty simple tool with which it should not
take long to get familiar. More importance should be given to each
post-editor's personal experience with the task of PE (and how defiant they
are, in reference to l.388), their experience using a CAT tool, as well as
the specific domain of the AutoDesk data used in this experiment.

Last but not least, I wish the authors would have shared more about what they
think could be the reason why QE is still in bad faith of the post-editors.
Although, their analysis shows that the latter benefits from MTQE, they still
express a negative opinion towards QE. Is it because QE is still not
informative enough? What do the authors think about that?

## Typos:
- l. 53: translators, => translators.
- l. 67: traffic lights system => traffic light system
- l. 88: labels the => labels to the
- l.140: The the fuzzy => The fuzzy
- l.156: is MT output => is a MT output
- l.182: an MTQE system => a MTQE system
- l.187: Autodesk dataset => Autodesk dataset,
- l.191: the setting of => the settings of
- l.194: CAT took => CAT tool
- l.194: a user interface => an user interface
- l.201: Section 4 => Section 4.2
- l.216: an MT suggestion => a MT suggestion
- l.246: translators this => translators, this
- l.251: 3 and 14 years => 3 to 14 years
- l.255: hd used => had used
- l.267: takes => took
- l.267: and numbers => and the numbers
- l.268: on the post-editing => on post-editing
- l.287: closer at => closer look at
- l.295: conditions in Figure 3 is => conditions is
- l.297: same carries => same observation carries
- l.300: setting => settings
- l.314: any => no
- l.319: Here Translator => Here, Translator
- l.326: '<'   =>   '<='
- l.329: '<'   =>   '<='
- l.345: still cut down => still cuts down
- l.352: effort => keystrokes
- l.363: Table 8 => Figure 8
- l.363: TER => HTER
- l.364: result => results
- l.368: an MT suggestion => a MT suggestion
- l.387: traffic lights system => traffic light system
- l.403: traffic lights system => traffic light system

## Remarks:
- l. 67: they authors mention a "third" traffic light, but considering that
  yellow, the default color is listed l.214, it should be "forth";  

- Table 2 is redundant with the list in l.214-221, description should be
  merged, and the table should be removed;  

- l.204: the screenshot of the PET's interface, in figure 1, is a bit too
  big.  Also, since the experiments use a single source for the translations,
  it is not necessary to display the "MTs" menu.  Alternatively, I would rather
  see, if possible, the differences in terms of display introduced by the
  authors' modifications to the default version of PET (e.g. vertical cuts
  showing the "traffic lights"?);  

- l.185: the sentence "We choose a FMS of 75 or higher to be the threshold for
  post-editing" is misleading: a FMS <= 75 triggers PE, but this sentence
  says the opposite;

- l.225-228: the paragraph "In the first category [...] not to post-edit" is
  redundant with the beginning of sub-Section 4.2. Please remove.

- l.281: the sentence "Each translator is identified by a letter" should be
  moved at the end of l.253 (after "Table 4 summarises the details").

- l.288: in the sentence " The normalised [...]", the authors should removed
  "by sentence length" and add "per word" after "2.9 seconds to 2.4", for
  clarity;  

- in figure 2 to 7, the y-axis/caption should mention that the normalised
  time is per word;   

- Figures 2 & 3 should be joint, similarly to figures 6 & 7;  
- Figures 4 & 5 should be joint, similarly to figures 6 & 7;  

- l.306-309: the paragraph should be slightly rephrased to avoid "However
  [...]. Therefore [...]. However [...].";  

- l.311: the sentence "Each translator is identified by a letter." is not
  necessary;  

- Table 6 is not cited in the text;  

----
[1]: https://0-doi-org.brum.beds.ac.uk/10.2200/S00854ED1V01Y201805HLT039

Author Response

Dear Reviewer,

 

Thank you for your time going over our paper and the helpful feedback you provided. 

To address the reviewers comments, we have made the following changes to the paper:



We addressed all the typographical and grammatical errors suggested by the reviewers. In addition, we carefully checked the paper again. The major changes made to the paper as a response to the reviewers’ feedback are detailed below. The comments from reviewers are in italics, whilst our response in bold.

Furthermore, we have attached a document that shows all changes.

 

A quick remark, however, regarding a statement made by the authors (l.102-106):

the authors claim that the results from Moorkens and O'Brien (i.e., 81% of

the translators expressed the need for confidence scores to be displayed in

the interface) validates their findings. They do report that the post-editors

liked getting a first impression via the traffic light system, yet the

feedback in Table 6 indicates a strong feeling against QE.

Could the authors explain/discuss this?

 

As mentioned in our paper, our participants also had to learn how to interact with a tool different from the one they would usually use and hence our results may not be generalisable. Besides, while we only had 4 participants, while Moorkens and O’Brien had 403 survey participants, of whom 231 answered all sections. It could be that once the participants in Moorkens and O’Brien did an experiment like ours, their initial opinion would change. All in all, more data is needed to really be able to compare initial “wishes” against the feeling of translators after a hands-on task.

 

In Section 3, have the authors studied the impact of the three STS features,

or considered training a QE model on a smaller feature set (e.g. the 17

baseline feats + 3 STS feats)? The improvements shown in Table 1 look

Marginal.

 

The improvements are indeed marginal and the comparison is only there to explain why we used the system we did. We have clarified this in the text in Section 3.

 

Also, I assume that the "collection" from which the similar sentences are

extracted, is the AutoDesk data introduced in sub-Section 3.2?

 

This is correct and we have clarified this in the text in Section 3.2.




In Section 4, my biggest concern relates to the selection of the inaccurate

QE predictions. The authors consider a QE prediction as "good" if it is within

10% of the observed FM score, as "bad"/inaccurate otherwise. How did they

determine this value? 

 

We have included a justification in the paper. This is based on the work of Parra and Arcedillo (2015) and their study into FM scores and post-editing.

 

Also, what is the distribution of the inaccurate QE

predictions? 

 

We include this distribution in Table 6 in Section 3.5

 

This is quite important, yet not mentioned in the paper, as a

prediction 12% lower than the observed FM score is not as dramatic (in terms

of effort for the post-editor to fix) as a prediction that would be 40% away

from the expected score. This is also relevant for the discussion in

sub-Section 5.2, where the authors present the results of the effect of good

QE vs. bad QE, as figures 4 and 5 show that there is almost no difference

between the two.

 

It is true that it would be interesting to study what happens when the predicted score is at different distances from the real score. However, in order to obtain meaningful results we would have needed a much bigger experiment. In addition, we revised Section 5.3 and we think it partially answers this observation.

 

Continuing with Section 5, I was surprised to reach the end of the paper

without having read any mention of PE guidelines. Post-editors are usually

provided with guidelines which indicate either a light or full post-editing.

Regarding the authors' statement that "the inclusion of MTQE cuts post-editing

effort by 0.4 keystrokes per word", I wondered whether this is also biased by

some guidelines. In a scenario where post-editors are provided with

guidelines for light PE, or no guideline at all, can one assume that this

would result in a minimal effort from the post-editor to fix the

translation (considering that a translation with light green background tells

the post-editor that the quality is predicted as "good enough for PE", which

corresponds to a FM score > 75%)?

 

As mentioned in Section 4.3 we did prepare guidelines, but they focused on how to use PET and the meaning of the traffic lights. We did not give any guidelines regarding the postediting operations. 

 

The analysis in sub-Section 5.3 is somehow confusing. It was not clear at the

beginning whether Table 5 presents another manually-selected subset of

sentences from the AutoDesk dataset, or if it corresponds to the post-edited

data. Further down, the statement "attempting to post-edit still cuts down the

time and effort, despite the low quality" is a bit vague: what does 'low

quality' mean? I guess this is also related to my comment in Section 4 (and

5.2) above.

 

We revised Section 5.3 and made it clearer. 

 

In sub-Section 5.5, I kind of disagree with the authors when they write that

the time spent getting familiar with PET is responsible for the observed

additional time. PET looks like a pretty simple tool with which it should not

take long to get familiar. More importance should be given to each

post-editor's personal experience with the task of PE (and how defiant they

are, in reference to l.388), their experience using a CAT tool, as well as

the specific domain of the AutoDesk data used in this experiment.

 

All our translators had experience with CAT tools and translation of technical documents. For this reason, we don’t think these posed a problem. On the basis of the comments we received from the translators we believe that the main problem was caused by PET, despite its simple interface. 



Author Response File: Author Response.pdf

Reviewer 2 Report

General:

This paper looks at how providing translators with information on machine translation quality estimation might affect their productivity and technical effort. The paper concludes the following: “Our results show that MTQE, especially good and accurate MTQE, is vital to the efficiency of the translation workflow, and can cut translating time and effort significantly.”

Unfortunately, I am not convinced that the paper presents sufficiently sound evidence to back such a strong statement. On page 9, it says “Individually, however, this drop [in translating time] does not account equally over each translator. For two of the translators, the change is not significant.” Statistical significance is mentioned occasionally, but the actual size of the effects and the p-values are not reported. Later on, the paper states “Once again [i.e. for both time and keystrokes], on average there is little to no difference between “Good QE” and “Bad QE”. These results are described as ‘erratic’ and ‘strange’. The paper then analyses cases of correct and incorrect MTQE labels, though again a thorough statistical analysis is not provided, and the results are presented with rather unconvincing language such as ‘seems to suggest’ and ‘might indicate’.

The analysis of target-text quality could also have been stronger. Showing that all participants achieved similar quality standards compared to Autodesk translators is important, but what really matters here is showing that any improvements in productivity did not occur at the expense of lower target-text quality. The paper should have shown how the sentences post-edited ‘with QE’ differ in quality from the sentences post-edited ‘without QE’.

Another limitation which is not mentioned is the fact that translating difficulty is not accounted for. The paper controls for sentence length, but it apparently disregards other textual characteristics that could make certain sentences intrinsically harder to translate irrespective of the level of MT quality. Ideally the paper should show paired translating difficulty across the conditions it sets out to compare.

The fact that the study has professional translators as participants is a strength, but the small sample (only 4 translators) should have been acknowledged more directly, especially when the results are compared to previous research providing more complete statistical analyses based on larger samples. The study also wishes to claim that it tests MTQE in ‘real-world’ conditions, but it uses a lab tool (PET) that translators had some difficulty getting used to. Note that use of PET is well justified – it is the strong claim of high ecological validity that is the problem.

See other specific issues below.

L53 – HTER needs to be defined and ‘denote’ sounds awkward here.

L55 – ‘will be’ sounds awkward. The description of previous work should ideally be in the past; if the present is used that should be consistent, but the future should not be used at all.

L62 – increase in productivity for the MTQE condition, presumably? This should be clearer.

L66 – the fuzzy match score should also be (briefly) explained.

L78 – Bing not Bing’s

L88 – to the group of participants

L89 – the second group was

L93-94 – repetition of ‘compared’ – needs rewording

L95 – in an IT sense, ‘program’ even in British English

L112 – ‘technical effort’ is the usual term

L187 – worth reminding readers at this point that the dataset included information on how the sentences were translated, which is presumably what allowed accuracy to be calculated here.

L194 – CAT tool

Section 4.1

-          The use of PET is well justified, but this is a lab tool, so the study should not claim to carry out a real-world test (as it does, for example, in L179).

-          It is unclear why a screenshot of PET was taken from http://www.clg.wlv.ac.uk/projects/PET/ while no screenshot of the actual settings used in the study are provided. What matters here is what translators in this specific study were exposed to, so the screenshot must be replaced.

L210 – ‘automatically generated’ is unclear. Is this MT?

L227 – ‘.However’ or ‘;however’

L239 – a reference should be provided in relation to the daily throughput estimate

L255 – had

Table 4 – it should be clear what C M V S stand for. I presume the translators, but this should be explicated.

L281 – Only here it’s explained that the translators are identified by letters – this should come earlier.

L283 – it is better to use ‘translators’ (and not annotators) throughout

L286 – closer look?

L292 – better to say figures ‘show’ or ‘present’

L320 – why ‘strange’ results?

L333 – the the (typo)

L347 – the basis for making this statement should be clearer. V and S were faster when they were told to post-edit poor-quality MT. How is this linked to following or not following the traffic light system? Could V and S simply be faster at post-editing poor-quality MT than M and C? What is the basis for saying that M and C did not follow the traffic lights? Was this checked (PET would have recorded what happened)? Does that matter? If not, why not? A lot is left implicit here.

Section 5.4

At this point I am still unclear about what the key result presented by the study is. If it wishes to claim that the MTQE condition is preferable than the ‘just MT’ condition, then target-text quality should be compared between these two conditions to show that improvements in productivity did not come at the expense of lower quality. In addition, a full analysis with the statistical results that underpinned comments on significance should be provided.

L384 – this is another reason why the study cannot claim to test MTQE under ‘real-world’ conditions

L389-390 – Efficiency is not the only variable of interest here. Could the translator who disliked post-editing produce lower-quality translations in this condition?

L405 – what is mentioned here overstates the results quite substantially. The paper presents rather small effects throughout, sometimes with insignificant differences for some of the translators. In addition, all this is based on 4 translators only, whereas previous research cited in the paper is often based on larger samples.

L410 – ‘win over the hearts and minds of translators’ comes across as dismissive of translators’ views considering the small differences presented in the results and the lack of clarity around target-text quality.

Author Response

 

Dear Reviewer,

 

Thank you for your time going over our paper and the helpful feedback you provided. 

To address the reviewers comments, we have made the following changes to the paper:


We addressed all the typographical and grammatical errors suggested by the reviewers. In addition, we carefully checked the paper again. The major changes made to the paper as a response to the reviewers’ feedback are detailed below. The comments from reviewers are in italics, whilst our response in bold.

The attached document is a comparision report.

 

This paper looks at how providing translators with information on machine translation quality estimation might affect their productivity and technical effort. The paper concludes the following: “Our results show that MTQE, especially good and accurate MTQE, is vital to the efficiency of the translation workflow, and can cut translating time and effort significantly.” 

Unfortunately, I am not convinced that the paper presents sufficiently sound evidence to back such a strong statement. 

 

We have addressed this by rewording it to make it less strong. However we still maintain that MTQE can have a positive effect on the post-editing process.

 

On page 9, it says “Individually, however, this drop [in translating time] does not account equally over each translator. For two of the translators, the change is not significant.” Statistical significance is mentioned occasionally, but the actual size of the effects and the p-values are not reported. Later on, the paper states “Once again [i.e. for both time and keystrokes], on average there is little to no difference between “Good QE” and “Bad QE”. These results are described as ‘erratic’ and ‘strange’. The paper then analyses cases of correct and incorrect MTQE labels, though again a thorough statistical analysis is not provided, and the results are presented with rather unconvincing language such as ‘seems to suggest’ and ‘might indicate’.

 

The criticism on statistical significance is valid and has been addressed throughout the paper through f-tests performed on all our results. The explanation begins on on line 330.These results have been added in tables 5 and 7.

 

The analysis of target-text quality could also have been stronger. Showing that all participants achieved similar quality standards compared to Autodesk translators is important, but what really matters here is showing that any improvements in productivity did not occur at the expense of lower target-text quality. The paper should have shown how the sentences post-edited ‘with QE’ differ in quality from the sentences post-edited ‘without QE’.

Another limitation which is not mentioned is the fact that translating difficulty is not accounted for. The paper controls for sentence length, but it apparently disregards other textual characteristics that could make certain sentences intrinsically harder to translate irrespective of the level of MT quality. Ideally the paper should show paired translating difficulty across the conditions it sets out to compare.

The fact that the study has professional translators as participants is a strength, but the small sample (only 4 translators) should have been acknowledged more directly, especially when the results are compared to previous research providing more complete statistical analyses based on larger samples. The study also wishes to claim that it tests MTQE in ‘real-world’ conditions, but it uses a lab tool (PET) that translators had some difficulty getting used to. Note that use of PET is well justified – it is the strong claim of high ecological validity that is the problem.

 

We would have liked to expand the study but time and cost kept it at 4. We justify the use of “real-world” example as we use Autodesk Data, a real-world post-editing dataset. 

 

It is unclear why a screenshot of PET was taken from http://www.clg.wlv.ac.uk/projects/PET/ while no screenshot of the actual settings used in the study are provided. What matters here is what translators in this specific study were exposed to, so the screenshot must be replaced.

 

We have added screenshots of the modified version of PET. We had initially excluded them for length.



 it is better to use ‘translators’ (and not annotators) throughout

 

We have taken this suggestion on board and more consistently use Translators or Post-editors.




Section 5.4

At this point I am still unclear about what the key result presented by the study is. If it wishes to claim that the MTQE condition is preferable than the ‘just MT’ condition, then target-text quality should be compared between these two conditions to show that improvements in productivity did not come at the expense of lower quality. In addition, a full analysis with the statistical results that underpinned comments on significance should be provided.

 

We have elaborated and tried to be more specific in our conclusions.

 

L405 – what is mentioned here overstates the results quite substantially. The paper presents rather small effects throughout, sometimes with insignificant differences for some of the translators. In addition, all this is based on 4 translators only, whereas previous research cited in the paper is often based on larger samples.

 

We have tried to make the statement less strong and explain the limitations of our study.

 

L410 – ‘win over the hearts and minds of translators’ comes across as dismissive of translators’ views considering the small differences presented in the results and the lack of clarity around target-text quality.

 

This sentence is now removed and rephrased.

 

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

I appreciate the attempt to improve the manuscript, but the new version still has several issues that need to be addressed. 

My main comments are that the methods lack clarity and that the argument still overstates the results. 

I don't follow why the actual F statistic and the actual p values were not provided. It is also not clear why the results are sometimes provided per translator and sometimes overall. The explanation for - and + symbols comes towards the end of the paper. The paper also confuses hypotheses with significance levels.

Regarding target-text quality, the text should elaborate on the results shown in the figure. Note that this would help to strengthen the argument. The key finding from this section is that improvements in post-editing effort linked to MTQE are not, it seems, linked to lower levels of target quality. This should be highlighted.

There are also many issues I raised in my previous report that were not addressed. I do not come back to all these here as they are all listed in my previous review.

Some specific comments:

L24-25 – ‘proposing for post-editing sentences which are good enough’ (typo)

L51 – ‘ran’

L54 – ‘denote’ is the not right word to use here

L81 – Bing – errors already pointed out remain in the manuscript

L98 – open-source ‘program’ – again, errors remain (see e.g. https://grammarist.com/spelling/program-programme/)

P8 – An uncaptioned figure is introduced here. And I still don’t see how the out-of-the-box screen shot of PET adds anything to the paper since this is not the interface that was used in the study. If anything, it confuses the description of the methods.

L278 – punctuation error in the use of ‘however’, as previously mentioned

L291 – No reference is provided for the daily throughput estimation

L333 – there is a confusion here between what hypotheses and significance levels mean. Defining 0.05 as the level of significance has nothing to do with the study’s hypotheses. If hypothesis testing is carried out, the study should define the null hypothesis and the alternative hypothesis

Table 5 – what do – and + mean? Why isn’t the F statistic or the p-value provided? Why is this calculated per translator? What is the overall result?

L360 – ‘physical effort’ is not the usual terminology used in this type of study, again as previously pointed out

L386 – ‘erratic’ and ‘strange’ are not helpful terms to use here. If the results are surprising or unexpected, the basis for this interpretation needs to be explicated. Results are not ‘strange’ just because they do not follow expectations.

L490-491 This still overstates the results. It is not ‘especially’ good and accurate MTQE; based on what was presented, only good and accurate MTQE had a positive effect on post-editing effort relative to the ‘No QE’ condition. When looking at results per translator, QE had a positive effect on both temporal and technical effort for only 2 translators (50% of the sample). What was the overall result in the comparison between QE and 'No QE'? This strikes me an obligatory result to report if the study wants to claim that MTQE is ‘vital’.

 

Author Response

Dear Reviewers,

 

Thank you for your time going over our paper and the helpful feedback you provided. 

To address the reviewers comments, we have made the following changes to the paper. We have also attached a compare report.

 

I don't follow why the actual F statistic and the actual p values were not provided. It is also not clear why the results are sometimes provided per translator and sometimes overall. The explanation for - and + symbols comes towards the end of the paper. The paper also confuses hypothesis with significance levels.

We have replaced the symbols with actual numbers and reworded our hypothesis for clarification (lines 331-336).

Regarding target-text quality, the text should elaborate on the results shown in the figure. Note that this would help to strengthen the argument. The key finding from this section is that improvements in post-editing effort linked to MTQE are not, it seems, linked to lower levels of target quality. This should be highlighted.

At the reviewer’s suggestion we have added this at line 458:
We also observe that the category with the highest FMS is the GoodQE category. This is consistent across all translators and on average. As this is the category with the highest improvement in post-editing time and effort, we can safely conclude that this change is not linked to lower levels of target quality. 

There are also many issues I raised in my previous report that were not addressed. I do not come back to all these here as they are all listed in my previous review.

We have gone over the previous comments and addressed any that we missed. Apart from the typographical errors, this includes:

Elaborating on FMS and specifically which FMS algorithm was used.
Consistent use of past tense in the literature review
Explained that the letters in the tables stand for the translator initials

Some specific comments:

L24-25 – ‘proposing for post-editing sentences which are good enough’ (typo)

L51 – ‘ran’

L54 – ‘denote’ is the not right word to use here

We addressed the three comments above through rewording.

L81 – Bing – errors already pointed out remain in the manuscript

We changed “Bing’s” to “Microsoft Bing”

L98 – open-source ‘program’ – again, errors remain (see e.g. https://grammarist.com/spelling/program-programme/)

We amended all use of the word “programme”

P8 – An uncaptioned figure is introduced here. And I still don’t see how the out-of-the-box screen shot of PET adds anything to the paper since this is not the interface that was used in the study. If anything, it confuses the description of the methods.

We removed this figure as the reviewer suggested it does not add anything to the paper.

L278 – punctuation error in the use of ‘however’, as previously mentioned

We did not change this before as we did not agree with the change. 

L291 – No reference is provided for the daily throughput estimation

We have added a link and a reference to justify using this estimation.

L333 – there is a confusion here between what hypotheses and significance levels mean. Defining 0.05 as the level of significance has nothing to do with the study’s hypotheses. If hypothesis testing is carried out, the study should define the null hypothesis and the alternative hypothesis

We have clarified this. We define the hypothesis and the alternative hypothesis, along with p=0.05.

Table 5 – what do – and + mean? Why isn’t the F statistic or the p-value provided? Why is this calculated per translator? What is the overall result?

We have edited this to provide the F statistic and added the overall result.

L360 – ‘physical effort’ is not the usual terminology used in this type of study, again as previously pointed out

This is the same terminology used by Teixeira and O’Brien (2017). However we have replaced “physical effort” with “technical effort” as requested

L386 – ‘erratic’ and ‘strange’ are not helpful terms to use here. If the results are surprising or unexpected, the basis for this interpretation needs to be explicated. Results are not ‘strange’ just because they do not follow expectations.

We have changed these terms and use “inconsistent” instead.

L490-491 This still overstates the results. It is not ‘especially’ good and accurate MTQE; based on what was presented, only good and accurate MTQE had a positive effect on post-editing effort relative to the ‘No QE’ condition. When looking at results per translator, QE had a positive effect on both temporal and technical effort for only 2 translators (50% of the sample). What was the overall result in the comparison between QE and 'No QE'? This strikes me an obligatory result to report if the study wants to claim that MTQE is ‘vital’.

We reworded this section:

Our results show that MTQE is important to the efficiency of the translator workflow when the MT suggestion is low quality (FMS<=75). Furthermore, we also show that good and accurate MTQE can cut translating time and effort significantly regardless of the quality of the MT suggestion.




Author Response File: Author Response.pdf

Back to TopTop