United States Patent 9,501,470 to Bangalore and other inventors from AT&T on November 22, 2016 for a “System and Method for Enriching Spoken Language Translation With Dialog Acts.” Disclosed herein are systems, computer-implemented methods, and tangible computer-readable media for enriching spoken language translation with dialog acts. The method includes receiving a source speech signal, tagging dialog acts associated with the received source speech signal using a classification model, dialog acts being domain independent descriptions of an intended action a speaker carries out by uttering the source speech signal, producing an enriched hypothesis of the source speech signal incorporating the dialog act tags, and outputting a natural language response of the enriched hypothesis in a target language. Tags can be grouped into sets such as statement, acknowledgement, abandoned, agreement, question, appreciation, and other. The step of producing an enriched translation of the source speech signal uses a dialog act specific translation model containing a phrase translation table. A method comprising: tagging, via a processor and using a maximum entropy classification model, dialog acts associated with a user utterance in a source natural spoken language, to yield dialog act tags, the dialog act tags being domain independent descriptions of an intended action of a speaker; and outputting, via the processor, an enriched version of a hypothesis translated into a target natural spoken language, to yield a translated speech output signal with a word order determined by the dialog act tags, wherein the enriched version of the hypothesis has a word order distinct from the hypothesis.
The present invention relates to automatic speech recognition and more specifically to recognizing and translating speech. Automatic speech processing has advanced significantly but is still largely compartmentalized. For instance, automatic speech recognition typically transcribes speech orthographically and hence insufficiently captures context beyond words. Enriched transcription combines automatic speech recognition, speaker identification and natural language processing with the goal of producing richly annotated speech transcriptions that are useful both to human readers and to automated programs for indexing, retrieval and analysis. Some examples of enriched transcription include punctuation detection, topic segmentation, disfluency detection and clean-up, semantic annotation, pitch accent, boundary tone detection, speaker segmentation, speaker recognition, and annotation of speaker attributes. These meta-level tags are an intermediate representation of the context of the utterance along with the content provided by the orthographical transcription. Accordingly, what is needed in the art is an improved way to enrich automatic speech translation with information beyond the text to be translated. Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
The features and advantages of the invention may be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. These and other features of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth herein. Disclosed are systems, computer-implemented methods, and tangible computer-readable media for enriching spoken language translation with dialog acts. The method includes receiving a source speech signal, tagging dialog acts associated with the received source speech signal using a classification model (such as a maximum entropy model), dialog acts being domain independent or domain dependent descriptions of an intended action a speaker carries out by uttering the source speech signal, producing an enriched hypothesis of the source speech signal incorporating the dialog act tags, and outputting a natural language response of the enriched hypothesis in a target language. An example system for translating speech using dialog act tags accepts an incoming speech signal. If the tagger realizes that the speech signal has multiple sentences or multiple dialog acts, a speech segmenter splits the speech into discrete sentences or into discrete dialog acts. The tagger then analyzes each sentence or dialog act and can classify them into sets of tags based on the categories described above, such as statement, acknowledgement, abandoned, agreement, question, appreciation, etc. The tagger outputs enriched, dialog-act-tagged speech, sending it to a translation module capable of understanding and incorporating the additional dialog act tag enriched speech. A phrase translation table can assist the translation module in translating the enriched speech. Further, dialog act specific translation models can generate hypotheses that are more accurate with sufficient training data than without the use of dialog acts.
The translation module then converts the enriched speech to enriched translated speech in a language other than the original speech signal. For example, the original speech signal can be French and the translation module can output the enriched translated speech in Hindi. In one example, not shown, a single dialog act tagger connects to multiple translation modules, each capable of translating into a different language. In another example, a single translation module contains multiple plug-in modules which translate the speech signal to multiple different languages. The system can output actual speech or the system can output a set of instructions for reproducing speech, such as a lossless or lossy digital audio file or a Speech Synthesis Markup Language (SSML) file. Tags can be grouped into sets such as statement, acknowledgement, abandoned, agreement, question, appreciation, and other. The step of producing an enriched translation of the source speech signal uses a translation model containing a dialog act specific phrase translation table. The method can further include appending to each phrase translation table belonging to a particular dialog act specific translation model those entries from a complete model that are not present in the phrase table of the dialog act specific translation model, and weighting appended entries by a factor. When the source speech signal is a dialog turn having multiple sentences, the method can further include segmenting the source speech signal, tagging dialog acts in each segment using a maximum entropy model, and producing an enriched translation of each segment in a target language incorporated with the dialog act tags. The method can further include annotating tagged dialog acts.
This article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.
