Apple takes security and privacy very seriously as an important tenet of their value proposition to their customers. So, it comes as no surprise that a new published patent application to Apple, Inc. entitled “Speaker Recognition“ reports a new Security / Biometric authentication invention for use with Siri on the iPhone family of devices.
Inventors Gunnar Evermann and Donal McAllister have invented for Apple a non-transitory computer-readable storage medium stores one or more programs including instructions, which when executed by an electronic device, cause the electronic device to receive natural-language speech input from one of a plurality of users, the natural-language speech input having a set of acoustic properties; and determine whether the natural-language speech input corresponds to both a user-customizable lexical trigger and a set of acoustic properties associated with the user; where in accordance with a determination that the natural language speech input corresponds to both a user-customizable lexical trigger and a set of acoustic properties associated with the user, invoke a virtual assistant; and in accordance with a determination that either the natural language speech input fails to correspond to a user-customizable lexical trigger or the natural-language speech input fails to have a set of acoustic properties associated with the user, forego invocation of a virtual assistant.
Intelligent automated assistants (or digital assistants/virtual assistants) provide a beneficial interface between human users and electronic devices. Such assistants allow users to interact with devices or systems using natural language in spoken and/or text forms. For example, a user can access the services of an electronic device by providing a spoken user request to a digital assistant associated with the electronic device. The digital assistant can interpret the user’s intent from the spoken user request and operationalize the user’s intent into tasks. The tasks can then be performed by executing one or more services of the electronic device and a relevant output can be returned to the user in natural language form.
To the extent that a digital assistant has been invoked in the past with a voice command, the digital assistant is responsive to the speech itself, not to the speaker. Consequently, a user other than the owner of the electronic device is able to utilize the digital assistant, which may not be desirable in all circumstances. In addition, due to the prevalence of electronic devices and digital assistants, in some circumstances a user may provide a spoken user request to the digital assistant associated with his or her electronic device, and several electronic devices in the room (such as at a meeting) respond.
Some techniques for recognizing a speaker to invoke a virtual assistant using electronic devices, however, are generally cumbersome and inefficient, as set forth above. For example, existing techniques can require more time than necessary due to lack of specificity between electronic devices, wasting user time and device energy. This latter consideration is particularly important in battery-operated devices. As another example, existing techniques may be insecure, due to the acceptance by the digital assistant of spoken input by any user, instead of responding only to the spoken input of the device owner.
Accordingly, the present technique provides electronic devices with faster, more efficient methods and interfaces for recognizing a speaker to invoke a virtual assistant. Such methods and interfaces optionally complement or replace other methods for recognizing a speaker to invoke a virtual assistant. Such methods and interfaces reduce the cognitive burden on a user and produce a more efficient human-machine interface. For battery-operated computing devices, such methods and interfaces conserve power and increase the time between battery charges, and reduce the number of unnecessary and extraneous received inputs.
The patent application illustrates a process for recognizing a speaker to invoke a virtual assistant, according to various examples.
In some embodiments, a non-transitory computer-readable storage medium stores one or more programs, the one or more programs including instructions, which when executed by an electronic device, cause the electronic device to receive natural-language speech input from one of a plurality of users, the natural-language speech input having a set of acoustic properties; and determine whether the natural-language speech input corresponds to both a user-customizable lexical trigger and a set of acoustic properties associated with the user; wherein in accordance with a determination that the natural language speech input corresponds to both a user-customizable lexical trigger and a set of acoustic properties associated with the user, invoke a virtual assistant; and in accordance with a determination that either the natural language speech input fails to correspond to a user-customizable lexical trigger or the natural-language speech input fails to have a set of acoustic properties associated with the user, forego invocation of a virtual assistant.
In othe embodiments, a transitory computer-readable storage medium stores one or more programs, the one or more programs including instructions, which when executed by an electronic device, cause the electronic device to receive natural-language speech input from one of a plurality of users, the natural-language speech input having a set of acoustic properties; and determine whether the natural-language speech input corresponds to both a user-customizable lexical trigger and a set of acoustic properties associated with the user; wherein in accordance with a determination that the natural language speech input corresponds to both a user-customizable lexical trigger and a set of acoustic properties associated with the user, invoke a virtual assistant; and in accordance with a determination that either the natural language speech input fails to correspond to a user-customizable lexical trigger or the natural-language speech input fails to have a set of acoustic properties associated with the user, forego invocation of a virtual assistant.
In yet other embodiments, an electronic device includes a memory; a microphone; and a processor coupled to the memory and the microphone, the processor configured to receive natural-language speech input from one of a plurality of users, the natural-language speech input having a set of acoustic properties; and determine whether the natural-language speech input corresponds to both a user-customizable lexical trigger and a set of acoustic properties associated with the user; wherein in accordance with a determination that the natural language speech input corresponds to both a user-customizable lexical trigger and a set of acoustic properties associated with the user, invoke a virtual assistant; and in accordance with a determination that either the natural language speech input fails to correspond to a user-customizable lexical trigger or the natural-language speech input fails to have a set of acoustic properties associated with the user, forego invocation of a virtual assistant.
The invention includes a method of using a virtual assistant includes, at an electronic device configured to transmit and receive data, receiving natural-language speech input from one of a plurality of users, the natural-language speech input having a set of acoustic properties; and determining whether the natural-language speech input corresponds to both a user-customizable lexical trigger and a set of acoustic properties associated with the user; wherein in accordance with a determination that the natural language speech input corresponds to both a user-customizable lexical trigger and a set of acoustic properties associated with the user, invoking a virtual assistant; and in accordance with a determination that either the natural language speech input fails to correspond to a user-customizable lexical trigger or the natural-language speech input fails to have a set of acoustic properties associated with the user, foregoing invocation of a virtual assistant.
Also included is a system utilizing an electronic device includes means for receiving natural-language speech input from one of a plurality of users, the natural-language speech input having a set of acoustic properties; and means for determining whether the natural-language speech input corresponds to both a user-customizable lexical trigger and a set of acoustic properties associated with the user; wherein in accordance with a determination that the natural language speech input corresponds to both a user-customizable lexical trigger and a set of acoustic properties associated with the user, means for invoking a virtual assistant; and in accordance with a determination that either the natural language speech input fails to correspond to a user-customizable lexical trigger or the natural-language speech input fails to have a set of acoustic properties associated with the user, means for foregoing invocation of a virtual assistant.
In some embodiments, an electronic device includes a processing unit that includes a receiving unit, a determining unit, and an invoking unit; the processing unit configured to receive, using the receiving unit, natural-language speech input from one of a plurality of users, the natural-language speech input having a set of acoustic properties; and determine, using the determining unit, whether the natural-language speech input corresponds to both a user-customizable lexical trigger and a set of acoustic properties associated with the user; wherein in accordance with a determination that the natural language speech input corresponds to both a user-customizable lexical trigger and a set of acoustic properties associated with the user, invoke, using the invoking unit, a virtual assistant; and in accordance with a determination that either the natural language speech input fails to correspond to a user-customizable lexical trigger or the natural-language speech input fails to have a set of acoustic properties associated with the user, forego, using the invoking unit, invocation of a virtual assistant.
Executable instructions for performing these functions are, optionally, included in a non-transitory computer-readable storage medium or other computer program product configured for execution by one or more processors. Executable instructions for performing these functions are, optionally, included in a transitory computer-readable storage medium or other computer program product configured for execution by one or more processors.
Thus, devices are provided with faster, more efficient methods and interfaces for recognizing a speaker to invoke a virtual assistant, thereby increasing the effectiveness, efficiency, and user satisfaction with such devices. Such methods and interfaces may complement or replace other methods for recognizing a speaker to invoke a virtual assistant.
The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.