VoiceNotes: A Speech Interface for a
Hand-Held Voice Notetaker

This paper originally appeared in Proceedings of INTERCHI (Amsterdam, The Netherlands, Apr. 24-29), ACM, New York, 1993, pp. 179-186.

Lisa J. Stifelman*+, Barry Arons*, Chris Schmandt*, Eric A. Hulteen+

*Speech Research Group
MIT Media Lab
20 Ames Street, Cambridge, MA 02139
617-253-8026, lisa@media-lab.mit.edu

+Human Interface Group/ATG
Apple Computer, Inc.
20525 Mariani Ave., MS 301-3H
Cupertino, CA 95014

ABSTRACT

VoiceNotes is an application for a voice-controlled hand-held computer that allows the creation, management, and retrieval of user-authored voice notes--small segments of digitized speech containing thoughts, ideas, reminders, or things to do. Iterative design and user testing helped to refine the initial user interface design. VoiceNotes explores the problem of capturing and retrieving spontaneous ideas, the use of speech as data, and the use of speech input and output in the user interface for a hand-held computer without a visual display. In addition, VoiceNotes serves as a step toward new uses of voice technology and interfaces for future portable devices.

KEYWORDS

Speech interfaces, speech recognition, non-speech audio, hand-held computers, speech as data.

INTRODUCTION

How can you capture spontaneous ideas that come to you in the middle of the night, when you are walking down the street, or driving in your car? Pen and paper are often used to record this information, but it is difficult to read and write while driving, and scraps of paper with notes can become scattered or lost. A portable computer can provide better organization, but it is impossible to carry a computer, type, and look at the display while walking down the street. Some people use microcassette(TM) recorders since voice is a particularly fast and easy way to record such information. However, one is left with a long linear stream of audio and cannot randomly access individual thoughts. In a study of microcassette recorder users, this lack of random access was found to be the user's worst frustration [5].

This paper presents a speech interface for a hand-held [footnote-1] computer that allows users to capture and randomly access voice notes--segments of digitized speech containing thoughts and ideas. The development of the VoiceNotes application explores: (1) the problem of capturing and retrieving spontaneous ideas; (2) the use of speech as data; and (3) the use of speech input and output in the user interface for a hand-held computer.

WHY VOICE?

With advances in microelectronics, computers are rapidly shrinking in size. Laptop computers are portable versions of desktop PCs, but the user interface has remained essentially unchanged. There are also a host of small specialized electronic organizers, travel keepers, and even pocket-sized PCs that present the user with a tiny display and a bewildering array of keys. As computers decrease in size, so does the utility of traditional input and output modalities (keyboards, mice, and high resolution displays). Functionality and ease-of-use are limited on these small devices in which designers have tried to `squeeze' more and more features into an ever decreasing product size. Rather than simply shrinking the size of traditional interface elements, new I/O modalities must be explored.[footnote-2]

The work presented in this paper explores the concept of a hand-held computer that has no keyboard or visual display, but uses a speech interface instead. Information is stored in an audio format, as opposed to text, and accessed by issuing spoken commands instead of typing. Feedback is also provided aurally instead of visually.

Voice technology has been explored for use in desktop computers and telephone information systems, yet the role of voice in the interface for a hand-held device has received little attention. There are two important research challenges for this work: (1) taking advantage of the utility of stored voice as a data type for a hand-held computer while overcoming its liabilities (speech is slow, serial, and difficult to manage); (2) determining the role of voice in the user interface for a hand-held computer given the limitations in current speech recognition technology.

Research and experience using voice in user interfaces has revealed its many advantages as well as its liabilities. Voice allows the interface to be scaled down in size. In the extreme case, the physical interface may be negligible, requiring only a speaker and microphone. Speech provides a direct and natural means of input for capturing spontaneous thoughts and ideas. In comparison to writing, speech can provide faster output rates, and allows momentary thoughts to be recorded before they are forgotten [8]. In addition, voice can be more `expressive' than text, placing less cognitive demands on the communicator and allowing more attention to be devoted to the content of the message [3]. Voice as an access mechanism is also direct (the user thinks of an action and speaks it), and allows additional tasks to be performed while the hands or eyes are busy [11].

Speech is a natural means of interaction, yet recording, retrieving, and navigating among spoken information is a challenging problem. Speech is fast for the author but slow and tedious for the listener [8]. When reviewing written information, the eye can quickly scan a page of text, using visual cues such as highlighting and spatial layout to move from one idea to the next. Navigating among spoken segments of information is more difficult, due to the slow, sequential, and transient nature of speech [1][12]. The research presented in this paper addresses the issue of how voice input can be used to record, retrieve, and navigate among segments of speech data using a hand-held device that has no visual display.

RELATED WORK

The work detailed in this section highlights key issues considered in the development of VoiceNotes including: the use of voice in hand-held environments, navigating in speech-only interfaces, and notetaking.

Degen added two buttons to a conventional tape recorder that allow users to `mark' segments of audio while recording [5]. The audio data is then digitized and stored on a Macintosh (reg.) computer for review. In an evaluation of this prototype, users expressed the desire to customize the meanings of the marks, for more buttons to uniquely tag audio segments, and the ability to play back the marked segments directly from the device. VoiceNotes addresses these issues by using speech recognition and storage technology to allow users to create and name personal categories, and a user interface for direct entry, retrieval, and organization of audio data from a hand-held device.

Hyperspeech, a speech-only hypermedia system, addresses important design considerations for speech-only interfaces [1]. Hyperspeech provides the ability to navigate through a network of recorded speech segments using isolated word recognition. The Hyperspeech database was created and organized by the author of the system. In contrast, VoiceNotes is composed of information created and organized by the user. Additionally, voice notes are automatically segmented by the application, while the Hyperspeech audio data was manually segmented by the author.

Notepad, a visual notetaking program, is a tool for "thought-dumping--the process of quickly jotting down a flood of fleeting ideas" ([4], p. 260). Cypher emphasizes the importance of allowing users to quickly record an idea with a minimum amount of interference. VoiceNotes, like Notepad, is intended to allow `thought-dumping' so the interactions must be efficient--the `tool' should not impede the user's thought process. However, the considerations for designing a voice interface are very different from those for a visual interface. While a visual interface can present information simultaneously in a multitude of windows, VoiceNotes must be more efficient in its presentation of speech data.

VOICENOTES

VoiceNotes is an application for a hand-held computer (Figure 1) that allows the creation, management, and retrieval of user-authored voice notes [15]. For example, "call mom to wish her happy birthday" can be recorded as a voice note. Voice notes can be categorized according to their content. For example, the note "call mom..." could be put into a category of notes named "phone calls."

This illustration is currently not available in the HTML version of this paper.

Figure 1: Photograph of hand-held prototype.

Usage Scenarios

A demonstration of the VoiceNotes application in the context of how it might be used during the course of a user's day is provided in Figures 2 and 3.

Description of VoiceNotes

VoiceNotes provides a simple digital audio file system for organizing recorded segments of speech. The user can create lists of voice notes. Each list has a category name and a collection of associated notes. For example, the user might create a "things to do" list with associated notes such as "pay the rent," and "pick up the cleaning" (Figure 4). VoiceNotes allows the user to create multiple lists of notes. Figure 4 shows additional examples of categories such as "calls" to make, "movies" to rent, and "groceries" to buy. These are just examples--the names of the categories are defined by the user.

The user selects the "Things to do" category:
User: "THINGS TO DO"
Hand-held: "Moving into Things to do" [footnote-3]
"Pay the rent"
"Pick up clothes from cleaners"

The user interrupts to add a note to the category:
User interrupts: "RECORD"
Hand-held: "Recording note"
User: "Stop at the grocery store"
Hand-held: "New note added"

Figure 2: Waking up in the morning, checking and updating the day's activities.

The user plays the list of categories:
User: "CATEGORIES"
Hand-held: "Moving into Categories"
"Things to do," "Calls"

The user adds a new category called "Groceries":
User interrupts: "RECORD"
Hand-held: "Recording category"
"Groceries"
"New category added"

The user selects the "Groceries" category:
User: "GROCERIES"
Hand-held: "Moving into Groceries,
list is empty"

The user adds notes to the "Groceries" category:
User: "RECORD"
Hand-held: "Recording note"
User: "Milk"
Hand-held: "New note added"
User: "RECORD"
Hand-held: "Recording note"
User: "Orange juice"
Hand-held: "New note added"

Figure 3: In the kitchen making breakfast, creating a grocery list.

The category name provides a method for organizing a collection of notes as well as a handle for accessing them. When the user speaks a category name[footnote-4], it is simultaneously recorded for playback and trained for speech recognition. Category names allow random access across lists; to select and play back a list of voice notes, the user simply speaks the category name. Since training the recognizer only requires a single utterance, the user's spoken category name becomes a voice command without a separate training process.

Figure 4: Sample VoiceNotes speech database.

The VoiceNotes user interface provides a simple set of voice commands for recording, navigating, and managing voice notes. Figure 5 lists some of the basic voice commands and their associated actions.

Command       Action                       
Play            Plays each item in a list    
Record          Records an item at the end of a list                    
Stop            Interrupts the current activity                     
Next, Previous  Plays the next/previous item                         
Categories      Plays all of the categories  
<Category       Selects a category and       
name>           plays notes                  
Delete          Deletes the current item in the list                  
Undelete        Retrieves the last item deleted                      
Scan            Plays a portion of each item in a list               
First, Last     Plays the first/last item in a list                    
Stop-listening  Turns recognition off                                                  
Pay-attention   Turns recognition on         
Where-am-i      Plays the current category name                         
Move            Moves a note to another list

Figure 5: Basic voice commands.

In this design the "record", "next", "previous", and "delete" commands can apply to either voice notes or categories of notes depending upon the user's current position in the speech database.

In the first hand-held prototype, there were equivalent button controls for most of the voice commands. While the goal of this research has been to explore voice interfaces and applications in hand-held computers, there are cases in which button input provides a better, or more appropriate, interface.

Hand-Held Prototype

A prototype was developed to simulate the user interface experience with such a hand-held device (Figure 1). Although the prototype is tethered to a PowerBook (TM) computer, it allows exploration of the interface for a voice-controlled hand-held device that does not yet exist.

In the prototype, a Motorola (reg. TM) 68HC11 microcontroller was placed inside the shell of an Olympus (reg. TM) S912 microcassette recorder and interfaced to its buttons. The prototype communicates with a PowerBook through a serial connection to indicate button presses, and an analog audio connection for speech I/O. The original volume control is used for setting the speed of playback. Microphone input from the prototype is routed to the PowerBook for digitization and storage, and to a Voice Navigator (TM) for speech recognition.[footnote-5]

INTERFACE DESIGN ISSUES

The following key issues were considered during the initial phase of interface design, prior to user testing.

User-Definable Category Names

The most important use of speech input in the VoiceNotes application is for naming new categories and randomly accessing them. The user's ability to personalize the application by creating their own category names is essential. Since a category might contain the name of a friend, company, or an acronym, category names cannot come from a fixed recognition vocabulary. Users must be able to create these categories in real-time to support the spontaneous capture of information. Speaker dependent isolated word recognizers typically allow new words to be added to the recognition vocabulary in real-time based on acoustic data alone, whereas, some speaker independent recognition systems require a phonetic spelling of the word. Requiring users to spell or type in new words would defeat the premise underlying the use of voice (e.g., spontaneity, speed of entry). In addition, since the hand-held is a personal device, speaker independent recognition is not necessary.

Navigation

Since there is no visual display, users must be able to maintain a mental model of the VoiceNotes speech database. Voice notes are organized into a two-dimensional matrix (Figure 4), allowing the user to navigate within a particular list of notes, or between categories of notes. It was anticipated that users would have difficulty keeping track of their position in the speech database if the organization of notes was too complex. Therefore, the database was limited to a one-level hierarchy (a category of notes cannot contain a sub-category).

While graphical hypermedia systems can show the user's navigational path visually, speech is transient and leaves no trace of its existence. Navigating between lists of notes and keeping track of one's position is simplified if there is always an active list. The current list position does not change unless the user explicitly issues a navigational command. It is important that the user feel `in control' of the navigation, so automatic actions are avoided and commands are provided to give users complete control over their movement.

Voice and Button Input

The user interface for VoiceNotes combines multiple complementary input and output modalities. Combining voice and button input takes advantage of the different capabilities provided by each modality while allowing the limitations of one type of input to be overcome by the other. VoiceNotes can be operated using voice alone, buttons alone, or any combination of voice and button input. This flexibility is important, since the user's selection of how to interact with the application at any given time will be dependent on several factors.

The task. List selection by voice is direct, fast, and intuitive and gives the user control over the number of lists and the category names. Given a flexible number of lists, voice can provide a one-to-one correspondence between each list and the command for accessing it, while buttons cannot due to space limitations. However, buttons are better for tasks requiring fine control such as speed and volume adjustment since voice commands such as "faster, faster..." are awkward.

The context. The acoustic environment, social situation, and current user activity affect the choice of using voice or button input. Button input allows the hand-held to be operated when speech recognition accuracy is degraded due to background noise. Furthermore, button input supports the use of VoiceNotes in social contexts when it is inappropriate or awkward to speak aloud to one's hand-held computer. Alternatively, when the user's hands and eyes are busy (e.g., while driving), or vision is degraded (e.g., in darkness), voice input allows users to operate the application without requiring them to switch their visual attention in order to press a button.

Individual user preference. Some users may prefer to use buttons rather than speak to the computer while others may prefer to use speech input all the time.

Speech and Non-Speech Audio Output

Speech and non-speech audio [footnote-6] [7] output are the primary means of giving feedback to the user--they indicate the current state of the interface whenever the user issues a voice command or presses a button. Just as the combination of speech and button input provides the user with a richer set of interactions, the combination of speech and non-speech audio output is also powerful.

The type of feedback presented depends on the action being performed, the type of input used (voice or button), and the user's experience level and preference. Speech output is used, for example, to play back the contents of a voice note when it is deleted, while a page-flipping sound [footnote-7] indicates movement between notes. Speech feedback is used more often in response to voice input rather than button input, since speech recognition is error prone and requires the system to provide evidence that the correct command was recognized [9]. However, too much speech output becomes laborious and slows down the interactions. For example, spoken feedback both before and after recording a note ("recording note". . . "new note added") is tedious when recording several notes in a row. Non-speech audio (i.e., a single beep before recording and a double beep after) is faster and less intrusive on the user's task.

Streamlining the Speech Interaction

In graphical interfaces screen real estate is the most limited resource, yet for speech interfaces, time is the most valuable commodity [14]. Feedback must be brief, yet informative, to conserve time and to reduce the amount of information that the user must retain in working memory [17]. Audio output must be interruptible at all times--VoiceNotes provides the ability to jump between notes on a particular list, between different lists, or to stop playback at any instant. According to Waterworth, "If he can stop the flow, obtain repeats, and move forwards and backwards in the dialogue at will, he can effectively receive information at a rate that suits his own needs and capabilities" ([17], p. 167).

In addition, it is valuable to provide users with interactive control of the rate of playback. There are a range of techniques for time-compressing speech without changing the pitch (summarized in [2]). VoiceNotes allows the speed of playback to be increased up to several times the speed of the original recording. Research suggests that a speed up of more than two times the original rate presents too little of the signal in too little time to be accurately perceived [10]. However, comprehension of time-compressed speech increases with practice and users tend to adapt quickly [16]. VoiceNotes allows users to dynamically adjust the speed of playback in order to browse a list, speeding up during some portions and slowing down when reaching a note of interest. In addition, users can select a fixed rate of playback that they find comfortable for normal listening.

ITERATIVE DESIGN

The initial design of the VoiceNotes interface described above was developed through an iterative design process. Each aspect of the interface, especially navigation and feedback, went through many changes prior to user testing.

Moded vs. Modeless Navigation

The first VoiceNotes interface was moded--only a subset of the voice commands was valid at each point in the interaction. For example, when the last note on a list was played, the system would return to a `top level' mode, causing users to lose their position in the speech database. The user was essentially `dropped off the end of the list' and commands like "next" and "previous" were no longer valid.

The interface was redesigned in an attempt to create a modeless interface and to simplify navigation. In this design, all commands are always valid. When the last note is reached, if the user says "next" the system responds "end of list," and retains the user's position on the last item. Now, the user can issue commands like "next" and "previous" without fear of `falling off the end of the list'. In this way, the beginning and end of a list act as `anchors' for navigational control instead of drop off points.

Distinct Feedback

There were several problems with the initial design of the feedback provided by VoiceNotes. One problem was that feedback for different voice commands was not distinct, making it ambiguous as to whether or not a command was correctly recognized. For example, when selecting a category (e.g., "things to do") or saying "where-am-i", the system played the category name in response to both commands. Another problem with the feedback for selecting a list was that merely echoing the category name did not indicate any movement from one list to another.

In order to address these problems, the response to each command was made distinct and the feedback for selecting a category was changed to indicate movement ("moving into things to do"). Once this change was made, however, the feedback became too wordy. Therefore, an option was added to allow "moving into" to be replaced by a shorter duration sound effect (auditory icon).

USER TESTING

An informal user test (of the type described in [13]) was performed to help further refine the initial design of the VoiceNotes interface. The goal was to observe users to determine those aspects of the interface with which they had the most difficulty; particularly, how well users could navigate the speech database, given the structure shown in Figure 4. In addition, we solicited their initial reactions to the application.

Method

Six participants, three male and three female, used VoiceNotes to perform an inventory task and were subsequently interviewed. Each subject used VoiceNotes for a one hour period. The tests were video taped for later analysis. One of the participants used a microcassette recorder extensively at home and in the car for recording things to buy, videos to rent, books to read, etc. Another participant was considering buying a microcassette recorder to help keep track of personal information. None of the participants had ever used a speech recognizer before. Participants were told to `think out loud' as they performed the different tasks [6].

First, each user trained the speech recognizer [footnote-8] on the voice commands (Figure 5). Next, the user was briefly instructed on VoiceNotes operations. Following training, the user walked around an office[footnote-9], performing an inventory task of several cubicles. The user created a category for the name of the person occupying the office and a note for each piece of equipment contained in the office. While taking inventory, users were interrupted occasionally and asked to create and add items to a grocery and a to-do list. The user was free to use either voice or buttons for any task.

Observations

Performance varied widely across the users tested. Some users learned the application very quickly and had few problems performing any of the tasks, while others struggled throughout the test. Several problems with the interface design were consistently observed during the testing.

Navigation. Users sometimes lost track of their position in the VoiceNotes speech database. This often occurred when selecting a category of notes, after which, the notes in the category would automatically begin to play. This automatic playback was unexpected and made the user feel out of control of the interaction. While some used the "where-am-i" command to determine their location, most wanted some kind of visual indication on the device of the current list and note.

Despite our efforts to create a `modeless' interface, users still perceived the interface as moded (users referred to `category' and `notes' modes). Since the record, delete, previous, and next commands were overloaded (used for both categories and notes), users were often confused as to whether they were operating on categories or notes. When playing back the list of categories, some users stopped when they heard the category they wanted and attempted to record a note. Since they didn't first move into the list, their `note' was actually interpreted as a new category. When asked to add a new category, users would often say "new list" instead of "record". These `modes' also negated the benefit of list selection by voice, since most of the users thought they had to be in `category' mode in order to select a new category.

Interruption. A related problem was that users were unable to determine how to interrupt the speech output. During the user test, interruption by voice input was not enabled, although the ability to interrupt was available using the stop button. Users had a bias towards using voice to interrupt--when attempting to interrupt the user said "stop" rather than using the button. One user said, "I interrupt people that way [with voice], so why shouldn't I be able to interrupt this machine the same way."

Voice Input. VoiceNotes always listens for voice input, under the assumption that this allows more spontaneous use of the application. However, this makes it difficult to determine when the user is speaking a command (the system must differentiate between background noise and voice commands). Therefore, VoiceNotes remains silent unless a word is correctly recognized. During testing, if the user spoke a command and VoiceNotes did not respond, rather than repeat the command, users waited for a response, thinking that the system was still processing the input or busy performing the task. This caused the user to become confused and frustrated. Furthermore, background conversation often falsely triggered playback, making the user feel out of control because the device appeared to be operating spontaneously. Users expressed concern over the embarrassment that would be caused if this happened during a meeting or when talking to one's supervisor.

Interviews

At the end of the test, participants were interviewed about their difficulties with the user interface, their preference for voice or buttons, and their potential use of the application.

Feedback. Users perceived the interface as overly talkative or wordy--partly due to problems with interrupting the output, and due to the feedback initiated by falsely triggered recognitions. One user wanted the ability to turn the speech feedback off or select an alternative method of response. [footnote-10]

Voice vs. Buttons. When performing tasks, users employed both voice and button input. Users who obtained poor recognition results simply used buttons instead. Furthermore, during the test there was often background noise (e.g., a printer) that interfered with recognition, and users similarly compensated for this. When asked which input modality was preferred, some users said they would prefer voice if it was reliable enough, but all the users tested said they wanted both voice and buttons for communicating with the device.

Potential Use. All but one user said that they would like to use a hand-held device for creating personal voice notes. In addition, some wanted to use the device to listen to voice mail and electronic mail messages while driving.

Implications for Redesign

The information gathered during user testing uncovered aspects of the VoiceNotes interface requiring further design development.

Navigation. One solution for addressing the user's confusion between operating on categories versus notes, is to provide separate commands for each (e.g., "new category", "new note"). Another solution is a one-to-one mapping between categories and buttons on the device. A visual indicator for each category could also help users keep track of their position.

Interruption and Voice Input. Although the ultimate goal is to allow users to pick up the device and speak a command immediately, an alternate approach must be taken due to problems with speech recognition in noise. One solution is to provide a `push-to-talk' button. This approach also provides a consistent mechanism for interrupting the VoiceNotes speech output.

Feedback. The type of feedback (i.e., primarily speech or primarily non-speech) and amount (i.e., verbose or terse) used by VoiceNotes should be user definable. The perception of VoiceNotes as `wordy' indicates the need to make these customization capabilities easily accessible to the user.

Voice vs. Buttons. When asked whether they would use the device if only one input modality was provided, the users consistently responded that they wanted both voice and buttons. This reinforces our original assumption about the value of offering both of these input modalities.

CONCLUSIONS

In developing VoiceNotes, many lessons were learned that are applicable to other speech and small computer interfaces:

The use of multiple input and output modalities (in this case voice and button input, speech and non-speech output) combines the capabilities of each, while allowing limitations of a particular modality to be overcome.
In speech interfaces like VoiceNotes, time is a valuable commodity. Feedback must be as brief and responsive as possible, audio output must be interruptible at all times, and dynamic control over the rate of playback should be provided. Furthermore, despite the best attempts to design informative, unambiguous, and brief feedback, it is important to allow users to customize both the amount and type of system feedback.
Voice input was found to be especially valuable for categorizing and randomly accessing information (in this case, small segments of digitized speech).
Navigation in speech-only interfaces remains a challenging design problem. Audio feedback must provide a sense of movement when navigating. Navigational `anchors' must represent the limits of the information space, helping users to keep track of their position and maintain control over their movement.

This work has explored the use of voice, both as the data and the access mechanism in the user interface for a hand-held computer. In addition to addressing the problems of capturing and retrieving spontaneous ideas, VoiceNotes serves as a step toward new uses of voice technology and interfaces for future portable devices.

ACKNOWLEDGMENTS

We would like to thank several people for their contributions. The user testing was done in collaboration with Richard Mander, Jesse Ellenbogen, Eric Gould, and Jonathan Cohen; Andrew Kass and Nicholas Chan contributed to the software development; Derek Atkins and Lewis Knapp developed the hand-held prototypes.

This work was sponsored by Apple (reg. TM) Computer, Inc. [footnote-*]

REFERENCES

1. Arons, B. Hyperspeech: Navigating in speech-only hypermedia. In Proceedings of Hypertext '91, pp. 133-146. ACM, 1991.

2. Arons, B. Techniques, perception, and applications of time-compressed speech. In Proceedings of AVIOS '92, pp. 169-177. American Voice I/O Society, 1992.

3. Chalfonte, B.L., Fish, R.S. and Kraut, R.E. Expressive richness: A comparison of speech and text as media for revision. In Proceedings of CHI '92, pp. 21-26. ACM, 1991.

4. Cypher, A. The structure of users' activities. In Norman, D.A. and Draper, S.W., editors, User Centered System Design, chapter 12, pp. 243-263. Lawrence Erlbaum Associates, 1986.

5. Degen, L., Mander, R. and Salomon, G. Working with audio: Integrating personal tape recorders and desktop computers. In Proceedings of CHI '92, pp. 413-418. ACM, 1992.

6. Ericsson, K.A. and Simon, H.A. Protocol Analysis. The MIT Press, 1984.

7. Gaver, W.W. The SonicFinder: An interface that uses auditory icons. Human-Computer Interaction, 4(1):67-94, 1989.

8. Gould, J.D. An experimental study of writing, dictating, and speaking. In Requin, J., editor, Attention & Performance VII, pp. 299-319. Lawrence Erlbaum, 1978.

9. Hayes, P.J. and Reddy, D.R. Steps toward graceful interaction in spoken and written man-machine communication. International Journal of Man-Machine Studies, 19:231-284, 1983.

10. Heiman, G.W., Leo, R.J., Leighbody, G. and Bowler, K. Word intelligibility decrements and the comprehension of time-compressed speech. Perception and Psychophysics, 40(6):407-411, 1986.

11. Martin, G.L. The utility of speech input in user-computer interfaces. International Journal of Man-Machine Studies, 30:355-375, 1989.

12. Muller, M.J. and Daniel, J.E. Toward a definition of voice documents. In Proceedings of COIS '90, pp. 174-182. ACM, 1990.

13. Nielsen, J. Usability engineering at a discount. In Salvendy, G. and Smith, M.J., editors, Designing and Using Human-Computer Interfaces and Knowledge Based Systems, pp. 394-401. Elsevier, 1989.

14. Rudnicky, A.I. and Hauptmann, A.G. Models for evaluating interaction protocols in speech recognition. In Proceedings of CHI '91, pp. 285-291. ACM, 1991.

15. Stifelman, L.J. VoiceNotes: An application for a voice-controlled hand-held computer. Master's Thesis. Massachusetts Institute of Technology, 1992.

16. Voor, J.B. and Miller, J.M. The effect of practice upon the comprehension of time-compressed speech. Speech Monographs, 32:452-455, 1965.

17. Waterworth, J.A. Interaction with machines by voice: A telecommunications perspective. Behaviour and Information Technology, 3(2):163-177, 1984.

FOOTNOTES

[footnote-1] The term `hand-held' is used to refer to the size of the device. The device may actually be something worn on a belt.

[footnote-2] Small pen-based computers are an effort in this direction, but the interface is primarily suited for visual tasks.

[footnote-3] Note that "moving into" is replaced by a `list opening' sound effect if non-speech audio feedback is selected by the user. For the purposes of this demonstration, speech rather than non-speech feedback is used.

[footnote-4] A category name consists of a single short utterance.

[footnote-5] The Voice Navigator is a speaker dependent isolated word recognizer.

[footnote-6] VoiceNotes uses mostly auditory icons, everyday sounds used to convey information about computer events [7].

[footnote-7] This is an example of an auditory icon.

[footnote-8] Users were prompted to speak each word in the VoiceNotes vocabulary one time.

[footnote-9] The device was used under realistic ambient noise conditions.

[footnote-10] Non-speech feedback was not available during testing.

[footnote-*] Apple, the Apple logo, and Macintosh are registered trademarks of Apple Computer, Inc. PowerBook is a trademark of Apple Computer, Inc.

VoiceNotes: A Speech Interface for a Hand-Held Voice Notetaker