Talking to your computer has been a staple of science fiction since at least the 1960s, but it looks as if it's finally coming within reach.
This week saw the release of the first speech recognition software capable of handling continuous speech without the user having to train it in advance, namely Nuance's Dragon Naturally Speaking (DNS) version 9.
For anyone else who, like me, tried IBM ViaVoice or Dragon Dictate a few years ago, found it awkward to get the system used to your voice, and even more awkward to speak in a staccato word-by-word fashion, this is a huge leap forward.
It means that anyone can simply start talking to any speech-equipped PC and expect it to recognise what they're saying. That's continuous speech, not one. Word. At. A. Time. It could be text to go into a document or email message, or commands for Windows or a Windows app, to open a file or close a pane, say.
It's still not perfect - it can't handle umms and errs very well, not can it spot pauses and automatically insert punctuation, so you need to switch your speech pattern into what we might call "dictation mode."
It still needs a good microphone or headset too, and while it now supports Bluetooth wireless headsets, developer Nuance has so far only found two models good enough to certify.
However, for managers and others who are already used to dictating, this could at last relegate the keyboard to second class status.
So what has changed in the world of speech recognition to make all this possible? It's a convergence of a set of technological advances, says Nuance marketing manager Steven Steenhaut - from DSPs for better microphones and noise cancellation, through cheaper and more plentiful memory, to CPU chips powerful enough to handle the immense processing load, even on a PDA.
And of course, it is better and better software algorithms and language models, for example software can now learn on the fly, without the need for active training.
"The core concepts haven't changed since the technology's inception - it has to rely on those," Steenhaut says. "The advances are a combination of hardware and software, for example the hardware vendors are very focused on increasing their noise cancellation capabilities, and we have created noise cancelling algorithms too."
Other advances include software to recognise bigrams and trigrams - pairs and triplets of words that commonly occur together, and can therefore be used to improve the recognition process.
"It uses a statistical model on top of the acoustic model. It looks at how you create your vocabulary and adapts the statistical model to you," Steenhaut notes.
The same speech recognition technology is also being applied to call centres, allowing the simple enquiries to be handled by computer, within cars for access to information, and on mobile phones. The limited processing power and memory of the latter means you need to speak slowly though, and the vocabulary is more limited, but for text messaging it can be OK.
Nuance now owns Dictaphone as well, and sees a huge opportunity in layering specialist vocabularies - for lawyers, surgeons or radiographers, say - on top of its speech recognition technology. It says $15 billion a year is spent world-wide on manual transcription within healthcare, so anything it can get of that is worthwhile.
Of course, there's still things it can't do - and one of them is transcribe meetings where more than one voice is present. There is technology to recognise voice-prints, and it is already being used in security applications, but Steenhaut says it's not yet ready for a broader market.
In the meantime, he points to an Australian company which has addressed the problem of recording meetings by developing a system with Dragon software and multiple microphones.
"I think acceptance will broaden significantly one people realise there's no need now to sit down and read scripts," he says. "It makes it accessible to a much wider audience.
"At the moment the focus is to improve productivity for people who create documents. The average speed for dictation is 160 words per minute, versus 50 for a typist."
"It has improved so much through the versions, you can be much more relaxed now," he adds. "Conversational speech is still a way off, but that's definitely the way the technology is evolving."