Speech Recognition À̶õ ¹«¾ùÀΰ¡?
Wikipedia : Speech recognition
À½¼º ÀÎ½Ä ½Ã½ºÅÛÀ» ºÐ·ùÇÒ ¶§¿¡´Â ¿©·¯ °¡Áö º¯¼ö¿¡ µû¶ó ´ÙÀ½°ú °°ÀÌ ³ª´ ¼ö ÀÖ´Ù.
¹ßÀ½ ¹æ½Ä |
°í¸³´Ü¾î ÀνÄ(isolated word recognition), ¿¬¼Ó À½¼º ÀνÄ(continuous speech recognition), ³¶µ¶Ã¼, ´ëÈü(ÀÚ¿¬À½¼º) |
ÈÀÚ (speaker) |
ÈÀÚ Á¾¼Ó(speaker-dependent), ÈÀÚ µ¶¸³(speaker-independent), ÈÀÚ ÀûÀÀ(speaker adaptation) |
´Ü¾îÀÇ ¼ö |
¼Ò(1~99), Áß(100~999), ´ë( 1000) |
¾ð¾î ¸ðµ¨ |
À¯ÇÑ »óÅ ³×Æ®¿öÅ©(finite-state network), ¹®¸Æ ÀÇÁ¸(context-sensitive grammar) |
´Ü¾î È¥Àâµµ |
³·À½( <10), ³ôÀ½( >100) : ´Ü¾îÀÇ ¾Ö¸Å¼º°ú À½ÇâÇÐÀû È¥µ¿¼º (ambiguity and confusability) |
SNR (ÀâÀ½ºñ) |
³ôÀ½( >30dB), ³·À½( <10dB) : signal noise ratio ÁÖº¯ ȯ°æ¿¡ µû¸¥ ¼ÒÀ½ Á¤µµ |
À½¼ºÀνÄÀ» À§Çؼ´Â ¸¹Àº ´Ù¾çÇÑ ±â¼úµéÀÌ »ç¿ëµÈ´Ù. ¶ÇÇÑ À½¼º ÀνÄ,ÇØ¼®,ÀÌÇØ (speech recognition / analysis / understanding) ¸¦ À§Çؼ´Â ¸¹Àº ´Ü°è°¡ ÇÊ¿äÇÏ´Ù.
ÀüÇüÀûÀ¸·Î À½¼ºÀνÄÀº À½¼ºÀÇ digital sampling ¿¡¼ ½ÃÀ۵ȴÙ. ´ÙÀ½ ´Ü°è´Â À½Çâ ½ÅÈ£ 󸮴٠(acoustic signal processing). ´ëºÎºÐÀÇ ±â¼úÀº ¸ðÈ£ÇÑ ºÐ¼® (spectral analysis) À» Æ÷ÇÔÇÑ´Ù. ¿¹¸¦µé¸é LPC analysis (Linear Predictive Coding), MFCC (Mel Frequency Cepstral Coefficients), cochlea modelling µîµîÀÌ´Ù....
´ÙÀ½ ´Ü°è´Â À½¼ÒÀÇ ÀνÄÀÌ´Ù (recognition of phonemes). À½¼ÒµéÀÇ ±×·ì°ú ´Ü¾î¸¦ Æ÷ÇÔÇÑ´Ù. ÀÌ ´Ü°è¿¡¼ »ç¿ëµÇ´Â ±â¼úµéÀº ´ÙÀ½°ú °°´Ù. DTW (Dynamic Time Warping), HMM (hidden Markov modelling), Neural Networks, expert systems and combinations of technique.
´ëºÎºÎÀÇ ½Ã½ºÅÛµéÀº ÀνİúÁ¤¿¡ ´õÇØ¼ ¾î´ÀÁ¤µµÀÇ ¾ð¾î¿¡ ´ëÇÑ Áö½Ä(knowledge of the language)À» ÀÌ¿ëÇÑ´Ù. ¸î¸î ½Ã½ºÅÛÀº À½¼ºÀ» "ÀÌÇØ"ÇØ¼ ±×µéÀÌ ÀνÄÇÑ ´Ü¾î¸¦ ÈÀÚ°¡ ¸»ÇÏ·Á´Â Àǹ̷ΠǥÇöµÇµµ·Ï º¯È¯½ÃŲ´Ù.
À½¼ºÀÎ½Ä ¾Ë°í¸®ÁòÀ¸·Î´Â DTW (Dynamic Time Warping), HMM (hidden Markov modelling), Neural Networks µîÀ̸ç, ÃÖ±Ù±îÁö À½¼º ÀνĿ¡ °¡Àå ¸¹ÀÌ »ç¿ëµÇ¸ç ¼º°øÀûÀ̾ú´ø ¾Ë°í¸®ÁòÀº HMM (hidden Markov model)À̾ú´Ù. HMMÀº ÀÌÁß Åë°èÀû ¸ðµ¨·Î¼, ±âº»ÀÌ µÇ´Â À½¼Ò¿ÀÇ »ý¼º°ú ÇÁ·¹ÀÓ ´ÜÀ§ÀÇ Ç¥¸éÀû À½ÇâÇÐÀûÀΠǥÇöÀ» Markov °úÁ¤°ú °°ÀÌ È®·ü·Î¼ ³ªÅ¸³½´Ù. ÇÁ·¹ÀÓ ´ÜÀ§ÀÇ Á¡¼ö¸¦ ¿¹ÃøÇϴµ¥ Neural networkÀÌ »ç¿ëµÇ±âµµ Çϸç, HMM ½Ã½ºÅÛ°ú °áÇյǾî È¥ÇÕ ¸ðµ¨·Î¼ »ç¿ëµÇ±âµµ ÇÑ´Ù. ..............
À½¼ºÀνĿ¡¼´Â ´ÙÀ½°ú °°Àº ¹®Á¦°¡ ÀÖ´Ù.
ÀÌ»ó°ú °°Àº ¹®Á¦Á¡ÀÌ Àֱ⠶§¹®¿¡, ¿¹¸¦ µé¾î '³ë°íÁö¸®'¶ó°í ¹ßÀ½Çصµ '³ë°í¸®ºñ'·Î µè´Â´Ù°Å³ª, '¾Æºü'¶ó ÇÑ °ÍÀÌ '¾ÆÆÄ' '¾÷¾î' '¾Õ¹ß'À̶ó°í ÇØ¼®µÇ´Â µîÀÇ »çŰ¡ ÀϾÙ. À̰ÍÀÌ À½¼ºÀνÄÀÇ ¾î·Á¿òÀÌ´Ù.
¾îÈÖ¿¡¼ÀÇ À¯»ç¼ºÀº ÀÎ½Ä ½Ã½ºÅÛÀÇ ¼º´É¿¡ Á÷Á¢ÀûÀÎ ¿µÇâÀ» ¹ÌÄ£´Ù. ÀϹÝÀûÀ¸·Î ¾îÈÖ¿¡¼ÀÇ À¯»ç¼ºÀº ¾Ö¸Å¼º°ú È¥µ¿¼ºÀ¸·Î ±¸º°µÈ´Ù. À½ÇâÇÐÀûÀÎ ¾Ö¸Å¼ºÀº "know" ¿Í "no", "two" ¿Í "too", ¶Ç´Â "to" µî°ú °°ÀÌ ºñ½ÁÇÑ À½ÇâÀûÀΠƯ¼ºÀ» º¸ÀÌ´Â °ÍÀ» ¸»Çϸç, È¥µ¿¼ºÀº "bee", "see", "pea" µî°ú °°ÀÌ ´Ü¾îÀÇ ºÎºÐÀû À¯»ç¼ºÀ¸·Î ÀÎÇÑ È¥µ¿À» ¸»ÇÑ´Ù. ƯÈ÷ ¿Ü±¹Àο¡ ÀÇÇÑ ¹ßÀ½ÀÇ °æ¿ì ¾Ö¸Å¼º°ú È¥µ¿¼ºÀÌ ´õ¿í ÁõÆøÀÌ µÈ´Ù. ÀϹÝÀûÀ¸·Î À½ÇâÇÐÀûÀÎ ¾Ö¸Å¼ºÀº À½ÇâÇÐÀû ´Ü°è¿¡¼´Â ±¸º°ÀÌ ¾î·Æ±â ¶§¹®¿¡ ³ôÀº ´Ü°è (Áï, ¾ð¾îÀû ´Ü°è³ª ¿îÀ²Àû ´Ü°è µî) ¿¡¼ 󸮰¡ µÇ¾î¾ß Çϸç, À½ÇâÇÐÀû È¥µ·¼ºÀº À½Ç⠴ܰ迡¼ ¾î´À Á¤µµÀÇ ÇØ°áÃ¥ÀÌ ÀÖÀ¸³ª, Á» ´õ ³ºÀº ¼º´ÉÀ» À§Çؼ´Â ³ôÀº ´Ü°è¿¡¼ÀÇ Ã³¸®°¡ ¿ä±¸µÈ´Ù.
À§¿¡¼ ¼³¸íÇÑ À½¼ºÀνÄÀÇ ¾î·Á¿ò À̿ܿ¡µµ ½ÅÈ£¿Í °ü·ÃµÈ ¸¹Àº º¯ÀÌ·Î ÀÎÇÏ¿© À½¼º ÀνÄÀÌ ¾î·Æ°Ô µÈ´Ù. ¸ÕÀú, °¢ ´Ü¾îÀÇ ±¸¼ºµÇ´Â ÃÖ¼Ò ´ÜÀ§ÀÎ À½¼ÒÀÇ À½ÇâÇÐÀûÀΠǥÇöÀº Ç¥ÇöµÇ´Â ¹®¸Æ¿¡ ¸Å¿ì Á¾¼ÓÀûÀÌ´Ù. À̵é À½¼ºÀÇ º¯ÀÌ (phonetic variability) ´Â ¿µ¾î¿¡¼ÀÇ two, true, butte r¿¡¼ÀÇ /t/ ¹ßÀ½°ú °°ÀÌ À½¼ÒÀÇ Â÷À̷μ ¿¹½Ã鵃 ¼ö ÀÖ´Ù. ¶ÇÇÑ, ´Ü¾îÀÇ °æ°è¿¡¼ ¹®¸ÆÀÇ º¯ÀÌ´Â ´õ¿í ½ÉÇÏ°Ô ¹ß»ýÇϴµ¥, ¿¹¸¦ µé¸é ¿ì¸®¸»¿¡¼ÀÇ "¸ÀÀÖ´Ù"°¡ "¸¶½Ãµû" ¶Ç´Â "¸¶µðµû" ó·³ ¹ß¼ºµÇ´Â °æ¿ìÀÌ´Ù. µÑ°·Î, Àü´ÞÀÚ (transducer) ÀÇ À§Ä¡³ª Ư¼º¿¡ µû¸¥ À½ÇâÇÐÀûÀÎ º¯ÀÌ (acoustic variability) ·Î ÀÎÇØ ¹ß»ýµÇ´Â ¹®Á¦ÀÌ´Ù. ¼¼ ¹øÂ°´Â ÈÀÚÀÇ ¹°¸®Àû ¶Ç´Â °¨Á¤ÀûÀÎ »óÅ¿¡ µû¶ó, ¹ß¼º ¼Óµµ (speaking rate) ³ª À½ÁúÀÇ º¯È·Î ÀÎÇØ ¾ß±âµÇ´Â ÈÀÚ³»ºÎ º¯ÀÌ (within-speaker variability) °¡ ÀÖ´Ù. ¸¶Áö¸·À¸·Î, »çȸ ¾ð¾îÇÐÀûÀÎ Â÷ÀÌ¿¡¼ ¹ß¼ºÇÏ´Â ÈÀÚ°£ º¯ÀÌ (across-speaker variability) ¸¦ µé ¼ö Àִµ¥, ´ëÇ¥ÀûÀÎ Çö»óÀ¸·Î´Â ¼ºµµ (vocal tract) ÀÇ Å©±â³ª ¸ð¾çÀÇ Â÷À̸¦ µé ¼ö ÀÖ´Ù.