TY - GEN
T1 - Scalable HMM based inference engine in large vocabulary continuous speech recognition
AU - Chong, Jike
AU - You, Kisun
AU - Yi, Youngmin
AU - Gonina, Ekaterina
AU - Hughes, Christopher
AU - Sung, Wonyong
AU - Keutzer, Kurt
PY - 2009
Y1 - 2009
N2 - Parallel scalability allows an application to efficiently utilize an increasing number of processing elements. In this paper we explore a design space for application scalability for an inference engine in large vocabulary continuous speech recognition (LVCSR). Our implementation of the inference engine involves a parallel graph traversal through an irregular graph-based knowledge network with millions of states and arcs. The challenge is not only to define a software architecture that exposes sufficient fine-grained application concurrency, but also to efficiently synchronize between an increasing number of concurrent tasks and to effectively utilize the parallelism opportunities in today's highly parallel processors. We propose four application-level implementation alternatives we call "algorithm styles", and construct highly optimized implementations on two parallel platforms: an Intel Core i7 multicore processor and a NVIDIA GTX280 manycore processor. The highest performing algorithm style varies with the implementation platform. On 44 minutes of speech data set, we demonstrate substantial speedups of 3.4x on Core i7 and 10.5x on GTX280 compared to a highly optimized sequential implementation on Core i7 without sacrificing accuracy. The parallel implementations contain less than 2.5% sequential overhead, promising scalability and significant potential for further speedup on future platforms.
AB - Parallel scalability allows an application to efficiently utilize an increasing number of processing elements. In this paper we explore a design space for application scalability for an inference engine in large vocabulary continuous speech recognition (LVCSR). Our implementation of the inference engine involves a parallel graph traversal through an irregular graph-based knowledge network with millions of states and arcs. The challenge is not only to define a software architecture that exposes sufficient fine-grained application concurrency, but also to efficiently synchronize between an increasing number of concurrent tasks and to effectively utilize the parallelism opportunities in today's highly parallel processors. We propose four application-level implementation alternatives we call "algorithm styles", and construct highly optimized implementations on two parallel platforms: an Intel Core i7 multicore processor and a NVIDIA GTX280 manycore processor. The highest performing algorithm style varies with the implementation platform. On 44 minutes of speech data set, we demonstrate substantial speedups of 3.4x on Core i7 and 10.5x on GTX280 compared to a highly optimized sequential implementation on Core i7 without sacrificing accuracy. The parallel implementations contain less than 2.5% sequential overhead, promising scalability and significant potential for further speedup on future platforms.
UR - http://www.scopus.com/inward/record.url?scp=70449559088&partnerID=8YFLogxK
U2 - 10.1109/ICME.2009.5202871
DO - 10.1109/ICME.2009.5202871
M3 - Conference contribution
AN - SCOPUS:70449559088
SN - 9781424442911
T3 - Proceedings - 2009 IEEE International Conference on Multimedia and Expo, ICME 2009
SP - 1797
EP - 1800
BT - Proceedings - 2009 IEEE International Conference on Multimedia and Expo, ICME 2009
T2 - 2009 IEEE International Conference on Multimedia and Expo, ICME 2009
Y2 - 28 June 2009 through 3 July 2009
ER -