• ๋Œ€ํ•œ์ „๊ธฐํ•™ํšŒ
Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers
  • COPE
  • kcse
  • ํ•œ๊ตญ๊ณผํ•™๊ธฐ์ˆ ๋‹จ์ฒด์ด์—ฐํ•ฉํšŒ
  • ํ•œ๊ตญํ•™์ˆ ์ง€์ธ์šฉ์ƒ‰์ธ
  • Scopus
  • crossref
  • orcid

  1. (Dept. of Computer Science, Sangmyung University, Korea.)



Emotion recognition, Acoustic feature, Facial image, Deep learning

1. ์„œ ๋ก 

์ตœ๊ทผ ์šฐ๋ฆฌ ์ •๋ณด์‚ฌํšŒ์˜ ๊ธฐ์ˆ ๋ฐœ๋‹ฌ๋กœ ์„ฑ๋Šฅ์ด ๋†’์€ ๊ฐœ์ธ์šฉ ์ปดํ“จํ„ฐ๊ฐ€ ๊ธ‰์†๋„๋กœ ๋Œ€์ค‘ํ™”๋˜๊ณ  ์žˆ๋‹ค. ์ด์— ๋”ฐ๋ผ์„œ ์ธ๊ฐ„๊ณผ ์ปดํ“จํ„ฐ์‚ฌ์ด์˜ ์ƒํ˜ธ์ž‘์šฉ์€ ํ™œ๋ฐœํ•˜๊ฒŒ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ์œผ๋ฉฐ, ์‚ฌ์šฉ์ž๊ฐ€ ์ดํ•ดํ•˜๊ณ  ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฌ์šด ํ˜•ํƒœ๋กœ ๋ฐœ์ „ํ•ด๋‚˜๊ฐ€๊ณ  ์žˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ์ปดํ“จํ„ฐ๊ฐ€ ์ธ๊ฐ„์˜ ๊ฐ์ •์„ ๋” ์ž˜ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ ์ค‘์š”ํ•œ ๋ฌธ์ œ๊ฐ€ ๋˜์—ˆ๋‹ค. ์‚ฌ์šฉ์ž์˜ ๊ฐ์ • ์ƒํƒœ๋ฅผ ์ธ์‹ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์–ธ์–ด, ์–ผ๊ตด ํ‘œ์ •, ์Œ์„ฑ, ์ œ์Šค์ฒ˜, EEG, ์‹ฌ๋ฐ•์ˆ˜ ๋“ฑ ์—ฌ๋Ÿฌ ์ƒ์ฒด ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•ด ์ธ์‹ ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•˜๋‹ค. ๊ทธ๋ž˜์„œ ํŠนํžˆ ์‹ ํ˜ธ์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ๊ฐ์ • ์ธ์‹์€ ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋‹ค (1). ๊ฐ์ • ์ธ์‹ ์ธํ„ฐํŽ˜์ด์Šค๋Š” ์‚ฌ์šฉ์ž์˜ ๊ฐ์ • ์ƒํƒœ๋ฅผ ์ •ํ™•ํžˆ ์ถ”์ถœํ•˜๊ณ  ์ธ์‹ํ•˜์—ฌ, ๊ทธ์— ๋งž๋Š” ์„œ๋น„์Šค๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜๊ณ  ์žˆ๋‹ค. ๊ฐ์ •์€ ์‹ ์ฒด์  ์ž๊ทน, ์‹ฌ๋ฆฌ์  ๊ฒฝํ—˜๊ณผ ๊ฐ™์€ ์™ธ๋ถ€ ์ž๊ทน์— ๋Œ€ํ•ด ๋ณด์ด๋Š” ๊ฐœ์ธ์˜ ์ฃผ๊ด€์ ์ธ ๋Š๋‚Œ์ด๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉ์ž์˜ ๊ฐ์ • ์ƒํƒœ๋ฅผ ์ธ์‹ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์‚ฌ์šฉ์ž์˜ ์Œ์„ฑ ์‹ ํ˜ธ, ์–ผ๊ตด ํ‘œ์ •, ๋น„๋””์˜ค์™€ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ์ž…๋ ฅ ์ •๋ณด๋“ค์„ ํ•จ๊ป˜ ๋ถ„์„ํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค.

์ตœ๊ทผ ๋‹ค์–‘ํ•œ ์—ฐ๊ตฌ๋“ค์ด ์–ผ๊ตด ์ด๋ฏธ์ง€์™€ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๊ฐ์ • ์ธ์‹ ์„ฑ๋Šฅ์„ ๋†’์ด๊ณ  ์žˆ๋‹ค (2-4). ์–ผ๊ตด ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์–ผ๊ตด ๊ฐ์ •์ธ์‹์€ ์ฃผ๋กœ 1) ์ž…๋ ฅ ์ด๋ฏธ์ง€์—์„œ ์–ผ๊ตด์„ ๊ฐ์ง€ํ•˜๋Š” ๊ฒƒ, 2) ์–ผ๊ตด์˜ ํŠน์ง•์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ, 3) ๊ฐ์ •์„ ์ธ์‹ํ•˜๋Š” ๊ฒƒ ๋“ฑ 3๊ฐ€์ง€์˜ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•์—์„œ๋Š” ์–ผ๊ตด ์ด๋ฏธ์ง€์—์„œ ์ ์ ˆํ•œ ๊ฐ์ •์  ํŠน์ง•๋“ค์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ–ˆ๊ณ , ๊ทธ ๊ฐ์ •์  ํŠน์ง•๋“ค์˜ ์ˆœ๊ฐ„์ ์ธ ๋ณ€ํ™”, ์ฆ‰ ํ‘œ์ • ๊ทผ์œก๋“ค์˜ ์›€์ง์ž„๊ณผ ๊ฐ™์€ ๊ฒƒ๋“ค์„ ํฌ์ฐฉํ•˜๋Š” ๊ฒƒ์ด ์ธ์‹๋ฅ ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ํ•„์š”ํ–ˆ๋‹ค (5). ์–ผ๊ตด ๊ฐ์ • ์ธ์‹์—์„œ๋Š” ์ฃผ๋กœ Convolutional Neural Network(CNN)๊ฐ€ ๋งŽ์ด ์‚ฌ์šฉ๋˜์–ด์™”๋‹ค. CNN์€ ์—ฌ๋Ÿฌ ํ•„ํ„ฐ๋“ค์„ ํ†ตํ•ด ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ํ•ฉ์„ฑํ•˜๊ณ  ํŠน์ง• ๋งต์„ ์ž๋™์œผ๋กœ ๋งŒ๋“ค์–ด๋‚ธ๋‹ค. ํŠน์ง• ๋งต์€ Fully connected layer๋กœ ๊ฒฐํ•ฉ๋˜์–ด ํด๋ž˜์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ์ • ํ‘œํ˜„์„ ๋ถ„๋ฅ˜ํ•˜๊ฒŒ ๋œ๋‹ค (6).

์ธ๊ฐ„์˜ ๊ฐ์ •์ด ๋“œ๋Ÿฌ๋‚  ์ˆ˜ ์žˆ๋Š” ๋˜ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์ธ ์Œ์„ฑ์‹ ํ˜ธ ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” ์ธ๊ฐ„ ์‚ฌ์ด์˜ ์ปค๋ฎค๋‹ˆ์ผ€์ด์…˜์— ์žˆ์–ด์„œ ๊ฐ€์žฅ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋งค์ฒด์ด๋ฉฐ ์–ธ์–ด์  ๋‚ด์šฉ๊ณผ ์–ต์–‘, ํฌ๊ธฐ, ์†๋„ ๋“ฑ ํ™”์ž์˜ ๊ฐ์ •์ด ๋‚ดํฌ๋œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์Œ์„ฑ ๊ฐ์ • ์ธ์‹ ์‹œ์Šคํ…œ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ฌธ์ œ๋Š” ์‚ฌ์šฉ์ž์˜ ์Œ์„ฑ ์‹ ํ˜ธ์—์„œ ํ”ผ์น˜, ํฌ๋จผํŠธ, ์—๋„ˆ์ง€์™€ ๊ฐ™์€ ์ ์ ˆํ•œ ์Œํ–ฅ์  ํŠน์ง•๋“ค์ด ์ถ”์ถœ๋˜๊ณ  ์ ์ ˆํ•œ ๋ถ„๋ฅ˜์—”์ง„์ด ํ•จ๊ป˜ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด ๋•Œ ์Œํ–ฅ์  ํŠน์ง•์„ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” MFCC(Mel- Frequency Cepstrum Coefficients)๊ฐ€ ์ฃผ๋กœ ์‚ฌ์šฉ๋˜์–ด์™”๋‹ค. ํ•˜์ง€๋งŒ ์‚ฌ์šฉ์ž์˜ ๊ฐ์ • ์ƒํƒœ์™€ ์Œ์„ฑ ์‹ ํ˜ธ๋กœ๋ถ€ํ„ฐ ๋ฝ‘์•„๋‚ธ ์Œํ–ฅ์  ํŠน์ง• ์‚ฌ์ด์˜ ๋ถ„๋ช…ํ•œ ์—ฐ๊ฒฐ๊ณ ๋ฆฌ๋Š” ์—†๊ธฐ ๋•Œ๋ฌธ์—, ์–ผ๊ตด ๊ฐ์ •์ธ์‹๋ฐฉ๋ฒ•๊ณผ ๋‹ค๋ฅธ ํ˜•ํƒœ์˜ ๊ฐ์ • ์ธ์‹๋ฐฉ๋ฒ•๋ณด๋‹ค ์ƒ๋Œ€์ ์œผ๋กœ ์ธ์‹๋ฅ ์ด ๋‚ฎ๋‹ค. ๋”ฐ๋ผ์„œ ์ ์ ˆํ•œ ์Œํ–ฅ์  ํŠน์ง•์„ ์ถ”์ถœํ•˜์—ฌ ๋ชจ๋ธ์— ๋ฐ˜์˜ํ•˜๋Š” ๊ฒƒ์ด ์ธ์‹๋ฅ ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ์ค‘์š”ํ•˜๋‹ค.

์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ๋žŒ๋“ค์€ ํ–‰๋ณต, ์Šฌํ””, ๋ถ„๋…ธ, ์ค‘๋ฆฝ๊ณผ ๊ฐ™์€ ๋ง๊ณผ ํ‘œ์ •์„ ์‚ฌ์šฉํ•˜๋Š” ๋‹ค๋ฅธ ์‚ฌ๋žŒ๋“ค์˜ ๊ฐ์ •์„ ์ธ์‹ํ•œ๋‹ค. ์ด์ „์˜ ์—ฐ๊ตฌ์— ๋”ฐ๋ฅด๋ฉด, ์–ธ์–ด์  ์š”์†Œ๋Š” ์ธ๊ฐ„ ์˜์‚ฌ์†Œํ†ต์˜ 3๋ถ„์˜ 1์„ ์ฐจ์ง€ํ•˜๊ณ , ๋น„์–ธ์–ด์  ์š”์†Œ๋Š” ์ธ๊ฐ„ ์˜์‚ฌ์†Œํ†ต์˜ 3๋ถ„์˜ 2๋ฅผ ์ฐจ์ง€ํ•œ๋‹ค (7,8). ์–ผ๊ตด ํ‘œ์ •์€ ๋น„์–ธ์–ด์  ์š”์†Œ์˜ ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์ด๋‹ค. ์ธ๊ฐ„์˜ ์ง€๊ฐ์ , ์ธ์ง€์  ์ธก๋ฉด์—์„œ ์ปดํ“จํ„ฐ๊ฐ€ ์ธ๊ฐ„์˜ ๊ฐ์ •์— ์˜ํ–ฅ์„ ์ค„ ๋•Œ ์Œ์„ฑ ์‹ ํ˜ธ์™€ ์–ผ๊ตด ์ด๋ฏธ์ง€๋ฅผ ๋™์‹œ์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋ณด๋‹ค ์ •ํ™•ํ•˜๊ณ  ์ž์—ฐ์Šค๋Ÿฌ์šด ๊ฐ์ • ์ธ์‹์— ๋„์›€์„ ์ค„ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์€ ๋‹น์—ฐํ•˜๋‹ค. ํ•˜์ง€๋งŒ ์Œ์„ฑ ์‹ ํ˜ธ์™€ ์–ผ๊ตด ์ด๋ฏธ์ง€ ์‹œํ€€์Šค์—์„œ ๊ฐ๊ฐ ๊ฐ์ •์„ ์ธ์‹ํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ํŠน์„ฑ์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์—, ๋‘ ์ž…๋ ฅ์„ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ๊ฐ์ •์ธ์‹ ๋ถ„์•ผ์—์„œ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋Š” ๋ถ€๋ถ„์ด๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์Œ์„ฑ ์‹ ํ˜ธ์™€ ์–ผ๊ตด ์ด๋ฏธ์ง€ ์‹œํ€€์Šค๋ฅผ Joint Fine-Tuning๋ฐฉ๋ฒ•์œผ๋กœ ์œตํ•ฉํ•ด ๊ฐ์ • ์ธ์‹์—์„œ์˜ ์ธ์‹๋ฅ ์„ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

๋‘ ์ž…๋ ฅ์„ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 3๊ฐ€์ง€ ์‹ฌ์ธต ๋„คํŠธ์›Œํฌ๋ฅผ ์„ค๊ณ„ํ•œ๋‹ค. ์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์ฒซ ๋ฒˆ์งธ ๋ชจ๋ธ์€ ์–ผ๊ตดํ‘œ์ •์˜ ๋ณ€ํ™”๋ฅผ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์–ผ๊ตด ์ด๋ฏธ์ง€ ์‹œํ€€์Šค๋ฅผ ์ด์šฉํ•ด ํ•™์Šต์‹œํ‚จ๋‹ค. ์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๋‘ ๋ฒˆ์งธ ๋ชจ๋ธ์€ Face landmark๋ฅผ ์ด์šฉํ•˜์—ฌ ์–ผ๊ตด์˜ ์›€์ง์ž„์„ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šต๋˜๊ณ , ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์„ธ ๋ฒˆ์งธ ๋ชจ๋ธ์€ ์ด๋ฏธ์ง€ ์‹œํ€€์Šค๋ฅผ ๋™๊ธฐํ™”ํ•˜๋ฉด์„œ ์Œํ–ฅ์  ํŠน์ง•์œผ๋กœ ๋ณ€ํ™˜๋˜์–ด ํ•™์Šต๋œ๋‹ค. ์ด 3๊ฐ€์ง€ ๋ชจ๋ธ์€ Joint fine-tuning๋ฐฉ๋ฒ•์œผ๋กœ ํ†ตํ•ฉ์‹œํ‚จ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ 86.08%๋ผ๋Š” ๋†’์€ ์ธ์‹๋ฅ ์„ ์–ป์—ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์˜ ๊ตฌ์„ฑ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 2์ ˆ์—์„œ๋Š” ์—ฐ๊ตฌ์— ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ๋Œ€ํ•ด์„œ ์„ค๋ช…ํ•˜๊ณ , 3์ ˆ์—์„œ๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์˜ ์ „์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•, ์„ค๊ณ„ํ•œ 3๊ฐ€์ง€ ๋ชจ๋ธ๊ณผ ๊ทธ ๋ชจ๋ธ๋“ค์„ ํ†ตํ•ฉํ•˜๋Š” Joint Fine-Tuning๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ ์„ค๋ช…ํ•œ๋‹ค. 4์ ˆ์—์„œ๋Š” ์‹คํ—˜ ํ™˜๊ฒฝ๊ณผ ๋ฐฉ๋ฒ•, ๋ชจ๋ธ์— ๋”ฐ๋ฅธ ์ •ํ™•๋„๋ฅผ ๋น„๊ตํ•˜๊ณ , 5์ ˆ์—์„œ ๊ฒฐ๋ก ์œผ๋กœ ๋งˆ๋ฌด๋ฆฌ ์ง“๋Š”๋‹ค.

2. ๋ฐ์ดํ„ฐ ๋ฒ ์ด์Šค

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋กœ Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS) (9)๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋Š” neutral, calm, happy, sad, angry, fearful, disgust, surprised๋กœ ์ด๋ฃจ์–ด์ง„ 8๊ฐœ์˜ ๊ฐ์ • ์ƒํƒœ๋ฅผ ๋ถ„๋ฅ˜ํ•ด ํ‘œํ˜„ํ–ˆ๋‹ค. ๋ฐฐ์šฐ๊ฐ€ ๊ฐ๊ฐ์˜ ๊ฐ์ •์„ ๋‹ด์€ ํ‘œ์ •์„ ์ง€์œผ๋ฉด์„œ ๋Œ€์‚ฌ๋ฅผ ๋งํ•˜๋Š” ์˜์ƒ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๊ณ  ์–ธ์–ด๋Š” ๋ถ๋ฏธ ์˜์–ด๋กœ ๋˜์–ด์žˆ์œผ๋ฉฐ, ์ด 24๋ช…์˜ ๋ฐฐ์šฐ๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค. ๊ฐ๊ฐ์˜ ์˜์ƒ๋ฐ์ดํ„ฐ๋Š” audio-visual(AV), video-only(VO), audio-only(AO) 3๊ฐ€์ง€ ํ˜•์‹์œผ๋กœ ์ด์šฉํ•  ์ˆ˜ ์žˆ๊ณ  104๊ฐœ์˜ audio-visual(AV)๋ฐ์ดํ„ฐ์™€ song๋ฐ์ดํ„ฐ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.

์ด ์ค‘์— 4,320๊ฐœ์˜ audio-visual(AV)๋ฐ์ดํ„ฐ๊ฐ€ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. Neutral์„ ์ œ์™ธํ•œ ๋ชจ๋“  ๊ฐ์ •๋“ค์€ ๊ฐ์ •์˜ ๊ฐ•๋„๊ฐ€ ๋†’์€ ๊ฐ์ •์  ์˜ˆ์‹œ๋ถ€ํ„ฐ ์ผ์ƒ์ƒํ™œ์—์„œ ๋ฐœ๊ฒฌ๋  ์ˆ˜ ์žˆ๋Š” ๋‹ค์†Œ ๋‚ฎ์€ ๊ฐ•๋„์˜ ๊ฐ์ •์  ์˜ˆ์‹œ๊นŒ์ง€ ํฌํ•จ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ normal๊ณผ strong์œผ๋กœ ๊ฐ์ •์˜ ๊ฐ•๋„ ๋‹จ๊ณ„๋ฅผ ๋‚˜๋ˆ„์–ด ํ‘œํ˜„๋˜์–ด์žˆ๋‹ค. Neutral๊ณผ calm ๊ฐ์ • ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” ๋ฒ ์ด์Šค๋ผ์ธ ๊ฐ์ •์œผ๋กœ ์„ ํƒ๋˜์—ˆ๋Š”๋ฐ, ์ด๋Š” neutral์ด๋ผ๋Š” ๊ฐ์ •์— ์•ฝ๊ฐ„์˜ ๋ถ€์ •์ ์ธ ๊ฐ์ •์ด ํ˜ผํ•ฉ๋˜์–ด์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ์ค‘๋ฆฝ์ด๋ผ๋Š” ๊ฐ์ •์„ ์ž˜ ์ „๋‹ฌํ•˜๊ธฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ์ •์ ์ธ ๊ท ํ˜•์„ ๋งž์ถ”๊ธฐ ์œ„ํ•ด์„œ ์•ฝ๊ฐ„์˜ ๊ธ์ •์  ๊ฐ์ •์ด ํ˜ผํ•ฉ๋˜์–ด์žˆ๋Š” calm์ด๋ผ๋Š” ๊ฐ์ •์ด ์ถ”๊ฐ€์ ์œผ๋กœ ์„ ํƒ๋˜์—ˆ๋‹ค.

์ด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ 247๋ช…์˜ ํ‰๊ฐ€์ž๊ฐ€ ๊ฐ๊ฐ 7,356๊ฐœ์˜ ํŒŒ์ผ์˜ ํ•˜์œ„์ง‘ํ•ฉ์„ ํ‰๊ฐ€ํ–ˆ๊ณ , ์‹ ๋ขฐ์„ฑ์€ 72๋ช…์˜ ํ‰๊ฐ€์ž๊ฐ€ ํ‰๊ฐ€์ž ๋‚ด test-retest ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ–ˆ๋‹ค. ๊ฒ€์ฆ์€ ํ‰๊ฐ€์ž๋“ค์—๊ฒŒ ํ‘œํ˜„๋œ ๊ฐ์ •์— ๋ผ๋ฒจ์„ ๋ถ™์ด๋„๋ก ํ–ˆ๋‹ค. RAVDESS์—์„œ๋Š” ๊ธฐ์กด์˜ ์–ผ๊ตด ๊ฐ์ •์ธ์‹ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋“ค์˜ ๊ฒ€์ฆ๋ฐฉ๋ฒ•๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ, ์–ดํœ˜์ ์ธ ๋‚ด์šฉ์ด ๋“ค์–ด์žˆ๋Š” ์›€์ง์ž„๊ณผ ๊ฐ์ •์  ํ‘œํ˜„๊ณผ ๊ด€๋ จ๋œ ์›€์ง์ž„์ด ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” orofacial ์›€์ง์ž„์„ ๊ฒ€์ฆํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋“  ์ž๊ทน์— ๋Œ€ํ•œ ๊ฐ์ • ์ •ํ™•๋„, ๊ฐ•๋„ ๋ฐ ์ง„์„ฑ(์ง„์‹ค์„ฑ)์„ ์ธก์ •ํ•˜๋„๋ก ์ œ์‹œ๋˜์–ด์žˆ๋‹ค. ์ ์ ˆํ•œ ์ž๊ทน ์„ ํƒ์„ ์œ„ํ•ด Goodness์ ์ˆ˜๋ฅผ ๋ถ€๊ณผํ•˜๋Š”๋ฐ, Goodness score๋Š” 0๊ณผ 10์‚ฌ์ด์˜ ๋ฒ”์œ„๋กœ, ํ‰๊ท ์ •ํ™•๋„, ๊ฐ•๋„ ๋ฐ ์ง„์„ฑ ์ธก์ •์˜ ๊ฐ€์ค‘์น˜ ํ•ฉ์ด๋‹ค. ์ด ๋ฐฉ์ •์‹์€ ์ •ํ™•๋„, ๊ฐ•๋„ ๋ฐ ์ง„์„ฑ์˜ ๋” ๋†’์€ ์ธก์ •๊ฐ’์„ ๋ฐ›๋Š” ์ž๊ทน์— ๋” ๋†’์€ Goodness score๋ฅผ ๋ถ€์—ฌํ•˜๋„๋ก ์ •์˜๋˜์–ด์žˆ๋‹ค.

๊ทธ๋ฆผ. 1. ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์ด๋ฏธ์ง€ ์˜ˆ์‹œ

Fig. 1. Examples from the RAVDESS dataset

../../Resources/kiee/KIEE.2020.69.7.1081/fig1.png

3. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•

3.1 Preprocessing

๊ทธ๋ฆผ 2์—์„œ ์ƒ‰์ด ์žˆ๋Š” ๋ถ€๋ถ„์€ ๋ฐฐ์šฐ๊ฐ€ ๊ฐ์ •์„ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์„ ์ค€๋น„ํ•˜๊ฑฐ๋‚˜, ๋งˆ์น˜๋Š” ๋น„์Œ์„ฑ๊ตฌ๊ฐ„์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฐฐ์šฐ๊ฐ€ ๊ฐ์ •์„ ํ‘œํ˜„ํ•˜๋Š” ๋ถ€๋ถ„, ์ฆ‰ ๋งํ•˜๋Š” ๋ถ€๋ถ„์˜ ์–ผ๊ตด ์ด๋ฏธ์ง€์™€ ์Œ์„ฑ ์‹ ํ˜ธ๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—๋Š” ๋ฐฐ์šฐ๊ฐ€ ๊ฐ์ •์„ ํ‘œํ˜„ํ•˜๋Š” ๋ถ€๋ถ„, ์ฆ‰ ๋งํ•˜๋Š” ๋ถ€๋ถ„์˜ ์–ผ๊ตด ์ด๋ฏธ์ง€์™€ ์Œ์„ฑ ์‹ ํ˜ธ๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ๊ทธ๋ฆผ 2๋ฅผ ๋ณด๋ฉด ์Œ์„ฑ์‹ ํ˜ธ๋Š” ๋ฐฐ์šฐ๊ฐ€ ๊ฐ์ •์„ ํ‘œํ˜„ํ•˜๋Š” ๋ถ€๋ถ„๊ณผ ํ‘œํ˜„ํ•˜์ง€ ์•Š๋Š” ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋ˆ„์–ด์ง„๋‹ค. ๊ฐ์ •์„ ํ‘œํ˜„ํ•˜์ง€ ์•Š๋Š” ๋ถ€๋ถ„์€ ๋ถˆํ•„์š”ํ•œ ์ •๋ณด๋กœ ๋ชจ๋ธ์˜ ์ •ํ™•๋„๊ฐ€ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ •ํ™•๋„๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด ์ด ๋ถ€๋ถ„์„ ์ œ๊ฑฐํ•˜๋Š” ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค. ์Œ์„ฑ์‹ ํ˜ธ์—์„œ ์Œ์„ฑ๊ตฌ๊ฐ„์€ ๋น„์Œ์„ฑ๊ตฌ๊ฐ„๋ณด๋‹ค ์—๋„ˆ์ง€ ๊ฐ’์ด ํฌ๊ธฐ ๋•Œ๋ฌธ์— ๋‘ ๊ตฌ๊ฐ„์„ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋„๋ก Integrate Absolute Value(IAV) ํŠน์ง•๋ฒกํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ๊ณ  ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค (10).

(1)
$\bar{X}=\sum_{i=1}^{N}| X(i\triangle t)|$

์—ฌ๊ธฐ์—์„œ, X : ์ธก์ •๋œ ์‹ ํ˜ธ

โ–ณt : ์ƒ˜ํ”Œ๋ง ์‹œ๊ฐ„ ๊ฐ„๊ฒฉ

N : ์ƒ˜ํ”Œ์˜ ์ˆ˜

i : ์ƒ˜ํ”Œ์˜ ์ˆœ์„œ

๊ทธ๋ฆผ. 2. ๋น„๋””์˜ค์˜ ์ด๋ฏธ์ง€ ์‹œํ€€์Šค์™€ ์Œ์„ฑ์‹ ํ˜ธ

Fig. 2. Speech signal and image sequence from a video

../../Resources/kiee/KIEE.2020.69.7.1081/fig2.png

์šฐ์„ , ์‹ ํ˜ธ์—์„œ ์—๋„ˆ์ง€์˜ ์ตœ๋Œ“๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’์ธ ๋ถ€๋ถ„์„ ์ฐพ์•„์„œ ์ตœ์†Ÿ๊ฐ’๋ณด๋‹ค ์ตœ๋Œ“๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’์˜ ์ฐจ์ด์˜ 10%๋งŒํผ ์—๋„ˆ์ง€๊ฐ€ ํฐ ๋ถ€๋ถ„์„ IAV ์ž„๊ณ„๊ฐ’์œผ๋กœ ์„ค์ •ํ•œ๋‹ค. ๋งŒ์•ฝ์— ์ตœ๋Œ“๊ฐ’์˜ 70%๊ฐ€ ์ตœ์†Ÿ๊ฐ’๋ณด๋‹ค ์ž‘์œผ๋ฉด ์ž„๊ณ„๊ฐ’์€ ์ตœ๋Œ“๊ฐ’์˜ 20% ์•„๋ž˜๋กœ ์„ค์ •ํ•œ๋‹ค. ๊ทธ ๊ณผ์ •์˜ ์˜ˆ์‹œ๋Š” ๊ทธ๋ฆผ 3๊ณผ ๊ฐ™๋‹ค.

๊ทธ๋ฆผ. 3. IAV ์ž„๊ณ„๊ฐ’ ์„ค์ • ์˜ˆ์‹œ

Fig. 3. An examples of determining the threshold

../../Resources/kiee/KIEE.2020.69.7.1081/fig3.png

์Œ์„ฑ๊ตฌ๊ฐ„์€ ํ”„๋ ˆ์ž„๋‹จ์œ„๋กœ ํ•ด๋‹น ํ”„๋ ˆ์ž„ ๋‚ด์—์„œ ์ž„๊ณ„๊ฐ’๋ณด๋‹ค ์ปค์ง€๋Š” ์ง€์ ์„ ์‹œ์ž‘์ ์œผ๋กœ ํ•˜๊ณ  ์‹œ์ž‘์ ๋ถ€ํ„ฐ ์ž„๊ณ„์น˜๊ฐ€ ์ž‘์•„์ง€๋Š” ๊ตฌ๊ฐ„์ด ๋‚˜์˜ค๋ฉด ๊ทธ ์ง€์ ์„ ๋ ์ ์œผ๋กœ ํ•˜๊ฒŒ ๋œ๋‹ค. ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ •ํ™•ํžˆ ์Œ์„ฑ๊ตฌ๊ฐ„์„ ์ถ”์ถœํ•˜์˜€๊ณ , ๊ทธ ์Œ์„ฑ๊ตฌ๊ฐ„์— ๋งž์ถ”์–ด ์ด๋ฏธ์ง€ ์‹œํ€€์Šค๋ฅผ ์ƒ˜ํ”Œ๋ง ํ•˜์˜€๋‹ค. ์ด๋ฏธ์ง€ ์‹œํ€€์Šค์˜ sampling rate๋Š” 30Hz์œผ๋กœ ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด 33.33ms๋‹จ์œ„๋กœ ๋ถ„์„ํ•˜๊ฒŒ ๋˜๊ธฐ ๋•Œ๋ฌธ์— 48000Hz๋กœ ์ƒ˜ํ”Œ๋ง๋œ ์Œ์„ฑ์‹ ํ˜ธ์˜ ์œˆ๋„์šฐ ํฌ๊ธฐ๋Š” ์ด๋ฏธ์ง€ ์‹œํ€€์Šค์˜ ์ƒ˜ํ”Œ๋ง ์ฃผํŒŒ์ˆ˜์— ๋งž์ถ”์–ด 1,600์œผ๋กœ ํ•œ๋‹ค.

3.2 ๊ฐ์ •์ธ์‹๋ชจ๋ธ

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” 3๊ฐ€์ง€ ๋ชจ๋ธ์„ ์„ค๊ณ„ํ•˜๊ณ  ํ†ตํ•ฉํ•œ๋‹ค. ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋กœ์„œ ์–ผ๊ตด ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅ์œผ๋กœ ํ•œ ๋ชจ๋ธ๊ณผ Face land- mark์„ ์ž…๋ ฅ์œผ๋กœ ํ•œ ๋ชจ๋ธ์„ ์„ค๊ณ„ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์Œ์„ฑ ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์„ ์„ค๊ณ„ํ•˜๋Š”๋ฐ, ์ด ๋•Œ ์ด ๋ชจ๋ธ์˜ ์ž…๋ ฅ์„ Acoustic feature๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •์˜ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์Œ์„ฑ ๊ฐ์ •์ธ์‹ ์—ฐ๊ตฌ๋“ค (15,16)์—์„œ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” Acoustic feature๋“ค์„ ๋ฐ”ํƒ•์œผ๋กœ ์žฌ์กฐํ•ฉํ•˜์—ฌ ์ตœ์ ์˜ ํŠน์ง• ์กฐํ•ฉ์„ ๊ตฌ์„ฑํ•˜์˜€๋‹ค. ๋˜ํ•œ ์ด 3๊ฐ€์ง€ ๋ชจ๋ธ์„ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด Joint Fine-Tuning (11) ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.

3.2.1 ์–ผ๊ตด ์ด๋ฏธ์ง€ ๋ชจ๋ธ

๊ทธ๋ฆผ. 4. ์–ผ๊ตด ์ด๋ฏธ์ง€ ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ์œผ๋กœ ํ•œ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์˜ ๊ตฌ์กฐ

Fig. 4. Structure of Image based model for a Face image sequence

../../Resources/kiee/KIEE.2020.69.7.1081/fig4.png

๋ชจ๋ธ์€ CNN์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์–ผ๊ตด ์ด๋ฏธ์ง€ ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์–ผ๊ตด์˜ ๋ณ€ํ™”๋ฅผ ์ธ์‹ํ•œ๋‹ค. ๋ชจ๋“  ์ด๋ฏธ์ง€๋Š” ๊ทธ๋ ˆ์ด ์Šค์ผ€์ผ๋กœ ๋ณ€ํ™˜๋˜๊ณ  64x64 px ํฌ๊ธฐ๋กœ ๊ณ ์ •๋œ๋‹ค. ๊ฐ Convolution layer๋Š” 2D-CNN layer์ด๊ณ  ์ปค๋„ ์‚ฌ์ด์ฆˆ๋Š” (3, 3)์œผ๋กœ ํ•œ๋‹ค. ํ™œ์„ฑํ•จ์ˆ˜๋กœ๋Š” ReLU๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. Timestep์€ 10์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ํ•œ๋ฒˆ์— 10์žฅ์˜ ์ด๋ฏธ์ง€๊ฐ€ Convolution layer์— ๋“ค์–ด๊ฐ€ ์ฒ˜๋ฆฌ๋˜๋Š”๋ฐ, ์‹œ๊ฐ„์ถ•์„ ๋”ฐ๋ผ์„œ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณต์œ ํ•˜์ง€ ์•Š๊ณ  ์ž…๋ ฅ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ ์ปค๋„๋“ค์€ ์‹œ๊ฐ„์— ๋”ฐ๋ผ์„œ ๋‹ค๋ฅธ ํŠน์ง• ๋งต์„ ์ƒ์„ฑํ•˜๊ฒŒ ๋œ๋‹ค. ๊ทธ ํŠน์ง• ๋งต๋“ค์€ Stack์— ์Œ“์ด๊ณ  ๊ทธ๋Œ€๋กœ LSTM layer๋กœ ๋“ค์–ด๊ฐ€ ์ฒ˜๋ฆฌ๋œ๋‹ค. ๊ทธ ์ถœ๋ ฅ๊ฐ’์€ Fully connected layer์™€ ์—ฐ๊ฒฐ๋˜์–ด ๋งˆ์ง€๋ง‰ Softmax layer๋ฅผ ํ†ตํ•ด ๊ฐ์ •์˜ ํ™•๋ฅ ์„ ์ถ”๋ก ํ•œ๋‹ค. ๋˜ํ•œ Regulari- zation์„ ์œ„ํ•ด Weight-decay ๋ฐฉ๋ฒ•๊ณผ Dropout ๋ฐฉ๋ฒ•์ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.

์—ฌ๊ธฐ์„œ Regularization์€ ํ•™์Šต์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์—๋Ÿฌ ์™ธ์— ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ํ…Œ์ŠคํŠธ์ƒ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์—๋Ÿฌ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ˆ˜์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๊ณ , ๊ทธ ๋ฐฉ๋ฒ•์œผ๋กœ Weight-decay์™€ Dropout์ด ์žˆ๋‹ค. Weight-decay๋Š” weight๋“ค์˜ ๊ฐ’์ด ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ์ œํ•œํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์˜ ๋ณต์žก๋„๋ฅผ ๊ฐ์†Œ์‹œ์ผœ ์ œํ•œํ•˜๋Š” ๊ธฐ๋ฒ•์ด๊ณ , Dropout์€ ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์—์„œ ์ „์ฒด ์ค‘์— ์ผ์ •ํ•œ ๋น„์œจ์˜ ๋…ธ๋“œ๋ฅผ ํ•™์Šตํ•˜์ง€ ์•Š์•„ ์˜ค๋ฒ„ํ”ผํŒ…์ด ๋ฐœ์ƒํ•˜์ง€ ์•Š๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

3.2.2 Face landmark ๋ชจ๋ธ

๊ทธ๋ฆผ. 5. Face landmark๋ฅผ ์ž…๋ ฅ์œผ๋กœ ํ•œ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์˜ ๊ตฌ์กฐ

Fig. 5. Structure of Image based model for a Face landmark

../../Resources/kiee/KIEE.2020.69.7.1081/fig5.png

Face landmark๋Š” ์–ผ๊ตด์˜ ์›€์ง์ž„์„ ํฌ์ฐฉํ•ด ์–ผ๊ตด ํ‘œ์ •์„ ์œ ์ถ”ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. Face landmark๋Š” ์•ž์˜ ์–ผ๊ตด ์ด๋ฏธ์ง€ ๊ตฌ๊ฐ„๊ณผ ๋™์ผํ•œ ๊ตฌ๊ฐ„์—์„œ ๊ณ ์„ฑ๋Šฅ ์–ผ๊ตด ์ธ์‹ c++ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ธ dlib ์ œ๊ณตํ•˜๋Š” 68๊ฐœ์˜ landmark๋“ค ์ค‘์— ์œค๊ณฝ 17๊ฐœ์™€ ์ž…์ˆ  ์–‘๋ ์•ˆ์ชฝ landmark 2๊ฐœ๋ฅผ ์ œ์™ธํ•œ 49๊ฐœ๋งŒ์„ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ๋‹ค. ์–ผ๊ตด ์œค๊ณฝ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” ์‚ฌ๋žŒ์˜ ํ‘œ์ •์„ ์ธ์‹ํ•˜๋Š”๋ฐ ์ž˜ ์“ฐ์ด์ง€ ์•Š๊ณ , ์ž…์ˆ ์˜ ์–‘๋๊ณผ ์ž…์ˆ ์˜ ๊ฐ€์šด๋ฐ ๋ถ€๋ถ„์œผ๋กœ ์ถฉ๋ถ„ํžˆ ์ž…๋ชจ์–‘์„ ์ธ์‹ํ•  ์ˆ˜ ์žˆ์–ด ์ž…์ˆ  ์–‘๋ ์•ˆ์ชฝ landmark๋Š” ์ œ์™ธํ•˜์˜€๋‹ค. Timestep์„ 10์œผ๋กœ ํ•˜์—ฌ 10์žฅ์˜ ์ด๋ฏธ์ง€์—์„œ ๊ฐ๊ฐ ๋ฝ‘์€ 49๊ฐœ์˜ landmark๋Š” ๊ฐ๊ฐ x,y ์ขŒํ‘œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๊ณ , Face landmark ๋ฒกํ„ฐ๋Š” 1์ฐจ์›์œผ๋กœ ๋‚˜์—ด๋˜์–ด ์ž…๋ ฅ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ด 980๊ฐœ์˜ ํŠน์ง•์  ๋ฒกํ„ฐ๊ฐ€ Fully connected layer๋กœ ๋“ค์–ด๊ฐ€์„œ ์ฒ˜๋ฆฌ๋œ๋‹ค. ํ™œ์„ฑํ•จ์ˆ˜๋กœ๋Š” ReLU๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. Regularization์„ ์œ„ํ•ด์„œ๋Š” Dropout ๋ฐฉ๋ฒ•์ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.

3.2.3 ์Œ์„ฑ ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ

์Œ์„ฑ ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์˜ ์ž…๋ ฅ์„ ์ •์˜ํ•˜๋Š”๋ฐ Acoustic feature๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Acoustic feature๋“ค์„ ์กฐํ•ฉํ•  ๋•Œ, ์ด์ „ ๊ฐ์ •์ธ์‹ ์—ฐ๊ตฌ๋“ค์—์„œ๋Š” ์ž˜ ์‚ฌ์šฉ๋˜์ง€ ์•Š์•˜๋˜ ์Œ์„ฑ์˜ ํ™”์Œ์„ ์ž˜ ๋ฐ˜์˜ํ•ด์ฃผ๋Š” harmonic ๊ด€๋ จ ํŠน์ง•์š”์†Œ๋ฅผ ํฌํ•จ์‹œ์ผฐ๋‹ค. ์ตœ์ ์˜ ํŠน์ง• ์กฐํ•ฉ์„ ์„ ๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด ์Œ์„ฑ ๊ฐ์ •์ธ์‹์—ฐ๊ตฌ๋“ค์—์„œ ๋งŽ์ด ์‚ฌ์šฉ๋˜์—ˆ๋˜ ํŠน์ง•๋“ค์„ ์กฐ์‚ฌํ•˜๊ณ , ๊ฐ ํŠน์ง• ์š”์†Œ๋“ค์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ๋ถ„์„ํ•˜๊ณ  ํ†ต๊ณ„์ ์œผ๋กœ ๊ฐ์ • ๋ถ„๋ฅ˜์— ํŠนํ™”๋œ ํŠน์ง•๋“ค์„ ์„ ๋ณ„ํ•˜๊ณ  ๋‹ค์‹œ ์กฐํ•ฉํ•˜์—ฌ ์ตœ์ ์˜ ํŠน์ง• ์กฐํ•ฉ์„ ์ฐพ์•˜๋‹ค. ์„ ๋ณ„๋œ 43๊ฐœ์˜ Acoustic feature์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

โˆ™ 13 MFCCs

โˆ™ 11 Spectral feature: spectral centroid, spectral bandwidth, 7 spectral contrast, spectral flatness, spectral roll-off

โˆ™ 12 Chroma: 12-dimensional Chroma vector

โˆ™ 7 harmonic feature: inharmonicity, 3 tristimulus, harmonic energy, noise energy, noiseness

์Œ์„ฑ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

๊ทธ๋ฆผ. 6. ์Œ์„ฑ ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์˜ ๊ตฌ์กฐ

Fig. 6. Structure of Speech-based model

../../Resources/kiee/KIEE.2020.69.7.1081/fig6.png

์Œ์„ฑ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์€ ์•ž์˜ ๋‘ ๋ชจ๋ธ๊ณผ ๊ฐ™์ด Timestep์„ 10์œผ๋กœ ํ•˜๊ณ  ๊ฐ๊ฐ์˜ ์‹ ํ˜ธ์—์„œ 43๊ฐœ์˜ Acoustic feature๋ฅผ ๋ฝ‘์•„๋‚ธ๋‹ค. ์ด 430๊ฐœ์˜ feature ๋ฒกํ„ฐ๋“ค์€ LSTM layer์— ์ž…๋ ฅ๋œ๋‹ค. ๋˜ํ•œ ๊ทธ ์ถœ๋ ฅ๊ฐ’์€ Fully connected layer์— ์—ฐ๊ฒฐ๋˜๊ณ  ๋งˆ์ง€๋ง‰ layer๋Š” Softmax๋ฅผ ํ†ตํ•ด ๊ฐ ๊ฐ์ •์˜ ํ™•๋ฅ ์„ ์ถ”๋ก ํ•œ๋‹ค. ๊ฐ layer์—๋Š” Regularization์„ ์œ„ํ•ด 0.5๋กœ Dropoutํ•˜์˜€๋‹ค.

3.2.4 Joint Fine-Tuning

๊ทธ๋ฆผ. 7. Joint Fine-Tuning๋ฐฉ๋ฒ•์œผ๋กœ 3๊ฐ€์ง€ ๋ชจ๋ธ์„ ํ†ตํ•ฉํ•œ ๊ทธ๋ฆผ

Fig. 7. A figure that integrated three models with Joint Fine-Tuning

../../Resources/kiee/KIEE.2020.69.7.1081/fig7.png

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด 3๊ฐ€์ง€ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ๊ธฐ์กด ์—ฐ๊ตฌ (11)์—์„œ ์‚ฌ์šฉํ•œ Joint Fine-Tuning๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค. ์šฐ์„ , ๊ฐ๊ฐ์˜ ๋ชจ๋ธ์€ ๋งˆ์ง€๋ง‰ layer๋ฅผ softmax๋กœ ํ•˜์—ฌ ๋ฏธ๋ฆฌ ํ›ˆ๋ จ์‹œํ‚จ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐ๊ฐ ํ›ˆ๋ จ์ด ๋๋‚˜๊ณ  ๋‚˜์˜จ Fully connected layer๋“ค๋งŒ์„ ์ƒˆ๋กœ์šด ํ†ตํ•ฉ๋ชจ๋ธ๋กœ ์‚ฌ์šฉํ•˜๊ฒŒ ๋œ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ์•ž์„œ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ๋“ค์˜ weight๊ฐ’์€ freeze๋œ๋‹ค. ์ตœ์ข…์ ์œผ๋กœ 3๊ฐœ์˜ Fully connected layer๋ฅผ ์žฌํ›ˆ๋ จํ•˜๊ณ  ํ†ตํ•ฉ๋ชจ๋ธ์— ์žˆ๋Š” ํ•˜๋‚˜์˜ Softmax layer์™€ ์—ฐ๊ฒฐํ•ด ๊ฐ์ •์˜ ํ™•๋ฅ ์„ ์ถ”๋ก ํ•˜๊ฒŒ ๋œ๋‹ค.

4. ์‹ค ํ—˜

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ œ์•ˆํ•œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ, ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ, ์ด๋ฏธ์ง€์™€ ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ๊ณผ์˜ ๋น„๊ต ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. Jung et al. (11)์€ ์ด๋ฏธ์ง€ ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์–ผ๊ตด ์ด๋ฏธ์ง€์™€ Face landmark๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ์ธ์‹ํ•˜์˜€๊ณ , Wang et al. (12), Ma et al. (13), ๊ทธ๋ฆฌ๊ณ  Hossain et al. (14)์€ ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ Mel-frequency spectrum์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ CNN ๋ชจ๋ธ์— ์–ผ๊ตด ์ด๋ฏธ์ง€์™€ ๊ฒฐํ•ฉํ•˜์—ฌ ๊ฐ์ •์„ ์ธ์‹ํ•˜์˜€๋‹ค. ๋˜ํ•œ, Zamil et al. (15)๊ณผ Shaqra et al. (16)์€ Speech signal๋กœ ๋ถ€ํ„ฐ Acoustic feature๋ฅผ ์ถ”์ถœํ•˜์—ฌ ๊ฐ๊ฐ Logistic model tree์™€ multilayer perceptron neural network์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ •์„ ์ธ์‹ํ•˜์˜€๋‹ค. ์‹คํ—˜์€ ๊ฐ ๋ชจ๋ธ ๋ชจ๋‘ ๊ฐ™์€ ํ™˜๊ฒฝ์—์„œ ์ˆ˜ํ–‰ ๋˜์—ˆ์œผ๋ฉฐ, ์‚ฌ์šฉ๋œ ์†Œํ”„ํŠธ์›จ์–ด์™€ ํ•˜๋“œ์›จ์–ด ์‚ฌ์–‘์€ ํ‘œ 1๊ณผ ๊ฐ™๋‹ค.

ํ‘œ 1. ์‹คํ—˜์— ์‚ฌ์šฉ๋œ Software์™€ Hardware์˜ ๊ทœ๊ฒฉ

Table 1. Specifications of Software and Hardware used in the experiment

๊ทœ ๊ฒฉ

Operating system

Ubuntu 18.04 LTS

Tensorflow

1.15

Cuda

10.1

CPU

intel Core i7-4770

GPU

GeForce GTX 1080Ti x 1

RAM

16GB

ํ‘œ 2. ๋ชจ๋ธ์— ๋”ฐ๋ฅธ ์ •ํ™•๋„ ๋น„๊ต

Table 2. Model accuracy comparison

Model

Input

Accuracy

(11)

Image

g82.816%

(12)

Image, Speech

77.66%

(13)

Image, Speech

77.31%

(14)

Image, Speech

75.62%

Proposed model

Image, Speech

86.06%

(15)

Speech

67.14%

(16)

Speech

74%

๊ธฐ์กด์˜ ๊ฐ์ •์ธ์‹ ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด RAVDESS ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋น„๊ต ํ•˜์˜€๋‹ค. 2์ ˆ์—์„œ ์–ธ๊ธ‰ํ•œ ๊ฒƒ๊ณผ ๊ฐ™์ด RAVDESS ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋Š” AV ๋ฐ์ดํ„ฐ, VO ๋ฐ์ดํ„ฐ ๊ทธ๋ฆฌ๊ณ  AO 3๊ฐ€์ง€ ํ˜•์‹์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ๊ธฐ์กด ๋ชจ๋ธ์˜ ์ž…๋ ฅ์— ๋งž๊ฒŒ ๋ฐ์ดํ„ฐ ์…‹์„ ํ™œ์šฉ ํ•˜์˜€์œผ๋ฉฐ, ์ œ์•ˆํ•˜๋Š” ๋ชจ๋ธ์€ ์ด๋ฏธ์ง€์™€ ์Œ์„ฑ์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— AV ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ๋ฐ์ดํ„ฐ๋Š” ์ด 4,320๊ฐœ๋กœ ๊ตฌ์„ฑ ๋˜์–ด ์žˆ์œผ๋ฉฐ, ๊ฒ€์ฆ์„ ์œ„ํ•ด 10-fold validation ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ํ•™์Šต 90%, ํ…Œ์ŠคํŠธ 10%์”ฉ ๋ฐ์ดํ„ฐ๋ฅผ ์ž„์˜๋กœ ๋‚˜๋ˆ„์–ด ์ด 10๋ฒˆ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ๋”ฐ๋ผ์„œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ์…‹์„ ํ‰๊ฐ€์™€ ํ›ˆ๋ จ์— ํ™œ์šฉํ•˜์—ฌ ์‹ ๋ขฐ์„ฑ์„ ๋†’์˜€๋‹ค. ๋ชจ๋ธ์€ ๊ฐ๊ฐ 10๋ฒˆ์˜ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•˜๊ณ  ๊ทธ ์ •ํ™•๋„์˜ ํ‰๊ท ์„ ๊ตฌํ•˜์—ฌ ์ตœ์ข… ์ •ํ™•๋„๋ฅผ ์ธก์ •ํ•˜์˜€๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ํ‘œ 2์™€ ๊ฐ™๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ ๋ชจ๋ธ์€ Joint Fine-Tuning๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์–ผ๊ตด ์ด๋ฏธ์ง€์™€ Face landmark์™€ ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ํ•™์Šต์‹œ์ผฐ์„ ๋•Œ 86.06%์˜ ์ •ํ™•๋„๋ฅผ ๋ณด์˜€๋‹ค. ์ œ์•ˆ๋œ ๋ชจ๋ธ๊ณผ ๊ฐ™์€ ํ†ตํ•ฉ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ (11)์€ ์Œ์„ฑ ์‹ ํ˜ธ๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๊ณ , 82.816%์˜ ์ •ํ™•๋„๋ฅผ ๋ณด์˜€๋‹ค. ์ด๊ฒƒ์€ ํ‘œ 2์—์„œ ๋ณด์ด๋“ฏ์ด ์ œ์•ˆ๋œ ๋ชจ๋ธ๋ณด๋‹ค 3.2% ๋‚ฎ์€ ๊ฐ์ • ์ •ํ™•๋„์˜€๋‹ค. (12-14)๋Š” ์ œ์•ˆ๋œ ๋ชจ๋ธ์ฒ˜๋Ÿผ ๋ฏธ์ง€์™€ ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ ์‚ฌ์šฉํ•˜์˜€์ง€๋งŒ ์ œ์•ˆ๋œ ๋ชจ๋ธ๊ณผ๋Š” ํ†ตํ•ฉ๋ฐฉ๋ฒ•์ด ๋‹ฌ๋ž๊ณ , Face landmark ๋ฐ์ดํ„ฐ๋„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๋‹ค. ๋”ฐ๋ผ์„œ 75%์—์„œ 77%์˜ ๋” ๋‚ฎ์€ ๊ฐ์ • ์ •ํ™•๋„๋ฅผ ๋ณด์˜€๋‹ค. ์ œ์•ˆ๋œ ๋ชจ๋ธ์€ ์•ž์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด ์—ฐ๊ตฌ๋ฅผ ํ†ตํ•ด 43๊ฐœ์˜ Acoustic feature๋ฅผ ์ถ”์ถœํ•ด ์‚ฌ์šฉํ–ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๋ณด๋‹ค ๋” ๋‚ฎ์€ ์ •ํ™•๋„์ธ 67.14%๋ฅผ ๋ณด์˜€๋‹ค. (16)๋„ ์Œ์„ฑ์‹ ํ˜ธ๋งŒ์„ ์‚ฌ์šฉํ•ด Acoustic feature ์ถ”์ถœ ๋„๊ตฌ์ธ Opensmile์„ ์ด์šฉํ•˜์—ฌ feature๋ฅผ ์ถ”์ถœํ•˜์—ฌ ๋ชจ๋ธ์„ ์„ค๊ณ„ํ–ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ์ œ์•ˆํ•œ ๋ชจ๋ธ๋ณด๋‹ค ๋‚ฎ์€ ์ •ํ™•๋„์ธ 74%๋ฅผ ๋ณด์˜€๋‹ค. ๋”ฐ๋ผ์„œ ํ‘œ 2๋ฅผ ๋ณด๋ฉด ์ œ์•ˆ๋œ ๋ชจ๋ธ์ด ๊ฐ€์žฅ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

5. ๊ฒฐ ๋ก 

์ƒ์ฒด์‹ ํ˜ธ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์—ฌ๋Ÿฌ ๊ฐ์ • ์ธ์‹๋ฐฉ๋ฒ•๋“ค์ด ์žˆ์ง€๋งŒ ์ผ๋ฐ˜์ ์ธ ์˜์‚ฌ์†Œํ†ต์˜ ์ƒํ™ฉ์—์„œ ์‚ฌ๋žŒ๋“ค์€ ํƒ€์ธ์˜ ๊ฐ์ •์„ ์ธ์‹ํ•˜๊ธฐ ์œ„ํ•ด ์„œ๋กœ์˜ ์‹ฌ๋ฐ•์ˆ˜๋ฅผ ์žฌ๋Š” ๋“ฑ์˜ ํ–‰์œ„๋กœ ๊ฐ์ •์„ ์ธ์‹ํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค. ์‚ฌ๋žŒ๋“ค์€ ๋‹ค๋ฅธ ์‚ฌ๋žŒ๋“ค์ด ํ•˜๋Š” ๋ง๊ณผ ์–ผ๊ตด ํ‘œ์ •์œผ๋กœ ์ฃผ๋กœ ๊ฐ์ •์„ ์ธ์‹ํ•œ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ๋žŒ์˜ ์Œ์„ฑ๊ณผ ์–ผ๊ตด ํ‘œ์ •์œผ๋กœ ๊ฐ์ •์„ ์ธ์‹ํ•˜๋Š” ๊ฒƒ์€ ๋ง ๊ทธ๋Œ€๋กœ ์ธ๊ฐ„์ ์ธ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์–ผ๊ตด ์ด๋ฏธ์ง€๋งŒ์œผ๋กœ ๊ฐ์ •์„ ์ธ์‹ํ•˜๋Š” ๊ฒƒ์—์„œ ๋” ๋‚˜์•„๊ฐ€ ์Œ์„ฑ ์‹ ํ˜ธ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๊ฐ์ • ์ธ์‹์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ์–ผ๊ตด์˜ ์ „์ฒด์ ์ธ ๋ณ€ํ™”๋ฅผ ๊ฐ์ง€ํ•˜๋Š” ์–ผ๊ตด ์ด๋ฏธ์ง€๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ, ์–ผ๊ตด์—์„œ ํ‘œ์ •๊ณผ ๊ด€๋ จ๋œ ํŠน์ง• ์ ๋“ค์˜ ์›€์ง์ž„์„ ํŒŒ์•…ํ•˜๋Š” Face landmark ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ, ๊ฐ์ •๋ถ„๋ฅ˜์— ํŠนํ™”๋œ ํŠน์ง•์„ ์ถ”์ถœํ•˜์—ฌ ์ž…๋ ฅ์„ ์ •์˜ํ•œ ์Œ์„ฑ ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์„ ์„ค๊ณ„ํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ํ†ตํ•ฉํ•ด ๊ฐ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•˜์˜€๋‹ค. ์ถ”ํ›„์—๋Š” ์ธ์‹๋ฅ ์„ ๋” ๋†’์ด๊ธฐ ์œ„ํ•ด์„œ ๋ชจ๋ธ๊ฐ„์˜ ๊ฒฐํ•ฉ์— ๋Œ€ํ•ด์„œ ๋” ์—ฐ๊ตฌํ•  ์˜ˆ์ •์ด๋‹ค.

Acknowledgements

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. NRF-2019R1F1A1050052).

References

1 
S. Zhang, S. Zhang, T. Huang, W. Gao, 2008, Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching, IEEE Trans Multi- med 20:1576-1590DOI
2 
S. Li, W. Deng, 2020, Deep facial expression recognition: A survey, IEEE Trans Affective Comp (Early Access)DOI
3 
N. Sun, L. Qi, R. Huan, J. Liu, G. Han, 2019, Deep spatial- temporal feature fusion for facial expression recognition in static images, Pattern Recognit Lett 119, pp. 49-61DOI
4 
Myeong Oh Lee, Ui Nyoung Yoon, Seunghyun Ko, Geun- Sik Jo, 2019. 12, Efficient CNNs with Channel Attention and Group Convolution for Facial Expression Recognition, Journal of KIISE, Vol. 46, Vol. 12, No. 46, pp. 1241-1248DOI
5 
J. Hamm, C. G. Kohler, R. C. Gur, R. Verma, 2011, Automated facial action coding system for dynamic analysis of facial expressions in neuropsychiatric disorders., J Neurosci Methods, 200, pp. 237-256DOI
6 
B. C. Ko, 2018, A brief review of facial emotion recognition based on visual information, Sensors 18DOI
7 
A. Mehrabian, 1968, Communication without words, Psychol Today 2, pp. 53-56DOI
8 
K. Kaulard, D. W. Cunningham, H. H. Blthoff, C. Wallraven, 2012, The MPI facial expression database-A validated database of emotional and conversational facial expressions, PLoS ONE 7, pp. e32321DOI
9 
R. Livingstone Steven, A. Russo1 Frank, 2018, The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English, PloS one, Vol. 13, No. 5, pp. e0196391DOI
10 
Sung-Woo Byun, Seok-Pil Lee, 2016, Emotion Recognition Using Tone and Tempo Based on Voice for IoT, The Tran- sactions of the Korean Institute of Electrical Engineers, Vol. 65, No. 1DOI
11 
H. Jung, S. Lee, J. Yim, S. Park, J. Kim, 2015, Joint fine-tuning in deep neural networks for facial expression recognition, 2015 IEEE Int Conf Comput Vision (ICCV)DOI
12 
Wang Xusheng, Chen Xing, Cao Congjun, , Human emotion recognition by optimally fusing facial expression and speech featureDOI
13 
Y. Ma, Y. Hao, M. Chen, J. Chen, P. Lu, A. Kosir, 2019, Audiovisual emotion fusion (AVEF): A deep efficient weighted approach, Inf Fusion 46, pp. 184-192DOI
14 
M. S. Hossain, G. Muhammad, 2019, Emotion recognition using deep learning approach from audio-visual emotional big data, Inf Fusion 49, pp. 69-78DOI
15 
A. A. A. Zamil, S Hasan, S. J. Baki, J. Adam, I. Zaman, 2019, Emotion detection from speech signals using voting mechan- ism on classified frames, 2019 Int Conf Robotics, Electr Signal Processing Technol (ICREST)DOI
16 
F. A. Shaqr, R. Duwairi, M. Al-Ayyou, 2019, Recognizing emotion from speech based on age and gender using hierarchical models, Procedia Comput Sci 151, pp. 37-44DOI

์ €์ž์†Œ๊ฐœ

์†๋ช…์ง„ (Myoung-jin Son)
../../Resources/kiee/KIEE.2020.69.7.1081/au1.png

Son received BS degree in Computer Science from SangMyung University, Seoul, Korea in 2018.

She is now a Master degree student in department of computer science from Sang- Myung University.

Her main research interests include signal processing, artificial intelligence, audio digital processing.

์ด์„ํ•„ (Seok-Pil Lee)
../../Resources/kiee/KIEE.2020.69.7.1081/au2.png

Seok-Pil Lee received BS and MS degrees in electrical engineering from Yonsei University, Seoul, Korea, in 1990 and 1992, respectively.

In 1997, he earned a PhD degree in electrical engineering also at Yonsei University. From 1997 to 2002, he worked as a senior research staff at Daewoo Electronics, Seoul, Korea.

From 2002 to 2012, he worked as a head of digital media research center of Korea Elec- tronics Technology Institute. He worked also as a research staff at Georgia Tech., Atlanta, USA from 2010 to 2011.

He is currently a professor at the dept. of electronic engineering, SangMyung University.

His research interests include artificial intelligence, audio digital pro- cessing and multimedia searching.