Introduction¶
Human Perception of Sound¶

Dataset¶
Building¶
- Who are the users
- What do they need
- What task are they trying to solve
- How do they interact with the system
- Distance
- Environment
- Background Noise
- Reverb
- Quality Control
- Only keep whatever a human can understand
Industry-Standard¶
- Google Speed Commands dataset
- Recorded as individual words, not sentences
- 1000-4000 examples of each word

Good Characteristics of Model¶
| Volume Invariance | ![]() |
Pre-Processing¶
What aspects of the signal should you sent to the neural network
- Align on start point
- Normalization of amplitude
- Denoise
- Convert to frequencies, using Fast Fourier transform
- Extract features
- Sliding window
- Cut on end point
| Word | Volume | Waveform | Spectrogram | MFCC |
|---|---|---|---|---|
| Yes | Loud | ![]() | ![]() | ![]() |
| Quiet | ![]() | ![]() | ||
| No | Loud | ![]() | ![]() | ![]() |
| Quiet | ![]() | ![]() |
Mel Filterbanks¶











