Some Acoustic Design Guidelines#

This chapter presents a brief guide to a number of introductory acoustic considerations that designers should take into account when integrating the XVF3800 into their end product.

It should be stressed that a more ideal acoustic design will result in fewer compromises needing to be made whilst configuring the XVF3800. Designers should invest time in the acoustic design of the end product in order to optimise the overall product performance.

Note that the requirements discussed here are only intended as general guidelines to help devise a more precise product specification. The actual specification will depend on the intended application. For example, even within a telecommunications application, a handset, personal speakerphone, or shared space speakerphone will each have very different technical specifications.

Additionally, many certification requirements have multiple available levels of certification. For example, Microsoft Teams Audio Requirements has a basic and a premium certification level, where the premium level features more stringent requirements. A similar situation with basic and premium criteria can be found in Amazon's certification requirements for voice services.

Therefore, this document will not cover all the different potential requirements or applications. These must be decided on a case-by-case basis to match with the intended application of the product with consideration to the desired certification requirements.

Instead, this document will cover the technical areas that should be considered when designing the product specification. Where numerical values are given, they should not be considered recommended or required, but instead are provided as a “ball-park” figure for context. These figures have been chosen based on a generic product design for a typical smart speaker or personal speakerphone which aims to pass the basic level of certification requirements.

Microphones#

The XVF3800 requires 4 microphone inputs. These microphones may be omnidirectional; no additional benefit has been observed from the use of e.g. cardioid polar patterns.

Microphones chosen for a design should exhibit a signal-to-noise ratio (SNR) greater than 67 dB. This ensures a sufficiently low microphone self-noise, allowing a low enough noise floor for the XVF3800 to function effectively. Matched microphones are however not necessary. Total Harmonic Distortion (THD) should be less than 1%, although with modern MEMS microphones this is usually the case so long as the microphone is not operating near its acoustic overload point.

For compatibility with the XVF3800, microphones chosen should be digital MEMS microphones with a PDM output. These will be clocked at 3.072 MHz, with a decimation factor applied in firmware to generate the sampling rate used internally.

With loudspeakers operating at their loudest volume, microphones should not reach acoustic overload. At loudest loudspeaker volume, a headroom of 6 to 10 dB is a reasonable goal. It is important that the microphones are not driven into a non-linear response due to the volume of the loudspeakers in the end product. For far field voice a sensitivity of approximately -30 dBFS @ 94 dBSPL would be appropriate. For the SNR of 64 dB listed above, this would mean the noise floor is lower than -94 dBFS.

The XVF3800 supports both circular and linear microphone arrays. It is important to differentiate between “spacing” of the microphones and “aperture” of the array. The spacing is the distance between individual microphones whereas aperture is the outer limits of the array. The spacing is important for the high frequency limit. Make sure that d < wavelength / 2 to avoid spatial aliasing, e.g., for a 3 cm spacing the high frequency limit is 5.7 kHz.

The aperture defines the low frequency limit based on the ability to measure a phase difference between the microphones, so when the wavelength is much larger than the array aperture, the phase (and signal) at each microphone is nearly identical and it’s impossible to differentiate between the two signals.

There is not a hard limit in the way there is for high frequency; usually the low frequency limit is defined based on the beam width, which also varies with the number of microphones. As a rough rule of thumb, the array aperture should be at least 5% of the wavelength, i.e., for a 10 cm aperture the low frequency limit is 172 Hz.

However, regardless of the geometry chosen, at least 2 (and preferably more) of the microphones in the array should be approximately 10 cm apart. This is in order to ensure sufficient phase difference between microphones at lower frequencies to allow the AEC to function.

The frequency response should cover the desired voice band. For a wideband application, a response from 100 Hz to 10 kHz should be easily obtainable. A frequency mask can be defined to ensure flatness of the response, e.g., ±2 dB from 100 Hz to 6 kHz and ±4 dB from 6 kHz to 10 kHz.

With zero input (i.e. a silent room), there should be low coherence between microphone signals - that is to say, the self-noise of the microphones chosen should not be correlated between microphones. If correlation is observed with zero input, this usually indicates that there exists some common-mode interference between the microphone signals. The presence of correlated noise has a negative effect on the performance of the XVF3800, and so this should be as minimal as possible. To estimate coherence between pairs of microphones at frequencies up to the Nyquist limit (which in this system will be 8 kHz), xvf_tools.py provides the command coherence to generate a plot similar to that shown in Fig. 32, where the blue line shown is real data from two microphones in a silent room and the red line is a theoretical coherence plot between two perfect microphones measuring diffuse noise. This theoretical model is a sinc² function with its maximum at DC and its first zero crossing at f given by f = c / 2d, where c is the speed of sound (in m/s) and d is the distance between microphones (in m). The script can be called using xvf_tools.py which is located in sources. The script can be run using the command:

python3 xvf_tools.py coherence <mic0_1.wav>

The signal mic0_1.wav should be a 2 channel, 16 kHz WAV file with two microphone signals, which should be captured in silence; to capture these signals using the XVF3800’s output, issue:

(sudo) xvf_host(.exe) AUDIO_MGR_OP_L 1 0
(sudo) xvf_host(.exe) AUDIO_MGR_OP_R 1 1

Record 30 seconds of output from the device, and repeat for the other microphones:

(sudo) xvf_host(.exe) AUDIO_MGR_OP_L 1 2
(sudo) xvf_host(.exe) AUDIO_MGR_OP_R 1 3

Further information on the use of the host application to capture output can be found in Using the Host Application and documentation of this script may be found in its docstring.

For optimal algorithmic performance, the coherence between each possible pair of microphones should be less than 0.1. All possible pairs of microphones should be tested; this will result in a total of 6 plots. One example plot is shown in Fig. 32, where the blue line is real data and the red line is a theoretical coherence between two perfect microphones recording diffuse noise.

../../../../_images/06_ad_coherence.png — Fig. 32 Sample coherence plot between two microphones#

Loudspeaker(s)#

The loudspeaker, power amplifier and DAC are considered together as the playback path of the product. The frequency response of this assembly should cover the desired voice band, e.g., wideband should cover 250 Hz to 6.3 kHz. As mentioned in the microphone section, a mask can be defined to ensure desired flatness of the response within the passband. For a product which can also play music etc., the loudspeaker can have a wider frequency response; this should not degrade the performance of the voice pipeline.

The sensitivity should be sufficient for the application; it should be capable of producing 75 dB SPL at the user’s ear location when a -18 dBFS signal is supplied. This can be considered the nominal sensitivity of the product. For a more general application, the maximum volume produced can be higher. However, much higher volumes will impact the ability to detect voice during playback.

The directivity of the loudspeaker should not affect the voice pipeline performance significantly. However, it may be possible to use a directional loudspeaker to achieve required levels at the user position while minimizing the feedback path to the microphones, and hence improving echo performance.

The most pressing consideration when incorporating loudspeakers into a design using the XVF3800 is the minimisation of non-linearities within the design. Whilst the XVF3800 features a linear echo canceller (the AEC), and whilst it can also suppress tail echo and non-linear echo, it is advisable to keep any non-linearities in the design to a minimum in order to guarantee optimal intelligibility and algorithmic performance.

The two main sources of non-linearity in a design arise from mechanical coupling between a loudspeaker and the microphones and from non-linearities present in the loudspeaker/amplifier stage itself. Efforts should be made to ensure that any loudspeakers are appropriately isolated from the microphones and placed physically as far away as feasible. Isolation may take the form of mechanical decoupling from the rest of the enclosure and/or the use of soundproofing material between loudspeakers and the microphones. Additionally, product enclosures should be designed in such a manner as not to introduce non-linear effects; they should not rattle, click, vibrate, or otherwise introduce extraneous noise during normal operation.

Non-linearities present in the loudspeaker/amplifier stage are more difficult to provide generalised advice on.

Loudspeakers and amplifiers should be specified such that at nominal operating volume they are both operating within their linear region; this usually pushes design decisions towards larger or more powerful loudspeakers. As noted in the previous section, the loudspeakers at their maximum level should not be so loud that they push the microphones in the design to acoustic overload.

A THD of below 3 to 5%, measured over the full frequency range at the maximum level, is desirable. Designers should note that the THD for loudspeakers is typically only specified in datasheets at 1 kHz. THD can also be introduced by the amplifier used; it is important that amplifiers are chosen such that the overall THD of the loudspeaker system is minimised wherever possible.

Finally, it is important to consider the effect of loudspeaker placement on the far-field sensitivity of the device’s microphones. In general for a given nominal level, the closer a microphone is placed to a loudspeaker the lower its gain must be in order to avoid clipping. This means that the closer a loudspeaker is located to a microphone, the lower the overall system gain will be, and therefore the lower the far-field sensitivity of the device.

Enclosure and mounting#

The mechanical design of the product should account for the following:

The mounting of the loudspeaker and microphones should allow sound to propagate to and from the user without significant impediments, such that the performance of the transducers is not degraded.

The feedback path from the loudspeaker to the microphones should be minimized. The actual value of the attenuation required will vary depending on the application, but it generally should be as high as possible to reduce echo. Several possible paths are present and must be considered:

The acoustic path outside of the enclosure can be minimized by increasing the distance between loudspeakers and microphones or by choosing a more highly directional loudspeaker,
The acoustic path inside the enclosure can be minimized by sealing the rear chambers behind the loudspeakers and microphones, and
The vibration path in the physical structure of the enclosure can be minimized by mechanical design of the housing structure, choice of materials, and mounting of the transducers. Whatever this path, this should also remain time-constant; transducers should be solidly mounted to the enclosure, or fully decoupled with foam.

Vibration of loose parts in the product can cause rattles and buzzes which creates non-linearity and will drastically reduce echo performance. Ensure all components, panels and connectors are fixed.

Microphones should be prevented from detecting any additional noise sources from the product, e.g., cooling fans.

The specifics of enclosure design, in particular relating to the acoustic performance of loudspeakers, is beyond the scope of this document. There is a wealth of literature covering this topic as well as consultancy services for design.

More detailed information on transducer mounting and handling can be found from the component manufacturers. They will also provide guidance on handling during manufacture and other important considerations.

Additionally, specification requirements for various product applications can be found in certification documents. These certifications will allow the product to bear a mark which demonstrates that the product is suitable for use with a particular service.