AV16.3 Corpus
MATLAB/C Code
All resources
References
 
Back home
 

Questions and answers about microphone array ressources and Unsupervised Spectral Subtraction

Q: AV16.3 Corpus: Where can I access, browse and download files?
Q: AV16.3 Corpus: What does the file "seq01-1p-0000_gt.mat" contain?
Q: MATLAB/C code: Where can I access, browse and download files?
Q: Hardware: How can I get rid of inter-microphone coherence due to circuitry?
Q: Signal: What is a "band-limited" digital signal in discrete time domain?
Q: PHAT: Why upsampling? How was it implemented?
Q: SRP-PHAT, MATLAB/C code: How do I run the SRP-PHAT code?
Q: MATLAB: How to deal with "OUT OF MEMORY" messages?
Q: USS: Can Unsupervised Spectral Subtraction (USS) be applied to non-speech signals, taking into account the corresponding Probability Density Functions (PDFs)?
Q: USS: Are there any limitations for the applicability of the USS algorithm?
Q: USS, EM, priors: What are priors?
Q: USS, EM, priors: How can I select the values for P(I) and P(A)?
Q: USS, license: Am I free to use the MATLAB code provided on your site "http://www.glat.info/ma/2006-CHN-USS/index.html" for my personal research - no commercial usage?

Q: AV16.3 Corpus: Where can I access, browse and download files?
A: There.

Q: AV16.3 Corpus: What does the file "seq01-1p-0000_gt.mat" contain?
A: It contains, the speaker ground-truth annotation for the recording seq01-1p-0000, which means spatial location and speech time segmentation:

static_gt( 1 ).p2d is a 9 by 16 matrix of true 2D speaker locations:

static_gt( 1 ).p3d, is a 3 by 16 matrix of true 3D speaker locations, reconstructed from static_gt( 1 ).p2d see http://glat.info/ma/av16.3/EXAMPLES/3D-RECONSTRUCTION/index.html

static_gt( 1 ).sp_seg, is a 2 by 169 matrix of speech segments:

static_gt( 1 ).pos_ind, is a 1 by 169 matrix of integers. Each integer is a column index in .p3d : the integer tells where the speaker was for a given speech segment, because the columns of sp_seg match the columns of pos_ind.

static_gt( 1 ).speaker_id, is a 1 by 169 matrix of integers, telling who spoke for a given segment in sp_seg (here there is only one speaker in the whole sequence).

static_gt( 1 ).array( 1 ).Pmat and static_gt( 1 ).array( 2 ).Pmat are the 4x4 homogeneous 3D transform matrix (rotation + translation) defining the 3D referent of each microphone array, see http://glat.info/ma/av16.3/EXAMPLES/3D-RECONSTRUCTION/index.html for a concrete example.


Q: MATLAB/C code: Where can I access, browse and download files?
A: There.

Q: Hardware: How can I get rid of inter-microphone coherence due to circuitry?
A: (email excerpt)

Objet: Query regarding the UCA Hardware setup

I am writing in to ask you about some suggestions and information regarding the hardware setup you used in your thesis. I saw a picture of it on your personal webpage, what kind of microphones are you using, what preamps and sound card you were using for recording.

Actually we have 16 channel UCA in our lab, we built it on the baseline of NIST Mark III microphone array. We are using RME 800 sound card for audio capture. Recently we realized the microphones are not the best of quality for Blind Source Separation and the frequency response is mismatched between the microphones, we are trying to find some better replacement.

I do not remember which types of microphones we used - please simply write about this topic to Olivier Masson (olivier.masson@idiap.ch) on my behalf.

More specifically about the Nist Mark III array, other people already have had coherence problems similar to yours. Dr Luca Breyda did modifications to the Mark III to correct the problem [1] [2]. He was at Eurecom at that time, maybe you can try to write him there.

Good luck,

Dr Lathoud


Q: Signal: What is a "band-limited" digital signal in discrete time domain?
A: (email excerpt)

In your paper <<Spatio-temporal Analysis of Spontaneous Speech with Microphone Arrays>>, in chapter 2,index 2.2 "Discrete Time Processing of Quasi-Stationary Signals",Note 3 tells me:"All singnals x(t),y(t) etc. are assumed to be limited to the frequency band[0,fs/2]".

My question is:all the *.wav files I downloaded from your website,have alread been limited to the frequency band[0,fs/2], or I have to limite the signals to the frequency band[0,fs/2] in my matlab programme?

The wav signals are already band-limited, within [0 fs/2]. You do not need to modify them, and you can use them as they are.

Best regards,

Dr. Lathoud

PS: Another way to understand this is that you can reconstruct the whole, *continuous* signal curve from the *discrete* samples in the .wav, assuming that the signal is band-limited (no frequency above half the sampling frequency: fs/2). This theoretical point becomes particularly practical when doing upsampling (e.g. in time domain GCC-PHAT).


Q: PHAT: Why upsampling? How was it implemented?
A: (email excerpt)

I am now studying your implementation of SPR-PHAT and have several questions.

1. Why the following code can convert grid into the index ?

   % We also convert the time-delays of the grid into a more usable format
   % -> 1-dimensional integer index in up_gccphat(:)

   up_grid_index = up_rowzero - round( grid_td * p.upfactor );
   up_grid_index = up_grid_index + repmat( up_nrows * (0:npairs-1).', 1, size( up_grid_index, 2 ) );

Our target is a matrix of upsampled time domain GCC-PHAT values. Upsampled means:

About indexing:

2. why upsampling is necessary?

GCC-PHAT peaks:

 tau = distance difference / speed of sound * sampling frequency

appear at non-integer delay values (e.g. 1.234 samples). So you need an accuracy better than what the inverse FFT provides you (tau = 1 or 2).

Just play with any two signals, computing the frequency-domain GCC-PHAT, then the inverse FFT, and then play with upsample, and you will see the difference.

3. In :

    halfportion = 1 + ceil( max( maxdelay ) ) + ceil( p.filterorder / p.upfactor );
    up_rowzero = halforder + 1 + ( rowzero - rowstart ) * p.upfactor;

Could you please tell me why the left-hand side is equal to the right-hand side.

halfportion: it determines what portion of the upsampled FFT we need (when we know that two microphones are 20 cm appart, we don't need to consider delays bigger than what the tau equation above tells us).

halfportion in turn should determine rowstart, if I remember correctly.

up_rowzero & halforder: halforder compensates the delay introduced by the low-pass filter used right after upsampling (fir1(...) if I remember correctly).


Q: SRP-PHAT, MATLAB/C code: How do I run the SRP-PHAT code?
A:
  1. Try first to run the baseline example: a simple but very slow method where we evaluate SRP-PHAT at each point of a large grid, and pick the best point, for each time frame.

    As you can see in these two lines:

    addpath ../..
    o = seq_av163( 'seq01-1p-0000', '../..' );

    ...you need to install other things, including:

    ../../seq_av163.m      (another MATLAB file)

    ...which you'll find through AV16.3's file index.

    You also need to install the test data:

    ../../session08/seq01-1p-0000/        (a directory)

    which you'll find in AV16.3's session08/seq01-1p-0000/ directory.

    If you encounter any specific trouble running this baseline example, let me know.

  2. Once you have the above (slow) baseline working, try to run an faster, multispeaker version.

Q: MATLAB: How to deal with "OUT OF MEMORY" messages?
A: (email excerpt)

For your information I did these programs under Matlab 6.5.1 (R13).

Now, as an attempt to help you, I offer several solution:

  1. If you can, try with Matlab 6.5.1 (R13).
  2. If it still does not work:
    • Verify how much memory is available to MATLAB by typing:
      feature memstats
    • Try to increase the memory available to MATLAB. Instructions can be found on the Matlab support website.
  3. If it still does not work, you can try to reduce the block size. For example, in FASTTDE_detect_locate_wrapper.m you can replace this line:
    PAR.BLOCK_SIZE_SEC = 10; % in seconds
    with:
    PAR.BLOCK_SIZE_SEC = 2; % in seconds (any value that you want)
    Similarly, you can also modify PAR.BLOCK_SIZE_SEC in the following two files:

Q: USS: Can Unsupervised Spectral Subtraction (USS) be applied to non-speech signals, taking into account the corresponding Probability Density Functions (PDFs)?
A: (email excerpt)

Yes, because the core assumptions are *not* specific to speech. We assume in time domain:

This is not specific to speech at all. For more details, see Section 5 in [3] (esp. Section 5.1 and Fig. 5).

You are by the way absolutely free to use other PDFs. I tried other, more complex PDFs than the ones in the paper, but that led to overfitting issues and suboptimal results.

Practically, it seems to be safer to stick to PDFs with very few parameters. In particular, the mathematical structure of the two PDFs should reflect assumptions A1. and A2. (e.g. "the signal of interest has *bigger* amplitudes than the noise signal" -> *shifted* Erlang pdf for the target speech signal).


Q: USS: Are there any limitations for the applicability of the USS algorithm?
A: (email excerpt)

In case that your noise signal does not match well Figs. 5a and 5b in [3], you may need some pre-processing - any sort of whitening, e.g. channel estimation and de-convolution, which is the idea of CHN in [3].

Now let us assume a noise signal that matches well Figs. 5a and 5b in [3], and a signal of interest that has at least long tails (=bigger amplitudes than noise), as in Fig. 5c in [3]. An important assumption is "slowly-varying" in A1. That's why the code processes the signal in blocks [4]:

  % Size of the block on which we apply CHN / RSE-USS / GMN-USS
  par_default.block_size_sec = 1.0; % 1.0 second

The block should be one or two orders of magnitude larger than a typical variation of the target signal (10 to 20 ms in the case of speech). The noise signal is assumed to have a constant amplitude on one block (e.g. 1 second), which also means, that we allow noise amplitude varations from one block to the next (1 second in the speech case).


Q: USS, EM, priors: What are priors?
A: (email excerpt)

P(A) is the probability to observe the target (speech) signal at any given (time, frequency) point of the spectrum.

P(I) is the probability *not* to observe the target speech signal at any given (time, frequency) point of the spectrum.

Hence P(I) + P(A) = 1.

Within the Expectation-Maximization (EM) context, you can find an example in [5] (P(Zi = 1) and P(Zi = 2)).


Q: USS, EM, priors: How can I select the values for P(I) and P(A)?
A: (email excerpt)

We are using the EM algorithm to adjust all parameters together - including the priors - so as to maximize the likelihood of the observed data (again, see [5] for an example). That is what we call "fitting the model (Rayleigh + Shifted Erlang) to the data", and is implemented in two steps in [6], both fully automatic:

Step 1. Automatic initialization using a rough, but reliable initial estimate. In [6]:

  % Init priors

  raylsherl.p0 = max( 0.1, min( 0.9, numel( lsr_ind ) / nx ) );

--> yields automatically some reasonable value for P(I), garanteed to be between 0.1 and 0.9 .

Step 2. At each iteration of the EM algorithm, in the M step, automatic adjustment of the priors, as explained in [5]. In our case the code [6] is:

      % Update priors

      tmp = my_logsum_fast( log_w );
      
      raylsherl.p0 = min( 1, exp( tmp( 1 ) - my_logsum_fast( tmp ) ) );

Step 1. is particularly crucial, because "an EM algorithm may converge to a local maximum of the observed data likelihood function, depending on starting values." ([5], Section "Properties"). Incompetent researchers choose to ignore this mathematical fact.

In other words, the EM algorithm will never converge to an absolute optimum choice of parameters (= priors and pdf parameters), which is why you should pay a lot of attention to Step 1, and have a "rough, but reliable initial estimate" of the priors. "reliable" means that the initial estimate of the priors should guarantee a decent result of the EM convergence, and this in various situations (= various recordings, various contexts, various background noises).

In practice, initialization may be easier if you keep your PDFs simple (few parameters).

In the worst case, if for some reason you do not know (yet) what a good initialization should be, you can always use p0 = 0.5, which would mean: P(I) = 0.5 and P(A) = 1 - P(I) = 0.5

On the other hand, let me advise *against* manual tuning of the initial prior values, because it would mean that you are overfitting your data: you may obtain very good performance on a particular signal, and very bad performance on another = meaningless result.


Q: USS, license: Am I free to use the MATLAB code provided on your site "http://www.glat.info/ma/2006-CHN-USS/index.html" for my personal research - no commercial usage?
A: (email excerpt)

Yes you are. If you only use USS, please cite the ASRU paper [7]. If you use CHN as well, please cite the RR 06-09 [3].


References

[1] L. Brayda, C. Bertotti, L. Cristoforetti, M. Omologo, P. Svaizer, "Modifications on NIST MarkIII array to improve coherence properties among input signals", AES, 118th Audio Engineering Society Convention, Barcelona, Spain, May 2005.

[2] L. Brayda, C. Bertotti, L. Cristoforetti, M. Omologo, P. Svaizer, "On calibration and coherence signal analysis of the CHIL microphone network at IRST", Joint Workshop on Hands-Free Speech Communication and Microphone Arrays, Piscataway, NJ, March 2005.

[3] Channel Normalization for Unsupervised Spectral Subtraction

[4] USS implementation: uss_filter.m

[5] Wikipedia - Expectation-Maximization algorithm

[6] USS implementation: fit_raylsherl.m

[7] Unsupervised Spectral Subtraction for Noise-Robust ASR


Produced on 2011-09-27 by qa.scm - by Guillaume Lathoud (glathoud _at_ yahoo _dot_ fr)Valid HTML 4.01 Strict