Back to the audio resources page or to my home page.

AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking

Two examples: Snapshots, videos, audio and 3D annotation

The red ellipses indicate the locations of the two microphone arrays. The colored balls on the heads were used in the 3-D annotation process.
Click on a snapshot to watch the corresponding video.

Camera 1	Camera 2	Camera 3
seq18-2p-0101: access 16kHz audio and 3D mouth annotation.

Camera 1	Camera 2	Camera 3
seq45-3p-1111: access 16kHz audio and 3D mouth annotation.

High-level description

The AV16.3 corpus is an audio-visual corpus of 43 real indoor multispeaker recordings, designed to test algorithms for audio-only, video-only and audio-visual speaker localization and tracking. Real human speakers were used. The variety of recordings was chosen to test algorithms to their limits, and to cover a wide range of applicative scenarii (meetings, surveillance). The emphasis is on overlapped speech and multiple moving speakers. Recordings include mostly dynamic scenarii, with single and multiple moving speakers. A few meeting scenarii, with mostly seated speakers, are also included. More...

Uses

Using the AV16.3 corpus,

Javier Macias-Guarasa made some pretty nice audio tracking demos,
T.W. Pirinen (paper) and Jacek P. Dmochowski (journal paper and paper) published on audio localization,
Nam Truong Pham published on video multi-camera multi-speaker tracking.
Hari K. Maganti used the mouth annotation tool for his own corpus (thesis).
Jacek Dmochowski, Jacob Benesty and Sofiène Affes evaluated the Multichannel Cross-Correlation Coefficient (MCCC)-based acoustic localization (paper, journal paper).
Oscar Varela Serrano proposed and tested new Voice Activity Detection features (paper).
Javed Ahmed tested his Robust and Real-Time Visual Tracking Framework (PhD Thesis).
Peihua Li and Xianzhe Ma developped and evaluated their RANSAC-based acoustic source localization algorithm.
Axel Plinge tested his Robust Neuro-Fuzzy Speaker Localization using a Circular Microphone Array (IWAENC).
I used AV16.3 extensively, e.g. in my thesis and a journal paper.

If you would like to be cited here just drop me a note.

Technical details

Recordings were made with two 8-microphone Uniform Circular Arrays (16 kHz sampling frequency) and three digital cameras (25 frames per second) around the meeting room, hence the "AV16.3" name. Whenever possible, lapel microphones were also worn by each speaker. All sensors were synchronized. Thus, the three cameras were calibrated and used to determine the ground-truth 3-D location of the mouth of each speaker, with a maximum error of 1.2 cm. To the best of our knowledge, this audio-visual annotated corpus was the first to be made publicly available (recorded in fall 2003, published in June 2004 at the MLMI'04 workshop).

How to access the corpus (data + annotation + tools)

Freely download it through the various links in the "Contents" section below.
If you want order the whole corpus (3 DVDs), write to data-manager@idiap.ch.

How to use the corpus

The only requirement is to cite the following paper:
"AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking",
by Guillaume Lathoud, Jean-Marc Odobez and Daniel Gatica-Perez,
in Proceedings of the MLMI'04 Workshop, 2004.
Look at the documents, watch some recordings.
Check the MATLAB examples (including SRP-PHAT localization & evaluation against the 3-D ground-truth),
annotation tools and a more elaborate implementation of multisource localization.
If you need to extract PPM images from an AVI file you can use mplayer, as in the following example:
mplayer -ss 00:00:00 -vo pnm -vf scale=360:288 seq03-1p-0000_cam1_divx_audio.avi
Compatibility issues: there are a few binary data files ("*.mat", created with MATLAB 6.5.1).
If your MATLAB version cannot read those, then use the MATLAB scripts that permit
to recreate their content ("*_mat.m" ASCII file, in the same directory as each "*.mat" file).
For other technical help, write here.

Documents:
- PDF publication describing the corpus, the camera calibration and the annotation procedure.
- Description of the annotation file formats.
- README, including:
  - General description.
  - Microphone array geometry
  - File-by-file description.
  - Log of modifications.
Room geometry:
- The room was 8.2m * 3.6m * 2.4m, with a long table in the middle. For more details consult com02-07.pdf.
- The microphone arrays were 0.04m above the table. See also section 2 of the README.
Camera calibration:

As explained in the PDF, I used 75 ground-truth 3D points on the tables, walls, and panels to calibrate the 3 cameras.
- The gt.mat MATLAB file here contains the 3D ground-truth.
- 3D measurements were done manually, with less than 1 cm error (great thanks to Olivier Masson for the help).
Calibration result:
- For recording session 08, ./CAL_session08/MANUAL_CENTER_R2_TG/ contains the camera parameters resulting from the calibration.
- The same 3D calibration parameters as in session 08 were used for the following recording sessions (09 10 11 12), only a small 2D shift correction was applied:
DATA:
- Browse all files/directories. <- if you cannot find a file, look here.
- The 10 annotated human recordings (3-D mouth location annotated within a 1.2 cm error):
  (files: AVI, 16kHz WAV, annotation)
  - seq01-1p-0000: single human speaker, static, at each of 16 locations,
  - seq02-1p-0000: single human speaker, static, at each of 16 locations,
  - seq03-1p-0000: single human speaker, static, at each of 16 locations,
  - seq11-1p-0100: single human speaker, moving randomly, while seated as well as while walking,
  - seq15-1p-0100: single human speaker, walking around, alternating speech and long silences,
  - seq18-2p-0101: two human speakers, getting closer to & further from each other, speaking simultaneously,
  - seq24-2p-0111: two human speakers, walking across each other, speaking simultaneously,
  - seq37-3p-0001: three human speakers, static, speaking simultaneously,
  - seq40-3p-0111: three human speakers, two seated & one walking behind, speaking simultaneously,
  - seq45-3p-1111: three human speakers, all moving, speaking simultaneously,
  - Detailed description file for all 10 annotated recordings.
- All 43 human recordings (10 annotated + 33 non-annotated):
  (files: AVI, 16kHz WAV)
  - session08: 3 recordings (including 2 annotated),
  - session09: 8 recordings (including 3 annotated),
  - session10: 11 recordings (including 4 annotated),
  - session11: 16 recordings (including 1 annotated),
  - session12: 5 recordings (none annotated),
  - Detailed description file for all 43 recordings.
- Loudspeaker recordings of multiple vowels (synthetic speech), along with spatial and temporal annotation.
MATLAB code:
- The annotation tools (MATLAB GUIs):
  - ball marker (on the head),
  - head itself,
  - mouth.
- Examples:
  - 3-D reconstruction of the mouth location (code),
  - Audio localization with SRP-PHAT and evaluation against the 3-D ground-truth (code),
  - Video tracking with color distributions (snaphots only).
Misc. PDF documents:

Back to the audio resources page

Last updated on 2008-09-17 by Guillaume Lathoud - glathoud at yahoo dot fr