Back to the audio resources page or to my home page.
AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking
Two examples: Snapshots, videos, audio and 3D annotation
The red ellipses indicate the locations of the two microphone arrays. The colored balls on the heads were used in the 3-D annotation process.
Click on a snapshot to watch the corresponding video.
The AV16.3 corpus is an audio-visual corpus of 43 real indoor
multispeaker recordings, designed to test algorithms for audio-only,
video-only and audio-visual speaker localization and tracking. Real human speakers were used. The variety of recordings was
chosen to test algorithms to their limits, and to cover a wide range of
applicative scenarii (meetings, surveillance). The emphasis is on
overlapped speech and multiple moving speakers. Recordings include
mostly dynamic scenarii, with single and multiple moving speakers. A
few meeting scenarii, with mostly seated speakers, are also included. More...
Using the AV16.3 corpus,
If you would like to be cited here just drop me a note.
Recordings were made with two 8-microphone Uniform Circular Arrays (16
kHz sampling frequency) and three digital cameras (25 frames per
second) around the meeting room, hence the "AV16.3" name. Whenever
possible, lapel microphones were also worn by each speaker. All
sensors were synchronized. Thus, the three cameras were calibrated and
used to determine the ground-truth 3-D location of the mouth of each
speaker, with a maximum error of 1.2 cm. To the best of our knowledge,
this audio-visual annotated corpus was the first to be made publicly
available (recorded in fall 2003, published in June 2004 at the MLMI'04 workshop).
How to access the corpus (data + annotation + tools)
- Freely download it through the various links in the "Contents" section below.
- If you want order the whole corpus (3 DVDs), write to email@example.com.
How to use the corpus
- The only requirement is to cite the following paper:
"AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking",
by Guillaume Lathoud, Jean-Marc Odobez and Daniel Gatica-Perez,
in Proceedings of the MLMI'04 Workshop, 2004.
- Look at the documents, watch some recordings.
- Check the MATLAB examples (including SRP-PHAT localization & evaluation against the 3-D ground-truth),
annotation tools and a more elaborate implementation of multisource localization.
- If you need to extract PPM images from an AVI file you can use mplayer, as in the following example:
mplayer -ss 00:00:00 -vo pnm -vf scale=360:288 seq03-1p-0000_cam1_divx_audio.avi
- Compatibility issues: there are a few binary data files
("*.mat", created with MATLAB 6.5.1).
If your MATLAB version
cannot read those, then use the MATLAB scripts that permit
recreate their content ("*_mat.m" ASCII file, in the same
directory as each "*.mat" file).
- For other technical help, write here.
Back to the audio resources page
- Room geometry:
- The room was 8.2m * 3.6m * 2.4m, with a long table in the middle. For more details consult com02-07.pdf.
- The microphone arrays were 0.04m above the table. See also section 2 of the README.
- Camera calibration:
As explained in the PDF, I used 75
ground-truth 3D points on the tables, walls, and panels to
calibrate the 3 cameras.
gt.mat MATLAB file here contains the 3D ground-truth.
- 3D measurements were done manually, with
less than 1 cm error (great thanks to Olivier Masson for the help).
- For recording session 08, ./CAL_session08/MANUAL_CENTER_R2_TG/ contains the camera parameters resulting from the calibration.
- The same 3D calibration parameters as in session 08 were used for the following recording sessions (09 10 11 12), only a small 2D shift correction was applied:
- Browse all files/directories.
- The 10 annotated human recordings (3-D mouth location annotated within a 1.2 cm error):
(files: AVI, 16kHz WAV, annotation)
- seq01-1p-0000: single human speaker, static, at each of 16 locations,
- seq02-1p-0000: single human speaker, static, at each of 16 locations,
- seq03-1p-0000: single human speaker, static, at each of 16 locations,
- seq11-1p-0100: single human speaker, moving randomly, while seated as well as while walking,
- seq15-1p-0100: single human speaker, walking around, alternating speech and long silences,
- seq18-2p-0101: two human speakers, getting closer to & further from each other, speaking simultaneously,
- seq24-2p-0111: two human speakers, walking across each other, speaking simultaneously,
- seq37-3p-0001: three human speakers, static, speaking simultaneously,
- seq40-3p-0111: three human speakers, two seated & one walking behind, speaking simultaneously,
- seq45-3p-1111: three human speakers, all moving, speaking simultaneously,
- Detailed description file for all 10 annotated recordings.
- All 43 human recordings (10 annotated + 33 non-annotated):
(files: AVI, 16kHz WAV)
- Loudspeaker recordings of multiple vowels (synthetic speech), along with spatial and temporal annotation.
- MATLAB code:
- Misc. PDF documents:
Last updated on 2008-09-17 by Guillaume Lathoud - glathoud at yahoo dot fr