[ FORMATS of the annotation files ] Files finishing with ".headgt" contain a 7-column matrix, each row is a measurement: - column 1: integer frame index for this camera. Note that this index is camera-dependent, while timecodes are not. - column 2: "is_manual" flag (0 or 1). Indicates whether the 2D measurement was provided by a user or by some interpolation/tracking procedure. - column 3: "is_visible" flag (0, 0.5 or 1). Indicates whether the head is visible (1), partially occluded (0.5) or not visible at all (0). - columns 4 to 7: (x_upleft, y_upleft, x_downright, y_downright) pixel coordinates of the head bounding box. ---------------------------------------------------------------------- Files finishing with ".ballgt" contain a 7-column matrix, with exact same format as ".headgt" files. Note that the "is_visible = 0.5" flag is generally not used, since the ball marker is small: we usually simplify by stating that it is either visible or not. ---------------------------------------------------------------------- Files finishing with ".mouthgt" contain a 5-column matrix, each row is a measurement: - column 1: integer frame index for this camera. Note that this index is camera-dependent, while timecodes are not. - column 2: "is_manual" flag (0 or 1). Indicates whether the 2D measurement was provided by a user or by some interpolation/tracking procedure. - column 3: "is_visible" flag (0 or 1). Indicates whether the mouth is visible (1) or not (0). - columns 4 and 5: (x,y) pixel coordinates of the mouth. ---------------------------------------------------------------------- Files finishing with ".3dballgt" or ".3dmouthgt" contain a 5-column matrix, where each row is a measurement: - column 1: FLOAT, absolute time value in seconds (NOT a frame index). Note that this value is NOT camera-dependent, it is basically the time code expressed in seconds. - column 2: "is_manual" flag (0 or 1). Indicates whether ALL 2D measurements used to create this 3D location estimate had the "is_manual" flag set to 1. - column 3 to 5: (X,Y,Z) spatial coordinates in a referent of which origin is the middle of the two microphone arrays, as shown below: (window)------------------ cam1--- (people) | \ | \ ----------- | | | Y | | ^ | (wall) (1) | | | | (people) | O ---> X | | | | (2) | | / | | / | (wall) cam2< (table) | | \ | (people) | \ \ / | | v | | cam3 | | where (1) and (2) denote locations of the two circular microphone arrays.