pda Pitch Detection Algorithm

Table of Contents
Synopsis
Options
Examples

Synopsis

pda [input file] -o [output file] [options] [-h ] [-itype string] [-n int] [-f int] [-ibo string] [-iswap ] [-istype string] [-c string] [-start float] [-end float] [-from int] [-to int] [-L ] [-P ] [-fmin float] [-fmax float] [-shift float] [-length float] [-lpfilter int] [-forder int] [-d float] [-n float] [-h float] [-m float] [-r float] [-t float] [-otype string " {ascii}"] [-S float] [-o ofile]

pda is a pitch detection algorithm that produces a fundamental frequency contour from a speech waveform file. At present only the super resolution pitch detetmination algorithm is implemented. See (Medan, Yair, and Chazan, 1991) and (Bagshaw et al., 1993) for a detailed description of the algorithm.

The default values given below were found to optimise the performance of the pitch determination algorithm for speech data sampled at 20kHz using a 16\-bit waveform and low pass filter with a 600Hz cut-off frequency and more than \-85dB rejection above 700Hz. The best performances occur if the [\-p] flag is passed.

Options

-h

Options help

-itype

string Input file type (optional). If set to raw, this indicates that the input file does not have a header. While this can be used to specify file types other than raw, this is rarely used for other purposes as the file type of all the existing supported types can be determined automatically from the file's header. If the input file is unheadered, files are assumed to be shorts (16bit). Supported types are nist, est, esps, snd, riff, aiff, audlab, raw, ascii

-n

int Number of channels in an unheadered input file

-f

int Sample rate in Hertz for an unheadered input file

-ibo

string Input byte order in an unheadered input file: possibliities are: MSB , LSB, native or nonnative. Suns, HP, SGI Mips, M68000 are MSB (big endian) Intel, Alpha, DEC Mips, Vax are LSB (little endian)

-iswap

Swap bytes. (For use on an unheadered input file)

-istype

string Sample type in an unheadered input file: short, mulaw, byte, ascii

-c

string Select a single channel (starts from 0). Waveforms can have multiple channels. This option extracts a single channel for progcessing and discards the rest.

-start

float Extract sub-wave starting at this time, specified in seconds

-end

float Extract sub-wave ending at this time, specified in seconds

-from

int Extract sub-wave starting at this sample point

-to

int Extract sub-wave ending at this sample point

-L

Perform low pass filtering on input. This option should always be used in normal processing as it usually increases performance considerably

-P

perform peak tracking

-fmin

float miniumum F0 value. Sets the minimum allowed F0 in output track. Default is 40.000. Changing this to suit the speaker usually increases performance. Typical recommended values are 60-90Hz for males and 120-150Hz for females

-fmax

float maxiumum F0 value. Sets the maximum allowed F0 in output track. Default is 400.000. Changing this to suit the speaker usually increases performance. Typical recommended values are 200Hz for males and 300-400Hz for females

-shift

float frame spacing in seconds for fixed frame analysis. This doesn't have to be the same as the output file spacing - the -S option can be used to resample the track before saving default: 0.005

-length

float analysis frame length in seconds. default: 0.010

-lpfilter

int Low pass filter, with cutoff frequency in Hz Filtering is performed by a FIR filter which is built at run time. The order of the filter can be given by -forder. The default value is 199

-forder

int Order of FIR filter used for lpfilter and hpfilter. This must be ODD. Sensible values range from 19 (quick but with a shallow rolloff) to 199 (slow but with a steep rolloff). The default is 199.

-d

float decimation factor set down-sampling for quicker computation so that only one in decimation factor samples are used in the first instance. Must be in the range of one to ten inclusive. Default is four. For data sampled at 10kHz, it is advised that a decimation factor of two isselected.

-n

float Inoise floor. Set the maximum absolute signal amplitude that represents silence to Inoise floor. If the absolute amplitude of the first segment in a given frame is below this level at all times, then the frame is classified as representing silence. Must be a positive number. Default is 120 ADC units.

-h

float unvoiced to voiced coeff threshold set the correlation coefficient threshold which must be exceeded in a transition from an unvoiced classified frame of speech to a voiced frame as the unvoiced to voiced coeff threshold. Must be in the range zero to one inclusive. Default is 0.88.

-m

float min voiced to unvoiced coeff threshold set the minimum allowed correlation coefficient threshold which must not be exceeded in a transition from a voiced classified frame of speech to an unvoiced frame, as min voiced to unvoiced coeff threshold. Must be in the range zero to unvoiced to voiced coeff threshold inclusive. Default is 0.75.

-r

float voiced to unvoiced coeff threshold-ratio set the scaling factor used in determining the correlation coefficient threshold which must not be exceeded in a voiced frame to unvoiced frame transition, as voiced to unvoiced coeff threshold -ratio. The voiced to unvoiced coefficient threshold is determined by multiplying this scaling factor with the maximum cross-correlation coefficient of the previously voiced frame. If this product is less than min voiced to unvoiced coeff threshold then this is used instead. Must be in the range zero to one inclusive. Default is 0.85.

-t

float anti pitch doubling/halving threshold set the threshold used in eliminating (as far as possible) pitch doubling and pitch halving errors as anti pitch double/halving threshold. Must be in the range zero to one inclusive. Default is 0.77.

-otype

string " {ascii}" Output file type, if unspecified ascii is assumed, types are: none, esps, est, est_binary, htk, htk_fbank, htk_mfcc, htk_user, htk_discrete, xmg, xgraph, ema, ema_swapped, ascii, label

-S

float Frame spacing of output in seconds. If this is different from the internal spacing, the contour is resampled at this spacing

-o

ofile Output filename, defaults to stdout

Examples

Pitch detection on typical male voice, using low pass filtering:

$ pda kdt_010.wav -o kdt_010.f0 -fmin 80 -fmax 200 -L