Home

Purdue Prosodic Feature Extraction Tool on Praat

1. Naher tes OPE 18 A 2 4 Energy Features 21 A3 Statistical Tables 245 2 240 Rex quao A SUR UR RE RE UE EU RR 23 AA Derived Features eee o Bat RO d 24 A 4 1 Normalized Word Duration 24 A 4 2 Normalized Pause 25 A 4 3 Normalized Vowel Duration 25 A 4 4 Normalized Rhyme Duration 25 A5 Fo Derived Features ne peice do kone v oe Ma qux NE B 26 A 4 6 Energy Derived 29 A 4 7 Average Phone 31 A 4 8 Speaker Specific Normalization 31 Version history e version 0 1 1 The first public release version Introduction The prosody i e the duration pitch and energy of speech plays an important role in human communication Research in speech and language processing has shown that the prosodic content of speech can be quite valuable for accurate event tagging Prosodic cues have been exploited in a variety of spoken language processing tasks such as sentence segmentation and tagging 7 9 disfluency detection 8 dialog act segmentation and tagging 1 and speaker recognition 12 using the direct modeling approach 11 An advantage of this approach is that no hand segmentation or intermediate labeling of th
2. or no similar to the option in stats_batch praat Here is an example of running main_batch praat at the command line Tn this file FEATURE NAME is the required column label WORD WAV SPKR ID GEN PAUSE DUR and NORM LAST RHVME DUR are features defined in Appendix A DERIVE FEATURE is a feature class that tells the tool to output all the dereived prosodic features 10 praat main_batch praat demo wavinfo list txt user pf name table Tab demo work_dir stats_files demo work_dir yes Below are the steps to run the same example in the Praat ScriptEditor 1 Run Praat 2 Open main_batch praat from Read Read from file jects on the menu of Praat Ob 3 Click Run Run on the menu of ScriptEditor 4 Enter parameters Type demo wavinfo_list tat user pf name table Tab demo work_dir stats_files and demo work dir one by one in the four boxes and then check yes in the radio box 5 Click OK to start processing with the configurations or Cancel to close the interface Clicking the Apply button if available also starts processing but it keeps the interface on after the work is done The Standards button if available gives the option to restore the default configurations Please refer to the Praat manual 2 for details 6 Process related information is displayed in the Praat Info Window After computation is complete the prosodic feature files
3. Chapter 3 gives the structure of the tool as well as an augmentation example that demonstrates the procedures to modify the code in order to extract additional features An exhaustive list of all of the prosodic features implemented in our tool is given in Appendix A l This tool can be downloaded at ftp ftp ecn purdue edu harper praat prosody tar gz along with a man ual 6 Chapter 1 Implementation using Praat This chapter is mainly about how Praat s data structures and functionality support prosodic feature extraction The tool described in this document was designed to extract a set of prosodic features given an audio file and its corresponding word and phone alignments It is assumed that the alignments are in Praat TertGrid format as in Figure 1 1 145Hz 120 76 Hz s STORE AND Figure 1 1 An example of a word phone alignment in Praat TextGrid format together with the waveform B AY BUY 0 458624 9 930031 Given a corpus with audio and time aligned words and phones as input our tool first extracts a set of basic elements e g raw pitch stylized pitch voiced unvoiced segmentation VUV repre senting duration Fo and energy information using the procedures illustrated in Figure 1 2 Then a set of duration statistics e g the means and variances of pause duration phone duration and last rhyme duration Fo related statistics e g the mean and varianc
4. Proceedings of the Empirical Methods in Natural Language Processing 2004 B Pellom SONIC The University of Colorado continuous speech recognizer Technical Report TR CSLR 2001 01 University of Colorado 2001 E Shriberg and A Stolcke Direct modeling of prosody An overview of applications in automatic speech processing In International Conference on Speech Prosody 2004 K Sonmez E Shriberg L Heck and M Weintraub Modeling dynamic prosodic variation for speaker verification In Proceedings of International Conference on Spoken Language Processing ICSLP pages 3189 3192 1998 33 13 R Sundaram A Ganapathiraju J Hamaker and J Picone ISIP 2000 conversational speech evaluation system In Speech Transcription Workshop 2001 College Park Maryland May 2000 14 C Wightman and D Talkin The Aligner Entropic July 1997 34
5. DUR MEAN e LAST VOWEL DUR ZSP LAST VOWEL DUR SPKR PHONE DUR MEAN SPKR PHONE DUR STDEV e LAST VOWEL DUR NSP LAST_VOW_DUR SPKR PHONE DUR MEAN Where e LAST VOWEL DUR is a basic duration feature e ALL PHONE DUR MEAN and ALL PHONE DUR STDEV are statistics taken from table phone dur stats for the line corresponding to LAST VOWEL another basic feature e SPKR PHONE DUR MEAN and SPKR PHONE DUR STDEV are statistics taken from ta ble SPKR_ID phone_dur stats where SPKR ID is a base feature for the line corresponding to LAST VOWEL 4 4 Normalized Rhyme Duration e LAST RHYME DUR PH LAST RHYME DUR PHONES IN LAST RHYME e LAST RHYME DUR PH ND LAST RHYME DUR PHONES IN LAST RHYME LAST RHYME PHONE DUR MEAN e LAST RHYME DUR PH NR LAST RHXVME DUR PHONES IN LAST RHYME LAST RHVME PHONE DUR MEAN e LAST RHYME NORM DUR PH NORM LAST RHVME DUR PHONES_IN_LAST_RHYME e LAST RHYME NORM DUR PH ND NORM LAST RHYME DUR PHONES IN LAST RHYME NORM LAST RHYME PHONE DUR MEAN e LAST RHYME NORM DUR PH NR NORM LAST RHVME DUR PHONES IN LAST RHYME NORM LAST RHYME PHONE DUR MEAN 25 e LAST RHVME DUR ND LAST RHYME DUR LAST RHVME WHOLE DUR MEAN e LAST RHYME WHOLE DUR NR LAST RHVME DUR LAST RHVME DUR MEAN e LAST RHYME WHOLE DUR Z LAST RHYME DUR LAST RHYME WHOLE DUR MEAN LAST RHYME WHOLE DUR STDEV where e LAST RHYME
6. WIN MEAN STVLFIT FO NEXT WIN FIRST STVLFIT FO NEXT LAST STVLFIT FO NEXT WIN Stvlized Fo contour slope features e PATTERN WORD This feature is composed of a sequence of f u and r representing a falling slope an unvoiced section and a rising slope in the word preceding a boundarv Anv slope or unvoiced section that contains less than min frame length frames is skipped Note that sequences of f s or r s with different slopes are represented as ff or rr e PATTERN WORD CALLAPSED Similar to PATTERN_WORD except that consecutive f s or r s are combined into one f or r e PATTERN SLOPE Similar to PATTERN_WORD but instead of the sequence of f s or r s a sequence of slope values are listed e The following features are the same as the corresponding features without NEXT except these are computed for the word after a boundary PATTERN WORD NEXT PATTERN WORD COLLAPSED NEXT PATTERN SLOPE NEXT e The following features are the same as the corresponding features without WIN except in these cases the values are computed over the N frames before or after a boundary Maximum number of frames are used if there is not enough data PATTERN WORD WIN PATTERN WORD COLLAPSED WIN PATTERN_SLOPE_WIN PATTERN WORD NEXT WIN 19 PATTERN WORD CALLAPSED NEXT WIN PATTERN_SLOPE_NEXT_WIN There are also several features that involve counting e NO PREVIOUS S
7. end time of the word following a boundary e PAUSE START The start time of the pause around a boundary Its value is set to the end time of the preceding word since the boundary is defined at the end of the preceding word e PAUSE END The end time of the pause around a boundary Its value is set to the start time of the following word If there is no following word i e it appears at the end of the waveform then it is set to the end time of the waveform e PAUSE DUR The duration of the pause around a boundary PAUSE DUR PAUSE END PAUSE START e WORD PHONES The phones in the WORD with their durations hereafter duration is measured by the number of frames The format is phonel durationi phone2 duration2 e FLAG This feature indicates whether the word before a boundarv has reliable phone dura tions If the duration of anv of the phones in that word is larger than a specific threshold obtained from the phone_dur stat file for that phone then this feature is set to SUSP suspicious word if the threshold for anv of the phones in the word is missing or the word does not contain phones the value is set to otherwise it is set to 0 e LAST VOWEL The last vowel in the word preceding a boundary If it doesn t exist then all the related features are set to 77 This is the default treatment for the features whose values are not available e LAST VOWEL START The start time of the last vowel in the word
8. log MIN STVLFIT MAX STVLFIT FO WIN NEXT FOK WIN DIFF MNMN NG log MEAN STVLFIT FO WIN MEAN STVLFIT FO WIN NEXT e Log ratio of the maximum minimum and mean of the stylized Fo values between the previous and the next window normalized by pitch range FOK_WIN_DIFF_HIHI_NG log MAX STYLFIT F0 WIN log MAX STYLFIT F0 WIN NEXT SPKR FEAT F0 RANGE FOK_WIN_DIFF_HILO_NG log MAX STYLFIT F0 WIN log MIN STYLFIT F0 WIN NEXT SPKR FEAT F0 RANGE FOK_WIN_DIFF_LOLO_NG log MIN STYLFIT_FO_WIN log MIN STYLFIT F0 WIN NEXT SPKR FEAT F0 RANGE FOK WIN DIFF LOHI NG log MIN STYLFIT FO WIN log MAX STYLFIT F0 WIN NEXT SPKR FEAT F0 RANGE FOK_WIN_DIFF_MNMN_NG log MEAN STYLFIT F0 WIN log MEAN STYLFIT F0 WIN NEXT SPKR FEAT F0 RANGE e Difference and log difference between the last mean and minimum of the stylized Fo values in a window and the baseline of Fo values FOK_DIFF_LAST_KBASELN LAST STVLFIT FO SPKR FEAT F0 BASELN FOK_DIFF_MEAN_KBASELN MEAN STYLFIT F0 SPKR FEAT FO BASELN FOK_DIFF_WINMIN_KBASELN MIN STYLFIT F0 WIN SPKR FEAT F0 BASELN 27 FOK LR LAST KBASELN log LAST STVLFIT FO SPKR FEAT F0 BASELN FOK_LR MEAN_KBASELN log MEAN STYLFITF0 SPKR FEAT F0 BASELN FOK LR WINMIN KBASELN log MIN_STYLFIT_FO_WIN SPKR FEAT F0 BASELN where LAST STVLFIT F0 MEAN_STYLFIT_FO and MIN S
9. on the menu of ScriptEditor Enter parameters Type demo wavinfo_list tat and demo work dir in the two boxes and then check yes in the radio box Click OK to start processing with the configurations or Cancel to close the interface Clicking the Apply button if available also starts processing but it keeps the interface on after the work is done The Standards button if available gives the option to restore the default configurations Please refer to the Praat manual 2 for details Process related information is displayed in the Praat Info Window After computation is complete the statistics files can be found at demo work_dir stats_files 2 2 Prosodic Feature Extraction Once the global statistics are computed the tool can proceed to compute the prosodic features Although our tool is able to produce all the features described in Appendix A there is an option to limit the output to a selected set of prosodic features Below are several pre defined feature classes 3 3Each class is defined in a seperate file under code pf_list_files Currently all the features are computed in the tool even though a subset of features are selected for output One may also choose to output all the features and select the desired features by other means FULL FEATURE BASIC FEATURE BASIC BASE FEATURE BASIC_DUR_FEATURE BASIC F0 FEATURE BASIC ENERGY FEATURE DERIVE_FEATURE DERIVE_NORMALIZED_WORD DERIVE_
10. pause_dur stats For each audio the listed features are the mean and standard deviation of the pauses in the training database e spkr_feat stats This table has one row for each speaker Each row contains a variety of statistics related to consecutive voiced and unvoiced frames Fo and Fo slope energy and energy slope Note that these Fo and energy values are in logarithm base e These are described below MEAN VOICED The average length of the voiced sections inside the uttered words for all of the audio corresponding to the speaker for sequences of voiced frames longer than min frame length STDEV VOICED The standard deviation of the voiced sections inside the uttered words for all of the audio corresponding to the speaker for sequences of voiced frames longer than min frame length COUNT VOICED The number of voiced sections inside the uttered words for all audio corresponding to the speaker for sequences of voiced frames longer than min_frame_length MEAN UNVOICED The average length of the unvoiced sections inside the uttered words for all of the audio corresponding to the speaker for sequences of unvoiced frames longer than min frame length STDEV UNVOICED The standard deviation of the unvoiced sections inside the uttered words for all of the audio corresponding to the speaker for sequences of unvoiced frames longer than min frame length COUNT_UNVOICED The number of unvoiced sections in
11. 5 WE NEXT NO SUCCESSOR SSF NEXT NO SUCCESSOR VEF NEXT NO FRAMES WS FS NEXT e The following features are the same as the corresponding features without WIN except in these cases the values are computed over the N frames before or after a boundary Maximum number of frames are used if there is not enough data NO PREVIOUS SSF WIN NO_PREVIOUS_VF_WIN NO_FRAMES_LS_WE_WIN NO SUCCESSOR SSF WIN NO_SUCCESSOR_VF_WIN NO FRAMES WS FS WIN 20 NO PREVIOUS SSF NEXT WIN NO PREVIOUS VF NEXT WIN NO FRAMES LS WE NEXT NO SUCCESSOR SSF NEXT WIN NO SUCCESSOR VF NEXT WIN NOFRAMES WS FS NEXT WIN Features extracted concerning word boundaries e PATTERN_BOUNDARY The last f r or u in the PATTERN WORD concatenated with the first f or u in the PATTERN_NEXT_WORD e SLOPE_DIFF The difference between the last non zero longer than min frame length slope of the word and the first non zero longer than min frame length slope of the next word If one of the words does not have a non zero slope that occurs over more than min frame length frames then this feature receives a value Note again is the default value for these unavailable features A 2 4 Energy Features The basic energy features are computed similarily as the basic Fo features Below is the list of the basic energy features e MIN ENERGY e MAX ENERGY e MEAN E
12. DUR PHONES IN LAST RHYME and NORM_LAST_PHYME_DUR are duration features e LAST RHVME PHONE DUR MEAN is taken from table last_rhyme_phone_dur stats col umn MEAN for the line corresponding to the audio e NORM LAST RHVME PHONE DUR MEAN is taken from table norm last rhyme phone dur stats column MEAN for the line corresponding to the audio e LAST RHYME WHOLE DUR MEAN and LAST RHVME WHOLE DUR STDEV are taken from table last rhyme dur stats column MEAN and STDEV for the line corresponding to the audio A 4 5 Derived Features e Fo characteristics of the speaker The SRI prosodic model uses a pitch model to estimate several values to characterize the speaker s pitch Since our model is based on Praat s built in function for stylization we do not have counterparts for some of the pitch characteristics provided by the SRI s model However in order to compute the derived features similar to these defined in SRI s model we chose to approximate these characteristic values using the pitch statistics SPKR FEAT F0 MODE exp SPKR F0 MEAN SPKR FEAT FO TOPLN 75 exp SPKR F0 MEAN SPKR FEAT F0 BASELN 1 5 exp SPKR F0 MEAN SPKR FEAT F0 STDLN exp SPKR_FO_STDEV SPKR FEAT FO0 RANGE SPKR_FEAT_F0O_TOPLN SPKR FEAT FO BASELN e Log difference of the max min and mean stylized Fo values between the previous and the next word FOK WORD DIFF HIHI N log MAX STVLFIT FO MAX STYLFIT FO NEXT FOK
13. NERGY e MIN ENERGY NEXT e MAX ENERGY NEXT e MEAN ENERGY NEXT e MIN ENERGY WIN e MAX ENERGY WIN e MEAN_ENERGY_WIN e MINENERGV NEXT WIN e MAX ENERGY NEXT WIN e MEAN ENERGY NEXT WIN e MIN STYLFIT ENERGY e MAX STYLFIT ENERGY e MEAN STYLFIT ENERGY 21 FIRST_STYLFIT_ENERGY LAST_STYLFIT_ENERGY MIN STYLFIT ENERGY NEXT MAX STYLFIT ENERGY NEXT MEAN STYLFIT ENERGY NEXT FIRST STYLFIT ENERGY NEXT LAST_STYLFIT_ENERGY_NEXT MIN STVLFIT ENERGY WIN MAX STYLFIT ENERGY WIN MEAN STYLFIT ENERGY WIN FIRST STYLFIT ENERGY WIN LAST STYLFIT ENERGY WIN MIN STVLFIT ENERGV NEXT WIN STVLFIT ENERGV O NEXT WIN MEAN STVLFIT ENERGY NEXT WIN FIRST STVLFIT ENERGY NEXT WIN LAST STYLFIT ENERGY NEXT WIN ENERGY_PATTERN_WORD ENERGY PATTERN WORD CALLAPSED ENERGY PATTERN SLOPE ENERGY PATTERN WORD NEXT ENERGV PATTERN WORD CALLAPSED NEXT ENERGV PATTERN SLOPENEXT ENERGY_PATTERN_WORD_WIN ENERGY_PATTERN_WORD_CALLAPSED_WIN ENERGY_PATTERN_SLOPE_WIN ENERGY_PATTERN_WORD_NEXT_WIN ENERGY_PATTERN_WORD_CALLAPSED_NEXT_WIN ENERGY_PATTERN_SLOPE_NEXT_WIN ENERGY PATTERN BOUNDARY ENERGY SLOPE DIFF 22 A 3 Statistical Tables e phone_dur stats For each phone the table contains the mean phone duration the standard deviation of the phone duration the number of occurrences of that phone in the training database and the phone duration threshold computed as follows threshold phone mean phone 10 x std_deu phone e
14. NORMALIZED_PAUSE DERIVE_NORMALIZED_VOWEL DERIVE_NORMALIZED_RHYME DERIVE FO FEATURE DERIVE ENERGY FEATURE DERIVE AVERAGE PHONE The desired output features can be selected by including them in the output prosodic feature selection list file This file is a one column table with FEATURE as the column label in the first line and followed bv one feature name or one feature class name per line It is a convention that the name for a string feature such as GEN ends with symbol and the name for a numeric feature does not have a in the end Below is an example of the output prosodic feature selection list filet FEATURE_NAME WORD WAV SPKR_ID GEN PAUSE_DUR NORM_LAST_RHYME_DUR DERIVE_FEATURE The main script for computing prosodic features is main_batch praat which has the following argu ments e audio info table this is the same metadata file used in stats_batch praat It contains session ID speaker ID gender and the path to the audio file e output prosodic feature selection list the list file described above e statistics directory the directory containing the statistics files produced by stats_batch praat e working directory storing files the directory for storing parameter files un der subdirectory param_files local statistics files under subdirectory stats_files and prosodic feature files under subdirectory pf_files e use existing param files choose yes
15. Purdue Prosodic Feature Extraction Tool on Praat Zhongqiang Huang Lei Chen Mary P Harper Spoken Language Processing Lab School of Electrical and Computer Engineering Purdue University West Lafayette June 23 2006 Contents 1 Implementation using Praat 4 1 1 Audio and Word and Phone Alignment 6 1 2 Vowel Rhyme 6 13 VUV Raw and Stylized Pitch and Pitch Slope 6 1 4 Raw and Stylized Energy and the Energy 7 1 5 AO CALISUICS C y pu 7 2 Using the Tool 8 2 1 Global Statistics Computation 8 2 2 Prosodic Feature 9 3 Architecture of the Tool and Its Potential Augmentation 12 asas a d e de Eh posi ig abd e e Yes e NO Ie Pd wel acd dede qi 12 9 2 Code Organization bom p te Yd ege b een 14 3 8 An Augmentation Example 15 A Prosodic Feature List 16 AT IMG POGUCHION es iy ub ua Sal Sa a dudes es rx d date dus d qd 16 27 Basie Features dd A er ad ts de bu edi amp ads heen De rtr pig 16 AJ I Base Features 5 2 u verus de WAT gay P ip c Sa pie 16 2 2 Duration Features 222222222 bos bobo 5549 wb oe Ww SP ha 17 5299 Po Features oou mL eR Ae Sot Spal
16. SF Number of previous consecutive frames inside the word which have the same slope as last voiced frame in the word before a boundary voiced sequences of less than min frame length are not considered e NO PREVIOUS VF Number of consecutive voiced frames inside the word from the last voiced frame in the word backwards voiced sequences of less than min frame length are not considered e NO FRAMES LS WE Number of consecutive frames between the last voiced frame which belongs to a sequence of voiced frames larger than min frame length in the word preceding a boundary and the end of that word e NO SUCCESSOR SSF Number of successor consecutive frames inside the word which have the same slope as the first voiced frame in the word preceding a boundary voiced sequences of less than min frame length are not considered e NO SUCCESSOR VF Number of consecutive voiced frames inside the word from the first voiced frame in the word forward voiced sequences of less than min frame length are not considered e NO FRAMES WS FS Number of consecutive frames between the first frame of the word preceding a boundary and the first voiced frame in that word which belongs to a sequence of voiced frames larger than min frame length e The following features are the same as the corresponding features without NEXT except these are computed for the word after a boundary NO PREVIOUS SSF NEXT NO PREVIOUS VF NEXT NOFRAMES 1
17. TYLFIT FO are Fo features e Normalization of the mean of the stylized F values in the word and next word using the baseline topline and range of Fo values FOK_ZRANGE_MEAN_KBASELN MEAN STVLFIT FO SPKR FEAT FO BASELN SPKR FEAT FO RANGE FOKZRANGE MEAN KTOPLN SPKR_FEAT_F0_TOPLN MEAN STVLFIT FO SPKR FEAT FO RANGE FOKZRANGE MEANNEXT KBASELN MEAN STVLFIT FO NEXT SPKR FEAT F0 BASELN SPKR FEAT FO RANGE FOK_ZRANGE_MEANNEXT_KTOPLN SPKR_FEAT_F0_TOPLN MEAN FEAT FO NEXT SPKR FEAT F0 RANGE e Difference and log difference between the mean and maximum of the stylized F values in the next word and the topline of Fo values FOK_DIFF_MEANNEXT_KTOPLN MEAN STYLFIT FO NEXT SPKR FEAT F0 TOPLN FOK_DIFF_MAXNEXT_KTOPLN MAX STVLFIT FO NEXT SPKR FEAT FO TOPLN FOK_DIFF_WINMAXNEXT_KTOPLN MAX_STYLFIT_FONEXT_WIN SPKR FEAT F0 TOPLN FOK_LR MEANNEXT_KTOPLN log MEAN_STYLFIT_FO_NEXT SPKR FEAT F0 TOPLN FOK LR MAXNEXT KTOPLN log MAX STVLFIT FO NEXT SPKR FEAT FO TOPLN FOK_LR WINMAXNEXT_KTOPLN log MAX STYLFIT F0 NEXT WIN SPKR FEAT FO TOPLN e Normalization of the maximum of the stylized Fo values in the word and next word using the pitch mode and pitch of Fo values FOK MAXK MODE log MAX STYLFIT FO SPKR FEAT F0 MODE FOK MAXK NEXT log MAX STYLFIT F0 NEXT SPKR FEAT F0 MODE FOK MAXK MODE Z MAX_STYLFI
18. T_FO SPKR FEAT F0 MODE SPKR FEAT F0 RANGE FOK MAXK NEXT MODE MAX STVLFIT FO NEXT SPKR FEAT F0 MODE SPKR FEAT FO RANGE where MAX STVLFIT FO and MAX STVLFIT FO NEXT are Fo features e Log difference between the stylized Fo values in the word extremes FOK WORD_DIFF_BEGBEG log FIRST STYLFIT_F0 FIRST STVLFIT FO NEXT FOK WORD_DIFF_ENDBEG log LAST STVLFIT FO FIRST STYLFIT FO0 NEXT FOK INWRD DIFF log FIRST STYLFIT F0 LAST STYLFIT FO where FIRST STVLFIT FO LAST STYLFIT FO and FIRST STYLFIT FO NEXT are Fo features 28 e Slope patterns and the normalization LAST SLOPE The last f or r in the Fo feature PATTERN SLOPE FIRST_SLOPE_NEXT The first f or r in the Fo feature PATTERN SLOPE NEXT SLOPE_DIFF_N SLOPE DIFF SKPR FEAT F0 SD SLOPE LAST SLOPE N LAST SLOPE LAST STYLFIT FO where SLOPE DIFF LAST_DIFF and LAST STVLFIT FO are Fo features and 0 SD SLOPE is obtained from table spkr feat stats column STDEV SLOPE SPKR FEAT F for the line corr esponding to SPKR A 4 6 Energy Derived Features The derived energy features are computed similarly as the derived Fo features The following is a list of the derived energy features e ENERGY WORD DIFF HIHI N ENERGY_WORD_DIFF_HILO_N ENERGY_WORD_DIFF_LOLO_N ENERGY_WORD_DIFF_LOHI_N ENERGY_WORD_DIFF_MNMN_N ENERGV WORD DIFF HIHING ENERGY_WORD_DIFF_HILO_NG ENERGY WORD DIFF LOLO NG ENERGY_WORD_DIFF
19. WORD DIFF HILO N log STYLFIT FO MIN STYLFIT FO NEXT FOK WORD DIFF LOLO N log MIN STVLFIT FO MIN STYLFIT F0 NEXT FOK WORD_DIFF_LOHLN log MIN STYLFIT F0 MAX STYLFIT F0 NEXT FOK WORD_DIFF_MNMN N log MEAN STYLFIT F0 MEAN STYLFIT F0 NEXT where STVLFIT F0 MAX STYLFIT FO NEXT MIN STVLFIT FO MIN STVLFIT NEXT MEAN STVLFIT FO MEAN STVLFIT FO NEXT are all Fo features 26 e Log ratio of the maximum minimum and mean of the stylized Fo values between the previous and the next word normalized by the pitch range FOK WORD DIFF HIHI NG log MAX STYLFIT FO log MAX STYLFIT F0 NEXT SPKR FEAT F0 RANGE FOK WORD DIFF HILO log MAX STYLFIT FO log MIN STYLFIT F0 NEXT SPKR FEAT F0 RANGE FOK WORD DIFF LOLO NG log MIN STYLFIT FO log MIN STYLFIT F0 NEXT SPKR FEAT F0 RANGE FOK WORD log MIN STYLFIT FO log MAX STYLFIT F0 NEXT SPKR FEAT F0 RANGE FOK_WORD_DIFF_MNMN_NG log MEAN STYLFIT FO log MEAN STYLFIT F0 NEXT SPKR FEAT FO RANGE e Log difference of maximum minimum and mean of the stylized Fo values between the pre vious and the next window FOK_WIN_DIFF_HIHLN log MAX_STYLFIT_FO_WIN MAX_STYLFIT_FO_WIN_NEXT FOK_WIN_DIFF_HILO_N log MAX STVLFIT FO WIN MIN STVLFIT FO WIN NEXT FOK_WIN_DIFF_LOLO_N log MIN STVLFIT FO WIN MIN STVLFIT NEXT FOK_WIN_DIFF_LOHLN
20. _LOHI_NG ENERGY_WORD_DIFF_MNMN_NG ENERGY_WIN ENERGY_WIN ENERGY_WIN ENERGY_WIN ENERGY_WIN ENERGY_WIN ENERGY_WIN ENERGY_WIN ENERGY_WIN ENERGY_WIN DIFF_HIHI_N DIFF HILO N DIFF LOLO N DIFF DIFF MNMN NG DIFF DIFF LOLO NG DIFF LOHLNG DIFF MNMN NG 29 ENERGY DIFF LAST KBASELN ENERGY DIFF MEAN KBASELN ENERGY_DIFF_WINMIN_KBASELN ENERGY_LR_LAST_KBASELN ENERGY LR MEAN KBASELN ENERGY LR WINMIN KBASELN ENERGV ZRANGE MEAN KBASELN ENERGV ZRANGE MEAN KTOPLN ENERGY_ZRANGE_MEANNEXT_KBASELN ENERGY_ZRANGE_MEANNEXT_KTOPLN ENERGY_DIFF_MEANNEXT_KTOPLN ENERGY DIFF MAXNEXT KTOPLN ENERGY_DIFF_WINMAXNEXT_KTOPLN ENERGY_LR MEANNEXT_KTOPLN ENERGY LR MAXNEXT KTOPLN ENERGV LR WINMAXNEXT KTOPLN ENERGY_MAXK_MODE_N ENERGY_MAXK_NEXT_MODE_N ENERGY MAXK MODE Z ENERGY MAXK NEXT MODE Z ENERGY_WORD_DIFF_BEGBEG ENERGY_WORD_DIFF_ENDBEG ENERGY_INWRD_DIFF ENERGY_LAST_SLOPE ENERGY_SLOPE_DIFF_N ENERGY_LAST_SLOPE_N 30 A 4 7 Average Phone Duration e AVG PHONE DUR Z every_phone_in wordPhonez phone phones e MAX 4 maxXevery phone in wordphone zjphonej e AVG PHONE DUR N cer phone in wordPhone n phone phones e MAX PHONE DUR phone in wordphone n phone where e phones is the number of phones in the word e phone z phone phone dur phone phone dur mean phone phone dur stdev phon
21. ations such as table creation value updating value retrieval etc to handle these values stats stats praat contains routines for computing statistics stats routine praat contains routines for obtaining various basic elements stats utils praat contains some miscellaneous utility routines stats config praat contains the configuration of the pre defined parameter values such as frame and window size default file names etc e Scripts for Extracting Prosodic Features code main_batch praat interface accepts inputs and controls the overall operation code operations praat the highest level of operation flow code io praat contains routines for controlling file input output in Praat code table praat contains routines for controlling Praat Table operations We use Praat Tables for holding various intermediate values and have designed various operations such as table creation value updating value retrieval etc to handle these values code fetch praat contains higher level routines for extracting basic prosodic features by calling routines in routine praat code routine praat contains routines for obtaining various basic elements and lower level routines that implement feature extraction and support the higher level routines in fetch praat code derive praat contains routines for computing derived features code utils praat contains some miscellaneous utility routines code config praat contains the configurati
22. ber of F value counted over the uttered words in all of the audio corresponding to the speaker MEAN ENERGY SLOPE The mean slope over the uttered words in all of the audio corresponding to the speaker It is computed only over the sequences of frames that have the same slope for more than min frame length frames STDEV ENERGY SLOPE The standard deviation of the slope over the uttered words in all of the audio corresponding to SPKR It is computed only over the sequences of frames that have the same slope for more than min frame length frames COUNT ENERGY SLOPE The number of energy slope counted over the uttered words in all audio corresponding to the speaker It is computed only over the sequences of frames that have the same slope for more than min frame length frames spkr_phone_dur stats One table for each speaker These tables are similar to phone dur stats but they involve all of the audio corresponding to the speaker last rhyme dur stats For each audio the listed features are the mean duration of the last rhyme the standard deviation of the last rhyme duration and the number of last rhymes used in the computation of these statistics last rhyme phone dur stats For each audio the listed features are mean phone duration for the phones in the last rhyme the standard deviation of the phone duration for the phones in the last rhyme and the number of last rhymes used in the computation for these statistics pause dur sta
23. can be found at demo work dir pf files 11 Chapter 3 Architecture of the Tool and Its Potential Augmentation Our initial objective for implementing this tool is not only to utilize many of the features that have been used in other research efforts to support our research but also to make use of the flexibility popularity and extensibility of Praat to incorporate other useful features which can be computed with or by augmenting the current or future versions of Praat In this chapter we present the current structure of our tool and the organization of the code and provide an example of adding new features 3 1 Structure As discussed in the previous chapters and illustrated in Figure 3 1 the procedures of our tool include e Global Statistics Computation This module should be run prior to the feature extraction process in order to provide the global statistics needed for normalization Although it is not illustrated in Figure 3 1 it also contains a pre processing phase in which the basic elements i e the parameter files are extracted based on the audio files and the alignments These basic elements can be reused later by activating the using existing files option when configuring main batch praat e Feature Extraction After obtaining the global statistics the tool proceeds to extract the prosodic features by following the procedures below Tool Initialization Performing the following steps for each audio fil
24. e Initialization Pre processing Basic Feature Extraction Local Statistics Calculation Derived Feature Computation Clean up Tool Clean up and Termination 12 Corpus with audio and time aligned words and phones Global Statistics Computation Wav Info List Tool Initialization Y Initialization Pre processing Extract Prosodic Global Statistics W Raw Data Feature on each Audio Local Statistics Basic Feature Computation Extraction Y Derived Feature Computation Y Clean up Finished No Yes Y Tool Clean up Figure 3 1 Data flow diagram for the tool 3 2 Code Organization The Praat code consists several scripts each focusing on a certain type of processing We list each script below and for each give a brief description about the included routines For simplicity we separate the scripts for computing the statistics from those used for feature extraction into different directories e Scripts for Computing Global Statistics stats stats_batch praat interface accepts inputs and controls the overall operation stats operations praat the highest level of operation flow stats io praat contains routines for controlling file input output in Praat stats table praat contains routines for controlling Praat Table operations We use Praat Tables for holding various intermediate values and have designed various oper
25. e e phone_n phone phone durjphonej phone dur meanjphonej e phone dur phone is the phone duration for phone obtained from the feature WORD PHONES and phone_dur_mean phone and phone dur stdev phone are taken from table phone dur stats A 4 8 Speaker Specific Normalization e AVG PHONE DUR ZSP every_phone_in_wordPhone zsp phone phones e MAX PHONE DUR 75 maxeyery_phone_in_wordPhone_zsp phone e AVG PHONE DUR NSP phone nspjphonej phones every phone e MAX PHONE DUR maxevery phone in woradphone nsp phone where e phones is the number of phones in the word e phone_zsp phone phone dur phone spkr_phone_durmean phone spkr phone dur stdev phone e phone_nsp phone phone dur phone spkr phone dur mean phone e phone dur phone is the phone duration for phone obtained from the feature WORD PHONES spkr phone dur meanjphonel and spkr phone dur stdev phone are taken from table spkr_phone_dur stats Below are the features that are similar to the PHONE DUR features except these are only over the vowels not over everv phone in the word e VOWEL DUR Z e MAX VOWEL DUR Z 31 AVG VOWEL DUR N MAX VOWEL DUR N AVG VOWEL DUR ZSP MAX VOWEL DUR ZSP AVG_VOWEL_DUR_NSP MAX_VOWEL_DUR_NSP 32 Bibliography 1 2 3 4 7 10 11 12 J Ang Y Liu and E Shriberg Automatic dialog act segmentation and classification in mul
26. e of logarithmic Fo values and energy related statistics are calculated Given the duration Fo and energy information as well as the statistics it is straightforward to extract the prosodic features at each word boundary according to the definition of features in Appendix A see also Figure 3 1 for the data flow diagram of the tool Table 1 1 summarizes the use of raw duration Fo and energy in the computation of the prosodic features In the rest of this chapter we describe the requirements needed to use the tool in particular we discuss audio file and word phone alignment requirements We also give details on how vowel rhyme VUV pitch raw stylized and its slope and energy raw stylized and its slope are calculated m Energy Calcuation gt Raw Energy C Energy Pitch conversion Stylization gt Stylized Energy Energy Slope Pitch Tracking gt Raw Pitch Voiced Unvoiced Detection VUV Audio Pitch Stylization Stylized Pitch Forced Alignment gt Pitch Slope Transcript Word o w gt Figure 1 2 Procedures to obtain the basic elements that are directly needed for prosodic feature extraction The grayed ovals represent operations implemented in the tool while the grayed rect angles represent the basic elements Note that Forced Alignment is not a part of th
27. e prosodic content is required although if it were available it could be used Instead the prosodic features are extracted directly from the speech signal given its time alignment to a human generated transcription or to automatic speech recognition ASR output A prosody model can then be trained using these features and combined with a language model to build an event detection system Many of the past efforts on speech event detection utilize simple prosodic features such as pause duration 4 By contrast the above direct modeling efforts utilize a large number of features ex tracted using a proprietary prosodic feature extraction suite developed at SRI 3 to good effect SRI s feature extraction tool is Unix script based combining ESPS Waves for basic prosodic anal ysis e g preliminary pitch tracking and energy computation get_F0 with additional software components such as a piecewise linear model 12 for pitch stylization We have developed an open source automatic prosodic feature extraction tool 5 based on Praat 2 to extract a wide variety of prosodic features for event detection tasks that was inspired by the SRI suite By creating this tool we hope to provide a framework for building stronger baseline comparisons among systems and to support more effective sharing of prosodic features This document is organized as follows Chapter 1 discusses some important implementation details Chapter 2 is a user manual on the tool
28. e tool and so it appears in white Table 1 1 The use of basic elements for extracting various features For example the word alignment is used to compute duration features Fo features and energy features while the voiced unvoiced segmentation VUV is only used to compute Fo features Duration Fo Energy 7 y 7 ov v 11 Audio and Word and Phone Alignment An audio file can be any format e g WAV and AIFF files that can be loaded in Praat entirely using the Read from file command If the audio quality is poor the quality of the pitch and energy features may be degraded Word and phone alignments are required to be in Praat TextGrid format one tier each in a separate file Silence intervals should be empty We also require that the timing of the phones should align with the timing of the word at the start and end of a word boundary Since the prosodic features are extracted around the word boundaries it is important to have high quality alignments Different pronunciation dictionaries may use different phoneme sets however in our system currently we assume that the labels from the phone alignment should be consistent with CMU s dictionary or ISIP s dictionary what s more our code only supports capitalized phone labels However users can modify the code to support other phone sets see section 3 2 1 2 Vowel and Rhyme We extract the vowel and the last rhyme from the phone alignment file and store them in se
29. eded SESSION SPEAKER GENDER WAVEFORM demoC female demo data demo_C wav demo D D male demo data demo_D wav demo E male demo data demo_E wav demo_F F male demo data demo_F wav demoG male demo data demo_G wav The main script stats_batch praat for computing statistics has the following arguments e audio info table path to the metadata file described above e working directory the directory for storing parameter files under a subdirectory param_files and statistics files under a subdirectory stats_files If the statistics directory already exists then it is cleaned up These parameter and statistics files are created during the process of statistics computation using existing param files choose yes or no If this option is set to yes the tool attempts to verify the existence of the parameter files and uses these existing files in the parameter directory or generates the parameter files on the fly if they don t exist If this option is set to no the tool always generates parameter files no matter whether they exist or not Here is an example of running stats_batch praat at the command line praat stats_batch praat demo wavinfo_list tat demo work_dir yes Below are the steps to run the same example in the Praat ScriptEditor 1 2 6 Run Praat Open stats_batch praat from Read Read from file jects on the menu of Praat Ob Click Run Run
30. he corresponding features without WIN except these cases the values are computed over the N frames before a boundarv Maximum number of frames are used if there is not enough data MIN FO WIN FO WIN MEAN FO WIN MIN FO NEXT WIN FONEXT WIN MEAN_FO_NEXT_WIN Features computed using stvlized Fo e MIN STVLFIT FO The minimum stylized Fo value for the word preceding a boundary e MAX STVLFIT FO The maximum stylized Fo value for the word preceding a boundary e MEAN STVLFIT FO The mean stylized Fo value for the word preceding a boundary e FIRST STVLFIT FO The first stvlized Fo value for the word preceding a boundarv e LAST STVLFIT FO The last stylized Fo value for the word preceding a boundary e The following features are the same as the corresponding features without NEXT except these are computed for the word after a boundarv MIN STVLFIT FO NEXT 18 MAX_STYLFIT_FO_NEXT MEAN STVLFIT FO NEXT FIRST STVLFIT FO NEXT LAST STVLFIT FO NEXT e The following features are the same as the corresponding features without WIN except in these cases the values are computed over the N frames before or after a boundary Maximum number of frames are used if there is not enough data MIN STYLFIT FO WIN MAX STVLFIT FO WIN MEAN STVLFIT FO WIN FIRST STVLFIT FO WIN LAST STYLFIT FO WIN MIN STYLFIT FO NEXT WIN MAX STVLFIT FO NEXT
31. ine the feature names and list append them at code pf_list_files feature_name_table Tab 2 Write code in code routime praat to perform Fujisaki analysis on the audio just like we did for stylization The phrase and accent contours can be stored in two Praat PitchTier objects for later access 3 Write code in code operations praat and code io praat to ensure that the Fujisaki analysis is performed appropriately in the Pre processing step 4 Write code in code fetch praat and code routine praat to implement details of extracting fea tures based on the phrase and accent contours 5 Write code in code derive praat to compute the derived features based on the basic features on the contours 15 Appendix A Prosodic Feature List A 1 Introduction In our tool use scenario we have a set of audio files and each audio file has its word and phone alignments There is exactly one speaker in each audio although the same speaker can appear in several audio files Since most of our current research focuses on sentence boundary detection the prosody features are extracted around each word boundary Here are some definitions that will be used throughout for describing features e Frame In Praat pitch and energy are calculated on each frame the length of which is set to 0 01s by default The start end time and the duration of an object are measured by the index or the number of frames in the waveform e Boundary Prosodic features are ca
32. lculated around each boundary which is the end of a word Feature extraction is based on the preceding and following words and the preceding and following windows that has a size of N frames e Window Some features are computed within a window preceding or following a boundary The window size N is set to 0 2s by default If there are not enough frames in the beginning or at the end of a waveform to make a full size window then the maximum size window is used e Missing value There are situations where some features are not available e g the maximum stylized pitch value is not available for an unvoiced region When it happens a is used to denote that missing value A 2 Basic Features The features described here have been inspired by 3 however our implementation may differ Each feature is computed in terms of a boundary in the waveform under consideration A 2 1 Base Features e WAV The path to the corresponding current audio file 16 e SPK ID The speaker identification label for the current waveform e SPK GEN The gender of the speaker A 2 2 Duration Features e WORD The word preceding a boundary e WORD START The start time hereafter start end time is measured by the index of frame of the word preceding a boundary e WORD END The end time of the word preceding a boundary e WORD The word following a boundary e FWORD START The start time of the word following a boundary e FWORD_END The
33. nformation are also gathered in this step Please refer to Appendix A 3 for more information Note that the statistics on each session are computed along with the feature extraction process described in the next section A metadata file is needed to provide the session ID speaker ID gender and the path which can be absolute path or relative path relative to the Praat script to the audio file Our tool supports multiple sessions per speaker and makes use of the speaker information across the sessions for normalization The TertGrid format word phone alignments are assumed to be in the same directory as the audio file and their file names as well as names for the other files generated for that audio recording are hard coded based on the name of the audio file For example if the audio file is demo data demo C wav then the word and phone alignment files are both located at directory demo data and are named demo C word TextGrid and demo _C phone TeztGrid respectively Below is an example metadata file demo wavinfo list trt 1 path is relative to the Praat script This package comes along with a demo on running the tool The default path settings in the Praat scripts and the metadata file demo wavinfo list txt are configured to run the demo on any nix platform Slight change of the path delimiter from to is needed if running the demo on a Windows OS The headings i e SESSION SPEAKER GENDER and WAVEFORM are ne
34. on of the pre defined parameter values such as frame and window size default file names etc list files feature name table Tab contains a list of feature names which are im plemented in our tool The other files in the same directory as this file contain lists of feature names for different type of features e g basic Fo features derived Fo features While the above brief description reveals some of the main functionality that each script can perform the boundaries between them are somewhat vague Some of the routines could have been put into one script or another and there are also some scripts that contain extra operations to simplify our coding effort We will try to make the organization of code more clear in future release so that it would be easier for users to modify the code 14 3 3 An Augmentation Example As we said above we would like to make this tool easily extensible An example is best to show this Currently Praat doesn t have built in functions to build a Fujisaki model which decomposes the pitch contour into phrase contours and accent contours It is believed that features based on these superpositional contours should be helpful for a variety of prosodic analysis tasks If the Fujisaki analysis became available in Praat we could use the following steps to implement new features capitalizing on this new functionality For simplicity we assume that no statistics are needed for the new features 1 Def
35. parate TextGrid files i e each vowel or rhyme is considered as a interval with a label of vowel or rhyme and a starting and an ending time The other intervals are unlabeled i e they are blank Currently our vowel phone set consists of AA AE AH AO AW AX AXR AY EH ER EY IH IX IY OW OY UH UW and changes can be made by modifying the code in routine is Vowel in script code routine praat see section 3 2 We take the last rhyme to be the sequence of phones starting from the last vowel and covering all the remaining phones in the word 1 9 Raw Stylized Pitch and Pitch Slope Praat s existing pitch tracking and stylization functionality is one of the reasons for us to choose to build the tool based on Praat Additionally it provides various native objects and operations for holding and accessing pitch related information Below is a listing of how pitch information is represented in Praat data structures e Raw Pitch Pitch Tier We rely on Praat s autocorrelation based pitch tracking algorithm to extract raw pitch values using gender dependent pitch range 75 300 Hz for male and 100 600 Hz for female This is simply accomplished by using the command To Pitch ac on the sound object to obtain a Pitch object Pitch values are further smoothed by the command Smooth and stored in a PitchTier by using the command Down to PitchTier Praat provides several useful functions for opera
36. preceding a boundary e LAST VOWEL END The end time of the last vowel in the word preceding a boundary e LAST VOWEL DUR The duration of the last vowel in the word preceding a boundary e LAST RHYME START The start time of the last rhyme in the word preceding a boundary The last rhyme is considered as the sequence of phones starting with the last vowel to the end of the word e LAST RHYME END The end time of the last rhyme in the word preceding a boundary It is the same as WORD END 17 e NORM LAST RHYME DUR dur phone mean phone Where dur phone phone_in_word std_dev phone is the duration of the phone in the current audio and mean phone and std dev phone are the average duration of the phone and the standard deviation of the duration of that phone in the training data both values are obtained from the phone dur stats file e PHONES IN LAST RHYME The total number of phones in the last rhyme A 2 3 Fo Features Features computed using the raw Fo extracted by Praat e MIN F0 The minimum raw 7 value for the word preceding a boundary e MAX FO The maximum raw Fo value for the word preceding a boundary e MEAN_FO The mean raw Fo value for the word preceding a boundary e The following features are the same as the corresponding features without NEXT except these are computed for the word after a boundary MIN F0 NEXT MAX FO NEXT MEAN FO NEXT e The following features are the same as t
37. side the uttered words for all of the audio corresponding to the speaker for sequences of unvoiced frames longer than min frame length MEAN PITCH The average Fo value over the uttered words in all of the audio corre sponding to the speaker STDEV PITCH The standard deviation Fo value over the uttered words in all of the audio corresponding to the speaker COUNT_PITCH the number of Fo value counted over the uttered words in all of the audio corresponding to the speaker MEAN SLOPE The mean pitch slope over the uttered words in all of the audio corre sponding to the speaker It is computed only over the sequences of frames that have the same slope for more than min_frame_length frames STDEV SLOPE The standard deviation of the pitch slope over the uttered words in all of the audio corresponding to the speaker It is computed only over the sequences of frames that have the same slope for more than min frame length frames COUNT SLOPE The number of pitch slope counted over the uttered words in all of the audio corresponding to the speaker It is computed only over the sequences of frames that have the same slope for more than min_frame_length frames 23 MEAN_ENERGY The average energy value over the uttered words in all of the audio corresponding to the speaker STDEV ENERGY The standard deviation Fo value over the uttered words in all of the audio corresponding to the speaker COUNT ENERGY the num
38. stics need to be accumulated across sessions we compute them before the feature extraction step The local statistics are session dependent statistics which are computed during the features extraction process They include the means and variances of the last rhyme duration the last rhyme phone duration the normalized last rhyme duration and the pause duration 2We assume that 1 dB is small enough to replace intensity values lower than 1 dB Chapter 2 Using the Tool In our code design scheme stats batch praat and main batch praat are the two scripts that accept configuration inputs respectively for global statistics computation and prosodic feature extraction Due to the inherent functionality of Praat they both can be launched from the command line of nix which is most suitable for batch processing or can be run in graphic mode through the Praat ScriptEditor In this chapter we focus on the usage of this tool and give instructions step by step 2 1 Global Statistics Computation There are several statistics that need to be computed in advance before prosodic feature extraction to enable the normalization of related prosodic features The mean and variance of the phone duration across the whole data set are examples of these statistics In addition statistics with respect to each speaker e g the mean and variance of the phone duration of each speaker across sessions and the statistics related to each speaker s pitch and energy i
39. tensity on the sound object then the tool creates a blank Pitch object and inserts the energy values into the newly created Pitch Tier one frame at a time Note that some intensity values in dB may be negative which are illegal pitch values To address this we reset all intensity values lower than 1 to 1 to prevent negative pitch value After this transformation we process energy similarly to pitch except that stylization is applied directly to the entire tier rather than separately to segments since there is no VUV counterpart in the energy case The stylization command is now Stylize 3 0 Hz since energy has a smaller dynamic range 1 5 Statistics The statistics used in the model include the means and variances of phone length vowel length rhyme length pause length and speaker features related to pitch and energy There are two different types of statistics i e global and local statistics which are differentiated by the scope of data over which the statistics are computed See Appendix A 3 for more information about the statistics The global statistics relate to all sessions either speaker dependent or independent across all speakers T he speaker specific phone duration statistics and the pitch and energy related statistics are computed for each specific speaker across all the sessions of the speaker and the global phone duration statistics are computed across all sessions and all speakers Since global stati
40. ti party meetings In IEEE International Conference on Acoustics Speech and Signal Processing Philadelphia PA March 2005 P Boersma and D Weeninck Praat a system for doing phonetics by computer Technical Report 132 University of Amsterdam Inst of Phonetic Sc 1996 L Ferrer Prosodic features extraction Technical report SRI 2002 Y Gotoh and S Renals Sentence boundary detection in broadcast speech transcript In Proc of the Intl Speech Communication Association ISCA Workshop Automatic Speech Recognition Challenges for the new Millennium ASR 2000 2000 Z Huang L Chen and M Harper An open source prosodic feature extraction tool In 006 2006 Z Huang L Chen and M Harper Purdue Prosodic Feature Extraction Toolkit on Praat Spoken Language Processing Lab Purdue University ftp ftp ecn purdue edu harper praat prosody tar gz March 2006 Y Liu N V Chawla M P Harper E Shriberg and A Stolcke A study in machine learning from imbalanced data for sentence boundary detection in speech To appear in Computer Speech and Language 2005 Y Liu E Shriberg A Stolcke and M Harper Comparing HMM maximum entropy and con ditional random fields for disfluency detection In INTERSPEECH Lisbon Spain September 2005 Y Liu A Stolcke E Shriberg and M Harper Comparing and combining generative and posterior probability models Some advances in sentence boundary detection in speech In
41. ting on the PitchTier which makes it a simple matter to access pitch values for each frame e VUV TextGrid VUV is the voiced and unvoiced region segmentation It is obtained by first using the command To PointProcess on the Pitch object and then using the command To TextGrid vuv 0 02 0 01 on the newly generated PointProcess object e Stylized Pitch PitchTier Praat s pitch stylization function Stylize 4 0 Semitones is used to stylize raw Fo values over each voiced region After stylization only the slope changing Researchers can choose from a variety of alignment systems such as Aligner 14 ISIP ASR 13 and SONIC 10 points of the pitch contour remain in the PitchTier Interpolated pitch values between each pair of changing points are inserted back to form the stylized pitch contour e Pitch Slope TeztGrid We use an interval TeztGrid to store slope values Each non empty interval covers successive frames with the same pitch slope based on stylization and is labeled with the slope value 1 4 Raw and Stylized Energy and the Energy Slope Praat has no built in functionality for energy stylization To simplify our implementation we represent the energy values in PitchTier format so that we are able to use Praat s stylization function to stylize energy values and the routines for extracting Fo features are then reused for extracting energy features The raw energy is obtained by using the command To In
42. ts This table has a row for each audio The first feature is the speaker session id The other features in the table are MEAN The mean duration of the pauses in the audio STDEV The standard deviation of the duration of the pauses in the audio MEAN LOG The mean of the base e log pause duration in the audio STDEV LOG The standard deviation of the log pause duration in the audio COUNT PAUSE The number of pauses in the audio A A Derived Features Derived features are computed from the previously described basic features and statistics Some derived features are computed given two basic features such as log difference or log ratio of two values Some derived features are normalized basic features using the computed means and standard deviations A 4 1 Normalized Word Duration e WORD DUR WORD END WORD START where WORD END and WORD START are basic features 24 e WORD AV DUR every_phone in word mean phone where mean phone is obtained from the statistical table phone dur stats and the phones are from the basic features WORD PHONES e NORM WORD DUR WORD DUR WORD_AV_DUR A 4 2 Normalized Pause e PAU_DUR_N PAU_DUR PAUSE MEAN where PAUSE MEAN comes from the pause_dur stats column MEAN from the line corresponding to the current audio A 4 3 Normalized Vowel Duration e LAST VOWEL DUR Z LAST VOWEL DUR ALL PHONE DUR MEAN ALL PHONE DUR STDEV e LAST VOWEL DUR N LAST VOWEL DUR ALL PHONE

Purdue Prosodic Feature Extraction Tool on Praat

Contents

Download Pdf Manuals

Related Search

Related Contents