Home

Text to Matrix Generator&lowast; User's Guide

1. Delimiter emptyline The delimiter between tmg s view of documents Possi ble values are emptyline none_delimiter treats each file as single document or any other string Stoplist Name of file containing stopwords i e common words not used in indexing Min Length 3 Minimum term length Max Length 30 Maximum term length Min Local Frequency 1 Minimum local term frequency Max Local Frequency inf Maximum local term frequency Min Global Frequency 1 Minimum global term frequency Max Global Frequency inf Maximum global term frequency Local Term Weighting TF Local term weighting function Possible values Term Frequency TF Binary Logarithmic Alternate Log Augmented Normalized Term Frequency Global Term Weighting None Global term weighting function Possible values None Entropy Inverse Document Frequency IDF Gfldf Normal Probabilistic Inverse Database Name The name of the folder under data directory where data are to be saved currently supported only for the Create New tdm module Store in MySQL Checked if results are to be saved into MySQL currently supported only for the Create New tdm module use Normalization Indicates normalization method Possible values None Cosine use Stemming Indicates if stemming is to be applied The algorithm cur rrently support
2. OPTIONS global_weights The vector of term global weights returned by tmg OPTIONS dsp Displays results default 1 or not 0 125 REFERENCES 1 M Berry and M Browne Understanding Search Engines Mathematical Modeling and Text Retrieval Philadelphia PA Society for Industrial and Applied Mathematics 1999 2 T Kolda Limited Memory Matrix Methods with Applications Tech Report CS TR 3806 1997 126 tmg_save_results TMG_SAVE_RESULTS TMG SAVE RESULTS is a graphical user interface used from TMG_GUL for saving the results to a or multiple mat file s 127 tmg template TDM_TEMPLATE demo script This is a template script demonstrating the use of TMG as well as the application of the resulting TDM S in two IR tasks quering and clustering The quering models used is the Vector Space Model see vsm m and LSI see lsi m while two versions of the k means algorithm euclidean and spherical see ekmeans m and skmeans m cluster the resulting matrix see pddp m The user can edit this code in order to change the default OPTIONS of TMG as well as to apply other IR tasks or use his own implementations regarding these tasks 128 two_means_ld TWO_MEANS_1D returns the clustering that optimizes the objective function of the k means algorithm for the input vector CUTOFF CLUSTERS DISTANCE OF MEAN1 MEAN2 TWO_MEANS_1D A returns
3. 000022 ee Next view of tmg_gui according to the user selection The open_file window 000052 eee The output mat files of tmg_gul o o The MySQL view uppon tmg executiON o The GUIs general help tab o oo Starting window of dr_gui o e Next view of dr_gui according to the user selection The output mat files of dr_guUl ooo ooo Starting window of nnmf gui Next view of nnmf_gui according to the user selection The output mat files of nnmf gui o o Starting window of retrieval_gui o Next view of retrieval_gui according to the user selection The output of retrieval_gui o o Starting window of clustering_gUl o Next view of clustering_gui according to the user selection The output mat files of clustering_gui The output of clustering gui for PDDP Starting window of classification qui Next view of classification gui according to the user selection List of Tables Du Bb un Description of use of tmg_gui components Description of use of dr_gui components Description of use of nnmf_guicomponents Description of use of retrieval_gui components Description of use of clustering_gui compone
4. Make a new Folder Publish this Folder to the Web E Share this Folder Other Places csi My Documents Shared Documents Y My Computer My Network Places Details Figure 17 The output mat files of dr_gui 6 Press the Reset button in order to change the input 32 A 3 Non Negative Factorizations module nnmf_gui Assume we have processed a collection with tmg_gui construct a tdm with 1 033 documents and 12 184 terms corresponding to the well known MEDLINE collection and store the results to TMG_HOME data medline Assume then we want to con struct a non negative factorization of the TDM using the Multiplicative Update algo rithm initializing by the block NNDSVD technique for the following input e initialization Block NNDSVD e refine factors yes e method Multiplicative update e number of iterations 10 compute SVD with Propack clustering algorithm PDDP principal directions 1 e maximum number of PCs variant basic e number of clusters 10 e number of factors 10 and you want to store results to directory medline 33 1 Initially select the operation you want to perform by pressing the corresponding radio button at the upper left frame 2 The selection of a radio button activates the required fields in the GUI while deactivating the rest fields Text to Matrix Generator Non Negative Factorizations JV Display Results 7
5. poo F eose El a 5 b Figure 22 Next view of retrieval_gui according to the user selection 39 4 Press the Continue button in order to perform selected operation 5 Results have been saved to the workspace 6 Furthermore in case data have been stored to MySQL the user gets an html response lala Ele Edit View Go Debug Desktop Window Help Ey TMG Retrieval Response Query the crystalline lens in vertebrates including humans Document 212 Similarity 0 39685 experiments dealing with the role played by the aqueous humor and retina in lens regeneration of adult newts 1 these three groups of experiments involve approximately 140 eyes of adult newts triturus v viridescens they were devised to examine what if any role the aqueous humor plays during lens regeneration from the dorsal iris 2 many daily injections of aqueous humor from normal eyes were made in lentectomized eyes for as long as 96 days in some cases as controls some lensless eyes were daily injected with holtfreter s solution in others aqueous humor was merely withdrawn 3 procedures for the injection experiments are difficult to control however the most successful cases showed varying degrees of inhibition and retardation of lens regeneration 4 pairs of eyes were united at large adjacent wound openings to provide a common reservoir of aqueous humor bathing both lenses and dorsal irises in some cases the eyes were p
6. None e Entropy f Inverse Document Frequency IDF eg Gfldf n Normal p Probabilistic Inverse OPTIONS normalization Indicates if we normalize the document vectors default x Possible values x None c Cosine 85 myperms MYPERMS computes all possible combinations of the input V MYPERMSJP L returns all possible combinations of the input vector of integers with L numbers 86 nnmf_gui NNMF_GUI nnmf_gui is a graphical user interface for all non negative dimensionality reduction techniques implemented in the Text to Matrix Generator TMG Toolbox 87 nnmf_mul_update NNMF_MUL_UPDATE Applies the multiplicative update algorithm of Lee and Seung NNMF_MUL_UPDATE applies the multiplicative update algorithm of Lee and Seung for non negative factorizations W H S nnmf_mul_update A W H NIT DSP produces a non negative factorization of A W H using as initial factors W and H applying NIT iterations REFERENCES 1 D Lee S Seung Algorithms for Non negative Matrix Factorization NIPS 2000 556 562 88 open_file OPEN_FILE OPEN FILE is a graphical user interface for selecting a file directory or variable from the workspace The function returns the name of the selected file directory or variable 89 opt_2means OPT_2MEANS a special case of k means for k 2 OPT_2MEANS
7. OPTIONS epsilon Value for epsilon convergence criterion default 1 OPTIONS dsp Displays results default 1 or not 0 to the command window REFERENCES 1 I S Dhillon and D M Modha Concept Decompositions for Large Sparse Text Data using Clustering Machine Learning 42 1 pages 143 175 Jan 2001 113 stemmer STEMMER applies the Porter s Stemming algorithm 1 S STEMMER TOKEN DSP returns in S the stemmed word of TOKEN DSP indicates if the function displays the result of each stem 1 REFERENCES 1 M F Porter An algorithm for suffix stripping Program 14 3 130 137 1980 114 strip _html STRIP_HTML removes html entities from an html file S STRIP_HTML FILENAME parses file FILENAME and removes the html entities while the result is stored in S as a cell array and written in file FILENAME TXT 115 svd_tmg SVD_TMG Singular Value Decomposition U S VJ SVD_TMG A K METHOD computes the K factor truncated Singular Value Decomposition of A using either the svds function of MATLAB or the PROPACK package 1 REFERENCES 1 R M Larsen PROPACK A Software Package for the Symmetric Eigenvalue Problem and Singular Value Problems on Lanczos and Lanczos Bidiagonalization with Partial Reorthogonalization Stanford University http sun stanford edu rmunk PROPACK 116 svd_update SVD_UPDATE Singu
8. 1 X Y BLOCK_NNDSVD A CLUSTERS L FUNC ALPHA_VAL SVD_METHOD computes a non negative rank L approximation X Y of the input matrix A with the Clustered Latent Semantic Indexing Method 2 and the Non Negative Double Singular Value Decomposition Method 1 using the cluster structure information from CLUSTERS 3 FUNC denotes the method used for the selection of the number of factors from each cluster Possible values for FUNC P Selection using a heuristic method from 2 see KS_SELECTION 1 Same as f but use at least one factor from each cluster equal Use the same number of factors from each cluster ALPHA VAL is a value in 0 1 used in the number of factors selection heuristic 2 Finally SVD_METHOD defines the method used for the computation of the SVD svds or propack REFERENCES 1 C Boutsidis and E Gallopoulos SVD based initialization A head start on nonnegative matrix factorization Pattern Recognition Volume 41 Issue 4 Pages 1350 1362 April 2008 2 D Zeimpekis and E Gallopoulos CLSI A Flexible Approximation Scheme from Clustered Term Document Matrices In Proc 5th SIAM International Conference on Data Mining pages 631635 Newport Beach California 2005 3 D Zeimpekis and E Gallopoulos Document Clustering using NMF based on Spectral Information In Proc Text Mining Workshop 2008 held in conjunction with the 8th SIAM International Conference on Data Mining Atlanta
9. 2008 53 classification gui CLASSIFICATION GUI CLASSIFICATION GUI is a graphical user interface for all classification functions of the Text to Matrix Generator TMG Toolbox 54 clsi CLSI computes a rank L approximation of the input matrix using the Clustered Latent Semantic Indexing Method 1 X Y CLSICA CLUSTERS L FUNC ALPHA_VAL SVD METHOD computes the rank L approximation X Y of the input matrix A with the Clustered Latent Semantic Indexing Method 1 using the cluster structure information from CLUSTERS FUNC denotes the method used for the selection of the number of factors from each cluster Possible values for FUNC P Selection using a heuristic method from 1 see KS_SELECTION f1 Same as f but use at least one factor from each cluster equal Use the same number of factors from each cluster ALPHA_VAL is a value in 0 1 used in the number of factors selection heuristic 1 Finally SVD METHOD defines the method used for the computation of the SVD svds or propack REFERENCES 1 D Zeimpekis and E Gallopoulos CLSI A Flexible Approximation Scheme from Clustered Term Document Matrices In Proc 5th SIAM International Conference on Data Mining pages 631635 Newport Beach California 2005 55 clustering_gui CLUSTERING_GUI CLUSTERING_GUL is a graphical user interface for all clustering functions of the Text to
10. Number of factors for preprocessed training data Similarity Measure Cosine The similarity measure to be used Continue Apply the selected operation Reset Reset window to default values Exit Exit window Table 6 Description of use of classification_gui components 18 Acknowledgments TMG was conceived after a motivating discussion with Andrew Knyazev regarding a collection of MATLAB tools we had put together to aid in our clustering experiments We thank our collegues loannis Antonellis Anastasios Zouzias Efi Kokiopoulou and Constantine Bekas for many helpful suggestions Jacob Kogan and Charles Nicholas for inviting us to contribute to 18 Elias Houstis for his help in the initial phases of this research and Michael Berry Tamara Kolda Rasmus Munk Larsen Christos Bout sidis and Haesun Park for letting us use and distribute SPQR SDDPACK PROPACK NNDSVD and ANLS software respectively Special thanks are due to many of the users for their constructive comments regarding TMG This research was supported in part by a University of Patras Karatheodor1 grant The first author was also supported by a Bodossaki Foundation graduate fellowship References 1 M Berry Z Drmac and E Jessup Matrices vector spaces and information retrieval SIAM Review 41 1998 335 362 2 M W Berry S A Pulatova and G W Stewart Computing sparse reduced rank approximations to spar
11. no 4 325 344 2 D Zeimpekis E Gallopoulos k means Steering of Spectral Divisive Clustering Algorithms Proc of Text Mining Workshop Minneapolis 2007 100 pddp_extract_centroids PDDP_EXTRACT_CENTROIDS returns the cluster centroids of a PDDP clustering result 101 pddp_optcut PDDP_OPTCUT Hybrid Principal Direction Divisive Partitioning Clustering Algorithm and k means PDDP_OPTCUT clusters a term document matrix tdm using a combination of the Principal Direction Divisive Partitioning clustering algorithm 1 and k means 2 CLUSTERS PDDP_OPTCUT A K returns a cluster structure with K clusters for the tdm A CLUSTERS TREE_STRUCT PDDP_OPTCUT A K returns also the full PDDP tree while CLUSTERS TREE_STRUCT S PDDP_OPTCUT A K returns the objective function of PDDP PDDP_OPTCUT A K SVD_METHOD defines the method used for the computation of the PCA svds default or propack PDDP_OPTCUT A K SVD_METHOD DSP defines if results are to be displayed to the command window default 1 or not 0 Finally PDDP_OPTCUT A K SVD_METHOD DSP EPSILON defines the termination criterion value for the k means algorithm REFERENCES 1 D Boley Principal Direction Divisive Partitioning Data Mining and Knowledge Discovery 2 1998 no 4 325 344 2 D Zeimpekis E Gallopoulos k means Steering of Spectral Divisive Clustering Algorithms Proc of Text Mining Workshop Minneapo
12. using the default web browser 65 diff_vector DIFF_VECTOR DIFF_VECTOR returns the vector of differences between consecutive elements of the input vector 66 dr_gui DR_GUI DR_GUL is a graphical user interface for all dimensionality reduction functions of the Text to Matrix Generator TMG Toolbox 67 ekmeans EKMEANS Euclidean k Means Clustering Algorithm EKMEANS clusters a term document matrix using the standard k means clustering algorithm CLUSTERS EKMEANS A C K TERMINATION returns a cluster structure with K clusters for the term document matrix A using as initial centroids the columns of C initialized randomly when it is empty TERMINATION defines the termination method used in k means C epsilon stops iteration when objective function decrease falls down a user defined threshold see OPTIONS input argument while n_iter stops iteration when a user defined number of iterations has been reached CLUSTERS Q EKMEANS A C K TERMINATION returns also the vector of objective function values for each iteration and CLUSTERS Q C EKMEANS A C K TERMINATION returns the final centroid vectors EKMEANS A C K TERMINATION OPTIONS defines optional parameters OPTIONS iter Number of iterations default 10 OPTIONS epsilon Value for epsilon convergence criterion default 1 OPTIONS dsp Displays results default 1 or not 0 to the c
13. A X returns the clustering that optimizes the objective function of the k means algorithm based on the ordering of vector X CLUSTERS S OPT_2MEANS A X returns the cluster structure as well as the value of the objective function 90 pca PCA Principal Component Analysis U S VJ PCA A C K METHOD computes the K factor Principal Component Analysis of A i e SVD of A C ones size A 2 1 using either the svds function of MATLAB or the PROPACK package 1 REFERENCES 1 R M Larsen PROPACK A Software Package for the Symmetric Eigenvalue Problem and Singular Value Problems on Lanczos and Lanczos Bidiagonalization with Partial Reorthogonalization Stanford University http sun stanford edu rmunk PROPACK 91 pca_mat PCA_MAT Principal Component Analysis with MATLAB svds U S V PCA_MAT A C K computes the K factor Principal Component Analysis of A i e SVD of A C ones size A 2 1 using the svds function of MATLAB 92 pca_mat_afun PCA_MAT_AFUN Auxiliary function used in PCA_MAT 93 pca propack PCA_PROPACK Principal Component Analysis with PROPACK U S V PCA _PROPACK A C K computes the K factor Principal Component Analysis of A i e SVD of A C ones size A 2 1 using the PROPACK package 1 REFERENCES 1 R M Larsen PROPACK A Software Package for the Symmetric Eigenvalue Problem and Singular Value P
14. Delimiter Alternatively each file in the input directory contains a single document Create New tdm Checked if new tdm is to be created default checked Create Query Matrix Checked if new query matrix is to be created default checked Update tdm Checked if an existing tdm is to be updated with new doc uments Alternatively ckecked if an existing tdm is to be updated using different options change update_struct Downdate tdm Checked if an existing tdm is to be downdated according to the Document Indices field Dictionary Name of mat file or workspace variable containing the dictionary to be used by tmg query function if the Create Query Matrix radio button is checked Global Weights Name of mat file or workspace variable containing the vector of global weights to be used by tmg query function 1f the Create Query Matrix radio button is checked Update Struct Name of mat file or workspace variable containing the structure to be updated or downdated by tdm_update or tdm_downdate function if the Udpate tdm or Down date tdm radio button is checked Document Indices Name of mat file or workspace variable containing the document indices marked for deletion when the Down date tdm radio button is checked Field Name Default Description Line Delimiter Checked if the Delimiter takes a whole line of text
15. Matrix Generator TMG Toolbox 56 cm CM computes a rank L approximation of the input matrix using the Centroids Method 1 X Y CM A CLUSTERS computes the rank K approximation X Y of the input matrix A with the Centroids Method 1 using the cluster structure information from CLUSTERS REFERENCES 1 H Park M Jeon and J Rosen Lower Dimensional Representation of Text Data Based on Centroids and Least Squares BIT Numerical Mathematics 43 2 427448 2003 57 col_normalization COL_NORMALIZATION normalizes the columns of the input matrix 58 col_rearrange COL_REARRANGE reorders a matrix using a clustering result A N_COLS COL INDS COL_REARRANGE A CLUSTERS reorders the columns of matrix A using the clustering result represented by the structure CLUSTERS N_COLS stores the last column index for each column block while COL_INDS containes the permuted column indices 59 column_norms COLUMN_NORMS returns the column norms of a matrix 60 compute_fro_norm COMPUTE_FRO_NORM returns the frobenius norm of a rank 1 matrix A W H 61 compute_scat COMPUTE_SCAT computes the cluster selection criterion value of PDDP SCAT COMPUTE_SCAT A C returns the square of the frobenius norm of A C ones 1 size A 2 62 create _kmeans response CREATE_KMEANS_RESP
16. PCA Clustered Latent Seman Apply the CLSI method tic Indexing CLSI Centroid Method CM Apply the CM method Semidiscrete Decompo Apply the SDD method sition SDD SPQR Apply the SPQR method MATLAB svds Check to use MATLAB function svds for the computa tion of the SVD or PCA Propack Check to use PROPACK package for the computation of the SVD or PCA Euclidean k means Check to use the euclidean k means clustering algorithm in the course of CLSI or CM Spherical k means Check to use the spherical k means clustering algorithm in the course of CLSI or CM PDDP Check to use the PDDP clustering algorithm in the course of CLSI or CM Initialize Centroids At random Defines the method used for the initialization of the cen troid vector in the course of k means Possibilities are initialize at random and supplly a variable of mat file with the centroids matrix Termination Criterion Epsilon 1 Defines the termination criterion used in the course of k means Possibilities are use an epsilon value default 1 and stop iteration when the objective function improve ment does not exceed epsilon or perform a specific num ber of iterations default 10 Principal Directions 1 Number of principal directions used in PDDP Maximum num of PCs Check if the PDDP max variant is to be applied Variant Basic A set of PDDP variants Possibe values Basic Split with k mean
17. Propack fe Euclidean k means F Browse 7 Ss Spherical k means f PDDE El El JV Display Results E M Figure 18 Starting window of nnmf_gui 34 3 Fill in the required fields by pressing the check buttons editing the edit boxes or selecting the appropriate files variables by pressing the Browse button Text to Matrix Generator Non Negative Factorizations Program FilesMATLABIR2007aworkiTMG_workingidatawmediine Figure 19 Next view of nnmf_gui according to the user selection 35 4 Press the Continue button in order to perform selected operation 5 Results have been saved to the workspace Furthermore directory nnmf k_10 mlup has been created under TMG_HOME data medline with each output variable stored to a single mat file f mlup 5 x File Edit View Favorites Tools Help g Q ex Y wi JO search gt Folders E Address TMG data medinelnnmfkoimup OOOO ge i Hblocknndsvd File and Folder Tasks 2 E Whlocknndsvd 2 Make a new falar Type Microsoft Office Access Table Shortcut a Publish this Folder to the Date Modified 12 19 2008 11 44 PM Web Size 224 KB E Share this Folder Other Places kio My Documents Shared Documents 4 My Computer My Network Places Details Figure 20 The output mat files of nnmf_gui 6 Press the Reset button in order to change the input 36 A 4 Retr
18. TMG parses a text collection and generates the term document matrix A TMG FILENAME returns the term document matrix that corresponds to the text collection contained in files of directory or file FILENAME Each document must be separeted by a blank line or another delimiter that is defined by OPTIONS argument in each file A DICTIONARY TMG FILENAME returns also the dictionary for the collection while A DICTIONARY GLOBAL_WEIGHTS NORMALIZED_FACTORS TMG FILENAME returns the vectors of global weights for the dictionary and the normalization factor for each document in case such a factor is used If normalization is not used TMG returns a vector of all ones A DICTIONARY GLOBAL_WEIGHTS NORMALIZATION_FACTORS WORDS _PER_DOC TMG FILENAME returns statistics for each document i e the number of terms for each document A DICTIONARY GLOBAL WEIGHTS NORMALIZATION FACTORS WORDS PER DOC TITLES FILES TMG FILENAME returns in FILES the filenames contained in directory or file FILENAME and a cell array TITLES that containes a declaratory title for each document as well as the document s first line Finally A DICTIONARY GLOBAL_WEIGHTS NORMALIZATION_FACTORS WORDS_PER_DOC TITLES FILES UPDATE_STRUCT TMG FILENAME returns a structure that keeps the essential information for the collection s update or downdate TMG FILENAME OPTIONS defines optional parameters OPTIONS use_mysq Indicates if r
19. if cosine 1 or euclidean distance 0 similarity measure is to be used LABELS_AS contains the assigned labels for the columns of Q 108 scut_knn SCUT_KNN implements the Scut thresholding technique from 1 for the k Nearest Neighboors classifier THRESHOLD SCUT_KNN A Q K LABELS TR LABELS_TE MINEL NORMALIZE STEPS returns the vector of thresholds for the k Nearest Neighboors classifier for the collection A Q A and Q define the training and test parts of the validation set with labels LABELS_TR and LABELS TE respectively MINF1 defines the minimum F1 value and NORMALIZE defines if cosine 1 or euclidean distance 0 measure of similarity is to be used Finally STEPS defines the number of steps used during thresholding THRESHOLD F THRESHOLDS SCUT_KNN A Q K LABELS_TR LABELS TE MINF1 NORMALIZE STEPS returns also the best F1 value as well as the matrix of thresholds for each step row i corresponds to step 1 REFERENCES 1 Y Yang A Study of Thresholding Strategies for Text Categorization In Proc 24th ACM SIGIR pages 137145 New York NY USA 2001 ACM Press 109 scut_llsf SCUT_LLSF implements the Scut thresholding technique from 2 for the Linear Least Squares Fit classifier 3 THRESHOLD SCUT LLSF A Q CLUSTERS K LABELS_TR LABELS_TE MINF1 L METHOD STEPS SVD_METHOD CLSI_ METHOD returns the vector of thresholds for the Linear Least Squares Fit cla
20. the cutoff value of the clustering the cluster structure the separation distance the value of the objective function and the two mean values 129 unique_elements UNIQUE_ELEMENTS detects all distinct elements of a vector ELEMENTS N UNIQUE_ELEMENTS X returns in ELEMENTS all distinct elements of vector X and in N the number of times each element appears in X A value is repeated if it appears in non consecutive elements For no repetitive elements sort the input vector 130 unique_words UNIQUE_WORDS detects all distinct elements of a cell array of chars used by tmg m tmg_query m tdm_update m NEW_WORDS NEW_DOC_IDS UNIQUE_WORDS WORDS DOC IDS N_DOCS returns in NEW_WORDS all distinct elements of the cell array of chars WORDS DOC_IDS is the vector of the document identifiers containing the corresponding words while N DOCS is the total number of documents contained to the collection NEW_DOC_IDS contains the inverted index of the collection as a cell array of 2 x N DOCS arrays 131 vsm VSM Applies the Vector Space Model to a document collection SC DOCS_INDS VSM D Q NORMALIZE_DOCS applies the Vector Space Model to the text collection represented by the term document matrix D for the query defined by the vector Q 1 NORMALIZE_DOCS defines if the document vectors are to be normalized 1 or not 0 SC contains the sorted similarity coefficients
21. while DOC_INDS contains the corresponding document indices REFERENCES 1 M Berry and M Browne Understanding Search Engines Mathematical Modeling and Text Retrieval Philadelphia PA Society for Industrial and Applied Mathematics 1999 132
22. 137145 New York NY USA 2001 ACM Press 111 sdd_tmg SDD_TMG interface for SDDPACK X D Y SDD_TMG A K computes a rank K Semidiscrete Decomposition of A using the SDDPACK 1 REFERENCES Tamara G Kolda and Dianne P O Leary Computation and Uses of the Semidiscrete Matrix Decomposition Computer Science Department Report CS TR 4012 Institute for Advanced Computer Studies Report UMIACS TR 99 22 University of Maryland April 1999 112 skmeans SKMEANS Spherical k Means Clustering Algorithm SKMEANS clusters a term document matrix using the Spherical k means clustering algorithm 1 CLUSTERS SKMEANS A C K TERMINATION returns a cluster structure with K clusters for the term document matrix A using as initial centroids the columns of C initialized randomly when it is empty TERMINATION defines the termination method used in spherical k means epsilon stops iteration when objective function increase falls down a user defined threshold see OPTIONS input argument while n_iter stops iteration when a user defined number of iterations has been reached CLUSTERS Q SKMEANS A C K TERMINATION returns also the vector of objective function values for each iteration and CLUSTERS Q C SKMEANS A C K TERMINATION returns the final centroid vectors SKMEANS A C K TERMINATION OPTIONS defines optional parameters OPTIONS iter Number of iterations default 10
23. 2007aworkiTMG_workingidatawmediine y el E Figure 25 Next view of clustering_gui according to the user selection 43 4 Press the Continue button in order to perform selected operation 5 Results have been saved to the workspace Furthermore directory kmeans k_10 has been created under TMG_HOME data medline with each output variable stored to a single mat file am k_5 la x File Edit View Favorites Tools Help a Qa O J search gt Folders F aa Address TMc dataimedineipddpik ss ge e clusters_pddp File and Folder Tasks ES Make a new Folder Publish this Folder to the Web E Share this Folder Other Places 5 pddp 3 My Documents Shared Documents My Computer My Network Places Details Figure 26 The output mat files of clustering gui 44 6 The user gets an html response that summarizes the clustering result TM DDP response Ele Edt View Go Debug Desktop Window Help gt e gt S AA Location resutsipdip rests him TMG PDDP Response Node 1 I Node 2 Node 3 Cluster 1 306 documents Document 18 Document 23 Document 27 Document 31 Document 33 Document 34 Document 35 Document 36 Document 49 Document 53 Document 57 Document 75 Document 76 Document 80 Document 84 La sat OC OA lcameama
24. Clan MAT ADADADOT a Cluster 1 Node 7 Cluster_2 Cluster _3 Cluster_4 Cluster_5 C Program Files MATLAB R2007 a work TMG_working sample_documents retrieval med_text 18 bilateral popliteal cysts in a patient with rheumatoid arthritis C Program Files MATLAB R2007 a work TMG_working sample_documents retrieval med_text 23 renal amyloidosis a clinicopathological study C Program Files MATLAB R2007 atworkiTMG_workingisample_dacumentsvetrievalmed_text 27 amyloid goitre a case report C Program Files MATLAB R2007 a work TMG_working sample_dacuments retrieval med_text 31 a case of interventricular septal defect with dextrocardia and situs C Program Files MATLAB R2007 a work TMG_working sample_documents retrieval med_text 33 the localizing significance of limited simultaneous visual form C Program Files MATLAB R2007 a work TMG_working sample_documents retrieval med_text 34 visual anosognosia in cortical blindness anton s symptom C Program Files MATLAB R2007 a work TMG_working sample_documents retrieval med_text 35 the development of social attachments in infancy C Program Files MATLAB R2007 a work TMG_working sample_documents retrieval med_text 36 separation anxiety as a cause of early emotional problems in children C Program Files MATLAB R2007 a work TMG_working sample_documents retrieval med_text 49 the effect of selenium on the upper respiratory passages C Program Files MATLAB R2007 a work TM
25. G_working sample_documents retrieval med_text 63 hypothermia physiologic effects and clinical application C Program Files MATLAB R2007 a work TMG_working sample_documents retrievallmed_text 57 hyperglycemic coronary perfusion effect of hypothermia on myocardial C Program Files MATLAB R2007 a work TMG_working sample_documents retrieval med_text 75 postural changes in blood distribution and its relation to the change in C Program Files MATLAB R2007 a work TMG_working sample_documents retrieval med_text 76 comparative studies of the glycogen content of heart liver and brain C Program Files MATLAB R2007 a work TMG_working sample_documents retrieval med_text 80 rate of change of carbon dioxide tension in arterial blood jugular C Program Files MATLAB R2007 a work TMG_working sample_documents retrieval med_text 84 heparin levels during and after hypothermic perfusion erate i TAC dect tas O Resolving host results A Figure 27 The output of clustering_gui for PDDP 7 Press the Reset button in order to change the input 45 A 6 Classification module classification gui Suppose we have processed a collection with tmg_gui construct a tdm with 6 495 documents and 21 764 terms a single label dataset corresponding to the well known modapte split of the Reuters 21578 collection and store the results to TMG_HOME data reuters Assume then we want to classify the test part of the modapte split using the k Nearest N
26. MySQL login and password as well as the root directory of the MySQL Java Connector The installation script creates all necessary in formation including MySQL database TMG and adds to the MATLAB path all necessary directories e Run gui Alternatively use the command line interface type help tmg TMG requires the MySQL ANLS NNDSVD PROPACK SDDPACK and SPQR third party software packages PROPACK SDDPACK and SPQR packages are in cluded into TMG while the user has to download MySQL However we note that MySQL related software is necessary only if the user intends to use the database sup port implemented into TMG Ordinary TMG will run without any problem on a Matlab 7 0 environment without any other special software TMG ROOT Indexing module core functions classification Classification module core functions clustering Clustering module core functions dataset 1 data Output data dataset n documentation Documentation directory dr Dimensionality Reduction module core functions perl Perl support results Html files resulting from clustering and retrieval modules retrieval Retrieval module core functions sample_documents Sample text collections m PROPACK var Auxiliary files and third party software SDDPACK SPOR Figure 2 Structure of TMG root directory http www mysql com http dev mysql com downloads connector j 5 0 html http compbio med harvard e
27. ONSE returns an html response for k means CREATE_KMEANS_RESPONSE CLUSTERS TITLES creates a summary html file containing information for the result of the k means algorithm defined by CLUSTERS when applied to the dataset with document titles defined in the TITLES cell array CREATE_KMEANS_RESPONSE CLUSTERS TITLES VARIANT defines additionaly the k means variant possible values k means and skmeans The result is stored in the results directory and displayed using the default web browser 63 create pddp response CREATE_PDDP_RESPONSE returns an html response for PDDP CREATE_PDDP_RESPONSE TREE_STRUCT CLUSTERS L TITLES creates a summary html file containing information for the result of the PDDP algorithm defined by TREE_STRUCT and CLUSTERS when applied to the dataset with document titles defined in the TITLES cell array L defines the maximum number of principal directions used by PDDP The result is stored in the results directory and displayed using the default web browser 64 create_retrieval_respons CREATE_RETRIEVAL _RESPONSE returns an html response for a query CREATE_RETRIEVAL_RESPONSE DATASET IDS SIMILARITY QUERY creates an html file containing information for the text of documents of DATASET stored in MySQL defined by IDS and having SIMILARITY similarity coefficients against QUERY The result is stored in the results directory and displayed
28. SILMETHOD defines the method used for the determination of the number of factors from each class used in Clustered Latent Semantic Indexing in case METHOD equals clsi 77 lisf_single LLSF_SINGLE Linear Least Squares Fit for single label collections 2 LABELS_AS LLSF_SINGLE A Q CLUSTERS LABELS L METHOD SVD_METHOD CLSI_METHOD classifies the columns of Q with the Linear Least Squares Fit classifier 2 using the pre classified columns of matrix A with labels LABELS cell array of vectors of integers CLUSTERS is a structure defining the classes METHOD is the method used for the approximation of the rank truncated SVD with possible values clsi Clustered Latent Semantic Indexing 3 cm Centroids Method 1 svd Singular Value Decomosition SVD_METHOD defines the method used for the computation of the SVD while CLSI METHOD defines the method used for the determination of the number of factors from each class used in Clustered Latent Semantic Indexing in case METHOD equals clsi 78 lsa LSA Applies the Latent Semantic Analysis Model to a document collection SC DOCS_INDS LSA D P Q NORMALIZE DOCS applies LSA to the text collection represented by the latent semantic factors D P of the collection s term document matrix for the query defined by the vector Q 1 NORMALIZE DOCS defines if the document vectors are to be normalized 1 or n
29. T ZS m adag e eet be Ge are AUR Gee ee BP 52 block nndsvd ah a4 Bate ety PAE aks hea ey 53 classification Gut yo Audie te So Oo eS eos ES x 54 GLS ayarda ce ve Certo Se oe Gr eos Bae A tee ia 55 Clustering gai id Bnd dee baie E we 56 A ee Leth he bgt dk dete Be Ee ee eth oe te 57 Col normalization pe hese we ha oe AY Roe Ee eo A 58 ol rearrange arson Woe as ORR EE OO AA 59 COTUMAIOEMS dea gdh Rosy eee Gok ete ds BN a Se Rs 60 compute fro NORM s sa teak ee ba ee ee Be ae dee ek E 61 Computer Scar sik Gee oe A et A Tg 62 CIeate _kmeans Tesponse o 63 creatie pddp response luce at a A a ia 64 create_retrieval_response 65 o a tin ated as bd Se Ba ae ee Bi ees 66 GE ON th ae teed esse ks E ise Sa de Ge tate ew Her E Fore se inca Bates tee 67 SEMEN E Oa SER A Ae ne GLE 8 a ea Sas 68 SNE ROD Ys eek esi ee Rae ele gies hedge eh Gee Ready Sage See a 69 Get nodes Cae yx Sed A ewe ho ach ee ee a oe Pas 70 GUT pet e ae ehhh dy dy ee elec bona Aiea Stee dl Sey a eget fe lada 71 PME SEM Mii AA A a A eS SS Re a a S 72 Kin 1 10 i Bucs ge Sa ens dad da rd A a a ea de ee A 73 ERNESTINA a ee bw eae eS RU Se AER ES amp 74 ESiSeOLle RIOR a ar Rae eee wien delet gs Silk GOR Bn aa 75 SuSe Lec ELO ir a a e A tas d a 2M 76 US EMT A A A A A e eth ts A ate NS 77 TSB STIG Coe cil at ass Hare A toi GE ee ee A 78 ALAS AN cay eevee by St Se ook Geb ios Bae Se OS ee Mates ia 79 make GlistersemuVer p50
30. Text to Matrix Generator User s Guide Dimitrios Zeimpekis Efstratios Gallopoulos Department of Computer Engineering and Informatics University of Patras Greece December 2008 Work conducted in the context of and supported in part by a University of Patras KARATHEODORI grant Te mail zeimpekis gmail com supported in part by a Bodossaki Foundation graduate fellowship te mail stratis Oceid upatras gr Contents 1 Introduction 1 2 Istallation Instuctions 1 3 Graphical User Interfaces 3 3 1 Indexing module tmg_gui 0 200020 ae 3 3 2 Dimensionality Reduction module dr_gui 6 3 3 Non Negative Factorizations module nnmf_gui 9 3 4 Retrieval module retrieval_gui 12 3 5 Clustering module clustering gui 14 3 6 Classification module classification_gui 16 A Appendix Demonstration of Use 22 A l Indexing module tmg_gui o 00004 22 A 2 Dimensionality Reduction module dr_gui 29 A 3 Non Negative Factorizations module nnmf_gui 33 A 4 Retrieval module retrieval gui 37 A 5 Clustering module clustering gui 41 A 6 Classification module classification_gui 46 B Appendix Function Reference 50 about TMA GUE a ee Vea ew AIP A Ge Oo OA Rte E p k 50 bisectingenmndsvG vrh ti aces Gi ee a cd ee Bes 4 51 HIS6CkWATAGON AL
31. _METHOD DSP defines if results are to be displayed to the command window default 1 or not 0 REFERENCES 1 D Boley Principal Direction Divisive Partitioning Data Mining and Knowledge Discovery 2 1998 no 4 325 344 2 D Zeimpekis E Gallopoulos PDDP Towards a Flexible Principal Direction Divisive Partitioning Clustering Algorithmm Proc IEEE ICDM 03 Workshop on Clustering Large Data Sets Melbourne Florida 2003 99 pddp_2means PDDP_2MEANS Hybrid Principal Direction Divisive Partitioning Clustering Algorithm and k means PDDP_2MEANS clusters a term document matrix tdm using a combination of the Principal Direction Divisive Partitioning clustering algorithm 1 and k means 2 CLUSTERS PDDP_2MEANS A K returns a cluster structure with K clusters for the tdm A CLUSTERS TREE _STRUCT PDDP_2MEANS A K returns also the full PDDP tree while CLUSTERS TREE_STRUCT S PDDP_2MEANS A K returns the objective function of PDDP PDDP_2MEANS A K SVD_METHOD defines the method used for the computation of the PCA svds default or propack PDDP_2MEANS A K SVD_METHOD DSP defines if results are to be displayed to the command window default 1 or not 0 Finally PDDP_2MEANS A K SVD_METHOD DSP EPSILON defines the termination criterion value for the k means algorithm REFERENCES 1 D Boley Principal Direction Divisive Partitioning Data Mining and Knowledge Discovery 2 1998
32. a a da o rn A 80 Make clusters Singlen cacy sie Reda Sea ee Se oe id 81 Maker Vale s tio Sica foe wee a Be RAN Be oe A dd A a 82 ak val IAS eget Aye TE e e ae EO S NAS E e 83 merge dictionary oo re Gtk Gee Re Ghee See es 84 MEFGS tdms i taiu hod Hb ee eee dk Sete eb e ee a 8 85 MYDETMS 4 5 s bee lh ty dee Beale Gasket rr Sos wage ee bd GAD 86 AMELIA E ioe Ble Pee BS ae AAA 87 nomtMulcupdater aii Se eh eae a OS a eae a 88 OPSnezL VE rn dale SR amp le wh AA 89 OPE L2MEITS m Gin SERUAN Aa ce ER a een ee Ba ee a 90 ANN 91 PELRMAT it E Syed RA A ie a te BOP A EA 92 Perea mata tU ao gid ae an dc a as 93 PES Ppropaek eaa eT tt a e a oe ds 94 pea propack Atransfune s ls Bl td is 95 Pea propack BUD sirs ds ria ele BG za il p 96 pea update ii NAAA ARA Bakes ek 97 peca update a UD sia db a ri a 98 JA a Age a a ee eek AA ee RE atte 2 99 Pade ZMeans Ve wd eo howe Coa Bae ES ree ee Ree Je eee Soo ts Baers 100 pddp extract centroids eae bee di ee web ee 101 DAGP OPECUE miso ro gs ri ot Bye de FSS a oe le Pw 102 pddproptcut lt e2means 6a eo Pa eee ke bab be Bw ee ape A 103 Pddpropeeutpd amp ad ty tym helen Sad Sine Bk ade Rn a oS 104 PS paf2asicts 8 Wh wie we ale elt eke BG Gable a 3k Gia 4 105 PELRIEV al CUA asd a A BS Ba Se yes es 106 LOSE TO MULEL 4 05 Se ee Aw ee Sa dee a sde da a 107 FOCENO SENO LE a a ty aah Se saat aan mi las alae al AS oe no S 108 SCURA tds Bed ora dol Bees Hee eet es BE es isos Sets Bo geet v
33. alization techniques e Non Negative Double Singular Value Decomposition NNDSVD 4 e Block NNDSVD 20 e Bisecting NNDSVD 20 e By clustering 13 Resulting factors can be further refined by means of two NNMF algorithms e Multiplicative Update algorithm by Lee and Seung 9 e Alternating Non negativity constrained Least Squares NMF ANLS 7 Field Name Default Description Select Dataset Select the dataset Random Initialize at random Nonnegative Double Initialize by NNDSVD SVD NNDSVD Block NNDSVD Initialize by block NNDSVD Bisecting NNDSVD Initialize by bisecting NNDSVD By Clustering Initialize by clustering Refine factors Check to run refinement algorithm Method Refinement method default Multiplicative Update Number of iterations Refine by the Multiplicative Update algorithm Display Results Display results of refinement method Euclidean k means Check to use the euclidean k means clustering algorithm in the course of CLSI or CM Spherical k means Check to use the spherical k means clustering algorithm in the course of CLSI or CM PDDP Check to use the PDDP clustering algorithm in the course of CLSI or CM Initialize Centroids At random Defines the method used for the initialization of the cen troid vector in the course of k means Possibilities are initialize at random and supplly a variable of mat file with the ce
34. button Text to Matrix Generator Indexing Browse BrOW Se TemFrewency z fant oO O OR a E M Continue Rest _ e Figure 10 Next view of tmg gui according to the user selection 24 4 The user can select a file or a variable by pressing the corresponding browse button Workspace Variables z _open _cancel Figure 11 The open file window 25 5 Press the Continue button in order to perform the selected operation 6 Results have been saved to the workspace Furthermore directory sample1 has been created under TMG_HOME data with each output variable stored to a single mat file amp sample1 lol xj File Edit View Favorites Tools Help ay Q ea i JO search gt Folders E Address TMG data sample1 Go aa File and Folder Tasks 2 dictionary a files al global_weights normalization_Factors titles Share this folder a El J update_struct J words_per_doc Make a new Folder Publish this Folder to the Web Other Places O data My Documents Shared Documents My Computer My Network Places Details Figure 12 The output mat files of tmg_gui 26 7 Results have also been saved in MySQL used for further processing e g retrieval_gui ind_terms_per_doc sparsity R26867a work IMG_working data samplel1 163 75 87 6667 i 11 5268 a row in set 8 68 sec
35. chio Check if Rocchio classifier is to be applied Weight of Positive Ex amples The weight of the positive examples in the formation of the centroids vectors in Rocchio Weight of Negative Ex amples The weight of the negative examples in the formation of the centroids vectors in Rocchio Linear Least Squares Fit LLSF Check if LLSF classifier is to be applied Number of Factors Number of factors used in the course of LLSF Multi Label Check if classifier is to be applied for a multi label col lection Single Label Check is classifier is to be applied for a single label col lection Use Thresholds e If Multi Label radio button is checked use a stored vec tor of thresholds Compute Thresholds If Multi Label radio button is checked compute thresh olds Thresholds If Multi Label and Use Thresholds radio buttons are checked supply a stored vector of thresholds 17 Min F value If Multi Label and Compute Thresholds radio but tons are checked supply minimum F1 value used in the thresholding algorithm Vector Space Model Use the basic Vector Space Model Preprocessing by Use preprocessed training data with Singular Value Decomposition Principal Component Analysis Clus teredd Latent Semantic Analysis Centroid Mathod Semidiscrete Decomposition SPQR Number of Factors
36. du hkim nmf index html 3http www cs rpi edu boutsc paperl html 4http soi stanford edu rmunk PROPACK index html Shttp www cs umd edu oleary SDDPACK README html Shttp portal acm org citation cfm id 1067972 3 Graphical User Interfaces 3 1 Indexing module tmg_gui Text to Matrix Generator Indexing a pC Figure 3 The tmg_gui GUI TMG can be used for the construction of new and the update of existing term document matrices tdms from text collections in the form of MATLAB sparse arrays To this end TMG implements various steps such as e Removal of stopwords e Stemming currently Porter stemming algorithm 11 e Remove of short long terms e Remove of frequent infrequent terms locally or globally e Term weighting and normalization e Html filtering processing of Postscript and PDF e Store in MySQL optionally The resulting tdms can be stored as mat files while text can also be stored in MySQL for further procesing TMG can also update existing tdms by efficient incremental up dating or downdating operations Finally TMG can also construct query vectors using the existing dictionary that can be used from the retrieval and classification modules The indexing GUI module is depicted in Figure 3 while Table 1 describes in detail all the tmg_gui fields Field Name Default Description Input File Directory Files to be parsed with resulting documents separated by
37. e 109 Set TLSE eia tele wel Ge ln es ee ee ee di A wee ee 110 SEUEEOCCHTO 15 ira So oe ee EES Be ee ee Dae 111 Sad EM ls ee we ee aE eels a A eh ee 112 A he AR ee Reel ios Bote oe ae Bee eg Pe EN 113 Stemmer yie ete Ai ae hie A ee es eden et elo Ge i EM a S 114 SEPM OLN EME AAA A A AA Bo Myler bets 115 SVAT bb oe dS Sepa eA Ae Sede dae Ae bog oe a ee Pe 116 syd updaten s 428 LA ete Gets tp as ce Be Se GA e 117 svdlupdatesafune 22 4 aw be ee A Bae es be 118 11 PdmudownGdate 14 a teh te Set BaP Ae EL dt da 119 tdm pdate s o Eur E aw E Ge Bee Es 120 EMO NN Aes 122 emo gur Euhh Son ra a a e na 124 EMGuQUELVS E is o O A e a ate E A A 125 EMI Save PeSsul Ses iv ass A A AE ts ES 127 tmo template as e dt e e ds aes 128 two means laca rias de a e fee a a A 129 UN TUE USMENES y ra roaa A a ee as ee a at A 130 NIQUS WOLAS owl a te lia 131 SM ES ep A ARE A e Rep th ge EN STE Ge Es 132 111 List of Figures 0 ZDdDUuBunN _ NNNNNNNNNN Ee im PRR Rr HH Oo OMANDMNBRWNFTOMANANKRWBNYH CO Structure and dependencies of GUI modules of TMG Structure of TMG root directory 000 Thetmgig ut GUIs sue meee GE BS GIR Bee Site ts bel AS Th edr gui GULo aes he ae eS A eh E The nm gui Gus contig eae dota te lk a Bah Ae So at The retrieval_guiGUI 4 The clustering gua GUL Locos be ae Soe epee ee be AG The classification_guiGUI Starting window O tmg_gui
38. e input file does not exist or has a wrong format 1 if gs is not installed or the path isn t set 0 if ps2ascii didn t work properly and 1 if the conversion was successful 105 retrieval_gui RETRIEVAL_GUI RETRIEVAL_GUTI is a graphical user interface for all retrieval functions of the Text to Matrix Generator TMG Toolbox 106 rocchio_multi ROCCHIO_MULTI Rocchio classifier for multi label collections LABELS_AS KNN_MULTI A CLUSTERS BETA GAMMA Q LABELS NORMALIZED DOCS THRESHOLDS classifies the columns of Q with the Rocchio classifier using the pre classified columns of matrix A with labels LABELS vector of integers THRESHOLDS is a vector of class threshold values BETA and GAMMA define the weight of positive and negative examples in the formation of each class centroid NORMALIZED_DOCS defines if cosine 1 or euclidean distance 0 similarity measure is to be used LABELS_AS contains the assigned labels for the columns of Q 107 rocchio_single ROCCHIO_SINGLE Rocchio classifier for single label collections LABELS_AS KNN_SINGLE A CLUSTERS BETA GAMMA Q LABELS NORMALIZED DOCS classifies the columns of Q with the Rocchio classifier using the pre classified columns of matrix A with labels LABELS vector of integers BETA and GAMMA define the weight of positive and negative examples in the formation of each class centroid NORMALIZED_DOCS defines
39. eboulle eds Springer Berlin 2006 pp 187 210 19 D Zeimpekis and E Gallopoulos k means steering of spectral divisive clustering algorithms In Proc of Text Mining Workshop Minneapolis 2007 20 D Zeimpekis and E Gallopoulos Document clustering using nmf based on spec tral information In Proc of Text Mining Workshop Atlanta 2008 20 21 A A l Appendix Demonstration of Use Indexing module tmg_gui Assume we want to run tmg m for the following input filename sample_documents samplel delimiter emptyline line_delimiter yes stoplist common_words minimum length 3 maximum length 30 minimum local frequency 1 maximum local frequency inf minimum global frequency 1 maximum global frequency inf local term weighting logarithmic global term weighting IDF normalization cosine stemming and store results to directory samplel and MySQL 22 1 Initially select the operation you want to perform by pressing the corresponding radio button at the upper frame 2 The selection of a radio button activates the required fields in the GUI while deactivating the rest fields Text to Matrix Generator Indexing Tem requency Y E _ F Ej M Figure 9 Starting window of tmg_gui 23 3 Fill in the required fields by pressing the check buttons editing the edit boxes or selecting the appropriate files variables by pressing the Browse
40. ed is due to Porter Display Results Display results or not to the command windows Continue Apply the selected operation Reset Reset window to default values Exit Exit window Table 1 Description of use of tmg_gui components 3 2 Dimensionality Reduction module dr_gui Text to Matrix Generator Dimensionality Reduction fe Euclidean k means Spherical k means C PDDP Figure 4 The dr_gui GUI This module deploys a variety of powerful techniques designed to efficiently handle high dimensional data Dimensionality Reduction DR is a common technique that is widely used The target is dual a more economical representation of data and b better semantic representation TMG implements six DR techniques e Singular Value Decomposition SVD e Principal Component Analysis PCA e Clustered Latent Semantic Indexing CLSD 16 17 e Centroids Method CM 10 e Semidiscrete Decomposition SDD 8 e SPQR Decomposition 2 DR data can be stored as mat files and used for further processing The dimensionality reduction GUI module is depicted in Figure 4 while Table 2 describes in detail all the dr_gui fields Field Name Default Description Select Dataset Select the dataset Singular Value Decom Apply the SVD method position SVD Principal Component Apply the PCA method Analysis
41. eighboors classifier for the following input e Multiple docs file yes e filename sample_document reuters test delimiter lt reuters gt line delimiter yes use stored global weights yes stoplist common_words local term weighting Term Frequency classification method k Nearest Neighboors KNN e num of NNs 10 collection type Single Label preprocessed by Clustered Latent Semantic Indexing number of factors 100 similarity measure Cosine 46 1 Initially select the classification algorithm you want to apply by pressing the corresponding radio button at the left frame 2 The selection of a radio button activates the required fields in the GUI while deactivating the rest fields Figure 28 Starting window of classification_gui 47 3 Fill in the required fields by pressing the check buttons editing the edit boxes or selecting the appropriate files variables by pressing the Browse button ext to Matrix Generator Classification Figure 29 Next view of classification gui according to the user selection 48 4 Press the Continue button in order to perform selected operation 5 Results have been saved to the workspace 6 Press the Reset button in order to change the input 49 B Appendix Function Reference about_tmg_gui ABOUT_TMG_GUI ABOUT_TMG_GUI displays information for TMG 50 bisectin
42. er of the tdm A N_COLS 1s a vector containing the last column index for each column block while ALPHA_VAL is a value in 0 1 75 ks_selectionl KS_SELECTION implements the heuristic method from 2 for the selection of the number of factors from each cluster used in the Clustered Latent Semantic Indexing method 1 The number of factors from each cluster is at least 1 N_ST KS_SELECTION1 A N COLS ALPHA_VAL L returns in N_ST a vector of integers denoting the number of factors sum equals L selected from each cluster of the tdm A N_COLS is a vector containing the last column index for each column block while ALPHA VAL is a value in 0 1 76 l1sf multi LESF_MULTI Linear Least Squares Fit for multi label collections 2 LABELS_AS LLSF_MULTI A Q CLUSTERS LABELS L METHOD THRESHOLDS SVD_METHOD CLSI_METHOD classifies the columns of Q with the Linear Least Squares Fit classifier 2 using the pre classified columns of matrix A with labels LABELS cell array of vectors of integers THRESHOLDS is a vector of class threshold values while CLUSTERS is a structure defining the classes METHOD is the method used for the approximation of the rank l truncated SVD with possible values clsi Clustered Latent Semantic Indexing 3 cm Centroids Method 1 svd Singular Value Decomosition SVD_METHOD defines the method used for the computation of the SVD while CL
43. esults are to be stored in MySQL OPTIONS db_name The name of the directory where the results are to be saved OPTIONS delimiter The delimiter between documents within the same file Possible values are emptyline default none_delimiter treats each file as a single document or any other string OPTIONS line_delimiter Defines if the delimiter takes a whole line of text default 1 or not OPTIONS stoplist The filename for the stoplist 122 i e a list of common words that we don t use for the indexing default no stoplist used OPTIONS stemming Indicates if the stemming algorithm is used 1 or not 0 default OPTIONS update_step The step used for the incremental built of the inverted index default 10 000 OPTIONS min_length The minimum length for a term default 3 OPTIONS max_length The maximum length for a term default 30 OPTIONS min_local_freq The minimum local frequency for a term default 1 OPTIONS max local_freq The maximum local frequency for a term default inf OPTIONS min_global_freq The minimum global frequency for a term default 1 OPTIONS max_global_freq The maximum global frequency for a term default inf OPTIONS local_weight The local term weighting function default t Possible values see 1 2 t Term Frequency b Binary P Logarithmic a Alternate Log n Augmented Normalized Term Frequency OPTIONS g
44. g algorithm PDDP principal directions 1 maximum number of PCs variant basic automatic determination of num of factors from each cluster yes number of clusters 10 number of factors 100 and you want to store results to directory medline 29 1 Initially select the operation you want to perform by pressing the corresponding radio button at the upper left frame 2 The selection of a radio button activates the required fields in the GUI while deactivating the rest fields Text to Matrix Generator Dimensionality Reduction Figure 15 Starting window of dr_gui 30 3 Fill in the required fields by pressing the check buttons editing the edit boxes or selecting the appropriate files variables by pressing the Browse button Text to Matrix Generator Dimensionality Reduction Program Files MA TLABIR2007atworkiTMG_workingidatawmedline Figure 16 Next view of dr_gui according to the user selection 31 4 Press the Continue button in order to perform selected operation 5 Results have been saved to the workspace Furthermore directory clsi k_100 has been created under TMG_HOME data medline with each output variable stored to a single mat file fe k_100 o xi File Edit View Favorites Tools Help Ar Q ra E wi JO search gt Folders EB Address TMG data mediine clsi k_100 9 e Xclsi File and Folder Tasks A Gl velsi
45. g_nndsvd BISECTING_NNDSVD a bisecting form of the the Non Negative Double Singular Value Decomposition Method 2 BISECTING_NNDSVD applies a bisecting form of the the Non Negative Double Singular Value Decomposition Method 2 using a PDDP like 2 clustering Method W H BISECTING_NNDSVD A k svd_method returns a non negative rank k approximation of the input matrix A using svd_method for the SVD possible values svds propack REFERENCES 1 D Boley Principal Direction Divisive Partitioning Data Mining and Knowledge Discovery 2 1998 no 4 325 344 2 C Boutsidis and E Gallopoulos SVD based initialization A head start on nonnegative matrix factorization Pattern Recognition Volume 41 Issue 4 Pages 1350 1362 April 2008 51 block_diagonalize BLOCK_DIAGONALIZE reorders a matrix heuristically using a clustering result A N ROWS N COLS ROW_INDS COL _INDS BLOCK_DIAGONALIZE A CLUSTERS reorders matrix A using the clustering result represented by the structure CLUSTERS N ROWS and N COLS store the last row and column index for each row and column block resprectively while ROW_INDS and COL_INDS contain the permuted row and column indices 52 block_nndsvd BLOCK_NNDSVD computes a non negative rank L approximation of the input matrix using the Clustered Latent Semantic Indexing Method 2 and the Non Negative Double Singular Value Decomposition Method
46. gt row in set 0 00 sec mysql gt Figure 13 The MySQL view uppon tmg execution 8 Press the Reset button in order to change the input 27 9 10 11 12 For further documentation type help tmg_gui at the MATLAB command window or select the Documentation tab from the Help menu Text to Matrix Generator Indexing Figure 14 The GUIs general help tab In order to update a tdm give the input file directory and the update_struct corresponding to the initial collection In case you just want to alter some op tions give a blank input file direcory and change the corresponding fields of update_struct In order to downdate a tdm give the update_struct corresponding to the initial collection and the document indices vector you want to remove In order to construct a term query matrix give the dictionary char array of the initial collection and the corresponding vector of global weights optional 28 A 2 Suppose we have processed a collection with tmg_gui construct a tdm with 1 033 documents and 12 184 terms corresponding to the well known MEDLINE collection and store the results to TMG_HOME data medline Assume then we want to con struct a low rank approximation of the TDM using the Clustered Latent Semantic Dimensionality Reduction module dr_gui Indexing CLSJ technique for the following input compute SVD with Propack clusterin
47. hms can be combined with CLSI CM and SVD DR techniques The classification GUI module is depicted in Figure 8 while Table 6 describes in detail all the classification_gui fields 16 Field Name Default Description Training Dataset The training dataset Training Labels The labels of the training dataset Use Stored Labels Check to use the stored vector of labels of training docu ments in the container folder Insert query The test document s Single doc string Check if a single test document is to be inserted Multiple docs file Check if multiple test document are to be inserted Filename In Multiple docs file is checked insert the filename containing the test documents Delimiter In Multiple docs file is checked insert the delimiter o be used for the test documents Line Delimiter In Multiple docs file is checked check if delimiter of test documents file takes a whole of text Alternative Global Global weights vector used for the construction of the test Weights documents vectors Use Stored Global Use the global weights vector found on the container di Weights rectory of the training dataset Stoplist Use a stoplist Local Term Weighting TF The local term weighting to be used k Nearest Neighboors Check if the KNN classifier is to be applied ENN Num of NNs Number of Nearest Neighboors in KNN classifier Roc
48. ieval module ret rieval_gui Suppose we have processed a collection with tmg_gui construct a tdm with 1 033 documents and 12 184 terms corresponding to the well known MEDLINE collection and store the results to TMG_HOME data medline Assume then we want to retrieve the relevant documents to a specific query for the following input insert query the crystalline lens in vertebrates including humans use stored global weights yes stoplist common_words local term weighting Term Frequency latent semantic analysis by Clustered Latent Semantic Indexing number of factors 100 similarity measure Cosine number of most relevant documents 5 37 1 Initially select the retrieval method you want to apply by pressing the corre sponding radio button 2 The selection of a radio button activates the required fields in the GUI while deactivating the rest fields Text to Matrix Generator Retrieval Program FilesMATLABIR2007aworkiTMG_workingidatalwmediine y Figure 21 Starting window of retrieval_gui 38 3 Fill in the required fields by pressing the check buttons editing the edit boxes or selecting the appropriate files variables by pressing the Browse button Text to Matrix Generator Retrieval Program FilesMA TLABIR2007aworkiTMG_workingidatawmediine y the crystalline lens in vertebrates including humans a if E v Term Frequency Browse
49. laced on the side of the body in others more successful unions were made by fusing a transplanted eye to the right eye of a host 5 approximately three months after operation one of two large lens regenerates in a pair of perfectly fused eyes was removed six weeks later a new large lens regenerate reappeared in most of the lentectomized units in the presence of the intact lens of the other unit 6 there is a strong possibility that the more than normal amount of neural retina present provided a more powerful retinal factor for lens regeneration than the inhibiting influence of the intact lens in the environment 213 Document 184 Similarity 0 39233 an investigation of mitotic control in the rabbit lens epithelium a water soluble substance which inhibits mitosis in the rabbit lens epithelium has been found to be present in young and old rabbit lenses it has a high molecular weight and is relatively stable at room tempera ture the inhibitory factor is associated with the y crystallin frac tion and exists throughout the young lens although the activity in the nuclear region on a wet weight basis is less than half that of the cortex and epithelium i 185 Document 169 Similarity 0 38649 carbonic anhydrase distribution in rabbit lens the distribution of carbonic anhydrase activity in the mature rabbit lens was determined the activities in nucleus cortex epithelium with anterior capsule anterior capsule and posterior capsule were res
50. lar Value Decomposition of a rank update matrix with MATLAB eigs U S VI SVD_UPDATE A X Y K computes the K factor SVD of A X Y using the eigs function of MATLAB 117 svd_ update afun SVD_UPDATE_AFUN Auxiliary function used in SVD_UPDATE 118 tdm_downdate TDM_DOWNDATE renews a text collection by downdating the correspoding term document matrix A TDM DOWNDATE UPDATE_STRUCT REMOVED DOCS returns the new term document matrix of the downdated collection UPDATE_STRUCT defines the update structure returned by TMG while REMOVED_DOCS defines the indices of the documents that 1s to be be removed A DICTIONARY TDM_DOWNDATE UPDATE_STRUCT REMOVED DOCS returns also the dictionary for the updated collection while A DICTIONARY GLOBAL_WEIGHTS NORMALIZED_FACTORS TDM_DOWNDATE UPDATE_STRUCT REMOVED_DOCS returns the vectors of global weights for the dictionary and the normalization factor for each document in case such a factor is used If normalization is not used TDM_DOWNDATE returns a vector of all ones A DICTIONARY GLOBAL_WEIGHTS NORMALIZATION_FACTORS WORDS_PER_DOC TDM_DOWNDATE UPDATE_STRUCT REMOVED_DOCS returns statistics for each document i e the number of terms for each document A DICTIONARY GLOBAL_WEIGHTS NORMALIZATION FACTORS WORDS PER_DOC TITLES FILES TDM_DOWNDATE UPDATE_STRUCT REMOVED_DOCS returns in FILES the filenames containing the collectio
51. lis 2007 102 pddp_optcut_2means PDDP_OPTCUT_2MEANS Hybrid Principal Direction Divisive Partitioning Clustering Algorithm and k means PDDP_OPTCUT_2MEANS clusters a term document matrix tdm using a combination of the Principal Direction Divisive Partitioning clustering algorithm 1 and k means 2 CLUSTERS PDDP_OPTCUT_OPTCUT_2MEANS A K returns a cluster structure with K clusters for the tdm A CLUSTERS TREE_STRUCT PDDP_OPTCUT 2MEANS A K returns also the full PDDP tree while CLUSTERS TREE_STRUCT S PDDP_OPTCUT_2MEANS A K returns the objective function of PDDP PDDP_OPTCUT_2MEANS A K SVD_METHOD defines the method used for the computation of the PCA svds default or propack PDDP_OPTCUT_2MEANS A K SVD_METHOD DSP defines if results are to be displayed to the command window default 1 or not 0 Finally PDDP_OPTCUT_2MEANS A K SVD_METHOD DSP EPSILON defines the termination criterion value for the k means algorithm REFERENCES 1 D Boley Principal Direction Divisive Partitioning Data Mining and Knowledge Discovery 2 1998 no 4 325 344 2 D Zeimpekis E Gallopoulos k means Steering of Spectral Divisive Clustering Algorithms Proc of Text Mining Workshop Minneapolis 2007 103 pddp_optcutpd PDDP_OPTCUTPD Hybrid Principal Direction Divisive Partitioning Clustering Algorithm and k means PDDP_OPTCUTPD clusters a term document matrix tdm using a co
52. lobal_weight The global term weighting function default x Possible values see 1 2 gt x None e Entropy f Inverse Document Frequency IDF g Gfldf n Normal p Probabilistic Inverse OPTIONS normalization Indicates if we normalize the document vectors default x Possible values x None c Cosine OPTIONS dsp Displays results default 1 or not 0 to the command window REFERENCES 1 M Berry and M Browne Understanding Search Engines Mathematical Modeling and Text Retrieval Philadelphia PA Society for Industrial and Applied Mathematics 1999 2 T Kolda Limited Memory Matrix Methods with Applications Tech Report CS TR 3806 1997 123 tmg_gui TMG_GUI TMG_GUIis a graphical user interface for all indexing routines of the Text to Matrix Generator TMG Toolbox For a full documentation type help tmg help tmg_query help tdm_update or help tdm_downdate For a full documentation of the GUI s usage select the help tab to the GUI 124 tmg_query TMG_QUERY Text to Matrix Generator query vector constructor TMG_QUERY parses a query text collection and generates the query vectors corresponding to the supplied dictionary Q TMG QUERY FILENAME DICTIONARY returns the query vectors that corresponds to the text collection contained in files of directory FILENAME DICTIONARY is the array of terms corresponding
53. mbination of the Principal Direction Divisive Partitioning clustering algorithm 1 2 and k means 3 CLUSTERS PDDP_OPTCUT_OPTCUTPD A K L returns a cluster structure with K clusters for the tdm A formed using information from the first L principal components of the tdm CLUSTERS TREE_STRUCT PDDP_OPTCUTPD A K L returns also the full PDDP tree while CLUSTERS TREE_STRUCT S PDDP_OPTCUTPD A K L returns the objective function of PDDP PDDP_OPTCUTPD A K L SVD_METHOD defines the method used for the computation of the PCA svds default or propack Finally PDDP_OPTCUTPD A K L SVD_METHOD DSP defines if results are to be displayed to the command window default 1 or not 0 REFERENCES 1 D Boley Principal Direction Divisive Partitioning Data Mining and Knowledge Discovery 2 1998 no 4 325 344 2 D Zeimpekis E Gallopoulos PDDP Towards a Flexible Principal Direction Divisive Partitioning Clustering Algorithmm Proc IEEE ICDM 03 Workshop on Clustering Large Data Sets Melbourne Florida 2003 3 D Zeimpekis E Gallopoulos k means Steering of Spectral Divisive Clustering Algorithms Proc of Text Mining Workshop Minneapolis 2007 104 ps _pdf2ascii PS_PDF2ASCII converts the input ps or pdf file to ASCII RESULT PS_PDF2ASCII FILENAME converts the input ps or pdf files to ASCII using ghostscript s utility ps2ascii RESULT returns a success indicator e g 2 if th
54. n s documents and a cell array TITLES that contains a declaratory title for each document as well as the document s first line Finally A DICTIONARY GLOBAL_WEIGHTS NORMALIZATION_FACTORS WORDS PER_DOC TITLES FILES UPDATE_STRUCT TDM_DOWNDATE UPDATE_STRUCT REMOVED_DOCS returns the update structure that keeps the essential information for the collection s update or downdate TDM_DOWNDATE UPDATE_STRUCT REMOVED_DOCS OPTIONS defines optional parameters OPTIONS dsp Displays results default 1 or not 0 to the command window 119 tdm_ update TDM_UPDATE renews a text collection by updating the correspoding term document matrix A TDM_UPDATE FILENAME UPDATE_STRUCT returns the new term document matrix of the updated collection FILENAME defines the file or files in case a directory is supplied containing the new documents while UPDATE_STRUCT defines the update structure returned by TMG In case FILENAME variable is empty the collection is simply updated using the options defined by UPDATE_STRUCT for example use another term weighting scheme A DICTIONARY TDM_UPDATE FILENAME UPDATE_STRUCT returns also the dictionary for the updated collection while A DICTIONARY GLOBAL_WEIGHTS NORMALIZED_FACTORS TDM_UPDATE FILENAME UPDATE STRUCT returns the vectors of global weights for the dictionary and the normalization factor for each document in case such a factor is used If normaliza
55. nimum similarity measure value for which a document is treated as relevant to the query Continue Apply the selected operation Reset Reset window to default values Exit Exit window Table 4 Description of use of ret rieval_gui components 13 3 5 Clustering module clustering_gui Text to Matrix Generator Clustering At random Figure 7 The clustering_gui GUI TMG implements three clustering algorithms e k means e Spherical k means 6 e Principal Direction Divisive Partitioning PDDP 3 15 Regarding PDDP TMG implements the basic algorithm as well as the PDDP 15 along with some recent hybrid variants of PDDP and kmeans 19 The clustering GUI module is depicted in Figure 7 while Table 5 describes in detail all the clustering_gui fields 14 Field Name Default Description Select Dataset Select the dataset Euclidean k means Check to use the euclidean k means clustering algorithm Spherical k means Check to use the spherical k means clustering algorithm PDDP Check to use the PDDP clustering algorithm Initialize Centroids At random Defines the method used for the initialization of the cen troid vector in the course of k means Possibilities are initialize at random and supplly a variable of mat file with the centroids matrix Termination Criterion Epsilon 1 Defines the termination criterion used in the cou
56. ntroids matrix Termination Criterion Epsilon 1 Defines the termination criterion used in the course of k means Possibilities are use an epsilon value default 1 and stop iteration when the objective function improve ment does not exceed epsilon or perform a specific num ber of iterations default 10 Principal Directions 1 Number of principal directions used in PDDP Maximum num of PCs Check if the PDDP max 1 variant is to be applied Variant Basic A set of PDDP variants Possibe values Basic Split with k means Optimat Split Optimal Split with k means Optimal Split on Projection 10 MATLAB svds Check to use MATLAB function svds for the computa tion of the SVD or PCA Propack Check to use PROPACK package for the computation of the SVD or PCA Number of Clusters Number of clusters computed Display Results Display results or not to the command windows Store Results Check to store results Continue Apply the selected operation Reset Reset window to default values Exit Exit window Table 3 Description of use of nnmf_gui components 11 3 4 Retrieval module ret rieval_gui Text to Matrix Generator Retrieval Figure 6 The retrieval_gui GUL TMG offers two alternatives for Text Mining e Vector Space Model VSM 12 e Latent Semantic Analysis LSA 1 5 using a combinati
57. nts Description of use of classification_gui components 11 13 15 18 1 Introduction Text to Matrix Generator TMG is a MATLAB Toolbox that can be used for various Data Mining DM and Information Retrieval IR tasks TMG uses the sparse matrix infrastracture of MATLAB that is especially suited for Text Miinng TM applications where data are extremely sparse Initially built as a preprocessing tool TMG offers now a wide range of DM tools In particular TMG is composed of six Graphical User Interface GUI modules presented in Figure 1 arrows show modules dependencies Dim Reduction Clustering Classification Figure 1 Structure and dependencies of GUI modules of TMG In the sequel we first discuss the installation procedure of TMG and then describe in some detail the GUI s usage In Appendix A we give a demonstation of use for all the TMG components while Appendix B supplies a function reference 2 Istallation Instuctions Installation of TMG is straightforward by means of the init _tmg script In particular the user has to perform the following steps e For MySQL functionality install MySQL and Java Connector e Download TMG by filling the form from http scgroup hpclab ceid upatras gr scgroup Projects TMG tmg_request php Unzip TMG_X XRX zip and start MATLAB Figure 2 depicts the directory struc ture of the TMG root directory Change path to the TMG root directory Run init_tmg Give the
58. ommand window 68 entropy ENTROPY computes the entropy of a clustering result VENTROPY CONFUSION_MATRIX MISTAKES ENTROPY CLUSTERS LABELS computes the entropy value of a clustering result represented by the CLUSTERS structure LABELS is a vector of integers containing the true labeling of the objects The entropy value is stored in VENTOPY while CONFUSION_MATRIX is a k x r matrix where k is the number of clusters and r the number of true classes and CONFUSION _MATRIX i j records the number of objects of class j assigned to cluster 1 Finally MISTAKES contains the number of misassigned objects measured by m1 mk where mi sum CONFUSION_MATRIXG j jv i 69 get_node_scat GET_NODE_SCAT returns the PDDP node with the maximum scatter value see PDDP MAX_SCAT_IND M_SCAT GET_NODE_SCAT TREE_STRUCT SPLITTED returns the node index and the scatter value of the PDDP tree defined by TREE_STRUCT SPLITTED is a vector that determines the active nodes 70 gui GUI GUL is a simple top graphical user interface of the Text to Matrix Generator TMG Toolbox Using GUI the user can select any of the four GUI modules indexing dimensionality reduction clustering classification of TMG 71 init_tmg INIT_TMG Installation script of TMG INIT_TMG is the installation script of the Text to Matrix Generator TMG Toolbox INIT_TMG crea
59. on of any DR technique and Latent Semantic Indexing LSI Using the corresponding GUI the user can apply a question to an existing dataset using any of the aforementioned techniques and get HTML response The retrieval GUI module is depicted in Figure 6 while Table 4 describes in detail all the ret rieval_gui fields 12 Field Name Default Description Select Dataset Select the dataset Insert Query The query to be executed Alternative Global Weights Global weights vector used for the construction of the query vector Use Stored Global Weights Use the global weights vector found on the container di rectory of the dataset Stoplist Use a stoplist Local Term Weighting TF The local term weighting to be used Vector Space Model Apply the Vector space Model retrieval method Latent Semantic Analysis The method used in the course of the Latent Semantic Analysis technique Possible values Singular Value Decomposition Principal Component Analysis Clus tered Latent Semantic Analysis Centroid Mathod Semidiscrete Decomposition SPQR Number of Factors Select the number of factors used during the retrieval pro cess Similarity Measure Cosine Similarity measure used during the retrieval process Number of most revevant Defines the number of most relevant documents returned for a query Similarity measure exceeds Defines the mi
60. ot 0 SC contains the sorted similarity coefficients while DOC_INDS contains the corresponding document indices 79 make_clusters_multi MAKE_CLUSTERS_MULTI auxiliary function for the classification algorithms CLUSTERS MAKE_CLUSTERS_MULTI LABELS forms the cluster structure of a multi label collection with document classes defined by LABELS cell array of vectors of integers 80 make_clusters_single MAKE_CLUSTERS_SINGLE auxiliary function for the classification algorithms CLUSTERS MAKE_CLUSTERS_SINGLE LABELS forms the cluster structure of a single label collection with document classes defined by LABELS vector of integers 81 make_labels MAKE_LABELS creates a label vector of integers for the input cell array of string LABELS UNIQUE LABELS MAKE LABELS INPUT LABELS creates a vector of integer labels LABELS for the input cell array of strings INPUT_LABELS UNIQUE_LABELS contains the strings of unique labels of the input cell array 82 make_val_inds MAKE_VAL_INDS auxiliary function for the classification algorithms INDS MAKE_VAL_INDS LABELS constructs an index vector used during the thresholding phase of any classifier for the multi label collection with document classes defined by LABELS cell array of vectors of integers 83 merge_dictionary MERGE_DICTIONARY merges two cell arrays of chars and ret
61. pectively 2484 256 1571 87 545 93 159 39 and 65 49 moles co kg wet tissue wt per hr at 0 c itwas concluded on the basis of the available evidence that carbonic anhydrase cannot play a primary role in the cation transport system of the lens i 170 Figure 23 The output of ret rieval_gui 7 Press the Reset button in order to change the input A 5 Clustering module clustering_gui Suppose we have processed a collection with tmg_gui construct a tdm with 1 033 documents and 12 184 terms corresponding to the well known MEDLINE collection and store the results to TMG_HOME data medline Assume then we want to cluster the TDM using the PDDP clustering algorithm with the following input e principal directions 1 e maximum number of PCs e variant basic e number of clusters 5 and you want to store results to directory medline 41 1 Initially select the clustering algorithm you want to apply by pressing the corre sponding radio button 2 The selection of a radio button activates the required fields in the GUI while deactivating the rest fields Text to Matrix Generator Clustering es Figure 24 Starting window of clustering gui 42 3 Fill in the required fields by pressing the check buttons editing the edit boxes or selecting the appropriate files variables by pressing the Browse button Text to Matrix Generator Clustering Program FilesMATLABIR
62. roblems on Lanczos and Lanczos Bidiagonalization with Partial Reorthogonalization Stanford University http sun stanford edu rmunk PROPACK 94 pca propack Atransfunc PCA_PROPACK_ATRANSFUNC Auxiliary function used in PCA_PROPACK 95 pca propack_afun PCA_PROPACK_AFUN Auxiliary function used in TMG_PCA_PROPACK 96 pca update PCA_UPDATE Principal Component Analysis of a rank 1 updated matrix with MATLAB eigs U S V PCA _UPDATE A W H C K computes the K factor Principal Component Analysis of A W H i e SVD of A W H C ones size A 2 1 using the svds function of MATLAB 97 pca update afun PCA_UPDATE_AFUN Auxiliary function used in PCA_UPDATE 98 pddp PDDP Principal Direction Divisive Partitioning Clustering Algorithm PDDP clusters a term document matrix tdm using the Principal Direction Divisive Partitioning clustering algorithm 1 2 CLUSTERS PDDP A K L returns a cluster structure with K clusters for the tdm A formed using information from the first L principal components of the tdm CLUSTERS TREE_STRUCT PDDP A K L returns also the full PDDP tree while CLUSTERS TREE_STRUCT S PDDP A K L returns the objective function of PDDP PDDP A K L SVD_METHOD defines the method used for the computation of the PCA svds default or propack while PDDP A K L SVD
63. roc 14th Conference on Computational Linguistics pages 447453 Morristown NJ USA 1992 4 D Zeimpekis and E Gallopoulos Non Linear Dimensional Reduction via Class Representatives for Text Classification In Proc 2006 IEEE International Conference on Data Mining ICDM 06 Hong Kong Dec 2006 110 scut_rocchio SCUT_ROCCHIO implements the Scut thresholding technique from 1 for the Rocchio classifier THRESHOLD SCUT_ROCCHIO A CLUSTERS BETA GAMMA Q LABELS TR LABELS T MINF1 NORMALIZE STEPS returns the vector of thresholds for the Rocchio classifier for the collection A Q A and Q define the training and test parts of the validation set with labels LABELS_TR and LABELS_TE respectively MINF1 defines the minimum F1 value while NORMALIZE defines if cosine 1 or euclidean distance 0 measure of similarity is to be used CLUSTERS is a structure defining the classes and STEPS defines the number of steps used during thresholding BETA and GAMMA define the weight of positive and negative examples in the formation of each class centroid THRESHOLD F THRESHOLDS SCUT_ROCCHIO A CLUSTERS BETA GAMMA Q LABELS_TR LABELS_TE MINF1 NORMALIZE STEPS returns also the best F1 value as well as the matrix of thresholds for each step row 1 corresponds to step 1 REFERENCES 1 Y Yang A Study of Thresholding Strategies for Text Categorization In Proc 24th ACM SIGIR pages
64. rse of k means Possibilities are use an epsilon value default 1 and stop iteration when the objective function improve ment does not exceed epsilon or perform a specific num ber of iterations default 10 Principal Directions 1 Number of principal directions used in PDDP Maximum num of PCs Check if the PDDP max 1 variant is to be applied Variant Basic A set of PDDP variants Possibe values Basic Split with k means Optimat Split Optimal Split with k means Optimal Split on Projection MATLAB svds Check to use MATLAB function svds for the computa tion of the SVD in the course of PDDP Propack Check to use PROPACK package for the computation of the SVD in the course of PDDP Number of Clusters Number of clusters computed Display Results Display results or not to the command windows Store Results Check to store results Continue Apply the selected operation Reset Reset window to default values Exit Exit window Table 5 Description of use of clustering gui components 15 3 6 Classification module classification_gui Text to Matrix Generator Classification e k Nearest Weignboors kh kRocchio Linear Figure 8 The classification_gui GUI TMG implements three classification algorithms e k Nearest Neighboors KNN e Rocchio e Linear Least Squares Fit LLSF 14 All these algorit
65. s Optimat Split Optimal Split with k means Optimal Split on Projection Automatic Determina tion of Num of factors for each cluster Check to apply a heuristic for the determination of the number of factors computed from each cluster in the course of the CLSI algorithm Number of Clusters Number of clusters computed in the course of the CLSI algorithm Display Results Display results or not to the command windows Select at least one factor from each cluster Use this option in case low rank data are to be used in the course of classification Number of factors Rank of approximation Store Results Check to store results Continue Apply the selected operation Reset Reset window to default values Exit Exit window Table 2 Description of use of dr_gui components 3 3 Non Negative Factorizations module nnmf_gui Text to Matrix Generator Non Negative Factorizations fe Euclidean k means Spherical k means f PDDP JV Display Results Do Tael a Figure 5 The nnmf_gui GUI This module deploys a set of Non Negative Matrix Factorization NNMF tech niques Since these techniques are iterative the final result depends on the initial ization A common approach is the random initialization of the non negative factors however new approaches appear to result in higher quality approximations TMG im plements four initi
66. se matrices ACM TOMS 31 2005 no 2 3 D Boley Principal direction divisive partitioning Data Mining and Knowledge Discovery 2 1998 no 4 325 344 4 C Boutsidis and E Gallopoulos Svd based initialization A head start on non negative matrix factorization Pattern Recognition 41 2008 no 4 1350 1362 5 S Deerwester S Dumais G Furnas T Landauer and Harshman R Indexing by Latent Semantic Analysis Journal of the American Society for Information Science 41 1990 no 6 391 407 6 I S Dhillon and D S Modha Concept decompositions for large sparse text data using clustering Machine Learning 42 2001 no 1 143 175 7 H Kim and H Park Non negative matrix factorization based on alternating non negativity constrained least squares and the active set method SIAM Journal of Matrix Analysis and Applications 2008 to appear 8 T Kolda and D O Leary Algorithm 805 computation and uses of the semidis crete matrix decomposition ACM TOMS 26 2000 no 3 9 D D Lee and H S Seung Algorithms for Non Negative Matrix Factorizations Advances in Neural Information Processing Systems 13 2001 556 562 10 H Park M Jeon and J Rosen Lower dimensional representation of text data based on centroids and least squares BIT 43 2003 11 M F Porter An algorithm for suffix stripping Program 1980 no 3 130 137 19 12 G Salton C Yang and A Wong A Vector Space Model for Au
67. ssifier for the collection A Q A and Q define the training and test parts of the validation set with labels LABELS_TR and LABELS TE respectively CLUSTERS is a structure defining the classes while MINF1 defines the minimum Fl value and STEPS defines the number of steps used during thresholding METHOD is the method used for the approximation of the rank truncated SVD with possible values clsi Clustered Latent Semantic Indexing 4 cm Centroids Method 1 svd Singular Value Decomosition SVD_METHOD defines the method used for the computation of the SVD while CLSI METHOD defines the method used for the determination of the number of factors from each class used in Clustered Latent Semantic Indexing in case METHOD equals elsi THRESHOLD F THRESHOLDS SCUT_LLSF A Q CLUSTERS K LABELS_TR LABELS_TE MINF1 L METHOD STEPS SVD_METHOD CLSI_METHOD returns also the best F1 value as well as the matrix of thresholds for each step row i corresponds to step 1 REFERENCES 1 H Park M Jeon and J Rosen Lower Dimensional Representation of Text Data Based on Centroids and Least Squares BIT Numerical Mathematics 43 2 427448 2003 2 Y Yang A Study of Thresholding Strategies for Text Categorization In Proc 24th ACM SIGIR pages 137145 New York NY USA 2001 ACM Press 3 Y Yang and C Chute A Linear Least Squares Fit Mapping Method for Information Retrieval from Natural Language Texts In P
68. tes the MySQL database and adds all TMG directories to the path 72 knn_multi KNN_MULTI k Nearest Neighboors classifier for multi label collections LABELS_AS KNN_MULTI A Q K LABELS NORMALIZED DOCS THRESHOLDS classifies the columns of Q with the K Nearest Neighboors classifier using the pre classified columns of matrix A with labels LABELS cell array of vectors of integers THRESHOLDS is a vector of class threshold values NORMALIZED_DOCS defines if cosine 1 or euclidean distance 0 similarity measure is to be used LABELS_AS contains the assigned labels for the columns of Q 73 knn_single KNN_SINGLE k Nearest Neighboors classifier for single label collections LABELS_AS KNN_SINGLE A Q K LABELS NORMALIZED_DOCS classifies the columns of Q with the K Nearest Neighboors classifier using the pre classified columns of matrix A with labels LABELS vector of integers NORMALIZED_DOCS defines if cosine 1 or euclidean distance 0 similarity measure is to be used LABELS_AS contains the assigned labels for the columns of Q 74 ks_selection KS_SELECTION implements the heuristic method from 2 for the selection of the number of factors from each cluster used in the Clustered Latent Semantic Indexing method 1 N_ST KS_SELECTION A N COLS ALPHA VAL L returns in N_ST a vector of integers denoting the number of factors sum equals L selected from each clust
69. tion is not used TDM_UPDATE returns a vector of all ones A DICTIONARY GLOBAL_WEIGHTS NORMALIZATION FACTORS WORDS _PER_DOC TDM_UPDATE FILENAME UPDATE_STRUCT returns statistics for each document i e the number of terms for each document A DICTIONARY GLOBAL_WEIGHTS NORMALIZATION_FACTORS WORDS PER_DOC TITLES FILES TDM_UPDATE FILENAME UPDATE_STRUCT returns in FILES the filenames contained in directory or file FILENAME and a cell array TITLES that containes a declaratory title for each document as well as the document s first line Finally A DICTIONARY GLOBAL_WEIGHTS NORMALIZATION FACTORS WORDS_PER_DOC TITLES FILES UPDATE_STRUCT TDM_UPDATE FILENAME UPDATE_STRUCT returns the update structure that keeps the essential information for the collection s update or downdate TDM_UPDATE FILENAME UPDATE _STRUCT OPTIONS defines optional parameters OPTIONS delimiter The delimiter between documents within the same file Possible values are emptyline default none_delimiter treats each file as a single document or any other string OPTIONS line delimiter Defines if the delimiter takes a whole line of text default 1 or not 120 OPTIONS update_step The step used for the incremental built of the inverted index default 10 000 OPTIONS dsp Displays results default 1 or not 0 to the command window 121 tmg TMG Text to Matrix Generator
70. to a text collection Each query must be separeted by a blank line or another delimiter that is defined by OPTIONS argument in each file Q WORDS_PER_QUERY TMG QUERY FILENAME DICTIONARY returns statistics for each query i e the number of terms for each query Finally Q WORDS_PER_QUERY TITLES FILES TMG QUERY FILENAME returns in FILES the filenames contained in directory or file FILENAME and a cell array TITLES that containes a declaratory title for each query as well as the query s first line TMG_QUERY FILENAME DICTIONARY OPTIONS defines optional parameters OPTIONS del miter The delimiter between queries within the same file Possible values are emptyline default none_delimiter treats each file as a single query or any other string OPTIONS line delimiter Defines if the delimiter takes a whole line of text default 1 or not OPTIONS stoplist The filename for the stoplist i e a list of common words that we don t use for the indexing default no stoplist used OPTIONS stemming Indicates if the stemming algorithm is used 1 or not 0 default OPTIONS update_step The step used for the incremental built of the inverted index default 10 000 OPTIONS local_weight The local term weighting function default t Possible values see 1 2 t Term Frequency b Binary P Logarithmic a Alternate Log n Augmented Normalized Term Frequenct
71. tomatic Indexing Communications of the ACM 18 1975 no 11 613 620 13 S Wild J Curry and A Dougherty Improving non negative matrix factoriza tions through structured intitialization Pattern Recognition 37 2004 2217 2232 14 Y Yang and C Chute A linear least squares fit mapping method for information retrieval from natural language texts In 14th Conf Comp Linguistics 1992 15 D Zeimpekis and E Gallopoulos PDDP 1 Towards a Flexing Principal Direc tion Divisive Partitioning Clustering Algorithms Proc IEEE ICDM 03 Work shop on Clustering Large Data Sets Melbourne Florida D Boley I Dhillon J Ghosh and J Kogan eds 2003 pp 26 35 16 D Zeimpekis and E Gallopoulos CLSI A flexible approximation scheme from clustered term document matrices In Proc SIAM 2005 Data Mining Conf New port Beach California H Kargupta J Srivastava C Kamath and A Goodman eds April 2005 pp 631 635 17 D Zeimpekis and E Gallopoulos Linear and non linear dimensional reduction via class representatives for text classification In Proc of the 2006 IEEE Interna tional Conference on Data Mining Hong Kong December 2006 pp 1172 1177 18 4 D Zeimpekis and E Gallopoulos TMG A MATLAB toolbox for generating term document matrices from text collections Grouping Multidimensional Data Re cent Advances in Clustering J Kogan C Nicholas and M T
72. urns only the distinct elements of their union used by tmg m tmg_query m tdm_update m ALL_WORDS ALL_DOC_IDS MERGE _DICTIONARY ALL_WORDS NEW_WORDS ALL_DOC_IDS NEW_DOC_IDS returns in ALL_WORDS all distinct elements of the union of the cell arrays of chars ALL_WORDS NEW_WORDS corresponding to two document collections ALL_DOC_IDS and NEW_DOC_IDS contain the inverted indices of the two collections Output argument ALL_DOC_IDS contains the inverted index of the whole collection 84 merge_tdms MERGE_TDMS Merges two document collections A DICTIONARY J MERGE_TDMS A1 DICTIONARY 1 A2 DICTIONARY2 merges the tdms Al and A2 with corresponding dictionaries DICTIONARY and DICTIONARY 2 MERGE_TDS A1 DICTIONARY 1 A2 DICTIONARY2 OPTIONS defines optional parameters OPTIONS min local freq The minimum local frequency for a term default 1 OPTIONS max _local_freq The maximum local frequency for a term default inf OPTIONS min_global_freq The minimum global frequency for a term default 1 OPTIONS max global freq The maximum global frequency for a term default inf OPTIONS local_weight The local term weighting function default t Possible values see 1 2 t Term Frequency b Binary P Logarithmic a Alternate Log n Augmented Normalized Term Frequency OPTIONS global_weight The global term weighting function default x Possible values see 1 2 x

Text to Matrix Generator&lowast; User's Guide

Contents

Download Pdf Manuals

Related Search

Related Contents