Home

User Manual book 1 version 2.5

1. H Risk of disease is elevated near the focus Elevation in risk can be estimated with a relative risk function RRF that incorporates study subject distance from the focus Test statistic Following Bithell 1995 let o denote the relative risk for region 7 under the null hypothesis and let be the corresponding relative risk under the alternative hypothesis x is the case count in region i and k is the number of regions A log likelihood test can be used to see which model the null or the alternative better fits the data The log likelihood function logL is k logL Sk log 2 e Aa ll a The most powerful test of the null versus the alternative hypothesis is whether T exceeds a critical value tp chosen based on an appropriate type 1 error alpha The second part of the previous equation drops out because it is a constant for fixed values of the null and alternative relative risks Bithell 1995 k T gt x log A St i l Aoi Regardless of the assumption about the constant value of 4 a test based on the sum over all cases can be used in both the conditional and unconditional tests Each case is assigned a risk score given by the logarithm of the relative risk appropriate for its assigned region and these scores are summed over all areas 58 T gt x log A 5 Conditional and unconditional tests There are two forms of the test conditional and unconditional The conditional test and the Monte Car
2. b Submit case data file with the following structure region temporal case population at risk label interval count count This file will be checked for duplicate centroid values or temporal intervals for any one region If you wish you may use the Select File button to change your file choices Choose the number of Monte Carlo runs the number of simulations used to determine statistical significance of the test statistic After you hit OK ClusterSeer will establish nearest neighbor relationships If you hit Stop at this point the procedure will cancel Then ClusterSeer will run the Monte Carlo simulations You may stop the simulations at any time using the Stop button on the progress bar The stop button will halt the simulations and the results will be displayed for the number of Monte Carlo runs completed by the time the button was hit Kulldorff s Scan Results Distribution This histogram shows the reference distribution generated by randomizing the dataset and recalculating the test statistic To view the Monte Carlo distribution select MC Distribution from the View menu The test statistics for the three most likely clusters are illustrated as thin colored bars Comparing the observed values to the range of maximum values from the simulations provides one sided upper P values for each observed value The second and third most likely clusters are chosen using two criteria 1 the value of the tes
3. e Higher numbers of bands increase the resolution of the L h plot ClusterSeer defaults the number of distance steps to 10 unless you supplied a different value in the previous analysis Choose the number of Monte Carlo runs the number of simulations that are graphed for comparison with the observed L h function shown in the Plot Once you hit the OK button ClusterSeer will run the Monte Carlo simulations You may stop the simulations at any time using the Stop button on the progress bar The Stop button will halt the simulations and the results will be displayed for the number of Monte Carlo runs completed by the time the button was hit 9 Then you can view the results of the analysis Ripley s K Results Map Choose Map from the View menu ClusterSeer will display a map of the cases spatial distribution nd If you query one of these points you ll be able to view its label and spatial coordinates Plot To view the plot choose Plot from the View menu The plot displays the observed values of L h and the results of the Monte Carlo simulations The x axis is distance with the maximum distance h The y axis is the values of L h calculated from the data or simulated in Monte Carlo randomizations L h from dataset L h points black L h estimated from the data L h Connects L h points L h simulations Individual simulation results average simulation green values L h simulation blue U
4. Cliff A D and Ord J D 1981 Spatial processes Model and Application London Pion Diggle P J and Rowlinson B S 1994 A conditional approach to point process modeling of elevated risk Journal ofthe Royal Statistical Society 157 433 440 Diggle P J 1990 A point process modelling approach to raised incidence of a rare phenomenon in the vicinity of a prespecified point Journal of the Royal Statistical Society 153 349 362 Fishman G S 1973 Concepts and methods in discrete event digital simulation New York John Wiley and Sons Hjalmars U Kulldorff M Gustafsson G and Nagarwalla N 1996 Childhood leukemia in Sweden Using GIS and a spatial scan statistic for cluster detection Statistics in Medicine 15 707 175 Holland B S and Copenhaver M D 1987 An improved sequentially rejective Bonferroni test procedure Biometrics 43 417 23 Holm S 1979 A simple sequentially rejective multiple test procedure Scandinavian Journal of Statistics 6 65 70 Jacquez G M and Waller L A 1999 The effect of uncertain locations on disease cluster statistics In Quantifying Spatial Uncertainty in Natural Resources Theory and Applications for GIS and Remote Sensing H T Mowrer and R G Congalton eds Chelsea Michigan Sleeping Bear Press pp 53 64 Kulldorff M 1999 Spatial scan statistics models calculations and applications in Scan Statistics and Applications Glaz J amp Balakrishnan eds B
5. e Planar This category encompasses all map projections including UTM Universal Transverse Mercator and user coordinates e Geographic latitude longitude Within ClusterSeer data in geographic coordinates are transformed to UTM for calculation and mapping o If your data are in geographic coordinates you can choose to use a scale of either meters or kilometers This scale will be used to specify distances on the map and in the analyses Missing data Currently the only type of missing data ClusterSeer can handle is gaps in temporal intervals If you have a file with case counts for temporal intervals and you are using census data for population at risk counts then ClusterSeer will interpret the missing intervals as having a case count of zero Other missing data will prevent file import Al FILE TYPES Text files ClusterSeer requires most data in ASCII text file format ASCII or plain text files can be exported from many spreadsheet and data analysis programs or you can create them directly in a text editor While the Select File dialog defaults to importing a file with the extension txt ClusterSeer will import plain text files with any file extension To import a file with a different extension choose All Files after Files of type in the Select File dialog to view all files Then choose the file to import Different methods require different file structures The types of data and their or
6. 46 Local spatial methods These cluster detection methods are used to investigate spatial disease clusters near a particular area They can be thought of as methods that attempt to answer the question Are cases neighboring a particular case closer together than expected by chance Local cluster detection methods are available for group level data only e Besag and Newell s method e Turnbull s method e Local Moran Focused spatial methods These cluster detection methods evaluate spatial disease patterns around a particular location or focus Candidate locations can be used to represent the position of a proposed risk factor such as a contaminated well These methods attempt to answer the question Is there a cluster of cases around the identified location The null hypothesis for focused tests is no clustering around the focus Focused cluster detection methods available in ClusterSeer Individual level data Group level data Diggle s Method Bithell s method ee IEA Space time clusters Spatio temporal methods detect disease clusters in space that depend on the time period Space x Time interaction e Kulldorffs Spatial Scan 47 Temporal clusters 6 12 20 2 21 28 22 29 24 31 Temporal Analysis Temporal cluster detection methods are used to investigate disease clusters in time whether cases of disease tend to aggregate in particular periods All are used on group level data
7. In this chapter you can learn about the methods within ClusterSeer e retrospective surveillance e spatial clustering o global o local o focused e spatio temporal clustering Temporal clustering will be included in the next version of ClusterSeer Retrospective surveillance Retrospective surveillance methods monitor changes in the occurrence of some event such as the temporal or spatial pattern of a disease Surveillance methods can signal when current conditions differ from a historical baseline O Brien and Christie 1997 For surveillance the important steps are determining the baseline rate and the threshold for alarm how much change from the baseline is enough for concern Thus statistical surveillance methods trade off sensitivity to changes with the likelihood of producing a false alarm Surveillance methods have the highest accuracy for larger datasets and the highest sensitivity for lower baseline disease rates Barbujani and Calzolari 1984 45 ClusterSeer contains two surveillance methods Levin and Kline s method analyzes group level data and Rogerson s method requires both individual level and group level data e Levin and Kline s modified CuSum for temporal surveillance This method explores changes in the frequency of an event such as infection or a disease e Rogerson s Spatial Pattern Surveillance Technique for spatial surveillance This method explores changes in the spatial pattern of an ev
8. Lambda 2 40 E ly 0 E T Mim i To 0 3 6 9 12 Value Lambda 5 40 S 20 j TULL 0 e E i T T aan el 0 3 6 9 12 Value Poisson point processes Poisson point process models are used for null and alternative spatial models in Diggle s Method and Ripley s K function Poisson point processes produce sets of points with a given intensity the mean and variance of the Poisson distribution an expected number of points or cases per unit area 18 Z Scores Z scores calculate a standardized difference between the observed and expected value of a statistic 1E Var I Z In this case Tis the statistic E 1 is the expected value of J and Var D is the variance of I Z scores are distributed approximately normally with a mean of 0 and a variance of 1 0 Interquartile distance The interquartile distance is used to find outliers in the local Moran test The interquartile distance is the difference between the values for the 25 percentile and the 75 percentile of the test statistic To obtain these values ClusterSeer orders the test statistics from smallest to largest The 25 percentile value is the test statistic that divides the ordered set such that 25 of the statistics are smaller and 75 are greater than that value The 75 percentile value is the test statistic that divides the ordered set such that 75 of the statistics are smaller and 25 are greater Ifthe number of test statistics cannot be evenl
9. a a A aa 39 41 O TO 30 33 34 e 17 20 NN AAA n e teaSea vedas vedecateetens eee seis 71 76 maximum likelihood estimation ooooocnnnccnnoncccnonccnnoncnnnnnnononcnonnccnnnnccinnnconnnos 71 Linear Risk Score iaa 57 E O 46 methods il da 49 75 89 114 Local Mor ta o Aad ee et 89 90 91 1T E E A eee se Maes Cie es ak lacie NS oat oes 26 27 M Mapa ide 30 33 34 MA a a a 35 36 A iss Seis EA 30 32 Mat cias 24 25 43 Maximum Likelihood Estimation ooooonnnnccnnonccnonncnnonccnnnnccnonccnnnnccnoncccnonccnnnnc ns 71 MG DistribUtion ade ic Elric 20 29 MER dd ad e ed ed 20 21 Meth dss ita At 45 46 47 48 Miss data leds 41 MEE rr o dl ida 71 136 Monte Carlo randomization ooooocccccncnnnnnnincccnnnnnononanococnnonononanicccnnnnnnnnaninos 20 21 MR is 115 Multinomial randomization ooooccccnccncnnnnninccnnnnnnononiniccnnnonononiricccnnnnnnnos 20 21 22 N Neighbor relationships oooooccccnnnnooonononononcnnnnonannnnnnnnnncnnnnnonononcnnonnc no 24 25 43 NUD distributions a OE a 16 Null hy pothesis ii ae 16 Null spatial model svc A geal games a a 16 O Onetailled Palta ta a ele 17 20 Overlap 0 Gea ed 25 P Pa atk A A AI 60 62 Pl an s 26 28 Pomtlayer Prop dd an dd eco EA 30 35 Pott PLrOCess ct as 18 96 POS Mimi A A A a Bee 18 22 76 h ll modeles nan i adel A A EEA eS 18 randomizatOn tati a a a a e 21 22 Polygon in A o OA 24 25 NS A A E E A 30 36 Pop lationm at risk it ii 23 39 41 Pita lts
10. 54 lll If you edit the average disease frequency the caption for the box will change from average to expected disease frequency You can reset the value to the average frequency at any time by clicking the reset button next to the box Enter the significance level you wish to use for the test The significance level is the alpha level the cutoff for statistical significance If you run multiple tests at the same significance level you can then choose to run a Multiple Comparisons analysis to determine the proper significance level for all comparisons Choose the number of Monte Carlo runs the number of simulations used to determine statistical significance of the test statistic After you hit OK ClusterSeer will establish nearest neighbor relationships If you hit Stop at this point the procedure will cancel Then ClusterSeer will run the Monte Carlo simulations You may stop the simulations at any time using the Stop button on the progress bar The stop button will halt the simulations and the results will be displayed for the number of Monte Carlo runs completed by the time the button was hit Besag and Newell Results Distribution You can view the Monte Carlo distribution by choosing MC Distribution from the View menu This histogram shows the reference distribution generated by randomizing the dataset and recalculating r r is illustrated in black and it is compared with the distribution for estimatin
11. Choose Local Moran Test from the QuickStat menu or from Analysis choose Spatial then Local I 92 In a series of dialogs ClusterSeer will prompt you for the files it requires If you submitted suitable datasets in the previous analysis you will jump directly to step 2 a Submit the disease frequency file with the following structure region labelldisease frequency This file will be checked for duplicate regions and should follow ClusterSeer data import requirements b Submit the contiguity file for file structure see Contiguity files If you wish use the Select File button to change your file choice Set the initial alpha level ClusterSeer will correct this level using the Bonferroni and Sidak adjustments that compensate for the average number of neighboring regions found in the dataset Choose the number of Monte Carlo runs the number of simulations used to determine statistical significance of the test statistic After you hit OK ClusterSeer will establish nearest neighbor relationships If you hit Stop at this point the procedure will cancel Then ClusterSeer will run the Monte Carlo simulations You may stop the simulations at any time using the Stop button on the progress bar The stop button will halt the simulations and the results will be displayed for the number of Monte Carlo runs completed by the time the button was hit Local Moran Results Distribution You can view a hist
12. Commonly k is set to Y the change you would like to detect measured in standard deviations Setting k 0 5 implies that you seek to detect a shift in the mean of the baseline value of one standard deviation from that mean For a given choice of k the time required to detect a true change that has a magnitude of 2k standard deviations will be minimized You do not set k directly in the dialog Instead you enter K and then ClusterSeer uses the following formula to set k K pe Jn Critical value h The term h is a cutoff or critical value that is compared with the cumulative sum When the cumulative sum exceeds h ClusterSeer will signal a significant change in the process The higher the value of h the higher the false alarm rate where a change is signalled but has not in fact occurred You do not set h directly in the dialog Instead you enter H and then ClusterSeer uses the following formula to set h H fee Jn Risk weight Tau Tau weights the surrounding subregions see formula larger values correspond to decreasingly severe declines in risk with distance Thus larger values of tau require clusters to be larger or more localized to be noticed Batch size n The term n is the batch size for accumulating the mean of Z These batches are used when the underlying data are not normal as occurs for most case count data 104 Rogerson s Method How to To run a Rogerson s analysis choose Rogerson s Surveillan
13. These methods can be used to evaluate disease frequency or case counts in a single or in multiple time series The following methods are currently available in BioMedware s Stat You may order Stat or wait until these methods are incorporated into ClusterSeer in the next release Method Disease Case count Case count single time multiple time i series Dat s Method Ederer Myers Mantel Method Empty Cells Grimson s Method JY Larsen s Method Wallenstein s Scan 48 Chapter 5 Besag and Newell s Method Besag and Newell s method can detect local or global spatial clusters in group level data When you initiate a Besag and Newell analysis in ClusterSeer you get both local and global analysis output While individual or case level analysis is theoretically possible with this method ClusterSeer implements only the region centered group level technique This method scans the data for collections of cases that appear to be unusual clusters To do so it centers a circular window on each region in turn This window is then expanded to include neighboring regions until the total number of cases in the window reaches a user specified threshold k Then the population size inside the window is compared to that expected under an average or expected disease frequency Examples Besag and Newell 1991 use the method to screen for clusters of childhood leukemia in northern England They found no evidence for clustering of leuke
14. ClusterSeer will calculate adjusted alpha levels and combined p values for all tests considered in the Session Log Summary statistics e Original alpha level e Number of comparisons method used Adjustments e Bonferroni and Sidak adjustments for the entire set of tests e A table of all tests ordered from smallest to largest P value noting the parameter values used in each the original P values for each and the adjusted significance level using the Simes and the Holm s methods e You should compare the P value for each test to recommended adjusted significance levels Combined P value e A combined P value for all tests performed You can compare this value to your original alpha level to see if the set of tests show significant results 122 Resources Troubleshooting Data import errors ClusterSeer will not be able to import the data that fails to meet its general import requirements or the specific requirements for the method you chose When this occurs it will send an error message identifying the line where it first encountered a problem Check the dataset at that line number and compare the general requirements and the how to page for your method to find the problem References Anselin L Local indicators of spatial association LISA 1995 Geographical Analysis 27 93 115 Bailey T C and Gatrell A C 1995 Interactive spatial data analysis Harlow UK Longman Scientific amp Technical Barbu
15. Information from surveys of population size reported for various years Within ClusterSeer census data can be used to estimate population at risk size centroid see region centroid 127 cluster contiguity relationship control coordinate system data type data format dataset disease frequency ego An aggregation of disease in space in time or in both space and time often considered the same as a disease outbreak Continuity or the state of being so near as to be touching Within ClusterSeer two regions are defined as contiguous if they share a common border See rook and or queen A study subject that has not experienced the health related event under investigation These subjects are considered to represent all individuals at risk of illness and are used for comparison purposes to uncover factors that may influence risk of disease A method for representing spatial location Within ClusterSeer spatial information can be represented using any planar projection and geographic coordinates though geographic coordinates are transformed to UTM for analysis Within ClusterSeer data type refers to the unit of observation in the dataset whether it describes individuals or groups Within ClusterSeer data format refers to the data import requirements for different types of data The observations used for analysis The dataset for a particular method may be found in one or several files Measurement
16. P values ClusterSeer will also provide a combined P value for all tests performed at one initial alpha level This is accomplished for Bonferroni and Holm s adjustments Bonferroni P lmin P Cc Holm s P min 1P os In this case P denotes the combined P value for all tests P the value for an individual test j is the number of comparisons and is the sequential index for the individual test considered 120 Multiple Comparisons How to Multiple comparisons tests are available for any number of analyses performed in one ClusterSeer session that meet the following criteria 1 The same dataset and significance level and 2 Using one method from the following list Besag and Newell s Method Bithell s Test Diggle s Method Levin and Kline s Modified CuSum Score Test or Turnbull s Method This menu item is unavailable displayed in gray when there is an insufficient number of tests to support multiple comparisons When you choose Multiple Comparisons from the Analysis menu ClusterSeer will present you with a list of all tests that meet the above two criteria Choose the method of interest then ClusterSeer will calculate the adjustments and combined P values and display these results in the Session Log The Multiple Comparisons menu item will be unavailable until you run more tests that meet criteria 1 and 2 121 Multiple Comparisons Results When you run a Multiple Comparisons analysis
17. This approach is used to redistribute disease frequency values among spatial regions in the Local Moran method Anselin 1995 In each randomization the disease frequency is held fixed for one spatial region and the remaining values are randomly assigned new locations Thus the randomness is conditional all regions receive randomized frequencies but one This process is repeated as each region is evaluated in turn 21 Multinomial randomization A multinomial distribution describes the outcomes of independent trials with two or more possible mutually exclusive outcomes This approach is used to redistribute cases of disease among spatially or temporally referenced sub groups bins under analysis Cases are distributed at random among bins where the probability of a case being placed in a particular bin is proportional to the population at risk size in that bin The figure below shows a simple example of this process There are four bins a b c and d that have population sizes of 10 50 20 and 20 The interval from 0 1 is partitioned among them with each bin getting an interval proportional to its relative size so 1 10 1 2 1 5 and 1 5 respectively Then as a random number generator supplies values between 0 1 each value falls into a particular bin and counts as a case in that bin This randomization technique is used in Besag and 0 Newell s Bithell s conditional Kulldorffs Scan and A Turnbull s methods bins a b
18. dd ad 27 28 29 33 Probado As 17 20 Properties ii es 35 36 PAE dor is 17 Monte Catlo aia id 20 Q QUEEN 34 R att dad ee edita esas 50 52 Raised incidence function ccccconnnncncccnnnnnnnanocnnccanona nino cano na cnn nnanoncccnnnnnnnns 68 69 Randomizauon acc cid 20 La aI E EE EEE A ade Ne ete eee Ne Oo ose E dlvate ted 2 1322 Reference distribution aneren nanan a tient ed A nuvi void Mees 17 20 Region centroid it aa 37 38 REPIOM SPE CLC ov EE E EA EE E A 115 Relative Density Function ccconnnoooooonnncnnnonononnnnnnnncnncnnnnnonononnnnnncnos 67 68 69 71 Relative Risk ut ti 15 57 foncione e e ese A eat E 58 60 62 Results Interpreta A Dd res dio 99 Va is A Ad 27 28 29 30 Retrospective survelllancCe ooooonooocococcnccnonoooonnnnnonncnnnnnnnnnnnnnnnnnnnnnnnonn ono nnnnnnnnnns 45 Ripley s K fUMCtlOn triada 95 96 97 98 Rogerson s Spatial Pattern Surveillance Method ccccecccceees 101 102 105 RRE s 57 58 60 62 S ENAA TAEA A ERA 75 76 COLE ALES bse ces ce eae eek Re Se Tee aoe eee cere 108 109 110 Select Ale urraca e ici deve a atac 38 SESION LO a a 26 27 Shapefile 00 ee ee ee ee 43 a 43 E RS 43 138 Space time methods ai aa aes 47 Spatial clusters 2 A a 46 49 114 Spatial formats in naati id fete ate 38 39 Spatial weight files ai da 24 43 90 91 Spatio TemporalcAmalysisiviic tose et ese ences lea et 47 Ma ecto a ies eas 102 Submitting da tarstoscc0s cect Mk ston
19. e elated aed 73 Mai A ME a e BEE 74 SESSION O oo es 74 CHAPTER 8 KULLDORFF S SCAN cecscscsccscsccccscsccccscseecees 75 Kulldorffs Scan Statistic Poisson ccccccccncnnnnnonononnnnnnnnnnononanininoninininnns 76 Testistatiti ad eda 76 IAN O tess A NO 76 ulldorffs Scan HOW dit 77 Kulldorffs Scan With census file ooooccccccnnonoonooocnncccccnnnnonononononnncninnno 77 Kulldorffs Scan With population at risk data occccnnnoooooooonnnnnncnnnnnnnos 79 Kulldorffs Scan Re sults 0 0 4 a a gel Bh ele 80 DISTIN in di a 80 MAD dada ia a ad cian iio 80 Plis a a a it A ie ari 81 SES lO iia ir A ii A dia 81 CHAPTER 9 LEVIN AND KLINE S MODIFIED CUSUM 68 83 Levin and Kline s Modified CuSum Statistic cooonnonononoconnnoccnnnnnnnnannnnnos 84 Test Statistye iii a o 84 Levin and Kline s Modified CuSum How to ccccccnnnnnoooooonnnnnnnnncnnnnnnnnnnos 85 Levin and Kline s Modified CuSum Single file oononno ooooonccccnccnnnnnananono 85 Levin and Kline s Modified CuSum Two files ooonnnoonooocooccnccccnnnnnnnnnono 86 Levin and Kline s Modified CuSum Results ccoonnnonononoconnncncncnnnnnnnnnnnnns 88 DISTIN A A de 88 Pl e a a cd At it o a EARE 88 A A OE 88 CHAPTER 10 LOCAL MORAN TEST c ccccccscscsccscscsccscseccees 89 Local Moran Statiste irisse at Ra ta 90 Test statistic it caida 90 SIGN CAC AA AA A A A 90 Local Moran HOW toni eiii iones 91 Local Moran With Sha
20. for the local Moran test Once you select the file to use ClusterSeer will prompt you to choose region labels and disease frequencies from column headings in your dbf file Select column that holds region ID Cancel x cea Select column that holds disease frequency maman Once you have selected the columns ClusterSeer loads the data If you cancel at this point the procedure will cancel Contiguity files These files are used to define neighbor relationships in local Moran Contiguity files gal indicate whether areas neighbor each other Future versions of ClusterSeer will accept general weight files gwt to specify more complex relationships say based on distance rather than contiguity Binary contiguity relationships gal These files indicate whether a region has any neighbors identifying them if so These files can be created within and exported from SpaceStat or created manually in a text editor that can save unformatted ASCII files The gal file has the following structure 43 total region count egolabel neighbor count neighbor label neighbor label egolabel egolabel neighbor count a The first row specifies the total region count ClusterSeer checks for at least one field in that row and it verifies that the total region count in the first field matches the total number of regions specified in the disease frequency data file The second row specifies a targe
21. functions This can occur when all three lines have the same pattern For instance when beta is set to zero the default models 2 4 have the same result As all are drawn in the same place only the one drawn last is visible Choose a relative risk model Expected disease frequency optional This value can be an expected 63 64 10 11 12 13 frequency from another region a national average or any external value As a default ClusterSeer calculates an internal average from the data file the average disease frequency The average disease frequency is the total number of cases divided by the total population at risk Reset t f f Beset to average frequency If you edit the average disease frequency the caption for the box will change from average to expected disease frequency You can reset the value to the average frequency at any time by clicking the reset button next to the box Enter the significance level you wish to use for the test The significance level is the alpha level the cutoff for statistical significance If you run multiple tests at the same significance level you can then choose to run a Multiple Comparisons analysis to determine the proper significance level for all comparisons Choose whether to run a conditional or an unconditional analysis e fora conditional test the Monte Carlo randomizations are based on a multinomial distribution e forthe unconditional test the randomiza
22. group level data Not uniform Determines the expected number of points or cases per unit area for Poisson point process null models The difference between the values for the 25th percentile and the 75th percentile of a distribution Used in the local Moran method When importing data labels are used to match data imported in separate files The term can also refer to editable text labels on the axes of histograms and plots As used within ClusterSeer and its help local clustering methods are tests that evaluate clustering by looking at the level of individual cases or regions within the study area Contrast with global or focused methods Monte Carlo randomization MCR A computationally intense method that null distribution null hypothesis null spatial model estimates probability values through resampling the data set MCR involves repeatedly reassigning observations to sample locations in a random way according to a particular null hypothesis and recalculating the statistic for the sets of randomized data A distribution of the test statistic based on the null hypothesis It can be derived empirically through Monte Carlo randomization or through distribution theory A prediction based on the null spatial model Defines the distribution of cases of the disease expected without clustering 129 one tailed P value P value point data polygon data polygon nested population at risk queen contigui
23. into time dependent collections of individuals See temporal data formats Spatio temporal data Study units may have associated spatial and temporal information In order to minimize data repetition several input files may be required See formats for both spatial and temporal data 37 Data types ClusterSeer can analyze individual and group level data Different methods are appropriate to different data and analysis types Individual Level The unit of observation and analysis is the individual study subject Currently ClusterSeer offers methods for surveillance and spatial cluster analysis of individual level data Data can consist of the locations or time references for individuals with cases or at risk for controls the health outcome under investigation Group Level The unit of analysis is a group of study subjects aggregated within geographic regions and or temporal intervals Spatial and spatio temporal cluster detection can be conducted on group level data ClusterSeer also offers two retrospective surveillance methods for temporal and spatial clustering of group level data though Rogerson s Spatial Pattern Surveillance method also requires individual level data The data often consist of disease frequency estimates or case and population at risk counts for each group The location of spatially aggregated data may have to be simplified for analysis In practice these areas can be represented with a single point loc
24. large or small for the null distribution This calculation compares the observed value to the upper and the lower tails of the null distribution Most tests in ClusterSeer explore whether the observed value is unusually large for the distribution using P pe only _NGE 1 _ NLE 1 P te upper N 1 lower N 1 runs runs where N 1s the total number of Monte Carlo simulations NGE is the number of simulations for which the statistic was greater than or equal to the observed statistic and NLE is the number of simulations for which the statistic was lower than or equal to the observed statistic One 1 is added to the numerator and denominator because the observed statistic is included in the reference distribution 20 Types of randomization Randomization is a broad term used differently in different contexts Within ClusterSeer randomization methods vary between methods For the multinomial and Poisson distributions ClusterSeer generates random values by choosing values from the specified distribution For conditional randomness data values are reassigned among sub groups Randomization Technique Cluster Detection Method Conditional randomness Drawing from a multinomial distribution Besag and Newell Bithell conditional Kulldorffs Scan Drawing from a Poisson distribution Alter distances between points by multiplying their Ripley s K function locations by a random number Conditional randomness
25. level data Populations within the study area are scanned for clusters of cases A circular window is centered on each region in turn and expanded to include neighboring regions until the total aggregated population within the window equals a user defined threshold R These circular windows may overlap and the counts within the windows will not be independent This method will be most powerful when the population size at elevated risk is known a priori otherwise Kulldorff s Spatial Scan is likely to be more robust Examples Turnbull et al 1990 applied this method to examine the distribution of leukemia cases in upstate New York They called the method the cluster evaluation permutation procedure They varied the size of R to see its effect on the analysis Adjusting their results for multiple comparisons they found no significant clusters in the upstate New York leukemia data 114 Turnbull s Method Statistic H The number of cases in the constant population areas follow a Poisson distribution with a common rate but they are not statistically independent as the areas overlap H The number of cases in the constant population areas exceeds that predicted by a Poisson distribution with a common rate Test statistic The test statistic is Mp the maximum number of cases observed among all windows of population size R The circular windows with fixed population sizes are constructed by visiting each location often region centr
26. them all transparent Transparent fill lets information from underlying map layers come through if more than one layer is present 36 Chapter 3 Submitting Data ClusterSeer provides analytic methods for exploring spatial and temporal trends in health data It offers a number of state of the art methods for cluster detection as well as data and results visualization The method you select determines the data types and format required what parameters you need to enter and what output is available to view Data overview ClusterSeer analyzes pattern in spatial and spatio temporal data These methods analyze study subjects such as cases and susceptible individuals as study units described at the individual or group level Spatial data Study units may have associated spatial information expressed as point locations or areas Data on individuals can be fixed to a point location such as a workplace or residence Group level data is often aggregated over a region a wider spatial area such as a township or county This area may be represented as a point often the region s centroid or an area a polygon See spatial data formats Temporal data Study units may have associated temporal information These temporal references can represent either a point in time or an interval of time For individuals time point may indicate the date of diagnosis or symptom onset For groups time intervals may be used to aggregate study subjects
27. value of U is illustrated in black Map You can view the map by choosing Map from the View menu The map consists of two layers focus illustrated It can be queried for its coordinates x y values If the with a red X on the coordinates were converted to UTM the query table will report both latitude longitude and UTM coordinates If you query one of these points you ll be able to view its label coordinates case count population at risk count and distance to the focus If the data were transformed from geographic coordinates the scale for distance is the scale you specified on import Plot You can view the plot by choosing Plot from the View menu The cumulative case plot displays the observed and expected cumulative number of cases with increasing distance from the focus Divergences between observed and expected cases indicate divergence of the data from the null hypothesis 112 Session log After ClusterSeer performs a Score analysis 1t will place summary information and results into the session log Parameters and summary statistics e Expected disease frequency if supplied e x and y coordinates of the focus Focused cluster detection results e The test statistic U e 2 P values o One approximated from a standard normal distribution o And the second from the Monte Carlo randomizations 113 Chapter 14 Turnbull s Method Turnbull s method detects local spatial clusters in group
28. 0 01 62 Bithell s Test How to Choose Bithell s Linear Risk Score Test from the QuickStat menu or from the Analysis menu Spatial and then Focused 1 In a series of dialogs ClusterSeer will prompt you to submit the file to analyze If you submitted a suitable dataset in the previous analysis you will jump directly to step 4 You will need to specify the coordinate system of the data Ifthe data are in geographic coordinates you will also need to choose a distance measurement ClusterSeer will prompt you to submit the data file This file should contain group level data with the following columns in the following order The file is checked for duplicate centroids and it must follow general ClusterSeer data requirements If you wish you may use the Select File button to change your file choices Enter the x and y coordinates of the focus the default is the origin 0 0 Enter the location in the original coordinate system of your data If your data were converted from geographic coordinates on import ClusterSeer will expect focus coordinates in geographic coordinates Enter the relative risk model parameters If you click on the Visualize button ClusterSeer will display a plot of the relative risk function models The points represent relative risk values at various distances from the focus calculated from the dataset For some visualizations you may not see lines for all four relative risk
29. 4 the values of the fitted raised density model alpha beta and rho maximized likelihood for the fitted model original likelihood from the initial values generalized likelihood ratio P value from comparing the generalized likelihood ratio value to the chi squared distribution to assess goodness of fit Chapter 8 Kulldorff s Scan Kulldorffs Scan method Kulldorff and Nagarwalla 1995 Kulldorff 1997 can detect local spatial clusters that depend on time in group level data The scan statistic uses a cylindrical window to identify excesses of cases in space and time At each spatio temporal location the window increases in size in both space and time until it reaches an upper size limit The scan statistic provides a measure of whether the observed number of cases is unlikely for a window of that size using reference values from the entire study area By searching for clusters without specifying their size or location the method avoids pre selection bias Kulldorff 1997 developed two models a Poisson model and a Bernoulli model For a small number of points compared to the expectation under the null hypothesis the two models are similar The Bernoulli model is best for questions about binary counts yes no while the Poisson model better describes questions about continuous variables where the degree of exposure matters At this point ClusterSeer implements the Poisson method Examples The scan statistic has been app
30. 5 of the time if the null hypothesis were true The figure below graphs 1 000 Poisson random numbers lambda 3 The thin line illustrates the P 0 05 alpha level for a one tailed test The P value is less than alpha when the test statistic is higher than that cutoff In that case it is customary to reject the null hypothesis and accept an alternative hypothesis that there is clustering ET Most ClusterSeer methods Poisson Distribution lambda 3 are one tailed focusing on the upper tail of the distribution They test 300 gt whether the test statistic is g 200 higher than expected Two a 100 tailed tests evaluate whether 7 A E O I the statistic diverges from a 0 AA 4444 1 central value and the alpha 012345678910 level is divided between the two tails of the distribution 17 Poisson null models The null hypothesis of a Poisson disease rate is usually a good representation of randomly distributed non infectious rare diseases Waller and Jacquez 1995 It is used in many cluster detection methods in ClusterSeer including Besag and Newell s method A Poisson function can be described by one parameter lambda the mean and variance of the distribution Two Poisson distributions are illustrated below each with a different lambda value Within ClusterSeer lambda is the average or expected case count calculated from the average or expected disease frequency multiplied by the population at risk
31. 69 Alternative hypothesis s 33 05 vaccinate ioe ica GE 16 Alternative spatial model ooooonnooccccncccccoonoonocnnnnnncnnnnnonnnnnnnnnnnnnonnno non nnnnnnncnnno 16 rL D e ii datado 42 43 Autocorrelation entera dalla a ea 89 B Besag and Newell s Meth0d oooonnnnnnnnnnnnocnnonocncncncncnonononononononininnnoos 49 51 52 53 BD AA Aa On eee eats 17 60 62 69 Bithell s Linear Risk Score Method ooccccccnnnnnncccccononononinicccnnnnnnnnas 57 58 60 63 Bonferroni ti ii iaa 90 C Case data uta AO R 39 CDE Guides is 13 Census data A oe wage bed Sea ean dees Dae 23 39 41 methods USING teed aslo a rada 75 83 101 COT OIG ia Beet ote ese eRe etal cet rele ON eh sca la BR eI Behe MIRE Fa ot 37 38 CGE A Nee E Re ss Nc aad eas eh atin oi 102 Change Colt miii td 28 29 35 36 Change formatting ti boa 28 29 35 36 Cluster detection cccccccccceccscccuecesseeseseccucessescessccusceusessesscssecvensceseces 12 14 48 Spa an a 46 47 CONCEP iiie A eee eee eed 12 133 COntiguity tai A e 24 25 43 matrik ensema thats caab A a abel teem 24 Controlan dida da 39 Coordinate systemi id ta meee cd at 39 41 Cumulative SUM ei a 83 101 CUM a linia 83 84 85 D Di Aia Gabe 37 O a Um 14 16 OS td rta va 39 YPES o e tad dotan 38 Dei ana 43 Densit RO 67 68 69 71 Diggle Methods a Bev A a 67 68 72 Likelihood acia ia dr 71 Relative density function occccnonoooooonnnonoconononnnnnnnnncconnnnnnnnnnnonnncnonanannnnnnn
32. AL CONCEPTS About statistical methods The methods in ClusterSeer evaluate spatial temporal and spatio temporal disease clusters The fundamental question behind all these methods is whether the pattern of the data is clustered All the methods evaluate hypotheses though these hypotheses are better considered exploratory see Limits of cluster detection The hypotheses differ between methods but all the methods can be characterized using the following structure from Waller and Jacquez 1995 The null spatial model defines the distribution of cases of the disease expected without clustering This distribution may be spatial temporal or spatio temporal depending on the method question and data The null hypothesis is a prediction about spatial pattern based on the null spatial model The test statistic summarizes an aspect of the data of biological or epidemiological interest The null distribution of the test statistic can be derived theoretically or empirically through Monte Carlo randomization Example theoretical null distributions include the Poisson null distribution Either way the null distribution reflects the null spatial model The alternative hypothesis is a counter to the null hypothesis a different prediction defined either in the terms of the null spatial model or in terms of additional parameters to define clustering The alternative spatial model can be very basic and somewhat vague not the null spatial mo
33. Bithell s 1995 1999 linear risk score test is a spatial focused cluster detection method appropriate for group level data This test is sensitive to excess risk near a point source exposure focus and it considers the spatial relationship of the cases to the focus The method scores each disease case with a risk score the logarithm of the relative risk in that region The test statistic is the sum of these risk scores The change in relative risk from the focus can be evaluated graphically in plots of the relative risk function RRF Because of the linear structure of the statistic T Bithell calls this type of test a linear risk score LRS test Example The test was originally presented in a paper evaluating the pattern of childhood leukemia and non Hodgkin s lymphoma near nuclear plants in the UK Bithell 1995 57 Bithell s Test Statistic H The regional case counts are independent variables that follow a Poisson distribution with a mean determined by region specific relative risks and expected case counts e For an unconditional test the relative risk is constant across regions and equals 1 The baseline disease frequency used to calculate expected case counts is appropriate for the study area For a conditional test the relative risk is assumed to be constant across regions but not necessarily equal to 1 The baseline disease frequency used to calculate expected case counts is not assumed appropriate for the study area
34. Choosing parameters ccooooooocooonnnccononoononononononononannos 104 Change threshold K ccccccccccsssccesccssssccsaceessceceacesssceseaceessceseuceessesseaesensesens 104 Critical vale Mi ASR OR 104 Risk W lght TOU lila lieder chachael dred ead Mogai aaa eint 104 BatchssiZen OT eee annus eta ene 104 Rogerson s Method How to r re aaaeeeaa a Eaa a Ea A EEIN 105 Rogerson s Method ResultS oooooooooonnononncononononononnnononnononononononononnonono 106 T E 1 o PAREA a tia dao de te id o eee aa 106 Pd a a ENN e a a 106 A A NN 106 CHAPTER 13 SCORE TEST cccccccccscccccsccsccccccccccsccsccsescees 108 Scores Statiste id ies 109 Testa AC la tein el 109 EA E E E 109 Score HOW Oi a e daa a R 110 DISTODUON ii A aetie nian a aS 112 MaPini a A A A E 112 PA ee e es an e de a e e Sa des Seal et 112 SESSION A AA AA IAS 113 CHAPTER 14 TURNBULL S METHOD cscecscsccscscsccccscsees 114 Turnbull s Method Statistic ccoooooooooooonnccnonnoonnnnnnnnccnnnnnnnnnoncnnnnncnnannnos 115 Testistatistici AS BAVA 115 T mbull s Method HOw toi ann r a a aaa aa a ra adaa 116 Turnbull s Method Results oooooocnnnnnnnnnnnnnnnnonnnnnnnnnnnnnnnnononononnnnnnos 117 Distribution ii A A A A 117 MAD vest cece ds A a 117 SESSION OL A a Meta eed Le E Maat 118 CHAPTER 15 MULTIPLE COMPARITSONG cscscsccscscscescsecees 119 Multiple Comparisons Statistics ooooocccccnoononannnnnnnnnonnnanono
35. E EE E EE Ee 10 SY CMT QUEMAS a E aletas st 10 Manual 0 Verve Woa iesper a ac 11 CHAPTER I OVER VIEW ccscessccssecccsctctcecocdsccisectvocstecteasscocdeesteestes 12 About cl ster detection id ade 12 W hatis a cluster apend anra esti albeit deisd snl aatieaarest nhcadaiets 12 Th elassic cxample rice 12 Cluster detection methods seccion old aare atra TLE aranna 12 CDC dee A A eaten eee beens 13 CDC miultistep approach fii tds da tnd eich eee 13 Limits of Cluster detection sitiada ina 14 Disease risk and relative risk l nains inne ana a ono nnnnncnnnnnn nono 15 STATISTICAL CONCEPTS cocconccnnccnnccnncnnncnnnccnncnonononoconononononocaninanonanoss 16 About statistical methods enea ene a ek raa i eaa eena 16 Prada e Arua eons 17 Poisson null Mode Secta ti te tdt 18 POISSON point processes nic ciee iaa darian dentadas 18 VALEO AIEE A a o A 19 Interquartile distan enes ita 19 MONTE CARLO RANDOMIZATIONS cccccccccnnnnnccccnnnnnnnnnnnnnnnncnnnnnonnnnnnns 20 About Monte Carlo randomization oocccccccnnnononoonnnnnnnncnnnnnnnnnnnnnnnccninnnnnns 20 Calculating Monte Carlo P Values ccoonnooooocoonnnnconononooncnnnnnncccnnnononononnos 20 Types of randomization ii Soke aa ea i ae 21 Conditional randomness cccccnnooooooonnnnnncnonononnnnnnnnnnncnnnnn non nnnnnnnnnnnan ona nnnos 21 Multinomial randomization occcccnnooononnnoncnnnonononnnononnnnnnnnnnnononnnnncnnnnnnnnns 22 Poisson tandomMiZatiOM oooooocccncccnnoooononnnnnnnnn
36. OBioMedware 2012 4 ClusterSeer software for the detection and analysis of event clusters User Manual book 1 version 2 5 e oik BioMedware Copyright 2012 BioMedware Inc All rights reserved ClusterSeer and BoundarySeer are trademarks of BioMedware Inc Project Leaders Geoff Jacquez and Leah Estberg STTR Collaborating Institutions BioMedware Inc the University of Michigan and the University of Minnesota Software developers Leah Estberg Andrew Long Eve Do and Bob Rommel Manual and help authors Dunrie Greiling Leah Estberg Andrew Long and Geoff Jacquez Advisors Luc Anselin Arthur Getis Dan Griffith Uriel Kitron Lance Waller and Mark Wilson The following individuals provided suggestions and insights that greatly improved the software Martin Kulldorff Peter Diggle Bruce Levin Peter Rogerson and graduate students and instructors in the course Spatial Epidemiology offered in at the School of Public Health University of Michigan This project was supported by STTR grant CA64979 from the National Cancer Institute to BioMedware Inc The software and manual contents are solely the responsibility of the authors and do not necessarily represent the official views of the National Cancer Institute For updated troubleshooting information and FAQs please visit ClusterSeer online http www biomedware com files documentation clusterseer default htm Table of Contents PREFACE cocine EEE
37. Statistic The distribution of disease cases is a spatial Poisson point process where L h h H The distribution of disease cases is clustered at some scales L h gt h Test statistic Ripley s K function compares the pattern of the data to that produced by a homogeneous Poisson point process where cases are considered events The expected number of other cases within a fixed distance 4 of one case is K h where is the intensity or mean number of cases per unit area K h can be estimated by the following formula from Bailey and Gatrell 1995 X noon NY ja jaie Wi Where R is the area of the region of interest n is the total number of cases in region R dy is the distance between the 1 and j cases and In dij is the indicator function which is 1 if dj lt h and 0 otherwise Essentially it sums the cases within distance h of each location in the dataset each 7 Wj is an edge correction factor the conditional probability that a case is observed in the region given that it is dy from the event i Evaluating the K function To evaluate clustering Ripley 1981 compares the estimated distribution of K h to that consistent with a homogeneous Poisson point process using another function L h For the null hypothesis K h mh and so L h h ClusterSeer compares K h for the observed data to that predicted by the null hypothesis by plotting the observed L h against f h h If the pattern under st
38. a on as 42 Shapefile import requirements oooccccccnnoooooonnnnnnoconononnnnnnnnnnnncnonnnnnnnnnnnnnnnos 43 Contipuity leS n Sofie ee Re Ser ee ee eee 43 Binary contiguity relationships F GAl cccccccccccscccsessesscseseescscsesessseeseaeees 43 CHAPTER 4 DISEASE CLUSTER METHODS cccscscsscscseesees 45 Retrospective survelllances c 2 3 ee ea ai 45 Spatial clusters errada a ia ie 46 Global spatial metho0dS ooooooooccccccnoooonocnnncnononnnononononcnnnnnnnnnnnnnnncnnnnnnnornnnncnonnnns 46 Local spatial methods ssc ccc wavsocies codeseuveveaanaesceesviversdcciascuseucentecdevsecvacveedstasan 47 Focused spatial methods ccccccccccceeesseeeceeeecseesenneeeeeeceseseeseeeeeeeeeeeesteeeeees 47 Spacetime clusters ti he Shed a 47 Temporal Clusters ii aia 48 CHAPTER 5 BESAG AND NEWELL S METHOD ocococococonononononos 49 Besag and Newell s method StatiStiCS cocononooonocononocccononannnnnonnnnnnnnnnnno 50 POSE StATISEICS 5 ci lares 50 Notes a in 50 Besag and Newell s method ccccceeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeees 51 Besag and Newell s method 7 cccceceeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeees 52 Besag and Newell s method HOWtO coccccccccnnncnnnnnonononononononnnnnnonnnnononon 53 Besag and Newell Results ccccsssssseeeeeeeeeeeeeeeeeeeeeeseeeeeeeeeeeeeeeeeeeeeeees 55 DiStr button 25sec tested e ds 55 Mai e cee sate cae soudin
39. al population size Score How to Choose Score Test of Lawson and Waller from the QuickStat menu or from the Analysis menu Spatial and then Focused 1 Ina series of dialogs ClusterSeer will prompt you to submit the file to analyze If you submitted a suitable dataset in the previous analysis you will jump directly to step 4 2 You will need to specify the coordinate system of the data If the data are in geographic coordinates you will also need to choose a distance measurement 3 ClusterSeer will prompt you to submit the data file This file should contain group level data with the following columns in the following order centroid centroid x centroid y case _ population at label coordinate coordinate count risk count The file is checked for duplicate centroids and it must follow general ClusterSeer data requirements 4 Ifyou wish you may use the Select File button to change your file choices 5 Enter the x and y coordinates of the focus the default is the origin 0 0 Enter the location in the original coordinate system of your data If your data were converted from geographic coordinates on import ClusterSeer will expect focus coordinates in geographic coordinates 6 Expected disease frequency optional This value can be an expected 110 frequency from another region a national average or any external value As a default ClusterSeer calculates an internal average from the data
40. an image editor to view and manipulate it 33 Querying maps Querying calls up information about items on the map Q Click on the query tool and then click on the map This brings up a table of information on the nearest feature in the active map layer the highlighted layer The active layer is queried even if it is not currently displayed on the map checked in red To change the active map layer select a new layer in the map layers pane Once you ve queried a layer the queried feature will be recolored orange and its table will pop up This table lists information about the feature For example if you query a point layer you will get the coordinates of the nearest data point and any associated data If you query a circle layer you will get information on the circle with the nearest center point The queried feature will return to its original color when the query table is closed 34 FORMATTING MAPS Formatting maps To format a map layer select it on the map layer pane the selected layer is highlighted PE Then call up the properties dialog by right clicking on the map with the selector and choosing Properties from the pull down menu Because formatting options change with the layer type read up on formatting individual layers e point and e polygon map layers Point layer properties You can choose the size of the points by specifying their radius in pixels You can change the color of the points by c
41. ary statistics e Number of regions analyzed average or user supplied expected disease frequency population radius R and the alpha level you specified for possible adjustment following multiple comparisons Local cluster detection results e A table summarizes the three highest statistics for the given population radius the first second and third most likely clusters o ClusterSeer lists the regions included in each cluster beginning with the centering region and continuing in order of proximity to the center o It also lists the local disease frequency in the cluster the MR o Last is that cluster s P value P values for the second and third most likely clusters come from comparing their test statistics to the reference distribution of the maximum test statistics for the Monte Carlo simulations a more conservative test 118 Chapter 15 Multiple Comparisons If you perform a statistical test multiple times on the same dataset you may need to adjust your significance level to reflect the number of analyses with different parameters When you interpret the significance of a test statistic you compare the probability of that statistic against a pre determined cutoff your alpha level Alpha is the probability of rejecting the null hypothesis when it is true If you run the test repeatedly with slightly different parameters then you increase the likelihood of wrongly rejecting the null hypothesis In essence to compensat
42. ases than if they were not clustered r is simply the total number of clusters found in the local scale analysis Notes Because of the circular shape of the window this method is less sensitive to directional exposures such as a plume of airborne or waterborne pollutants Besag and Newell 1991 Waller and Turnbull 1993 show that the significance of depends on the level of aggregation and the chosen value of k 50 Besag and Newell s method is the number of regions required for the window centered over an individual region to contain k cases To evaluate whether the k cases form a cluster the method looks to see whether the number of cases in the window is unlikely for the window s population at risk The null hypothesis is that there is no clustering so that a common Poisson disease rate exists across the study area Thus the population at risk inside the window should be proportional to the case count otherwise the null hypothesis can be rejected Following Besag and Newell 1991 the null spatial model is that cases are distributed within the study region proportional to population size and with a common disease rate ClusterSeer calculates a probability for under the null spatial model k 1 A A4 x P L lt l 1 Y E a x 0 x This expression calculates the probability that has reached or exceeded that predicted by the null hypothesis L It is 1 minus the probability that is less than L i e that there are fewer
43. ation such as the geographic center centroid for group level data About submitting data ClusterSeer currently requires specific file structures for each method though we intend to relax this restriction in future versions For plain text data files the data for each unit of analysis individuals or groups are stored on separate file lines as records Currently ClusterSeer expects the record data in a particular order such as label first then x coordinate then y coordinate then case count then population at risk count Required file structures are detailed in the How to section for each method Must ClusterSeer data files will be expected in plain text format Shapefiles and SpaceStat sparse ASCII files are used to specify neighbor relationships for local Moran SpaceStat was developed by Luc Anselin and it is distributed by BioMedware Inc 38 Data formats general Spatial temporal and other data must follow specific data formats to be read by ClusterSeer Duplicate spatial locations and or temporal references should not be submitted for aggregate data such as regions and associated centroids or temporal intervals Additionally all census years submitted as temporal references for population at risk sizes should be unique Duplicate points in space and time can be submitted to indicate individual subject locations and times of events Type Format Valid range Case count or Positive numbers can incl
44. canta na ceaseeuwcioamen bade leetranbaae aa 55 SESSION lO iaa aia lo rd 55 CHAPTER 6 BITHELL S LINEAR RISK SCORE TEST o oooccccccccnconcnso 57 BrthellsTestoStatistic tt a ce aia on SE 58 POSE Sta ltd TE 58 Conditional and unconditional tests oooooococccccnonooonocnnonoonnonnonononnononannnonnnnncnno 59 Bithell s Test Relative risk functiOMS oooccccnnnnooononcnnonocononanannonononnnnnnnno 60 Bithell s Test Choosing parameters ooooccccccnnoooononnnnnnnncnnnnnnnnnonononocnnnnnnnos 62 Beta the intercept cccccccccccesssssccceeceeeessneeeeeecceeessseeeeeeceeeeseseeeeeeeeessnteeeeeees 62 Phi distance deCay ccccccccccccccssssccceceseeessseeeeeeceeeessseeeeeeeceeeessseeeeeeeeesssteeeeeees 62 Bithell s Test HOW ts 63 Bithell s Test Result aa 65 DIS a aa 65 Ma e a Dot o 65 Pl iii 65 Session lO ii A AS 66 CHAPTER 7 DIGGLE S METHOD o ccoccccccccccoconcnconcncoconcnccncnnoconcnns 67 Diggle s Method Statistic ooooocccnnnoocccnnoooocccnnnonnnononnnnncnnnnnnnncnnnnnnccnnnnnnnno 68 Test StAtISEIC sisi adicta it bin 68 Diggle s raised density model oooooocccnnnoocccnonooonccononnnncnnnonononnnonnncncnnnnnnnnnnns 69 Diggle s Method Choosing initial parameters oooooocccnnococccccononnncncnnnnncnonno 70 Digele s Method GERD in indicara 71 Diggle s Method MLE renos ioa ENEE G EA ER R 71 Diggle s Method How tO ccooooccccnnoccccccnonnncnonnnnncncnonononconnnnnnncnnnnnnnncnnnnnnno 72 Pd A A A AE
45. cd Poisson randomization This Monte Carlo randomization approach redistributes cases of disease among spatially or temporally referenced sub groups using Poisson random variables This approach is used in the Score Bithell unconditional and CuSum methods Generating Poisson random variables This method generates randomized case counts drawing from Poisson distributions The shape of the Poisson distribution depends on one parameter lambda its mean and variance see example Poisson distribution below In this case is set using the Lambda 2 expected case count for that subgroup region or time period the product 40 E 20 of the population at risk o 0 1 Ma o n and the average or user 0 3 6 9 12 specified baseline risk Value 22 SPATIAL AND TEMPORAL CONCEPTS Extrapolation from census data ClusterSeer can extrapolate population at risk counts from census data This feature can be used in Kulldorffs Scan Rogerson s and CuSum methods ClusterSeer offers two extrapolation methods step and linear extrapolation o census value extrapolation step linear both population size 1980 1990 2000 years The population at risk count is assumed equal to the immediately preceding census count It will change with the next provided census value The population at risk count is estimated assuming a linear change in population between the two nearest census figures Pop
46. ce from the QuickStat menu or from the Analysis menu Surveillance submenu For this method you will need to submit three files Labels must match between all submitted files All should follow ClusterSeer data import requirements 1 Ina series of dialogs ClusterSeer will prompt you for the files it requires If you submitted suitable datasets in the previous analysis you will jump directly to step 7 2 You will need to specify the coordinate system of the data If the data are in geographic coordinates you will also need to choose a distance measurement 3 Submit the coordinate data file with the following structure region label centroid x coordinate centroid y coordinate The file will be checked for duplicate centroids 4 ClusterSeer will ask you how it should extrapolate population at risk counts from census data step or linear 5 Next ClusterSeer will prompt you to import the case data file with the following columns in the following order case label case event date region label 6 Submit census data file with the following structure region labellcensus year population count The file will be checked for duplicate census years for any one region 7 Ifyou wish you may change your file choice using the Select File button 8 Choose values for H K Tau and n 9 After you hit OK ClusterSeer will calculate distances between region centroids If you hit Stop at this point the procedure wil
47. choose a distance measurement ClusterSeer will prompt you to submit the data file This file should contain individual level data with the following columns in the following order subject label case control status ClusterSeer will check the file for duplicate subject labels and that case control status values are equal to 0 or 1 The file must follow general ClusterSeer data requirements If you wish use the Select File button to change your file choice Enter the x and y coordinates of the focus the default is the origin 0 0 Enter the location in the original coordinate system of your data If your data were converted from geographic coordinates on import ClusterSeer will expect focus coordinates in geographic coordinates Enter the raised density function parameters If you click on the Visualize button ClusterSeer will display a plot of the relative density and the raised density models Enter the significance level you wish to use for the test The significance level is the alpha level the cutoff for statistical significance If you run multiple tests at the same significance level you can then choose to run a Multiple Comparisons analysis to determine the proper significance level for all comparisons Once you hit OK you can view the results of the analysis Diggle s Method Results Plot You can view the plot by choosing Plot from the View menu The plot shows the raised density model and the rati
48. cus d is the distance from x to the focus and f d is a function describing the change in intensity of the process with distance from the focus Diggle terms f d a raised incidence function To separate this concept from the epidemiological definition of incidence we will use the phrase raised density model The null hypothesis is fd 1 no change in density of cases with respect to the focus The alternative hypothesis is a higher relative density of cases near the focus ClusterSeer offers one raised density function from Diggle 1990 f d 1 wexp Ba where d is the squared distance between the location under consideration and the focus The raised intensity of cases represented by the value of f d decreases away from the focus see graph First parameter estimates are optimized through maximum likelihood estimation and the fit of the case data to the model is compared with a generalized likelihood ratio test 68 Diggle s raised density model Diggle s method compares the distribution of case locations to controls The method is based on the idea that distribution of the control locations has no relationship to the focus so the raised density model below equals 1 alpha 0 and is not important for the control locations ClusterSeer implements one raised density model graphed below f d 1 aexp Bd T DAS ANA Han aa ae RU ClusterSeer determines the model parameters using maximum likel
49. del or it can be a more specific model defining a particular model of disease distribution Probability values P values for the observed test statistics can be obtained by comparing them to the null distribution This comparison gives a quantitative estimate of the probability of the observed value under the null hypothesis 16 P values P values short for probability values provide an estimate of how unusual the observed values are The P value of a test statistic can be obtained by comparing the test statistic to 1ts expected distribution under the null hypothesis the null distribution The interpretation of a test statistic balances the possibility of two types of errors Declaring whether a P value is statistically significant involves choosing the level of error with which you are comfortable Alpha provides the threshold for significance If the P value for the observed value falls below alpha then the observation is termed significant symbol or formula meaning type I error alpha also called the probability of rejecting the null significance level hypothesis when it is true type II error beta the probability of accepting the null hypothesis when it is false statistical 1 the power of a test indicates its ability to power reject the null hypothesis when it is false P 0 05 is the traditional alpha level which can be interpreted to mean that results that are more extreme would occur by chance less than
50. der in the file is described in the How to sections for individual methods Depending on the method the file may contain some or all of these categories spatial coordinates temporal information and case disease data Text file guidelines Data for a particular method may be contained in one large file or in several files depending on the method s requirements For several files consistent labeling is required to merge the information between files Each row in the data file should contain one unit of study This study unit may be individual data count data or frequency data Data associated with that study unit must be in the same row of the text file delimited by tabs or spaces Study units rows are separated by a carriage return If the data file has more columns than the method requires additional columns will be ignored The relevant columns need to be the first ones as you currently cannot choose which columns to import from a text file If the data file has fewer columns than the method requires ClusterSeer will report a data import error ClusterSeer does not require a header for the text file 42 Shapefile import requirements This file format consists of three separate related files all with the same file name but different file extensions shp shx dbf Once you tell ClusterSeer where to find the shp file it will look in the same directory for the shx and dbf files You may import a shapefile
51. dy of disease clusters Many of the methods offered in ClusterSeer are very new developed in the last decade Cluster statistics offer criteria to determine when observed patterns of disease significantly depart from expected patterns ClusterSeer includes methods that explore different kinds of clustering spatial temporal and space time clusters Many of the methods in ClusterSeer use Monte Carlo randomization techniques to evaluate observed values These computationally intense methods are more available now that a computer can quickly randomize datasets and perform the calculations 12 CDC guidelines The Centers for Disease Control and Prevention CDC advocate a multi step approach for investigating disease clusters 1990 ClusterSeer offers tools for the cluster assessment stage steps 2a and 2c CDC multi step approach 1 Initial contact and response An agency is notified of a perceived cluster 1t then decides whether further evaluation is necessary 2 Cluster assessment a Preliminary evaluation This step provides a rough estimate of the probability of the perceived cluster occurring by chance In this step determine the geographic area and time to examine and find a reference population for comparison Then calculate statistics for the perceived cluster and compare them to the reference population b Case evaluation Verify the case reports are accurate c Occurrence evaluation A more thorough descriptive e
52. e It also surveys concepts in epidemiology spatial analysis temporal analysis and statistics used in ClusterSeer Chapter 2 provides an overview of how to use ClusterSeer and what tools are available for viewing your data and results Chapter 3 details how to submit files and data file and format requirements Chapter 4 describes the heart of ClusterSeer cluster detection methods You may read this section to choose a method or you can use the Cluster Advisor available within the software Chapters 5 14 detail individual statistical methods while Chapter 15 describes the multiple comparisons feature The manual also has a resources section that includes a glossary troubleshooting references and an index For easier differentiation of interface and description this manual will use the following style conventions Typeface Meaning serif type explanatory text sans serif type part of the ClusterSeer interface such as menu items or dialogs This information is also available in online help CSeer Help chm accessible from the Help menu and Help buttons on dialogs in ClusterSeer The online help has hyperlinks that connect related topics BioMedware also has a ClusterSeer Online page on its website http www biomedware com files documentation clusterseer default htm Please check this for updates and additional information 11 Chapter 1 Overview ClusterSeer offers statistical methods for the analysis of
53. e active layer see Querying maps changing the properties color size of elements of the active highlighted layer or removing the active layer from the map B use the zoom tool to focus on a section of the dataset Move the tool to where you want to zoom and click to zoom in Q Use the zoom out tool to enlarge the field of view Move the tool to where you want the enlargement to be centered and click to zoom out ClusterSeer will not zoom past the spatial extent of the data El The zoom to fit tool returns the visual display to the full spatial extent of the dataset The pan tool can be used instead of the scrollbars to move the field of view across the map This tool only works when the map is zoomed in somewhat from the full spatial extent of the data Click on the button to activate the tool and then use it to pan the map across the viewing window For example to expose a section to the right of the viewing window drag the map to the left Q Finally the query button is a method for querying the map clicking a point with this tool brings up a table of information about the nearest map feature in the active layer 32 Working with maps ClusterSeer maps are not simply visual displays of data and results they provide opportunities for querying the underlying data Maps are created when ClusterSeer performs spatial and spatio temporal analyses on data referenced to spatial locations To view the map choose Ma
54. e analysis Turnbull s Method Results Distribution You can view the Monte Carlo distribution by choosing MC Distribution from the View menu This histogram shows the reference distribution generated by randomizing the dataset and recalculating Mg The three highest values of Mz are illustrated as thin colored bars Comparing the observed values to the range of maximum Mp values from the simulations provides one sided upper P values for each observed value The second and third highest Mz values are compared with the highest from the simulations a more conservative test Map To view the map choose Map from the View menu The map has four layers region centroid points and the spatial extent of each of the three most likely clusters each represented with a circular outline Cr you query the region centroid points you ll be able to view the region label centering region x y coordinates case count and population at risk count If the dataset was originally in geographic coordinates ClusterSeer will report the coordinates in UTM first followed by the original geographic coordinates If you query a cluster layer you can view the centering region label local test statistic P value a list of included regions and the local disease frequency within the window 117 Session log After ClusterSeer performs a Turnbull analysis it will place summary information and results into the session log Parameters and summ
55. e is how well the model fits the data If model fitting is appropriate to your analysis then you may wish to choose a range of values for beta and phi and use the visualization button to compare the fit of different values and models to the data itself To follow the hypothesis testing approach you need to choose model parameters objectively Beta the intercept Beta influences the intercept how high the relative risk is at the focus of models 2 4 Higher values of beta represent higher relative risks relative risk or f d 1 beta when distance is zero or close to it Beta has no influence on the first model as it has no intercept relative risk is infinite at the focus If you did not supply a different value in a previous Bithell analysis ClusterSeer defaults beta to 0 making the null and the alternative hypotheses equivalent for the models 2 4 Phi distance decay All relative risk functions subside to 1 far away from the focus When RRF 1 the risk at that location is equal to the baseline or average risk There is no elevation of risk far from the focus The value of phi controls how quickly the relative risk returns to 1 At higher values of phi the RRF returns to one more slowly As phi is an exponent in the first model that model in particular is sensitive to high values of phi Phi cannot 0 for RRF models 2 4 If you did not supply a different value in a previous Bithell analysis ClusterSeer defaults phi to
56. e previous analysis you will jump directly to step 5 2 You will need to specify the coordinate system of the data If the data are in geographic coordinates you will also need to choose a distance measurement 3 You will need to indicate the temporal scale for the case data whether the data represent observations on a daily weekly monthly yearly or some other user defined basis 4 You will be asked to indicate whether you wish to specify study period limits see Temporal data formats 5 ClusterSeer will prompt you to submit the data files a Submit the coordinate data file with the following structure Y ed x coordinatelcentroid y coordinate label The file will be checked for duplicate centroids b Submit case data file with the following structure region label temporal intervaljcase count 78 This file will be checked for duplicate temporal intervals for any one region c Submit census data file with the following structure region label lcensus year population count The file will be checked for duplicate census years for any one region If you wish you may use the Select File button to change your file choices Choose the number of Monte Carlo runs the number of simulations used to determine statistical significance of the test statistic After you hit OK ClusterSeer will establish nearest neighbor relationships If you hit Stop at this point the procedure will cancel Then ClusterSe
57. e test statistic and 2 whether they overlap higher ranking clusters the second will not overlap the first the third will not overlap the second or the first Choose Plot from the View menu The plot s x axis is time in sequence from the beginning to the end of the study period The axis itself is in units of a time index representing the sequence of time intervals 1 is the first etc The y axis is the average disease frequency across the regions included in each of the three most likely clusters The plot has a line representing each most likely cluster The average disease frequency is calculated for all time intervals included in the study period The duration of identified clustering is represented with a thick black line The lines are color coded red indicates the most likely cluster green the second and blue the third Session log Once ClusterSeer has performed a Kulldorff s Scan analysis it writes information on the procedure and results into the session log Summary information and parameters e number of regions study period span number of cases population at risk size average disease frequency e maximum population radius maximum temporal span number of Monte Carlo simulations Information on each of the three most likely clusters The second and third most likely clusters are chosen using two criteria 1 the value of the test statistic and 2 whether they overlap higher raking clusters the second wi
58. e you must lower your threshold for significance ClusterSeer contains a multiple comparisons feature that allows you to take multiple testing into account when you run any of the following methods e Besag and Newell s Method e Bithell s Test e Diggle s Method e Levin and Kline s Modified CuSum e Score Test e Turnbull s Method 119 Multiple Comparisons Statistics ClusterSeer offers two ways to evaluate your results after multiple testing a variety of significance level adjustments and a combined P value for all the tests Adjusted significance levels Q j Bonferroni a Sidak a 1 1 a Cc i i 1 Simes Q F Modified Holm s a 1 1 0 The Bonferroni adjustment is the classical approach but it is known to be overly conservative Recently other approaches have been developed that are less conservative and have more power for a large number of comparisons Sarkar and Chang 1997 such as the Sidak 1967 Simes 1986 and Modified Holm s Holland and Copenhaver 1987 adjustments These approaches provide you with adjusted significance level c for critical level This new critical level reflects your initial significance level and the number of comparisons j conducted at that initial significance level The Simes and Holm s adjustments are performed for each test sequentially ordered from lowest to highest P value with denoting the sequencing index range 1 J for each individual test Combined
59. ed an analysis Graphical views reflect the most recent analysis No record of maps histograms and plots from previous analyses will remain To view them again you must recreate them Open always Available after an analysis records all activities displays the most recent results as rs 26 Session log ClusterSeer records text based information from your analyses in the memo screen within the main window the session log Information recorded includes the name and date last modified of the data files results from each analysis and results from multiple comparison adjustments During data exploration and analysis you may find it useful to edit or print the text on this page You may export the log as a plain text file txt for opening in other applications Editing You may also add references or notes directly to the session log page by positioning the cursor and typing Printing To print the log select File then Print from the menu Click OK when the dialog box appears Exporting You can export the log by choosing Save Log from the File menu ClusterSeer will export the log as a text file txt Instead you may choose to copy a piece of the log to paste into another application You can copy sections by selecting them and choosing Copy from the Edit menu 27 Plots You can use plots to view and interpret the results of the most recent analysis After you initiate a new analysis Clus
60. emes Biometrika 41 100 15 Ripley B D 1976 The second order analysis of stationary point processes Journal of Applied Probability 13 255 66 Ripley B D 1981 Spatial Statistics John Wiley amp Sons New York Robinson D and Williamson J D 1974 Cusum charts The Lancet i 317 Rogerson P A 1997 Surveillance systems for monitoring the development of spatial patterns Statistics in Medicine 16 2081 2093 Rothman K J and Greenland S 1998 Measures of Disease Frequency amp Measures of Effect and Measures of Association In Modern Epidemiology Philadelphia Lippincott Raven pp 29 64 Sarkar S K and Chang C K 1997 The Simes method for multiple hypothesis testing with positively dependent test statistics Journal of the American Statistical Association 92 1601 8 Schulte P A Ehrenberg R L and Singal M 1987 Investigation of occupational cancer clusters theory and practice American Journal of Public Health 77 52 6 Simes R J 1986 An improved Bonferroni procedure for multiple tests of significance Biometrika 73 751 4 Snow J 1855 On the Mode of Communication of Cholera London John Churchill Sokal R R Oden N L amp Thomson B A 1988 Local spatial autocorrelation in 125 a biological model Geographical Analysis 30 331 354 Tango T 1995 A class of tests for detecting general and focused clustering of rare diseases Statistics in Medicine 14 2323 2334 Turnb
61. en Two options are available rook and queen their names come from the movements of chess pieces The rook can only move to squares that share a border of some length with its current square In the figure below the rook illustrated as the gray circle can only move to the four black squares The queen can move to any square that shares even a point length border So she can move to the rook s squares and any square that shares a corner one vertex with her current square If the gray circle illustrated the queen s position the queen could move to any of the eight adjacent squares Thus rook is a more stringent definition of polygon contiguity than queen for rook the shared border must be of some length whereas for queen the shared border can be as small as one point 25 Chapter 2 Working in ClusterSeer ClusterSeer workflow is organized around the methods themselves The general framework is the same for all methods you specify a method you supply data ClusterSeer performs an analysis and then you may view the results of the analysis When you open ClusterSeer a session log is opened at the same time It will serve as a text based view for reporting results of all analyses in a single ClusterSeer session As you perform new analyses information on them is appended to the existing log Graphical views can help visualize the results of an analysis and so they are only available once you have imported data and perform
62. ent Spatial clusters These cluster detection methods evaluate whether cases of a disease tend to aggregate in particular locations Besag and Newell 1991 classified cluster detection methods into general and focused tests We further subdivide general methods into local and global categories e General methods explore clustering without pre determined hypotheses about cluster location o Global methods detect clustering throughout the study area regardless of their specific locations or spatial extent o Local methods detect clustering limited to geographically restricted areas within the study e Focused methods detect clustering around a specific location such as a point source exposure to a proposed risk factor Global spatial methods Global cluster detection methods are used to investigate the presence of spatial patterns anywhere within the study area They attempt to answer the question Are there any unusual disease patterns These tests focus on whether clustering exists or not regardless of location or scope Essentially the method evaluates whether a spatial pattern exists in the data that is unlikely to have arisen by chance The null hypothesis for these methods is simply no clustering exists Global cluster methods available in ClusterSeer Individual level data Group level data Ripley s K function Besag and Newell s Method For retrospective surveillance of spatial data use Rogerson s Method
63. er will run the Monte Carlo simulations You may stop the simulations at any time using the Stop button on the progress bar The stop button will halt the simulations and the results will be displayed for the number of Monte Carlo runs completed by the time the button was hit Kulldorff s Scan With population at risk data Choose Kulldorff s Scan Method from the QuickStat menu or from the Analysis menu Spatiotemporal submenu This analysis requires 2 files 1 a spatial data file and 2 a case and population at risk count data file All files will be checked for duplicates and should follow ClusterSeer general data requirements Labels must match between all submitted files 1 In a series of dialogs ClusterSeer will prompt you for information about your data and ask which files to use If you submitted suitable datasets in the previous analysis you will jump directly to step 5 You will need to specify the coordinate system of the data If the data are in geographic coordinates you will also need to choose a distance measurement You will need to indicate the temporal scale for the case data whether the data represent observations on a daily weekly monthly yearly or some other user defined basis ClusterSeer will prompt you to submit the data files a Submit the coordinate data file with the following structure E x coordinatelcentroid y coordinate label The file will be checked for duplicate centroids
64. erSeer reports the spatio temporal location and the extent of the cluster that caused the rejection Likelihood ratio The likelihood ratio is Ae K N nz J L Z _u Z AN u Z Lo N A a ifnz gt Z 1 L otherwise Where n is the observed number of cases and Z is the expected number of cases in cylinder Z The observed N and expected 4 number of cases are calculated over the entire study area across all time periods 76 Kulldorff s Scan How to You can perform a Kulldorffs Scan in one of two ways submitting population at risk counts directly with case counts or extrapolating population at risk counts from census data If you have data on the population at risk you will need to import two files If you intend to extrapolate population at risk counts from census data you will need to import three separate files Kulldorff s Scan With census file Choose Kulldorff s Scan Method from the QuickStat menu or from the Analysis menu Spatiotemporal submenu This analysis requires 3 files 1 a spatial data file 2 a case data file and 3 a census file from which to estimate population at risk counts All files will be checked for duplicates and should follow ClusterSeer general data requirements Labels must match between all submitted files 1 Ina series of dialogs ClusterSeer will prompt you for information about your data and ask which files to use If you submitted suitable datasets in th
65. erved and expected proportions Regions with more cases than expected form part of the cluster that 106 signaled the alarm e The case observations in the table identified by their order of occurrence e The census year used to estimate population at risk sizes 107 Chapter 13 Score Test The Score test detects focused spatial clusters in group level data It was developed independently by Lawson 1989 and Waller et al 1992 The score test evaluates the pattern of disease frequency around a point focus The null hypothesis is no clustering relative to the focus Each region is scored for the difference between observed and expected disease counts weighted by degree of exposure to the focus ClusterSeer estimates exposure strength using the inverse of distance to the focus 1 d Example Waller et al 1992 examined the rate of leukemia near 12 hazardous waste sites in upstate New York The Score test found some of the foci to be associated with high leukemia risk The significant foci found by the Score test include but are not limited to areas identified by other tests of the same data 108 Score Statistic H Observed number of cases in each region are independent Poisson random variables with a common disease frequency H Observed number of cases in each region are independent Poisson random variables where the disease frequency is a proportionally increasing function of exposure Test statistic The te
66. es etc AA aba Si nectar tne bes 38 39 41 file tomas Neat 42 43 Survellidlnianaaia loos ead cgobeanedevidned eaten Cee ae ee 45 T NN alls 58 VS ek paises ustedes A A A i ha RERE ARE 20 Temporal No eee eed ls ee eh Ge aes 48 CataiOLmMatsr car iat kant OR 38 39 Test statisticin abi da tthe gaat va hale bad dee 16 Text files ic a lines 27 38 42 O A RN 32 Furnbull s MethOderoitacodaci dida 114 115 116 MD ii 27 38 42 U A ae aaa O ci es EE 109 110 A SR ee he ea eas 17 20 Usd oe ent ee oh 39 41 A E NN 41 V Mie 27 28 29 33 139 WwW Weight A ada Mss 105 109 AlAT NCO aE T ees Se tee Sia ee ee 96 97 for neighbor relationships sintiera iien i i as 24 Wi E T EE 90 96 97 WOT Wei A rth PEE Sati peek AEN AA ARE E SAET 26 37 MM E E E E E EEE A 84 85 Z Zi SCOTES E OENE EATE E EEEE ON 19 ZAM ENE N AE EEEE N N N E AE 102 105 140
67. file the average disease frequency The average disease frequency is the total number of cases divided by the total population at risk Reset to average frequenc Beset to average frequency If you edit the average disease frequency the caption for the box will change from average to expected disease frequency You can reset the value to the average frequency at any time by clicking the reset button next to the box Enter the significance level you wish to use for the test The significance level is the alpha level the cutoff for statistical significance If you run multiple tests at the same significance level you can then choose to run a Multiple Comparisons analysis to determine the proper significance level for all comparisons Choose the number of Monte Carlo runs the number of simulations used to determine statistical significance of the test statistic Once you hit OK you can stop the analysis at any time using the Stop button on the progress bar The Stop button will halt the analysis and the results will be displayed for the number of Monte Carlo runs completed by the time the button was hit Then you can view the results of the analysis 111 Score Results Distribution You can view the Monte Carlo distribution by choosing MC Distribution from the View menu This histogram shows the reference distribution generated by randomizing the dataset and recalculating the test statistic The observed
68. for a number of statistics Anselin 1995 ClusterSeer implements the LISA for Moran s I The sum of LISAs for all observations is proportional to Moran s I an indicator of global pattern Thus there can be two interpretations of LISA statistics as indicators of local spatial clusters and as a diagnostic for outliers in global spatial patterns 89 Local Moran Statistic H There is no association between the disease frequency observed at a location and disease frequencies observed at nearby sites values of I are close to zero H Nearby sites have either similar or dissimilar disease frequencies I is large and either positive or negative Test statistic Spatial association can be evaluated by comparing matrices of similarity where one matrix expresses spatial similarity for example a contiguity or spatial weights matrix and the other expresses similarity of disease frequency values Anselin 1995 defines a local Moran statistic for an observation 1 p gt W Bj j The local Moran statistic is based on the gamma index a general index of matrix association In this equation p is the difference between the disease frequency in area i and the mean disease frequency w is a weight denoting the strength of connection between areas i andj developed from neighbor information This weight ensures that only neighboring values of p are considered in the statistic and weights are standardized to adjust for the number of ne
69. g the one sided P value Map You can view the map by choosing Map from the View menu The map has two layers region centroid points and a cluster layer illustrating the spatial extent of each cluster 0 E 2 Tf you query a region centroid you ll be able to view its label centroid coordinates case count and population at risk count If you query a cluster in the cluster layer you can view the center area label center x y coordinates local test statistic P value local disease frequency and a list of included regions ordered by distance from the center Session log After ClusterSeer performs a Besag and Newell analysis it will place summary information and results into the session log Summary statistics and parameters e Total number of regions cases and the population at risk size e Disease frequency average or expected e Significance level alpha e Cluster size to detect Power A report on whether there was adequate power to find clusters of size k in all regions Local results a table listing individually significant clusters e The region label e The local disease frequency 55 e The test statistic e One sided P value for each cluster Global results e The total number r of individually significant clusters of k e Expected R under the null hypothesis e P value forr List of regions without statistical power if any 56 Chapter 6 Bithell s Linear Risk Score Test
70. health data Using ClusterSeer will draw on your understanding of concepts in epidemiology spatial analysis temporal analysis and statistics About cluster detection What is a cluster A cluster is an aggregation of disease in space in time or in both space and time Cases of a disease can be referenced to a specific location such as a residence and time such as the date of diagnosis Disease clusters occur when more cases are identified at a particular place and or time than would otherwise be expected The study of disease clusters may suggest possible factors and exposures influencing risk for a disease More likely cluster identification will provide incentive to undertake a comprehensive epidemiological study The classic example Dr John Snow s study of the 1854 London cholera outbreak is an historic example of a cluster analysis that suggested an effective intervention In brief the outbreak of cholera was detected by Dr Snow even before the bacterium that causes cholera had been identified He mapped mortality and found that most deaths occurred near the Broad Street Pump Once the handle of the pump was removed the outbreak subsided Cluster detection methods Since the time of the London cholera outbreaks more sophisticated statistical analyses have been developed to detect clustering Advances in computer databases Geographic Information Systems and statistical techniques have augmented our toolbox for the stu
71. ighbors The local Moran statistic J will be positive when values at neighboring locations are similar and negative if they are dissimilar ClusterSeer uses significance values below z scores and interquartile distance to find extreme local Moran values Significance Statistics tend to be correlated among neighboring locations Following Anselin 1995 ClusterSeer uses both Bonferroni and Sidak adjustments to correct the alpha level when several locations are considered simultaneously This technique adjusts the alpha level for significance for the average number of neighbors n Bonferroni adjustment a a N 1 Sidak adjustment a 1 I a n The significance of single I values can be evaluated with Monte Carlo randomization using conditional randomness Their significance can also be evaluated analytically by comparing the observed value to a normal 90 approximation for the distribution of expected values under the null hypothesis Anselin 1995 This second method depends on the assumption that the statistic converges to a normal random variable an assumption that has not been demonstrated Local Moran How to ClusterSeer requires information on disease frequencies and neighbor relationships to run a local Moran test You can submit this data in one of two ways through submitting a shapefile or through submitting a disease frequency file and an associated contiguity file Local Moran With Shapefile Choose Loca
72. ight for cases within the gray circle is 0 5 The case count in that area is divided by 0 5 essentially doubling the cases to account for the missing half of the circle Orr 97 Ripley s K function How to Choose Ripley s K function from the QuickStat menu or from the Analysis menu Spatial and then Global 98 f In a series of dialogs ClusterSeer will prompt you to submit the file to analyze If you submitted a suitable dataset in the previous analysis you will jump directly to step 4 You will need to specify the coordinate system of the data If the data are in geographic coordinates you will also need to choose a distance measurement ClusterSeer will prompt you to submit the case data file This file should contain individual level data with the following columns in the following order ClusterSeer will check the file for duplicate subject labels and the file must follow general ClusterSeer data requirements If you wish you may change your file choice using the Select File button Choose a distance 4 This sets the spatial extent of the clusters you will find e A good rule of thumb is to make h small compared to the scale of the study area e ClusterSeer defaults to 1 4 of the maximum interpoint distance unless you supplied a different value in the previous analysis Choose the number of distance steps ClusterSeer calculates the K function over a range of distances up to h you specify
73. ihood estimation beginning with initial values you specify 69 Diggle s Method Choosing initial parameters The parameters for the raised density model are determined through maximum likelihood estimation beginning from parameters you specify f d 1 aexp Bd alpha the intercept beta distance decay Alpha determines the height of the The raised density model subsides to 1 cone the raised density of cases at the far away from the focus The value of focus Higher values of alpha represent beta controls how quickly the raised higher concentration at the focus The density returns to 1 At higher values of initial default value for alpha is 0 a value beta the raised density subsides more that equates the alternative and null quickly Beta must be greater than zero hypotheses and its initial default value is 1 Within one session subsequent analyses will retain previously fitted alpha and beta values as the defaults 70 Diggle s Method GLRT The crux of Diggle s method is to compare two spatial models for case locations one with no relationship to the focus the null hypothesis and one where the pattern of the disease depends on the focus Diggle and Rowlingson 1994 compare the two models using a generalized log likelihood test GLRT Essentially the test evaluates which model better explains the data The generalized log likelihood test is D 2lL p Lo p l Where L _ is the log likel
74. ihood of the alternative hypothesis and Lol is the log likelihood for the null hypothesis below Llp p x N S logli p x i n 1 Lo p nlogp n m log 1 p The case and control subject locations represent the complete set of locations under study x In the above equations the p x functions describe the probability that location i is the location of a case subject pf d 1 pf d The significance of D is obtained with reference to the chi squared distribution with 2 degrees of freedom Diggle s Method MLE P x The parameters for the raised density model are optimized through maximum likelihood estimation MLE a general statistical method for estimating parameters In this case the process involves maximizing the log likelihood function for rho Rho is maximized when the raised density model is 1 i e when the density is not elevated or when the null hypothesis is true at LA Where n cases and m controls Diggle and Rowlingson 1994 71 Diggle s Method How to Choose Diggle s Method from the QuickStat menu or from the Analysis menu Analysis gt Spatial gt Focused 72 1 In a series of dialogs ClusterSeer will prompt you to submit the file If you submitted a suitable dataset in the previous analysis you will jump directly to step 4 You will need to specify the coordinate system of the data If the data are in geographic coordinates you will also need to
75. ing the number of disease events by the number of subjects at risk in a specified time interval Yet drawing individual level conclusions about risk from group level data has its limits Morgenstern 1998 Relative risk RR is often estimated for a sub group of study subjects as the ratio of that group s average risk to a baseline measure of disease risk In those cases when an appropriate referent group cannot be identified either the average risk over the entire set of study subjects or a national average may be used as the baseline risk for comparison Some of the spatial methods require an understanding of risk or relative risk as a function of space Suppose that exposure to a point source focus elevated the risk for a particular type of disease and distance to the point source served as a proxy estimate of the amount of exposure experienced We could create a function by which degree of exposure would be estimated according to distance from the focus postulated degree of exposure The RR could peak at the point source and decline with increasing distance It may be difficult to anticipate the appropriate model form and the fit of the final model to the actual data should be considered However please note using the observed spatial disease pattern to estimate the risk or RR function is circular and invalidates statistical inference A priori knowledge should contribute to the specification of the function parameters 15 STATISTIC
76. ively the data may describe spatial locations for individual data The duration of the dataset the length of the study you wish to analyze The focus of study The study unit can be individuals either cases or susceptibles or it can be groups individuals aggregated within regions or time intervals Individuals who could contract the studied disease These individuals may be included in an analysis as the population at risk or controls A value summarizing an aspect of the data A P value obtained by comparing the test statistic to the end of the reference distribution where the statistic s values are highest Most ClusterSeer methods are one tailed focusing on the upper tail They test for clustering for where test statistics will be higher than expected A value used to alter the influence of another variable Within ClusterSeer weights are used for edge correction in Ripley s K function to specify neighbor relationships for Local Moran and to include distance from a focus in Lawson and Waller s Score or between neighboring 131 z score 132 regions in Rogerson s Spatial Pattern Statistic A method of standardization that involves subtracting the expected value 1 e mean and dividing by the standard deviation Z scores can be interpreted as the number of standard deviation units from the expected value Index A AAC 24 25 43 SO cod sae eee ath ect nas wd nee nh ersten Sans E 17 Alpha parametros 68
77. ividual level data It is used to monitor changes in spatial pattern for observations processed sequentially Essentially it can be used to determine when a disease shows spatial clustering Examples The method has been used to look at patterns of Burkett s lymphoma in Uganda Rogerson 1997 Rogerson reanalyzed data from a previous study Williams et al 1978 of cases from 1961 1975 His analysis confirms that spatial clustering in Burkett s lymphoma did exist in specific time intervals 101 Rogerson s Method Statistic H The number of cases in each area is a Poisson random variable with an expected value equal to the population at risk multiplied by the average disease frequency H The number of cases in some regions exceeds the expected value Test statistic Rogerson 1997 developed a cumulative sum approach to Tango s clustering statistic for surveillance Tango s statistic itself cannot be recalculated after each time period because of the problem of multiple testing Modified Tango statistic This method uses a modified Tango statistic Tango 1995 Ces r p A r p Where r is the vector of observed proportions of cases in regions 1 m and p is the vector of the expected proportions A is a matrix of the scaled distances of all areas from each other aj di a exp T Where d is the distance between area i and j scaled by tau To detect larger clusters choose larger values of tau Cumulati
78. jani G and Calzolari E 1984 Comparison of two statistical techniques for the surveillance of birth defects through a Monte Carlo simulation Statistics in Medicine 3 239 47 Bender A P Williams A N Johnson R A and Jagger H G 1990 Appropriate public health responses to clusters the art of being responsibly responsive American Journal of Epidemiology 132 S48 S52 Besag J and Newell J 1991 The detection of clusters in rare diseases Journal of the Royal Statistical Society Series A 154 143 155 Bithell J F 1995 The choice of test for detecting raised disease risk near a point source Statistics in Medicine 14 2309 2322 Bithell J F 1999 Disease mapping using the relative risk function estimated from areal data Disease mapping and risk assessment for public health A B Lawson A Biggeri D Bohning E Lesaffre J F Viel and R Bertollini eds New York John Wiley amp Sons pp 247 55 Bithell J F Dutton S J Draper N M amp Neary N M 1994 Distribution of childhood leukemias and non Hodgkin s lymphomas near nuclear installations in England and Wales British Medical Journal 309 501 505 Caldwell G G 1990 Twenty two years of cancer cluster investigations at the 123 Centers for Disease Control American Journal of Epidemiology 132 S43 47 Centers for Disease Control 1990 Guidelines for investigating clusters of health events Mortality and Morbidity Weekly Report 39 1 16
79. l Moran Test from the QuickStat menu or from Analysis choose Spatial then Local 1 Ina series of dialogs ClusterSeer will prompt you for the shapefile If you submitted a suitable dataset in the previous analysis you will jump directly to step 4 2 You will need to specify which data columns to analyze and how ClusterSeer should evaluate neighbor relationships 3 Once you have provided information about your file ClusterSeer will obtain neighbor information from the shapefile This will take a short while If you cancel at this point the procedure will stop 4 Ifyou wish use the Select File button to change your file choice 5 Set the initial alpha level ClusterSeer will correct this level using the Bonferroni and Sidak adjustments that compensate for the average number of neighboring regions found in the dataset 6 Choose the number of Monte Carlo runs the number of simulations used to determine statistical significance of the test statistic 7 After you hit OK ClusterSeer will establish nearest neighbor relationships If you hit Stop at this point the procedure will cancel Then ClusterSeer will run the Monte Carlo simulations You may stop the simulations at any time using the Stop button on the progress bar The stop button will halt the simulations and the results will be displayed for the number of Monte Carlo runs completed by the time the button was hit 91 Local Moran With two files
80. l cancel Then you can view the results of the analysis 105 Rogerson s Method Results Map To see the map select Map from the View menu The map displays the region centroid points Q lt If you query a region centroid you will see that point s label case count and population at risk count Plot To see the plot select Plot from the View menu The plot has two features the series of cumulative sum values shown as black points connected by a line and the alarm threshold illustrated in red If the cumulative sum exceeds the alarm threshold an alarm will be recorded in the session log Session log Once ClusterSeer has performed a Rogerson s analysis it writes information on the procedure and results into the session log Parameters e The values you entered for n H K and tau followed by the h and k that ClusterSeer calculated Summary statistics e Total number of regions analyzed e Total number of case events analyzed in the duration of the study Alarm list ClusterSeer reports all intervals leading up to an alarm when the cumulative sum exceeded the alarm threshold For each alarm ClusterSeer reports e The alarm number the cumulative sum value and the batch when it sounded identified by the case labels and the time intervals beginning and ending the batch e A table listing regions with their observed case proportion their expected proportion and the ratio of the obs
81. l cancer near an industrial incinerator in Lancashire England He compares this pattern with the distribution of lung cancer in the area the control Diggle and Rowlingson 1994 reanalyze the Lancashire data as well as childhood asthma in Derbyshire England in relation to three industrial plants They found no effect of two of the three plants but there was modest evidence for an association with one of the plants ClusterSeer currently supports the investigation of a pattern around a single focus 67 Diggle s Method Statistic H The case and control disease occurrences have the same underlying spatial distribution H The case subject locations have a different spatial pattern than the control locations and the density of the case locations is higher than the control near the focus Test statistic The test is essentially a goodness of fit test comparing two spatial models for the case subject locations a null spatial model developed from control locations and a model that incorporates distance from the focus The spatial pattern of control subject locations also called intensity or density is modeled as an inhomogeneous spatial Poisson point process In this case the process is inhomogeneous because the intensity varies with location x A X ph x f ad Where rho _ is the overall number of events per unit area x is the spatial variation in intensity of the control locations with position irrespective of the fo
82. lable in ClusterSeer are similar to those described in Bithell 1995 with the difference that the scale parameter is not included 60 Model 1 f d exp e d This model has a serious potential problem 1t is infinite at the origin the focus This model is appropriate if disease risk increases towards certainty towards the focus Thus the figure displays the inverse of the additive model as this surface tends towards zero at the center the RR is tending towards infinity Model 2 f d 1 PBexp d p This model comes to a sharp point at the origin focus risk increases more rapidly the closer the subject is to the focus Model 3 fld 1 Plexp d 97 Much smoother than the other similar models 2 and 4 Model 4 fld 1 B 1 d 0 Very similar to Model 2 Bithell s Test Choosing parameters Two approaches are possible for choosing parameters for the relative risk function 1 hypothesis testing and 2 model fitting For hypothesis testing model parameters must be chosen objectively based on prior knowledge of the system Whereas for model fitting the parameters can be chosen to match the pattern of the data If you follow the model fitting approach the P value for the statistical test cannot be used for hypothesis testing as you are testing a hypothesis generated for the data using the data which is circular reasoning What the P value indicates in this cas
83. licking the Change Color button and choosing a new color Hit Update to apply any changes you make Choose Cancel to keep the current formatting 35 Polygon layer properties You may change the outline style and the fill colors of polygon layers Hit OK to apply any changes you make Choose Cancel to keep the current formatting Line style You can choose the width of the lines and their color Choose line width from the drop down box and line color using the Change Color button Fill color Single color Choose this option to color all polygons the same Change the color by hitting Change Color and picking a new one from the palette Categorical You can choose to color the map based on the values of one categorical variable Choose the variable from the pull down list ClusterSeer will choose the color automatically Graduated color You can choose to display the values of a single variable using a gradient between two colors You can choose a minimum and a maximum color the minimum value will be displayed as the minimum color and the maximum value as the maximum color with intermediate values a blend To change the variable displayed choose another from the pull down list You also may change the minimum and maximum colors RGB You may choose to represent the values of up to three variables using red green and blue You specify the value associated with each color Transparent You can also color
84. lied to childhood leukemia in Sweden Hjalmars 1996 and upstate New York Kulldorff and Nagarwalla 1995 and to breast cancer in the northeastern United States Kulldorff et al 1997 75 Kulldorff s Scan Statistic Poisson H The null spatial model is an inhomogeneous Poisson point process with an intensity proportional to the population at risk H In some locations in the multidimensional space the number of cases exceeds that predicted under the null model Test statistic A cylindrical window is moved systematically through the study s geographic and temporal space The window is centered on an individual region centroid at a particular time and expanded to include neighboring regions and time intervals until it reaches a maximum size The number of cases observed and expected within the window is calculated at each window size The maximum size will not exceed 50 of the average population at risk size for the study period and 50 of the study period span The window is then centered on the next region centroid and the process continues The hypotheses are evaluated with a maximum likelihood ratio test that examines whether the null or alternative model better fits the data notation follows Kulldorff 1999 The scan statistic is the maximum likelihood ratio over all possible window sizes Its P value is obtained through Monte Carlo randomization based on a multinomial randomization If the null hypothesis is rejected Clust
85. lists all the map layers You may need to expand the frame to view the full layer names You may show or hide a map layer by checking or clearing its associated box using the mouse Displayed layers have a red check in the box next to their name The active layer is highlighted on the layers list Click on a layer s name in the pane to activate it The maps are drawn sequentially with layers higher on the list drawn over those lower on the list For instance if you have a polygon layer it may obscure a point 30 layer underneath it To fix this change the order of layers in the layer list To change the order of layers on a map drag layers up or down the list The right panel the map itself The map panel displays data and results You may query or reformat active layers 31 The map toolbar RRA Bw 0 The map visualization toolbar appears when the map window is active To activate the map click on it R The selection tool is the default tool In the map layer pane it can be used for changing the order of map layers and activating and deactivating map layers see Maps Overview for details In the map pane it can be used to select map features Using this tool you can click directly on a feature to select it or you can click and drag open a rectangle to select all features that intersect the rectangle If you move the arrow to the map pane and right click you will have the option of querying the nearest feature on th
86. ll not overlap the first the third will not overlap the second or the first e regions included starting with the centering region with remaining regions ordered from nearest to farthest e cluster temporal span 81 82 disease frequency averaged over the cluster temporal span log likelihood ratio upper tail Monte Carlo P value Chapter 9 Levin and Kline s Modified CuSum Cumulative Sum CuSum methods were developed for monitoring industrial production Page 1954 1961 They track changes in a variable of interest relative to a baseline value Levin and Kline 1985 modified Page s CuSum method for use in epidemiological retrospective surveillance The modified CuSum monitors the pattern of disease over time in group level data case and population at risk counts The CuSum accumulates deviations from a baseline disease occurrence over time It allows rapid measurement of change from historical case counts The statistic magnifies small abrupt changes Only when the CuSum exceeds a chosen threshold used to create an indifference zone is the value added to the running cumulative sum Small rises in disease occurrence do not register limiting the chance for false positives Although Levin and Kline used the single maximum CuSum value in the analysis as their test statistic ClusterSeer finds and tests the three highest CuSum values Example Levin and Kline use the modified CuSum to examine the pattern of spontaneo
87. lo randomization process is based on the multinomial distribution The conditional form evaluates the pattern of the cases Its advantage is that 1t can be applied even when the baseline disease frequency may not be accurate for the study population Yet it can be significant solely through finding fewer than expected cases far from the focus not quite the same as finding a cluster of cases near the focus The unconditional test and the Monte Carlo randomization process is based on the Poisson distribution where the mean is the expected risk for the area This form requires an accurate baseline disease frequency for the study population In the unconditional version T increases with increases in case counts across the entire study area and when this excess is concentrated near the focus 59 Bithell s Test Relative risk functions Bithell s method hinges on relative risk and how it changes over distance from a focus The Relative Risk Function RRE describes this change in mathematical terms In the null hypothesis for Bithell s method relative risk is the same regardless of location and equal to 1 In the alternative spatial model risk depends on distance from the case location to the focus d the rate of decay of cases with distance from the source phi or and the ratio of risk at the focus over that infinitely far from the focus the parameter 1 beta It can be represented by a number of different models The models avai
88. mia cases in the years surveyed 1975 85 Waller et al 1994 use it to survey patterns in leukemia in upstate New York They did not find strong evidence for clustering though there was a suggestion of some clustering in one county They recommend using the method to prioritize areas for further study Le Petkau and Rosychuk 1996 use a modification of the method to examine whether cancer clusters appear near pulp and paper mills in British Columbia Canada The method 49 successfully re identified several known clusters of different types of cancers Besag and Newell s method Statistics qe number of cases in an area follows a Poisson distribution with a common rate H For some areas the number of cases exceeds that predicted by a Poisson distribution with a common rate Test statistics This method assesses clustering at the local and global scale using two test statistics for the local scale and r for the global scale Thus use to evaluate local scale clustering and use r to examine global scale clustering This method is designed for case and population at risk count data aggregated into regions with small population sizes Regions could be census tracts zip codes or towns describes the extent of local clustering the number of regions needed to aggregate at least k cases with k defined by the user If the cases are in a cluster you can imagine there would be fewer regions to aggregate to find a set number of c
89. n you run the CuSum analysis the population at risk sizes n and the average disease risk calculated from the data o Relative risk is the change in risk after exposure the risk after exposure divided by the baseline risk The significance of the three largest CuSum values are determined by comparing these values to the Monte Carlo distribution of the largest test statistic 84 Levin and Kline s Modified CuSum How to ClusterSeer requires case counts and population at risk counts over time to run a CuSum analysis You can submit this data in one of two ways as a single file or as a Case file and a census file To use a census file your case data must be on a year based scale daily weekly monthly or yearly observations Levin and Kline s Modified CuSum Single file Choose Levin and Kline s Modified CuSum from the QuickStat menu or from the Analysis menu Surveillance submenu l In a series of dialogs ClusterSeer will prompt you for information and to submit the file If you submitted a suitable dataset in the previous analysis you will jump directly to step 5 You will need to select the temporal unit for the case data whether the case counts were aggregated on a daily weekly monthly yearly or other user defined basis You will be asked whether you will submit census data indicate No ClusterSeer will prompt you to import the case data file with the following columns in the following orde
90. nnnnncncnannnnns 120 Adjusted significance levels ai A Sd 120 Combined P values ooccccnnnoooconnncccnononnncnnnnononononornnncnnnnnn E R e an aaa 120 Multiple Comparisons HOW to ooooooooooccncccccnoononanonononcnnnonnonononcnnnnnncnnnnnos 121 Multiple Comparisons ResultS ooooooooccccccnoooooononnnnnncnnonnonononcnnncnccnnnnnos 122 RESOURCES conan a e e e a aa 123 Troubleshooting at o T A aa 123 Data Import EO S A he eee hk rE hos 123 Referente tala 123 lOs a o et o os 127 Preface ClusterSeer supplies data visualization tools and state of the art statistical methods to explore spatial and temporal patterns of disease ClusterSeer methods can be used to investigate disease clusters in space in time and spatial clusters that depend on time spatio temporal interaction Use the method of your choice or find an appropriate method using the ClusterSeer Advisor System requirements e Windows 95 or Windows NT 4 0 or more recent operating system e Screen resolution of 800 x 600 or finer for best viewing of the maps and graphics e 256 colors or better highly recommended for graphics 10 Manual overview This manual outlines how to use ClusterSeer BioMedware s tool for detecting pattern in health data Chapter 1 presents the conceptual background for the software This chapter includes a cluster definition and a perspective on the role of cluster detection in the larger process of identifying the source of diseas
91. nos 69 Disease frequency re lirica tada aan 37 38 39 Diseases coi aa Aita ias 15 Dista aa 41 A O nl A A nother 105 109 DISTAL att ii tosh donde eaters Wakes 16 17 18 Monte Carlie Se no tac A A A LS 20 21 22 E Ed SO Corrections seve u i dete ta Ae tea eat ntti hee 97 Ei a 27 O hbedh cnie nee ede 43 YL ON tosis sa corsa eu tees aa euit dv NEONET av ectees a teeueaestan ies ha eestees 123 134 F A E E EE 46 47 57 67 108 A en Adee eitde ste atv a nena est ous tevterogtesinerasie tes 35 38 39 Mei Ad ds aba vee tenes RA 42 43 G Generalized log likelihood ratio teSt ooooccccccnooooonooonnnnccnononnonanonononccnnnanannns 71 Geographic coordinates haee aa a a EE E 39 41 Add A E 46 MES LOR mid a a ito 49 95 GRD ee segelectstisves becca E E E gel nag Ale 71 Group level data he a es 38 methods fort eine 49 57 75 83 89 101 108 114 H Mt id 96 102 a MS men O LATA 26 29 Hot a 53 63 72 85 91 98 105 110 116 I Ma 90 NN oekaki Pee aes eh E E Wel aha eee he a gee 38 39 41 Individual l vel datar atada e es 38 Methods Or A a 67 95 Interpolation nsen tied ees a te ath heed ee 23 K E A a 49 95 102 Ethreshold atar a ad eerie eee eee 49 51 102 K fant On ii eb erates ascites aden eh e oe 95 96 98 135 L A E E AEE ese Ae set Seat 23 50 95 Te LEGIONS ien ovate tet adhevasatgaebalesdere tena heseachssvereca chine e apia Ea 49 51 Misa a ia 95 96 Label 39 Lambda A a aa 17 18 22 51 Latitude lonpitude
92. o of the observed expected number of cases calculated for 10 distance intervals from the focus 35 0 Relative density y e t e ng 0 20 0 Distance from focus km The y axis shows relative density the ratio of the two models The points on the plot illustrate the ratio of the observed density of cases and that expected according to the null model The line illustrates the ratio of the alternative and null spatial models As the two models differ only in the raised density model it is graphed directly 73 Map You can view the map by choosing Map from the View menu The map has 2 layers Each can be queried focus illustrated When you query the focus you can view a table holding its with a red X on the coordinates x y values map If the coordinate was converted to UTM the query table will report both latitude longitude and UTM coordinates case and control If you query one of these points you ll be able to view its point locations coordinates and distance to the focus The scale for distance is in the scale specified on import if the data were transformed from geographic coordinates or the scale of the data for planar data Session log After ClusterSeer performs a Diggle analysis it will place summary information and results into the session log Parameters and summary statistics the coordinates of the focus the original parameter values you supplied Cluster detection results 7
93. of a change in health status disease state usually calculated as an incidence proportion by dividing the case count by the population at risk count It may be calculated locally temporally or spatially for comparison to either the average or expected disease frequency A target region in defining spatial weight files expected disease frequency A disease frequency value supplied by you when extrapolation 128 specifying a ClusterSeer method It is usually estimated from another population for comparison with the study data A set of processes for estimating values in between and outside of samples Within ClusterSeer you may extrapolate census data with linear or step methods focus global clustering group level data individual inhomogeneous intensity interquartile distance label local clustering Point location of potential environmental exposure ClusterSeer offers methods for evaluating the pattern of disease relative to a focus As used within ClusterSeer and this manual global clustering methods are tests that evaluate clustering by looking at spatial patterns throughout the entire study area Contrast with local or focused methods A data type where units of observation are collections of study subjects aggregated over geographic regions and or temporal intervals Compare to individual level data A data type where the units of observation are subjects that are cases or controls Compare to
94. ogram that shows the reference distribution from the Monte Carlo simulations ClusterSeer has a Monte Carlo distribution for each region in your dataset Choose MC Distribution from the View menu Next ClusterSeer will prompt you to choose a region from the list of regions in your dataset The distribution of test statistics from the simulations will appear as gray bars and the observed test statistic will be drawn as a slim black line Map A map is available only if you submitted the data in shapefile format You can view a map by choosing Map from the View menu You can view any of four variables displayed as a choropleth polygons coded with a color gradient The variables you can display are Local Moran statistic disease frequency Monte Carlo P value and the normal P value The map shows the local Moran statistic as a default choropleth To change the variable displayed and or the look of the map right click on the map to display a pop up menu Choose Properties from the menu See polygon layer properties for more details on options Lig you query the map you will see a table of the region label test statistic disease frequency and the P values from the Monte Carlo simulations and the normal approximation Session log Once ClusterSeer has performed a local Moran analysis it writes information on the procedure and results into the session log Summary information and parameters e total number of region
95. oids and including the nearest neighbor locations until the total aggregated population in the window equals R The last region added to the window may contribute only a fraction of its population to the window The case count occurring in this window is the sum of all cases in included regions For the farthest region which may have only a fraction of its population in the window the same fraction of its cases are included in the window The significance of Mz is found empirically through Monte Carlo randomization The reference distribution is generated by randomly distributing the cases among the population at risk based on a multinomial distribution estimated from relative region specific population sizes 115 Turnbull s Method How to Choose Turnbull s method from the QuickStat menu or from the Analysis menu Spatial and then Local f 116 In a series of dialogs ClusterSeer will prompt you to submit the file to analyze If you submitted a suitable dataset in the previous analysis you will jump directly to step 4 You will need to specify the coordinate system of the data If the data are in geographic coordinates you will also need to choose a distance measurement ClusterSeer will prompt you to submit the data file This file should contain group level data with the following columns in the following order The file is checked for duplicate centroids and it must follow general ClusterSeer data requi
96. olygon 3 another zero Lower rows describe polygons 2 and 3 in turn For local Moran ClusterSeer row standardizes spatial weights stored in the contiguity matrix Row standardizing matrix a leads to matrix b For example as polygon 2 has two neighbors each neighbor is weighted Y so weights in the row add up to 1 and the statistic is not biased by the number of neighboring regions SpaceStat was developed by Luc Anselin and it is distributed by BioMedware Inc 24 Polygon overlap If your polygons overlap it may be difficult to view them when mapped or to select them for queries ClusterSeer will not be able to display properly shaded areas where overlap occurs Uniquely named polygons completely contained within another polygon will be correctly processed for analysis and display Relatively smaller non uniquely named polygons will be discarded on import and excluded from the analysis Polygon contiguity ClusterSeer can derive neighbor relationships from a file of polygons In essence ClusterSeer will evaluate whether the polygons share a border with each other If they share a border they are considered neighbors In order to derive neighbor relationships from polygons in shapefile format you must specify how ClusterSeer should evaluate these relationships While it may seem like a trivial concept in fact the specification of neighbor relationships can influence the outcome of statistical analyses Rook vs que
97. ononnncnnnananononnnnnnno 33 Deleting map layers alada iba Levtakabsescdeasters 33 REMOVING maps misia aiii lean ia i td ELLA da dhe goku eneencerees 33 Exporting MAPS svi ccaiaelavtasdervelvudussedsasensoestavlvadesteveuddeddadecdevsuivevuodboetenevivatia bs ds 33 Querying MAPS anti de il eee eran ie ee ea 34 Hormattins Maps 2 psig fe o da abies a 35 Point layer Properties 35 Polygon layer properties ooooococcccnnooonooonnononcconononanononnnncccnnnnnn e EaR 36 A O NE Oss 36 Edil ciar aida aia Beep AEN ped 36 Sl oli aia 36 A EE eee eee AN 36 Graduated Colo ois cece ccczesiccnssscesestsncepvacsccnsdvaaceneste stash coun covtylsn SEENEN EE ETES 36 ROBA A A mere erce A errant ere erry ree 36 TR 36 CHAPTER 3 SUBMITTING DATA ococccccccncoconcccoconcncoccncaconcncoconcnnns 37 Patr Oven IEN di da cena tk ae 37 Spatialidatia si cigs season das Soca adevenck A chats deceviaedantvetnscaarerins 37 Temporal data ocio daran rica lada sted deca deena a aa 37 Spatio temporal data csi id a a ati 37 Data CY DSi 220 ised E otencd dente aal E ish veactoatees sleet ease eee 38 About submitting datan ias 38 Data formats general oooooooooocccccnnoooonoonnononnnnononononononnnncnonano nono nnnnncnnannnnnns 39 Spatial data formats incidan 40 Temporal data formats ensei ninn a npe eeraa iei 40 Coordinate Systemi A ares ates dcensaedane vans yoctveln dans retcess 41 Missing da A ed Ge 41 FILE TYPES outils 42 O cas inte E A 42 Textile Curd eles a
98. ononononononnnnnnnnnnnno non nnnnnnncnnnnnnnns 22 Generating Poisson random variables oocconocooocconcccnononononononcnnnononononnncnonnnnns 22 SPATIAL AND TEMPORAL CONCEPTS coccconooccnnnnnnnccnnnnonccnnononnccncnnonoss 23 Extrapolation from census data ooooooooooconncncccnooononnononnnncnnonnnnnnonnnnnccnnnnnnnns 23 Neighbor relationships is ia AA ie 24 COntiguity Matrix ie ff eneen ae na da ccansh cavectce levedess sadedeca te vaansh wadedadedeveauss 24 Polyson overlap ri end eG a Se es 25 POLEO CONAM icons 25 AN O 25 CHAPTER 2 WORKING IN CLUSTERSEER o cocococococononononononononos 26 SESSION LOG ii E EI A a a aa Neg 1s gata ta eas Sows taa ea ass Prot cti tia cali have eevee sted anes Dn ld dt ia os EXPO tail llar lso taleritl eara e a Formatting and editing axis labels Formatting axis scaling and points AR A A A Ste ta Bale ena da PO e ed e de o do ll e o dl EXPO da a hea A A AS a e IA Formatting and editing axis labels Formatting axis scaling and Bars mencionen latinos ana O SA EOS OO RO DI O A E EXPO as dado dades MAPS viii a bd Mia psiO Verve WM 30 The left panel the map layerS ooooooocccccnonooooconononononononononcnnnnnonononnnncnnnnannnnnnnnonno 30 The right panel the map itself i ccceccccccceceeessnseeeeeeceeeestseeeeeeeeeseeseeeeeess 31 Thema toolbartiter t sri e ts hie 32 Working with DPS dt 33 Changing the order of data layers ocooooooccconcccnonononccnnnncnnnonon
99. oston Birkhauser pp 303 322 Kulldorff M 1997 A spatial scan statistic Communications in Statistics Theory and Methods 26 1481 96 Kulldorff M and Nagarwalla N 1995 Spatial disease clusters detection and inference Statistics in Medicine 14 799 810 Kulldorff M Feuer E J Miller B A and Freedman L S 1997 Breast cancer clusters in Northeastern United States a geographic analysis American Journal of Epidemiology 146 161 70 Lawson A B 1989 Score tests for detection of spatial trend in morbidity data Dundee Dundee Institute of Technology 124 Le N D Petkau A J and Rosychuk R 1996 Surveillance of clustering near point sources Statistics in Medicine 15 727 740 Levin B Kline J 1985 The cusum test of homogeneity with an application in spontaneous abortion epidemiology Statistics in Medicine 4 469 488 Moran P A P 1950 Notes on continuous stochastic phenomena Biometrika 37 17 23 Morganstern H 1998 Chapter 23 Ecologic studies In Modern Epidemiology 2nd edition K J Rothman and S Greenland Philadelphia Lippincott Raven pp 459 80 O Brien S J and Christie P 1997 Do CuSums have a role in routine communicable disease surveillance Public Health 111 255 8 Oden N 1995 Adjusting Moran s I for population density Statistics in Medicine 14 17 26 Page E S 1961 Cumulative sum charts Techonometrics 3 1 9 Page E S 1954 Continuous inspection sch
100. p from the View menu If you have performed a sequence of analyses you can only view the map from the most recent one If you have a previous map open when you do a new analysis ClusterSeer will remove the previous map If you need to recreate a map from an earlier analysis instruct ClusterSeer to redo the analysis Changing the order of data layers The pane on the left side of the map window lists the map layers For a layer to be visible in the map window its associated box must be checked Click on the box to check or clear it The data layers appear in the order that they are listed with the top layer in the list appearing above other layers in the view To change the order of layers click on a layer in the list and drag it to where you want it Deleting map layers If you want to completely remove a data layer from a map not just deactivate it highlight the name of the layer and then hit the Delete key You may also remove a layer by right clicking on the map and choosing to Remove this layer from the map This procedure removes the active highlighted layer Removing maps If you no longer wish to view a map click on the close button x in the map s upper right corner You may re create a map of the most recent analysis by choosing Map from the View menu Exporting maps To capture your map as a bitmap take a screenshot of the map window using the Print Screen key You can then paste the screenshot into
101. pefile oooooonocoocccccccnonooonnononccnnnnnanononononncninnnnnns 91 Local Moran With two files cooooooononoocccccnnoooonononononncnnonnnnononononccnnnnanons 92 LocalMoran Rel aslo tec etn ee ee ie ibe 93 Map iu iaa ali d s 93 SESSION LOG uri litiasis cocacola latido 93 CHAPTER 11 RIPLEY S K FUNCTION oooccccccccccoccnccconcnccconcnnononos 95 Ripley s K function StatiStiC occccnnnooooooonncccnnnonnonononononncncnnnnonononnnnnnnnnnos 96 Test Statistics ik ot cito 96 Evaluating the K functiON coooooooooccnccononononononccnononnnonnnnccnnnnnnnornnnncnnnannnnnnnnncnns 96 Monte Carlo randomization ccoooooooonccccnonononononnncnononononnnnncnnnnnononnnnnnncnnnnn renos 97 Ripley s K function Edge COTTECtON ooooooccnccccnnnoononnononcncnnnnnnanonononnnnnnnnnno 97 Ripley s K function HOW to oere a Ee ena ranae eaa saa Aa ieaS eaat 98 Ripley s Res ici iea e an K N AA 99 Mape Gated iS Ad ee ade 99 PU Otis ces had ASAE Ss ee A a e e ee tes ea 99 SESSIOM O A A A Aaa 100 CHAPTER 12 ROGERSON S METHOD cscscsscscsccscscsccscees 101 Rogerson s Method StatiStiC ccoooooooooconnnnncconnnnnnnnnnnonccnonnnnanonononcnncnnns 102 Testistatistici a at tt tan td e e ta as el 102 Modified Tango sStatiStiC cocoonoooocooncncnononononononocnononnnnnnnnnonnnnnnnnnnnnnnccnnannnns 102 Cumulative SUM approach ooooooooocnccnnonooononnnnncnnnnnnnonnnnnncnnnananononnnncnnnnnnoncnnnnnns 102 Rogerson s Method
102. pper and lower bounds of envelope simulations L h h identity function Expectation if null hypothesis is true If L h and the simulations diverge from the identity function that indicates that 99 the data diverge from that expected under the null hypothesis If L h is greater than the identity function that suggests clustering at the spatial scale distance where the maximum deviation occurs When the simulations overlap the identity function you may not see it on the plot as 1t is drawn before the simulations Session log After ClusterSeer performs a Ripley s K function analysis it will place summary information and results into the session log Parameters Monte Carlo randomization runs performed distance h distance steps region coordinates Summary statistics Results total number of points analyzed ratio of distance A to the maximum interpoint distance This ratio of distances provides a check on h the maximum distance analyzed Because of edge correction calculations values of h that are close to the scale of the study are not appropriate minimum interpoint distance Maximum deviation of the observed L h from the identity function L h h 100 Chapter 12 Rogerson s Method Rogerson 1997 developed a cumulative sum modification of Tango s statistic Tango 1995 for detecting spatial clustering Rogerson s Spatial Pattern Surveillance Method detects global spatial clusters in ind
103. r without gaps in temporal intervals temporal interval case count population at risk This file will be checked for duplicate temporal intervals and should follow ClusterSeer data import requirements If you wish you may use the Select File button to change your file choices Choose a relative risk value This value sets the minimum change in relative risk that the method will detect This value is used to calculate r in the CuSum equation Relative risk cannot be less than 1 A relative risk of 1 indicates no elevation of risk a relative risk of 2 indicates that the risk is doubled etc Unless you supplied a different value in a previous CuSum analysis it defaults to 1 0 Enter the significance level you wish to use for the test The significance level is the alpha level the cutoff for statistical significance If you run multiple tests at the same significance level you can then choose to run a Multiple Comparisons analysis to determine the proper 85 significance level for all comparisons Choose the number of Monte Carlo runs the number of simulations used to determine statistical significance of the test statistic Once you hit OK you can stop the analysis at any time using the Stop button on the progress bar The stop button will halt the analysis and the results will be displayed for the number of Monte Carlo runs completed by the time the button was hit Levin and Kline s Modified CuSum Two files Choo
104. r evaluating disease clusters quantitatively Most statisticians and researchers consider cluster detection methods as more suitable for exploratory data analysis than rigorous hypothesis testing As is clear from the CDC guidelines for cluster investigations the study of disease clusters often occurs with incomplete knowledge Spatial locations of cases often simply serve as a proxy or indirect estimation for exposure to a risk factor The causes of a disease cluster may not yet be understood or even identified Additionally the precise date of disease onset is often unavailable and may be estimated with date of diagnosis or onset of symptoms Because of this incomplete knowledge cluster detection methods can better help identify patterns and generate hypotheses rather than formally test pre existing hypotheses Once the hypotheses are generated they need to be tested with additional independent data Otherwise the procedure is somewhat circular testing for patterns we have already identified Thus cluster detection assessment is a step towards understanding spatial and temporal patterns in health data rather than an endpoint in the process It can be used in planning subsequent studies such as case control studies and environmental monitoring schemes 14 Disease risk and relative risk Risk may be defined as the average probability of disease developing in an individual during a specified time interval It may be estimated by divid
105. r will prompt you to submit the data file This file should contain group level data with the following columns in the following order centroid centroid x centroid y case _ population at label coordinate coordinate count risk count The file is checked for duplicate centroids and it must follow general ClusterSeer data requirements 4 Use the Select File button to change your file choices 5 Choose the cluster cutoff size k The cutoff must be a positive integer between the minimum number of cases in any one region and the total cumulative case count The size of the cluster you choose to detect k determines in part where you can detect significant clusters For small k some regions may have too large a population to ever show that small a cluster as significant Waller and Turnbull 1993 In that case the test does not have adequate statistical power to reject the null hypothesis So in essence the cluster size you have chosen is too low for that region The default value is the average number of cases per region or the value you supplied in a previous analysis 6 Expected disease frequency optional This value can be an expected frequency from another region a national average or any external value As a default ClusterSeer calculates an internal average from the data file the average disease frequency The average disease frequency is the total number of cases divided by the total population at risk 53
106. rements If you wish you may change your file choice using the Select File button Choose a population radius Population radius R is the constant population size of each circular window R can be the number of people expected to be exposed by the risk factor under consideration It must be between the minimum region population size and the total population size aggregated across all regions If you did not specify a different value in a previous Turnbull analysis ClusterSeer will default R to the average population size across the sub regions Enter the significance level you wish to use for the test The significance level is the alpha level the cutoff for statistical significance If you run multiple tests at the same significance level you can then choose to run a Multiple Comparisons analysis to determine the proper significance level for all comparisons Choose the number of Monte Carlo runs the number of simulations used to determine statistical significance of the test statistic 8 After you hit OK ClusterSeer will establish nearest neighbor relationships If you hit Stop at this point the procedure will cancel Then ClusterSeer will run the Monte Carlo simulations You can stop the simulations at any time using the Stop button on the progress bar The stop button will halt the simulations and the session log will list results for the number of Monte Carlo runs completed Then you can view the results of th
107. rnative hypothesis An alternative to the null hypothesis a different prediction defined either in terms of the null spatial model or in terms of additional parameters to define clustering alternative spatial model An alternative to the null spatial model It can be very basic not the null spatial model or it can be a more specific model defining a particular disease distribution average disease frequency Disease frequency estimated from the dataset itself the ratio of the total case count over the total population at risk baseline disease frequency Used as a reference to evaluate suspected change in disease frequency A national or historic frequency may be used as the expected frequency or it may be estimated as the average frequency calculated for the study population under investigation calendar based intervals Any method for recording times for temporal data that is based on the calendar year such as daily weekly monthly or yearly intervals User defined data is not directly based on the calendar case A study subject that has experienced a health related event usually identified as disease diagnosis Case data may catalog individuals or cases may be aggregated into groups for disease frequency or case count data case count The number of cases in a particular location at a particular time or both case control status Indicated with a 1 integer if subject is a case and 0 if subject is a control census data
108. s average disease frequency alpha level specified on the dialog alpha level adjustments e Test statistic mean and standard deviation 93 Tables of outliers found three ways e Outliers more than 2 standard deviations from the mean o This table reports the region label test statistic z score and two sided P value obtained from the normal approximation e Outliers more than 1 5 times the interquartile distance o This table reports the region label test statistic z score and two sided P value estimated from the normal approximation e Significance from Monte Carlo simulations o This table reports the region label test statistic and two sided Monte Carlo P value If you wish to see the P value of a region not reported in any table and if you submitted a shapefile to run the analysis you can query the map 94 Chapter 11 Ripley s K function Ripley s K function is used to analyze the spatial pattern of point data It can detect global spatial clustering in individual level data In essence you can use it to compare the observed pattern of cases with that generated by a homogenous Poisson process A K function is estimated for the observed data and then it is compared to an expected K function for a Poisson distribution using a scaled metric L h Additionally a P value for the observed data is obtained by comparing the observed L h to Monte Carlo randomizations of the data 95 Ripley s K function
109. s a known unit distance between any 2 successive numbers 40 Census data must be submitted referenced to yearly time units Data to be associated with the population at risk counts extrapolated from census data must be referenced to calendar based units any system other than user defined Case counts intended to be referenced to populations estimated from census data are usually aggregated by time intervals Those intervals containing zero cases don t have to be specified If this sort of minimized dataset is submitted and the temporal range does not match the intended study period span study period limits can be explicitly specified in the Census Data dialog For analysis missing time intervals in the submitted data set will be filled with case counts equal to zero and population counts estimated from census data This approach can be especially useful for spatio temporally aggregated data in which all regions in the dataset must have the same temporal range Duplicate time intervals cannot be submitted for purely temporal analysis For spatio temporal analysis time intervals can be duplicated across regions but not within regions Coordinate system ClusterSeer can import data in planar or geographic coordinates If you perform a focused cluster detection method on your data specify the location of the focus in the data s original coordinates i e planar coordinates for planar data geographic coordinates for geographic data
110. se Levin and Kline s Modified CuSum from the QuickStat menu or from the Analysis menu Surveillance submenu 86 i In a series of dialogs ClusterSeer will prompt you to submit the files it requires If you submitted suitable datasets in the previous analysis you will jump directly to step 5 You will need to select the temporal unit for the case data whether the case counts were aggregated on a daily weekly monthly yearly or other user defined basis You will be asked whether you will submit census data indicate Yes a Next you will choose the extrapolation method how population at risk counts will be estimated from the census data b You will also indicate whether to specify study period limits see temporal data formats ClusterSeer will prompt you to import the files a case data file with the following structure temporal interval case count This file will be checked for duplicate temporal intervals and should follow ClusterSeer data requirements b Submit census data file with the following structure census year population count The file will be checked for duplicate census years and should follow ClusterSeer data import requirements If you wish you may use the Select File button to change your file choices Choose a relative risk value This value sets the minimum change in relative risk that the method will detect This value is used to calculate rin the CuSum equation Relati
111. sequence of the time intervals in the data You can compare the time period index to those reported in the table in the session log Session log Once ClusterSeer has performed the CuSum analysis 1t writes information on the procedure and results into the session log Summary statistics e Relative risk parameter you supplied e Study period span in the temporal scale of input For June 1961 December 1975 in monthly scale input that would be 196106 197512 e Average disease frequency calculated from the data e Monte Carlo simulations performed Results A table of the three largest CuSum values e With time interval that ended the accumulation of the highest including second and third statistics identified as both the time index numbered in sequence and the time interval e The time interval specific disease frequency e The CuSum statistic e The upper tail P value determined from the Monte Carlo simulations 88 Chapter 10 Local Moran Test The local Moran test Anselin 1995 detects local spatial autocorrelation in group level data It is related to Moran s I Moran 1950 a test for global spatial autocorrelation In essence the local Moran decomposes Moran s I into contributions for each location termed LISAs for Local Indicators of Spatial Association These indicators detect clusters of either similar or dissimilar disease frequency values around a given observation While LISA statistics can be developed
112. st neighbors until the P value exceeds the specified significance level default 0 05 In essence the cluster is diluted by adding neighboring regions until it is no longer a significant cluster The P value for the last significant level of neighbors is calculated for each region in the dataset If the last significant P values were equal to 0 05 for each region then the expected R 0 05 N with N representing the number of regions In practice the expected R is often smaller than that maximum as the last significant P value can be lower than 0 05 These P values are summed to create the expected R which is approximately equal to the average of the Monte Carlo distribution Those regions that never are the center of a significant cluster are not included in the calculation of R For these areas the cluster size k is too small to ever detect a significant cluster in those regions Waller et al 1994 52 Besag and Newell s method How to Choose Besag and Newell s method from the QuickStat menu or from the Analysis menu Spatial and then Local or Global submenus 1 Ina series of dialogs ClusterSeer will prompt you to submit the file to analyze If you submitted a suitable dataset in the previous analysis you will jump directly to step 4 2 You will need to specify the coordinate system of the data If the data are in geographic coordinates you will also need to choose a distance measurement 3 ClusterSee
113. st statistic is U the sum of the differences between the observed O and expected E disease counts at each location from 1 to I the total number of locations weighted by the exposure to the focus Following Waller et al 1992 ClusterSeer uses the inverse distance of the location from the focus 1 d as the weight L O E U 1 1 2 The closest allowable distance is 1 0 x 10 resulting in a maximum exposure weight of 1 0 x 10 The expected disease count is calculated under the null hypothesis of a Poisson distribution Under the null hypothesis U should equal zero P values for observed values of U can be calculated for the standardized statistic U as U generally has an asymptotic standard normal distribution except for very rare diseases ae fvor u Within ClusterSeer Monte Carlo P values are also calculated for randomizations of the data drawing from a Poisson distribution U Variance The variance of U var U is approximated differently depending on whether the baseline risk is known You may enter a baseline risk Expected disease frequency when you ask ClusterSeer to perform a Score analysis If you do ClusterSeer will approximate the variance by 109 var U Y dE If the baseline risk is not known an average risk can be estimated from the sample population and the variance of U will be calculated as l lon i l ja din Where n is the population in region and n is the tot
114. t region called an ego by its label and a count of its neighbors The third row lists the identities of those neighbors with the row continuing until all neighbors have been listed Egos without neighbors can be specified as having a neighbor count of zero or be omitted from the list ClusterSeer checks rows with neighbor counts for at least 2 fields checks that the count value is a positive integer and that the count is less than the total number of areas minus 1 because a region can t be its own neighbor The following row specifies the neighbors of the first ego and there must be at least as many fields in that row as the neighbor count excess fields will be ignored Neighbor labels cannot match the ego s label and there can be no duplicates If the neighbor count in the previous row is zero then the next row lists a new ego and the number of its neighbors All region labels for egos and neighbors must match those in the disease frequency file SpaceStat was developed by Luc Anselin and it is distributed by BioMedware Inc 44 Chapter 4 Disease Cluster Methods ClusterSeer offers data visualization tools and state of the art statistical methods to explore spatial and temporal patterns of disease ClusterSeer methods can be used to investigate disease clusters in space in time or spatial clusters that depend on time spatio temporal interaction To choose a method you may start with the ClusterSeer Advisor
115. t statistic and 2 whether they overlap higher ranking clusters the second will not overlap the first the third will not overlap the second or the first The test statistics for these possible clusters are compared with the maximum test statistic from the simulations a more conservative test Map To see the map choose Map from the View menu The map will display two layers region centroids shown as points and cluster extent shown as a circular outline for each of the three most likely clusters The second and third most likely clusters are chosen using two criteria 1 the value of the test statistic and 2 whether they overlap higher raking clusters the second will not overlap the first the third will not overlap the second or the first A Q If you query the region centroids you can view the region label x and y coordinates case count and population at risk count You can query each cluster layer to find its centering region label x and y coordinates start and end periods for the cluster local test statistic disease frequency P value and a list of other regions included in the cluster 80 Plot Spatio temporal clustering is defined by two factors spatial extent and temporal duration of the elevation in disease frequency You can view a plot of time and disease frequency for all three most likely clusters The second and third most likely clusters are chosen using two criteria 1 the value of th
116. terSeer will not retain plots from previous analyses though you can always recreate them Once you have performed an analysis that generates a plot you may view it by choosing Plot from the View menu Once it is displayed you may format and edit axis labels axis scaling and points You can also export plots from ClusterSeer Formatting and editing axis labels You can format and edit axis labels by double clicking on the axis This will call up a window where you can rename the axis and specify a new font for the label Formatting axis scaling and points You can format the plot by right clicking it and choosing Change Formatting This brings up a formatting window that allows you to change the attributes of the axes and points on separate tabs Axes To change the scaling on the axes set the minimum and maximum value shown for the x and the y axes You may also specify the number of tick marks for each axis or you may wish to let ClusterSeer choose the tick marks automatically To change the thickness of the axes choose a line thickness from the pull down box next to Line Thickness Points You may also change the color of the points A few different types of points may be shown on the same plot Thus you may want to change the colors and sizes of the points separately for each kind Choose the points to change in the pull down box after Data You may then specify a size and a color for those points Exporting At
117. than k cases in the area The probability of 0 through k 1 cases is found by summing the Poisson term from x 0 to x k 1 Lambda is the average or expected case count the average or expected disease frequency multiplied by the population at risk The term e indicates the exponential function When you perform a Besag and Newell analysis ClusterSeer will calculate and its significance for all clusters It will list all clusters that have a probability less the significance level you specify alpha The default alpha is P 0 05 51 Besag and Newell s method r r is simply the total number of clusters found in the local scale analysis To get the observed r ClusterSeer counts the number of significant local clusters As some potential cluster locations will be found significant simply due to multiple testing more quantitative methods of evaluating r are necessary ClusterSeer provides two methods for evaluating r Monte Carlo Randomization ClusterSeer generates a reference distribution to evaluate r by repeatedly randomizing the data and recalculating r for each randomization The data are randomized according to a multinomial distribution based on relative population size Expected R this is the R expected under the null hypothesis expressed as uppercase rather than lowercase r ClusterSeer calculates expected R using the method from Waller et al 1994 is calculated for each region expanding the window to include neare
118. the null hypothesis usually obtained by Monte Carlo simulations or from distribution theory Within ClusterSeer and its help file the term region is used to indicate an area represented by aggregate data A region may be defined as an area but its data may be assigned to its centroid A point that informally represents a sample area used for data aggregated within geographic regions The observations from that region such as case count population at risk count are located to the centroid relative risk risk rook contiguity significance level spatial weights matrix study area study time study unit susceptible test statistic upper tail P value weight Within ClusterSeer centroids are used to establish inter region distances The proportional change in risk after exposure the risk after exposure divided by the baseline risk The average probability of disease developing in an individual during a specified time interval Two regions are defined as contiguous under the rook criteria if they share a border of any length greater than a single point Compare to queen A probability threshold used for evaluating a null hypothesis A way to represent contiguity relationships between study regions Each matrix element corresponds to the relationship for a pair of regions The entire geographic extent of the data The study area may be subdivided into regions represented by aggregate data Alternat
119. this point you cannot export directly from ClusterSeer To capture your histogram as a bitmap take a screenshot of it using the Print Screen key You can then paste the screenshot into an image editor to view and manipulate it 28 Histograms You can use histograms to view and interpret the results of the most recent Monte Carlo randomizations After you initiate a new analysis ClusterSeer will not retain histograms from previous analyses though you can always recreate them Once you have performed an analysis that includes Monte Carlo simulations you may view the histogram by choosing MC Distribution from the View menu Once you are viewing it you may format and edit axis labels axis scaling and bars You can also export histograms of Monte Carlo distributions from ClusterSeer Formatting and editing axis labels You can format and edit axis labels by double clicking on the axis This will call up a window where you can rename the axis and specify a new font for the label Formatting axis scaling and bars You can format the histogram by right clicking it and choosing Change Formatting This brings up the formatting window that allows you to change the attributes of the axes and the bars on separate tabs Axes To change the scaling on the axes set the minimum and maximum value shown for the x and the y axes You may also specify the number of tick marks for each axis or you may wish to let ClusterSeer choose the
120. tick marks automatically To change the thickness of the axes choose a line thickness from the pull down box next to Line Thickness Bars You may also change the color of the bars Up to three colors of bars may be displayed on one histogram and these can be changed separately change primary color secondary color or tertiary color You may also change the number of bins into which ClusterSeer divides the data Exporting At this point you cannot export directly from ClusterSeer To capture your histogram as a bitmap take a screenshot of it using the Print Screen key You can then paste the screenshot into an image editor to view and manipulate it 29 MAPS Maps overview Maps are visual representations of data and statistical results The map displays the data and results from the most recent analysis After you initiate a new analysis ClusterSeer will not retain maps from previous analyses though you can always recreate them Most ClusterSeer maps are displayed in a two pane window The left hand window lists the active layers in the map and the right hand window contains the map itself Map Turnbull s Meth Layers ray largest test statistic E 2nd largest statistic E 3rd largest statistic E Region Centroids Some maps for example those produced by the local Moran method will have three panes In the three pane maps the rightmost pane is the map legend The left panel the map layers This panel
121. tions are based on a Poisson distribution Choose the number of Monte Carlo runs the number of simulations used to determine statistical significance of the test statistic Once you hit OK you can stop the analysis at any time using the Stop button on the progress bar The stop button will halt the analysis and the results will be displayed for the number of Monte Carlo runs completed by the time the button was hit Then you can view the results of the analysis Bithell s Test Results Distribution You can view the Monte Carlo distribution by choosing MC Distribution from the View menu This histogram shows the reference distribution generated by randomizing the dataset and recalculating the observed value The relative position of the observed value of T is illustrated with a slim vertical black line Map You can view the map by choosing Map from the View menu The map consists of two layers It can be queried for its coordinates x y values If the coordinates were converted to UTM the query table will report both latitude longitude and UTM coordinates If you query one of these points you ll be able to view its label coordinates case count population at risk count and distance to the focus If the data were transformed from geographic coordinates the scale for distance is the scale you specified on import Plot You can view the plot by choosing Plot from the View menu The cumula
122. tive case plot displays the observed and expected cumulative number of cases with increasing distance from the focus Divergences between observed and expected cases indicate divergence of the data from the null hypothesis 65 Session log After ClusterSeer performs a Bithell analysis it will place summary information and results into the session log Parameters and summary statistics e The external relative risk function if you specified one to use as the baseline relative risk e Function parameters e Focus location e The type of Monte Carlo technique conditional or unconditional Cluster detection results the value of the test statistic T Monte Carlo results e The number of simulations e The P value for the test statistic through comparison with the Monte Carlo distribution 66 Chapter 7 Diggle s Method Diggle s method is a spatial focused cluster detection method appropriate for individual level data It was developed in two papers Diggle 1990 and then Diggle and Rowlingson 1994 The method evaluates the spatial distribution of individuals with the disease of interest cases The spatial pattern of case locations is compared with the spatial pattern of control subjects with a more common control disease The control location pattern is used as a null model of no clustering and should reflect the spatial pattern of the population at risk Examples Diggle 1990 evaluates the pattern of laryngea
123. ty reference distribution region region centroid 130 A P value obtained by comparing the test statistic to one end of the reference distribution Most ClusterSeer methods are one tailed focusing on the upper tail They test for clustering for where test statistics will be higher than expected The probability that the observed test statistic was drawn from the null distribution or the probability that the null hypothesis is true given the observed statistic Data from individual spatial locations Points may represent the locations of individual disease cases or they may represent region centroids for group level data Data representing regions as areas A polygon completely contained within another polygon a nested polygon only shares borders with the polygon that contains it The individuals considered at risk for the health event i e disease under investigation This value serves as a reference population during cluster analysis Populations at risk may also be divided into subpopulations i e based on location or age and these subpopulation counts can serve as or contribute to the units of analysis Ifa disease is rare the cases may be included in the population at risk as would be expected with census data Two regions are defined as contiguous under the queen criteria if they share a border of any length even a single point such as a corner Compare to rook A distribution of the test statistic under
124. ude fractions 0to3 4x disease frequency 10 Population at risk Positive numbers can include fractions 1to3 4x count 1038 Represented by whole numbers such as 0 or 1 Can not be submitted as decimal values such as 1 000 if they match expected codes once truncated Labels must be unique Label matching between for regions or files is case sensitive individuals Can be numbers letters or a combination Can include spaces if the label is enclosed in single or double quotation marks 39 Spatial data formats Data can be imported in planar or geographic coordinates Planar coordinates must be expressed as numeric values Geographic coordinates must fall within the following range 180 to 180 When the coordinates describe region centroids used to aggregate study units the data is checked on import for duplicate centroids Temporal data formats data Yearly YYYY 1998 0001 to 9999 onthly YY YYMM 199801 monthly values MM range 000101 to from 01 12 999912 eekly YYYYWW 199843 weekly values WW range 000101 to from 01 52 999952 Daily MM DD YYYY 1 2 2001 month and date values may 12 30 1899 be expressed as single digits to 12 31 9999 User user defined 5 positive whole numbers that 0 to 4 2 defined may represent points in time billion or non overlapping successive temporal intervals In this scale the intervals are naturally ordered by their magnitude 5 comes after 4 and there i
125. udy shows clustering L h would exceed the expectation of f h h at some scales 96 Monte Carlo randomizations ClusterSeer compares the observed K h to that from Monte Carlo randomizations of the data ClusterSeer randomizes the distance between points dy above and then re estimates K h Ripley s K function Edge correction Ripley s K function evaluates how many other disease cases are within a specified distance A from each case in turn If a case is on the edge of the study area then there will be parts of that distance without data Instead of no cases in the area outside of R 1t should instead be interpreted as no data at all For example a section of a larger gray study area is illustrated below The edge of the study area is the thin black line The gray point sits at the edge of the study The circle of radius h around it is partly outside the study while the circle around the white point is fully inside the study area Data on these two points is not entirely comparable A weighting factor corrects for this The formula for K h divides the case count around a particular region by a weight W This weight is the conditional probability that points around i will be in the study area ClusterSeer calculates the weight as the proportion of the circle s area that lies in the study area The entire white circle is within the study area That weight is 1 About half of the gray circle is outside the study area so the we
126. ulation at risk values are estimated along the line connecting the two census values Dates before the first census value will be set to the first value Dates after the final census value will be set to the last value Census dates are specified on a yearly scale The extrapolation will be estimated at the temporal scale used for the case data daily weekly monthly or yearly 23 Neighbor relationships Neighbor relationships between regions underlie statistical methods such as local Moran To examine spatial association you first need to define how ClusterSeer should set neighbor or contiguity relationships Exactly what is next to what ClusterSeer can set neighbor relationships in two ways 1 using lists of neighbors for each region from SpaceStat sparse ASCII files or 2 based on polygon contiguity from a GIS file Contiguity matrix ClusterSeer uses either data file to create a contiguity matrix holding binary spatial weights These weights indicate whether regions neighbor each other The weight between two areas that share a common border is set to 1 The weight between two areas that do not share a common border is set to 0 The figure below illustrates a simple example of three polygons and their contiguity matrix The first row in matrix a describes neighbor relationships for polygon 1 it cannot neighbor itself so the first value is zero it neighbors polygon 2 so the second value is 1 and it does not neighbor p
127. ull B W Iwano E J Burnett W S Howe H L and Clark L C 1990 Monitoring for clusters of disease Application to leukemia incidence in upstate New York American Journal of Epidemiology 132 5136 S143 Waller L A and Jacquez G M 1995 Disease models implicit in statistical tests of disease clustering Epidemiology 6 584 90 Waller L A and Turnbull B W 1994 The effect of scale on tests of disease clustering Statistics in Medicine 12 1969 84 Waller L A Turnbull B W Clark L C and Nasca P 1994 Spatial pattern analyses to detect rare disease clusters In Case Studies in Biometry Lange N Ryan L Billard L Brillinger D Conquest L and Greenhourse J eds New York John Wiley amp Sons Inc pp 13 16 Waller L A Turnbull B W Clark L C and Nasca P 1992 Chronic disease surveillance and testing of clustering of disease and exposure Application to leukemia incidence and TCE contaminated dumpsites in upstate New York Environmetrics 3 3 281 300 Williams E H Smith P G Day N E Geser A Ellice J and Tukei P 1978 Space time clustering of Burkitt s lymphoma in the West Nile District of Uganda British Journal of Cancer 37 109 122 126 Glossary alpha level Synonym for significance level a probability threshold used for evaluating a null hypothesis alpha parameter A parameter used to determine the shape of the raised density function in Diggle s method alte
128. us abortion or miscarriages in the first 7 months of pregnancy as reported by a New York City hospital over five years They looked for patterns of fetal chromosomal anomalies in the data The pattern of spontaneous abortion was not significantly different from the baseline for fetuses with chromosomal anomalies For those with normal chromosomes there were significant patterns in the data with a rise in the frequency of spontaneous abortions of chromosomally normal males during the study The authors do not speculate on what caused the increase in spontaneous abortion of males 83 Levin and Kline s Modified CuSum Statistic H The disease occurs at a homogeneous rate over time H There are times where disease rates are temporarily elevated Test statistic The Levin and Kline 1985 modified Cumulative Sum CuSum value is calculated for each time interval in the study period The value is set to zero at the first interval t 0 For each successive interval the CuSum value W 7 is Wi r max 0 Wi 1 Y r 1 2 Wo 0 Where the Y is the case count in time interval t W is the CuSum for the last time interval and r is the reference value Levin and Kline use rto create an indifference zone In essence r determines the sensitivity of the CuSum to small changes To show a change in the CuSum the observed case count Y must be greater than r logo r is calculated from the relative risk you supply whe
129. valuation repeating the preliminary evaluation with verified data This step also includes a literature review to investigate an association between the cluster and exposure or source 3 Major feasibility study Here a case control study is designed and any environmental monitoring scheme planned 4 Etiologic investigation This step implements the study planned in Step 3 It evaluates the link between the hypothesized cause of the cluster and the disease It does not necessarily give information on the causes of the original cluster but evaluates plausible causes Most studies of apparent disease clusters are not substantiated after early data exploration Most end at stage 2 after finding no significant clustering For example The Minnesota Department of Health received 420 reports of apparent clusters between 1981 8 Bender et al 1990 About 95 of these investigations were ended at stage 2 with no clustering found Of the remaining 5 only 1 5 or 1 of the original total warranted an epidemiological study A similarly low rate of cluster verification occurred in a study of 61 cluster investigations between 1978 84 at the National Institute for Occupational Safety and Health Schulte et al 1987 Most apparent clusters did not have a greater than expected number of cases and of those that did most could not be explained by occupational exposure 13 Limits of cluster detection ClusterSeer provides statistical methods fo
130. ve risk cannot be less than 1 A relative risk of 1 indicates no elevation of risk a relative risk of 2 indicates that the risk is doubled etc Unless you supplied a different value in a previous CuSum analysis it defaults to 1 0 Enter the significance level you wish to use for the test The significance level is the alpha level the cutoff for statistical significance If you run multiple tests at the same significance level you can then choose to run a Multiple Comparisons analysis to determine the proper significance level for all comparisons Choose the number of Monte Carlo runs the number of simulations used to determine statistical significance of the test statistic Once you hit OK you can stop the analysis at any time using the Stop button on the progress bar The stop button will halt the analysis and the results will be displayed for the number of Monte Carlo runs completed by the time the button was hit 87 Levin and Kline s Modified CuSum Results Distribution You can view the Monte Carlo distribution by selecting MC Distribution from the View menu This histogram shows the reference distribution in gray generated by randomizing the dataset and recalculating the maximum test statistic The three highest CuSum statistics are shown as thin colored bars Plot You can view a plot of CuSum statistics over time by selecting Plot from the View menu The x axis shows the time period index an ordered
131. ve sum approach In this cusum approach the expectation of Cg after i observations Cg is conditioned on the previous value observed after i observations Cg E Co s p u where u is a vector and r K is the proportion of cases in each region provided that case i is in region k Uk 1 K p Alr k p Z monitors changes in Cg from its expectation When the statistic differs from its expectation Z will be large and positive 102 Z Cos ECo C a ote Where the conditional variance is 2 1 1 1 2 Sc lce P diaguu p u and diag represents the diagonal of the matrix uw The test can be used on non normal data by grouping samples into batches of a set size N Rogerson 1997 Then the Z for these batches are averaged to get Zn Rogerson uses the cumulative sum statistic based on Page 1954 to detect increases inZn S max 0 S Zn k S 0 where t is the batch number in order The cumulative sum monitors for deviations larger than k units from the target value of zero An alarm signal is triggered when S exceeds h a user defined threshold This expression is also used in Levin and Klein s modified CuSum for temporal surveillance 103 Rogerson s Method Choosing parameters To run a Rogerson s Spatial Surveillance Analysis you need to set four parameters k h n and tau Change threshold k The term K is the threshold for detecting changes in the cumulative sum statistic
132. y divided by two these values are calculated as the mean of the two test statistics closest to the appropriate position ClusterSeer then multiplies the interquartile distance by 1 5 Any values farther from the median than 1 5 times the interquartile distance are considered outliers median HH percentile 25th 75th interquartile distance 19 MONTE CARLO RANDOMIZATIONS About Monte Carlo randomization Monte Carlo randomization is one way to quantitatively evaluate observed data and test statistics In general Monte Carlo Randomization MCR procedures follow this sequence 1 Following the calculation of a statistic from the original dataset observations are randomized 2 The statistic is recalculated for the randomized data 3 Steps 1 2 are repeated a given number of times amassing distributions that will be used to calculate P values for the observed statistic 4 P values are calculated by comparing the observed statistic to the reference distribution ClusterSeer randomizes the original dataset according to the approach recommended for a particular method see Types of randomization Null hypotheses and the randomization approach are detailed in individual method descriptions Calculating Monte Carlo P values The P value is the relative ranking of the test statistic among the sample values from the Monte Carlo randomization You can calculate P values to see whether observed values are unusually

User Manual book 1 version 2.5

Contents

Download Pdf Manuals

Related Search

Related Contents