Home

µ-ARGUS 3.1 manual

1. Occupation 2 2 Statistical disclosure control measures When a file is considered unsafe it has to be modified in order to produce a safe file This modification can generally be done using so called SDC techniques Examples of such techniques are L ARGUS 3 1 user s manual 8 global recoding grouping of categories local suppression PostRAndomisation Method adding noise The effect of applying such techniques to a microdata set is that its information content is decreased Variables are specified in less and less detailed categories values in certain records are suppressed or values in certain records are being replaced by other values 2 2 1 Global recoding In case of global recoding several categories of a variable are collapsed into a single one In the above example we can recode the variable Occupation and the categories Statistician and Mathematician can be combined into a single category Statistician or Mathematician When the number of female statisticians in Urk plus the number of female mathematicians in Urk is sufficiently high then the combination Place of residence Urk Sex Female and Occupation Statistician or Mathematician is considered safe for release Note that instead of recoding Occupation one could also recode Place of residence for instance It is important to realise that global recoding is applied to the whole data set not o
2. 37 L ARGUS 3 1 user s manual 4 2 The File Menu 4 2 1 FilelJOpenMicrodata micro data Microdata mm Metadata In this dialog box you can select the microdata file and the corresponding metafile By default the microdata file has extension asc and the metafile rda When you click you get an open file dialog box In this box you can search for the files you want to use You can choose other file types when you click on the files of types listbox When you have selected the microdata file a suggestion for the metafile is given but only when this file exists Before you click OK you must have filled in the name of the microdata file If you change your mind you can cancel the whole operation by clicking Cancel 4 2 2 File Exit Terminates the L ARGUS session L ARGUS 3 1 user s manual 38 4 3 The Specify menu 4 3 1 Specify MetaFile Specify metadata Attributes Variable edd REGION C HH Identifier Starting position 1 C HH Variable Lenath 4 C Weight Decimals fo other Options for Argus Identification level Categorical Weight for local Numerical suppression 50 Categories remm v Truncation allowed 1 3999 7 Codelistile 2 9938 regio cdi ae KINDFACT TENURE x New Delete In this dialog box all the attributes of the identifying variables can be specified Attributes
3. OCCUCODE KINDFACT gt Apply Es EE This dialog box offers two possibilities to reduce the number of unsafe cells These are global recode and truncate Short description The left pane shows a list of all the variables that can be recoded If itis possible to truncate a variable the truncate button is active With Read you can open existing recode files If you click on the right edit box you can easily change a recoding scheme With Apply the recoding scheme will be applied to the selected variable With Truncate the truncation will be applied to the selected variable recoding and truncations are applied on the original categories When a recoding truncation or replacement is applied a R or T is placed in front of the variable and the variable is shown in red With Undo you can undo a global recoding or truncation of the variable which has been selected The Close button closes the dialog box Global Recode There are some rules on how you have to specify a recode scheme the codelists are treated as alphanumeric codes This means that the codelist are not restricted to numbers only However this implies that the codes 01 and 1 are considered different and also aaa and are different In a recoding scheme you can specify individual codes separated by a comma or ranges of codes separated by a hyphen Special attention should be paid when a range is given without
4. When finished making the output file the program makes a report file The format of this file is HTML This will make it possible to view the report when ARGUS has been finished with a browser The report file can be viewed with OutputlView report but will also be shown automatically when ARGUS has made a safe file This report will give a summary of the actions performed by u ARGUS L ARGUS 3 1 user s manual 36 4 Description of Menu Items In this section we will give a description of the program by menu item The information in this section is the same as the information shown when the help facility of L ARGUS is invoked 4 1 Main window If HU Argus CE x Specify Modify Output Help c EE Mik REGAR unsafe records in every dimension variable REGION Sode loe reg dim T cimz ams REGION 1595 Groningen Friesland Drenthe Overijssel Flevoland Gelderland Utrecht Noord Holland Zuid Holland Zeeland Noord Brabant Limburg MARSTAT KINDPERS NUMYOUNG NUMOLD AGEYOUNG EDUC EDUC2 ETNI PRIOCCU POSLABM REGJOBC RECBEN RECUNBEN RECODBEN RECBILL RECSOSEC c c m OF Specify Modify Open Metadata Show Table Make Contents microdata Collection suppressed file Exit Combinations Global recode View report About PRAM Risk specification Modify numeric var
5. version 3 1 User s Manual Document 2 D2 Statistics Netherlands Project CASC project P O Box 4000 Date January 2002 2270 JM Voorburg The Netherlands email ahnl krypton vb cbs nl Contributors Anco Hundepool and Aad van de Wetering Luisa Franconi and Alessandra Capobianchi Individual risk model Peter Paul de Wolf PRAM Contents Iiic M 3 L Introduction nee edet repete eet ee e RE Hid canard 6 2 Producing safe microdata seinieni e Sa te eee a dette ree u reducat 7 2 1 sate and unsate microdata ics 7 2 2 Statistical disclosure control 8 22 1 Global recoding ienes eie ted et edet te este ette e a 9 2 2 2 Suppr ssiOh ere eite ep te del edo ee e re HER 9 2 2 3 Top and bottom coding ette eee eere eerie n ee t hee dei 9 2 2 4 The Post RAndomisation Method enne enne 10 2 3 Methodology description of the individual risk approach 11 2 3 1 The traditional model eee eee pes 11 219 2 JNOLaLOD nere hee 11 2 3 3 Risk model ene RAW eee D ned etes 12 2 4 Application of Risk model 13 2 41 Introd cti n 5 eiu edd ee D ERE ERU Ip REP Hc 13 2 4 2 Anoverview of the use of Individual Risk ees
6. 13 Universitat Rovira 1 Virgili URV ES 14 Universitat Polit cnica de Catalunya UPC ES Although Statistics Netherlands is the main contractor the management of this project is a joint responsibility of the steering committee This steering committee constitutes of 5 partners representing the 5 countries involved and also bearing a responsibility for a specific part of the CASC project L ARGUS 3 1 user s manual 4 CASC Steering Committee Institute Country Responsibility Statistics Netherlands Netherlands Overall manager Software development Istituto Nationale di Statistica Italy Testing Office for National Statistics UK Statistisches Bundesamt Germany Tabular data Universitat Rovira i Virgili Spain Microdata The CASC microdata team Several partners of the CASC team work together and contribute in various stages to the development of L ARGUS These contributions are either the actual software development the research underlying this software development or the testing of the software CISC Soton 5 L ARGUS 3 1 user s manual 1 Introduction The growing demands from researchers policy makers and others for more and more detailed statistical information leads to a conflict The statistical offices collect large amounts of data for statistical purposes The respondents are only willing to provide a statistical office with the required information if they can be certa
7. RECPENS POSLABLY COMPCODE tables occucope I Cancel Threshold E Automatic specification of tables Clear Set table for Risk model is F ss al 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 E This dialog box can be used to specify the tables that should be checked This option can be used in one of two ways First of all a user can interactively select variables from the listbox that should span a table Secondly it can be used to let L ARGUS generate the tables The package can do that in one of two ways 1 by generating all combinations of variables Only the variables with identification level gt 0 are used K is to be specified by the user For each dimension the threshold value should be specified This option is the implementation of the Dutch approach for Public Use Files PUF 2 by using the identification levels of all variables provided that they have been specified by the user In this case U ARGUS generates all tables that can be obtained when variables from different levels of identification are chosen In this case a single threshold should be specified This option is the Dutch approach for Micro data files Under Contract MUC There are undo options that allows one to remove variables from a table or to remove tables that have been selected at an earlier stage Additionally there is the option of specifying that a table will be used for the new Individual risk
8. also section 2 2 4 Additionally the parameters of the added noise can be made available to the users of a dataset They can then make better use of the datafile At this stage you can specify the amount of noise added to the variable Typically this is done by specifying the chance of changing a category into an other category In the current implementation you specify the change of not changing a certain category The complement is the chance of changing the value into an other category You can limit the spread of this by specifying the bandwidth This implies that the chance of changing a category is limited to the nearest n categories The individual chance per category can be changed by using the slider Press the apply button to accept the settings The Undo button will reset the specification If a variable is PRAM med this is shown in the listbox by a P in the first column and an indication that the bandwidth has been used or not 31 L ARGUS 3 1 user s manual E PRAM specification x Variables ptions Code Labe Peob 2 Individual chances x 5 REGION a Based on Global SEX Recode AGE 3 Aalten 80 MARSTAT 4 Ter Aar 80 5 Aardenburg 80 KINDPERS Aarle Ristel 80 NUMYOUNG Default Probability NUMOLD 7 Abcoude 80 8 Achtkarspelen 80 26 oe 9 Akersloot 80 M EDUC 10 Alblasserdam 80 Set all cod 11 Albrandswaard 80 ETNI default 12 Alkemade 80 PRI
9. number of unsafe categories per table via the menu option ModifylShow tables You will see the following window Number of unsafe cells per table Ff Overview of table collection x Show all tables Select variable REGION COMPCODE OCCUCODE REGION SEX REGION AGE REGION MARSTAT REGION KINDPERS REGION NUMYOUNG REGION NUMOLD REGION AGEYOUNG REGION EDUC REGION EDUC2 REGION ETNI REGION PRIOCCU REGION POSLABM 29 LARGUS 3 1 user s manual 3 2 1 Global Recoding Via the ARGUS main window the user decides which variable will be recoded Double click on the variable in the main window or choosing the menu option ModifylGlobal recode brings you to the global recode window In this window the global recodings for each variable can be specified Specifying and selecting global recodings Global Recode Oy x AProjectskCasc ncoNMu amp rgus VB datata Regiop arc F Variable REGION SEX AGE MARSTAT KINDPERS NUMYOUNG Truncate NUMOLD AGEYOUNG EDUCI EDUC2 ETNI PRIDCCU Unde POSLABM REGJOBC SEED REN Close Codelist for recode RECODBEN pM RECBILL regiop cdl zal RECSOSEC Wen RECPENS POSLABLY 1 97 COMPCODE 2fs OCCUCODE KINDFACT Apply La jc TEN Missing Values On the left you see a list of the variables You can recode a variable by applying a recode scheme or by truncating a variable Truncation is only possible if i
10. while B f 1 1 P c b ud BR 4 1Xf 2 1 pct f 1 The above formulation works for f 2 3 if f 2 1 we use n _ 4a l P Pi z x 2 r tog f 4b l P p However we found the task of evaluating formula 2 exceedingly heavy or even absolutely impossible when observed frequencies are too large In these cases the introduction of a numerical approximation is convenient We obtained satisfactory results using Pk fr 5 t 0 while if f 2 This approximation is used for frequencies greater than 40 We were forced to set this value because of software limitations however use of a higher threshold could increase precision Finally in order to consider other factors influencing the risk such as the quality of the key variables the intruding probability and so on we use a multiplying factor 7 so the final risk formula is given by p m rit 6 The factor z set to 1 as the default may be fixed before the risk computation starts L ARGUS 3 1 user s manual 16 2 4 4 Frequencies calculation A fundamental step for risk estimation is the computation of the frequencies f and First of all we consider the population as partitioned into sub populations k 1 defined through all the possible combinations of categories of the key variables It must be stressed that in the individuation of these sub populations we us
11. Du 38 4 2 2 PilelExat npe len Re Maes q te d ue ee ee ten 38 4 3 The Specify menu ele eee ee tette cec b tare P ee ire e eoe ene 39 43 1 SpecitfylMetaEile sassa na u Cap u n ep ete ee tl ed eus 39 4 3 2 SpecitylCombinations uuu oer erem e eti ae rie eec teda 40 4 4 MOdlfy 35 dud pude IS elg eed 41 441 ModitylShow Fables erit rte tre eoe etri ree N ER e qe 41 44 2 Golbal Recode nee Sepe e ee pe eerie a 41 4 4 3 PRAM specification secs upa ert eoi tte este e ep a TEE ae eH cte evi 44 4 4 4 Individual risk specification 44 4 4 5 Modify numerical variables eesriie i ener 45 4 5 QUIpUb 46 4 5 1 OutputlMake Suppressed File an ana y nennen nennen nnne 46 4 5 2 QutputlView Report peg ated ede ete HERE tere end 47 4 6 lu e 48 4 6 1 HelplContents note EP he E eet sl etn eed 48 4 6 2 HelplADOUL iei eee eee tret etr ete HR Er TR HER eae 48 L ARGUS 3 1 user s manual 2 Preface This is the users manual of version 3 1 of L ARGUS a software package to assist in producing safe microdata The package has been developed under Windows NT and runs under Windows versions from Windows 95 This is the first version of that has been produced in the fifth Framework CASC Computational Aspects of Statistical Confidentiality proje
12. If under FilelOpenmicrodata an rda file has been specified this dialog box shows the contents of this file If no rda file has been specified the information can be specified in this dialog box after pushing the New button As a default newvar is substituted Apart from defining a new variable an existing one can be modified or deleted The following attributes can be specified name of the variable its first position in the data file its length and the number of decimal places Furthermore the kind of variable can be specified weight variable household variable household identifier or none of these A weight variable specifies the weight of the record and is based on the post sampling design used A household variable is one that yields the same score for individuals belonging to the same household A household identifier uniquely identifies households represented in the file Options for ARGUS The identification level of the variable can also be specified This level is used to let u ARGUS generate combinations of variables that should be checked Identification level 0 means that the variable is not identifying implying that it does not appear in any of the combinations to be checked Identification level 1 is the highest level of identifiability followed by levels 2 3 etc Identification levels gt 4 leads to very large sets of combinations to be checked If a variable is identifiable at level i is also identifiable at level
13. J 1998 Estimating the Re identification risk per record in microdata Journal of Official Statistics 14 4 361 372 L ARGUS 3 1 user s manual 20 2 5 Information loss A measure for disclosure risk such as a thresholding rule or the more sophisticated individual risk approach can be used as a criterion to distinguish between safe and unsafe microdata If unsafe microdata are going to be transformed into safe microdata it is necessary to have a measure of information loss at one s disposal This is used to limit the amount of damage done to the data when they are being modified In practice if the modifications are being carried out interactively by a data protector this person is likely to use a crude and intuitive measure of information loss However if the modification is partly done automatically using LARGUS it is necessary that the package uses some measure for the information loss In case of applying local suppressions only L ARGUS simply counts the number of local suppressions The more suppressions the higher the information loss In case of automatic global recoding uses an information loss measure that uses the following parameters a valuation of the importance of an identifying variable according to the data protector as well as a valuation of each of the possible predefined codings for each identifying variable Both global recoding and local suppression lead to a loss of information because
14. Practice Lecture Notes in Statistics Vol 111 Springer Verlag New York See in particular Section 4 2 9 This concept is developed and used in Gerhard Paass and Udo Wauschkuhn 1985 Datenzugang Datenschutz und Anonymisierung R Oldenbourg Verlag Munich T L ARGUS 3 1 user s manual rather sophisticated one Of the latter type are models that either yield estimations of the number of unsafe combinations or give for each record in the file a probability that an intruder will disclose it In a disclosure scenario keys are supposed to be used by an intruder to re identify a respondent Re identification of a respondent can occur when this respondent is rare in the population with respect to a certain key value i e a combination of values of identifying variables Hence rarity of respondents in the population with respect to certain key values should be avoided When a respondent appears to be rare in the population with respect to a key value then disclosure control measures should be taken to protect this respondent against re identification In practice however it is not a good idea to prevent only the occurrence of respondents in the data file who are rare in the population with respect to a certain key For this several reasons can be given Firstly there is a practical reason rarity in the population in contrast to rarity in the data file is hard to establish There is generally no way to determine with certainty whether a per
15. The actual top bottom coding will be done when the safe microdata file is written In this window ARGUS shows the real minimum and maximum of a variable You can specify value above below which all values are replaced by a specified replacement value ARGUS is completely flexible what replacement you specify The replacement value can also be alphanumeric u ARGUS only prevents you from entering a top bottom value above below the extreme value Rounding can be applied to a numerical variable If will increase the protection of the datafile however there are no real measures to calculate the increase in protection Weight Noise As the weight variable might in some cases be identifying it might e g disclose the municipality adding noise can be a solution At this stage you can specify the amount of noise to be added as a percentage 45 LARGUS 3 1 user s manual I Modify numerical variables x TM Variable BottomCoding TopCoding f WEIGHT Minimum 112 7 Maximum 420 INCOME Bottom alue TopValue DEBTS Replacement Replacement Round Rounding base WeightNoise Percentage 4 5 Output 4 5 1 Output Make Suppressed File When finally all the data modifications have been specified Global recoding Risk specification PRAM parameters Top bottom coding rounding noise added to the weight variable it is time to write a protected f
16. by the SDC project consortium and is therefore available for this consortium The main software developments in CASC are p ARGUS the software package for the disclosure control of microdata while t ARGUS handles tabular data The CASC project will involve both research and software development As far as research is concerned the project will concentrate on those areas that can be expected to result in practical solutions which can then be built into future version of the software Therefore the CASC project has been designed round this software twin ARGUS This will make the outcome of the research readily available for application in the daily practice of the statistical institutes CASC partners At first sight the CASC project team had become rather large However there is a clear structure in the project defining which partners are working together for which tasks Sometimes groups working closely together have been split into independent partners only for administrative reasons Institute Country 1 Statistics Netherlands NL 2 Istituto Nationale di Statistica I 3 University of Plymouth UK 4 Office for National Statistics UK 5 University of Southampton UK 6 The Victoria University of Manchester UK 7 Statistisches Bundesamt D 8 University La Laguna ULL ES 9 Institut d Estadistica de Catalunya ES 10 Institut National de Estad stica ES 11 TU Ilmenau D 12 Institut d Investigaci en Intellig ncia CIS ES Artificial CSIC
17. express the event of the identification of an individual as the fact that record i in the released file and unit i in the register available to the intruder belongs to the same unit in the population In the case of independence among units the hypothesised way in which an intruder will try to identify the target individual is by comparing the combination of categories of the key variables on his her register with the observed combinations in the released file If only one unit in the register and one record in the released file present the same categories of the key variables then the intruder will link in a deterministic way the unit to the record and therefore identify the individual Note that records in the file to be released with the same combinations of categories of the key variables are identical for the intruder For those units for whom it is not possible to find a one to one relationship between the categories of the key variables in the two files we assume that the intruder will always try to link a record to a unit Therefore for those records the link will not be deterministic but the choice will be based on probabilistic reasoning We denote with I the event to make a correct link between the record i and the individual i We define disclosure risk r for the target record i the probability Pr L of establishing a correct link between record i and unit in the population given the observed sample Note that records in the file to b
18. is a household identification e HOUSEHOLD e lt SUPPRESSWEIGHT gt household variable contains typically the same value for each member of a household When the suppression of the value for one member is necessary it will be done for each member Priority weight for the selection of the suppression pattern default value 50 An example of a meta data file SEX 5 19 REGION 1 4 9999 9998 lt RECODEABLE gt lt CODELIST gt Regio cdl lt IDLEVEL gt 1 lt SUPPRESSWEIGHT gt 50 lt RECODEABLE gt lt CODELIST gt Sex cdl lt IDLEVEL gt 2 lt SUPPRESSWEIGHT gt 50 AGE 6 2 99 lt RECODEABLE gt lt IDLEVEL gt 3 lt SUPPRESSWEIGHT gt 50 lt NUMERIC gt COMPCODE 31 3 999 lt RECODEABLE gt lt IDLEVEL gt 3 lt SUPPRESSWEIGHT gt 50 lt TRUNCABLE gt WEIGHT 44 6 lt NUMERIC gt lt DECIMALS gt 2 lt WEIGHT gt INCOME 50 8 99999999 lt NUMERIC gt An example of a codelist file of the Dutch provinces L ARGUS 3 1 user s manual 26 20 Groningen 21 Friesla nd 22 Drenthe 23 0 verijssel 24 Flevoland 25 G eld erland 26 Utrecht 27 Noord Holland 28 Zuid Holla nd 29 Zeeland 30 Noord Brabant 31 Limburg 3 1 1 Specify tables set When the metadata is ready the set of combinations to be inspected by H ARGUS can be specified Either you specify the tables manually or you use one of the two rules to generate this set These rules are based on the
19. is constructed the threshold for the tables must be specified This threshold is the maximum value of a cell in a table which is still considered unsafe Above this threshold a cell is considered safe In case of a sample survey this value is calculated at the level of the sample To translate this at the level of the population the sample fraction should be taken into account by the user If the new risk model is used you select a table for this model by pressing the button set table for risk model A description of this model can be found in section 3 2 3 A restriction is that there cannot be an overlap between the tables used for the classical threshold method and the new risk model Mixing the basic model and the new risk approach makes no sense Therefore the overlapping tables will be removed automatically 27 L ARGUS 3 1 user s manual The window for the specification of the combinations Select combinations gt lt gt lt lt lt Threshold fi MARSTAT KINDPERS NUMYOUNG NUMOLD AGEYOUNG EDUC EDUC2 ETNI PRIOCCU POSLABM REGJOBC RECBEN RECUNBEN RECODBEN RECBILL RECSOSEC RECPENS POSLABLY POSFACT Calculate COMPCODE tables ccucopp I Clear Set table for Risk model BF E F When you are satisfied with the specified set of tables you press the button Calculate tables This will be done in three phases In the first phase L ARGUS
20. number of such combinations K can be quite high in the order of hundred thousand Let and F be respectively the number of records in the released file and the number of units in the population with the k th combination of categories of the key variables F is unknown for each In the sample to be released only a subset of the total number of combinations will be observed and only this subset for whom f gt 0 is of interest to the disclosure risk estimation problem We consider the worst case in which no measurement error in the value of the key variables occurred an assumption which implies F gt f For a particular record i in the sample to be released let k i be the subpopulation it belongs to and similarly fi and Fro be respectively the number of records in the sample and of units in the population with the same characteristics as i 11 L ARGUS 3 1 user s manual The ratio fx can be considered a reasonable estimate of the sampling rate relative to the subpopulation k i in which the record i is included 2 3 3 Risk model To develop a quantitative model for assessing the risk of disclosure we need some assumptions on the nature of possible attacks to the privacy of an individual We assume that the person who attempts the identification therein called the intruder has an external data base or public register available that contains identifiers for example name and addresses of individuals and key variables We
21. produce safe microdata It is general within its philosophy in the same sense as a Volkswagen Beetle is general in its capability of transporting a limited number of persons but not 10 tons of bricks So we can say that the development of ARGUS was heavily inspired by the type of rules that Statistics Netherlands uses to protect its microdata rather than the precise form of the rules So in order to be able to understand the general ideas that u ARGUS uses it is necessary that the reader has a good understanding of the type of rules that it is supposed to support A brief discussion of these is given below A more extensive one can be found in Willenborg and De Waal 1996 The aim of statistical disclosure control is to limit the risk that sensitive information of individual respondents can be disclosed from data that are released to third party users In case of a microdata set ie a set of records containing information on individual respondents such a disclosure of sensitive information of an individual respondent can occur after that this respondent has been re identified That is after it has been deduced which record corresponds to this particular individual So the aim of disclosure control should help to hamper re identification of individual respondents represented in data to be published An important concept in the theory of re identification is a key A key is a combination of identifying variables An identifying variable or an
22. py LET Chart Bf regione X relaz X sesso X staciv X eta nrocompo Cumulative chart 12000 10000 000 5000 4000 Risk treshold 0 001508 DK Cancel unsafe records 14902 EN However for the convenience of the user the labels on the axis still report the corresponding p value in order to better evaluate the appropriate value For the same goal u Argus shows the number of records that present a risk value greater than the threshold To select the threshold simply move the bar till the appropriate value shown in the Risk threshold box below 2 4 6 Protection phase and creation of safe file After the final risk has been evaluated for each record and the value of has been chosen H Argus apply the protection step through the local suppression For that protection method an optimised procedure minimises the number of suppressions and reduce the value of the risk below the threshold Once the suppression algorithm has been applied the risk calculation can be run again on the output file produced by u Argus in order to produce the new values of the risk after the protection step so that it is possible to analyse the new distribution of the risk At this point the user can check the information contained in the report file about the protection phase 19 LARGUS 3 1 user s manual Ii View Report olx View Report OY x p ARGUS Report File created date 01 16 2002 time12 42 18 Suppres
23. rules which are used in practice at Statistics Netherlands If you select the tables manually you select the variables in the left pane and move them to the middle pane with the gt button If the table is ready and the appropriate threshold has been specified you add the table to the set of tables in the right pane with the gt button Each table can have a different threshold If you use the generator you will have two options 1 using the identification level and 2 all tables up to a given dimension 1 In the first case the set of tables is constructed as follows For each variable the parameter identification level must be specified Each variable with a level 2 1 is combined with each variable of level 2 2 with each variable of level 23 etc At Statistics Netherlands this rule is used to produce microdata files for researchers For all tables the same threshold must be specified 2 In the second case all tables up to a given dimension are used The selection of variables is restricted to those variables where the identification level is gt 0 For each dimension a different threshold can be specified When the set of tables is constructed using the generator it is still possible to make adjustments You can add additional ones or delete certain tables If a large number of tables is being generated due to the specific parameters U ARGUS will ask whether he should proceed or not Independent of the way in which the set of tables
24. the sub population i e combinations which completely agree except at most for one or more missing categories In the presence of missing values computation of f may be pursued by counting the number of units having strings compatible with the sub population A similar argument can be applied to F id The table below shows how missing values affect computation of the relevant quantities in the context of the previous example 17 L ARGUS 3 1 user s manual UnitID Key_Varl Key Var2 Key Var3 _ 4 Wi fro 1 1 2 5 1 18 3 149 2 1 2 1 1 45 5 2 84 5 3 1 2 1 39 4 194 5 4 1 5 17 3 576 5 4 3 1 541 3 566 6 3 1 1 8 2 549 7 6 2 1 5 5 2 22 8 1 2 5 1 92 3 149 The string 1 2 1 associated whit the UnitID 3 is compatible with the sub populations identified by the strings 1 2 5 1 and 1 2 1 1 and in the same way in each of this two sub populations it has to be counted also the unit characterised by the string 1 2 1 So Fam wi Wet 18 92 39 149 fu Fug 1 14 153 while Fo w3 W Wg w2 39 18 92 45 5 194 5 fa 1 1 1 1 3 L ARGUS 3 1 user s manual 18 2 4 5 Selection of the threshold After the evaluation of the final risk p the user has to fix the threshold a that is the maximum tolerable risk To fix this value u Argus provides a graphic that shows the frequency histogram of log
25. will inspect the microdata file and construct lists of existing codes for the active variables These lists are constructed independently from the specified codelists This is done in order to prevent from problems arising when some existing codes might be missing in the codelists When the codelists are available u ARGUS will in the second phase calculate the tables to be inspected In the final third phase the marginals of the tables will be calculated The progress is shown by the progress bar After this you are ready to start the disclosure control process 3 2 The process of Disclosure Control When all the preparations have been made the actual process of Disclosure Control can start The two main actions that can be performed are Global Recoding and Local Suppression Additionally you can modify numerical variable by top bottom coding and rounding The user can select and perform manually the Global Recodings via ModifylGlobal Recode The central information window is the main window where for each variable the number of unsafe combinations is shown by dimension of the table For the selected variable in the left window the number of unsafe combinations for each category are shown The higher the number of unsafe combinations for a variable the more likely it is that the variable needs to be globally recoded in order to achieve a safe microdata file L ARGUS 3 1 user s manual 28 The main menu for y ARGUS showing the numbe
26. 1 1 In other words the identification levels for 1 gt 0 are nested Also the suppression priority weight for local suppression can be specified here already It will play a role when the safe datafile is created at the end of a t ARGUS run 39 LARGUS 3 1 user s manual Codelist The codelist of a variable should be known to u ARGUS before it can operate properly There are two options for obtaining the codelist for each variable In the first place they can be specified by the user by listing them in a codelist filename Or such a codelist or rather the part that appears in the data can be determined by ARGUS by a run through the microdata Hierarchical variables can be truncated that is some digits of the codes can be chopped off and then yield a meaningful code In this way one can replace a n 1 digit code by a digit code by chopping off the least significant n 1th digit A variable can have two missing values The first one is the one that is used by ARGUS when it locally suppresses a value of this variable The second missing value is only used by u ARGUS to know that this value should not be used as a regular i e nonmissing value In many surveys two missing values are used for don t know and refusal 4 3 2 Specify Combinations Select combinations SEX AGE MARSTAT KINDPERS NUMYOUNG NUMOLD AGEYOUNG EDUC EDUC2 ETNI PRIOCCU POSLABM REGJOBC RECBEN RECUNBEN RECODBEN RECBILL RECSOSEC
27. L ARGUS 3 1 user s manual 40 approach Because the traditional risk model and the new apprach cannot be applied to the same identifying variables all overlapping combinations are removed when a table is used for the individual risk model 4 4 Modify 4 4 1 Modify Show Tables Ff Overview of table collection x Select variable REGION COMPCODE OCCUCODE REGION SEX REGION AGE REGION MARSTAT REGION KINDPERS REGION NUMYOUNG REGION NUMOLD REGION AGEYOUNG REGION EDUC REGION EDUC2 REGION ETNI REGION PRIOCCU REGION POSLABM In this window you can see which tables have been generated and how many unsafe cells each table has On default only the tables with unsafe cells are shown You can see all tables when you check Show all tables e If you are only interested in one variable you could use the select variable box The Close button closes the dialog box 4 4 2 Global Recode Global recoding in conjunction with local suppression is the main option for protecting individual information in a data file 41 L ARGUS 3 1 user s manual Global Recode Iof x AProjectskCasc ncoNMuArgus VB datataR egiop gre Al Variable REGION SEX AGE MARS TAT KINDPERS NUMYOUNG Truncate NUMOLD AGEYOUNG EDUCI EDUC2 ETNI PRIOCCU me POSLABM REGJOBC BERGEN Close Codelist for recode Missing Val regiop cdl zl RECSOSEC Issinig Values TES RECPENS POSLABLY 1 97 COMPCODE
28. OCCU 13 Alkmaar 80 POSLABM 14 Almelo 80 REGJOBC Bandwidth 15 Almere 80 g RECBEN ic Alnban aan da on Use bandwidth RECUNBEN RECODBEN Ert drag rrr 5 wf RECBILL 3 RECSOSEC RECPENS PASI ARI _ Close 3 2 3 Risk approach If for certain combinations the individual risk approach has been selected then you can specify the level of risk All records with an individual risk above this level will be treated as unsafe Local suppression will then be applied in the final stage of ARGUS Of course it is possible to apply global recoding and re inspect the risk chart The level of risk acceptable for making a datafile available is of course dependent of the usage of a file whether it goes to a research institute or whether it is meant as a public use file When the graph is shown the slider can be used to adjust the risk level Also the number of unsafe records is constantly adjusted as well If more than one risk table had been specified you are first asked to specify for which table you want to specify the risk threshold L ARGUS 3 1 user s manual 32 Individual risk chart Oy XI 2 Risk Chart REGION X SEX X AGE X MARSTAT Cumulative chart Risk treshold 0 002675 Cancel unsafe records 1716 3 2 4 Numerical variables For numerical variables you have the possibility of modifying them here You can simply round the variable or apply top bottom coding Top Bottom coding means that al
29. a certain variable has been recoded already so much or if a user thinks that for a certain variable the imputation of a missing value is very undesirable Other options are the treatment of the household identifier keep it as it is change it into a simple sequence number or remove it Also it is possible to change the order of the records in the output file L ARGUS 3 1 user s manual 34 If Make the protected file I Keen in sate f sequence number Remove sate KINDPERS NUMYOUNG NUMOLD AGEYOUNG EDUC EDUC2 ETNI 35 ARGUS 3 1 user s manual Then you press Make File The program then makes new microdata the program will apply the global recodings replacements and truncations as ordered and delete the last unsafe cells by locally suppressing categories in single records or sometimes in case of a household variable in all records belonging to the same household If there is a weight variable it will be randomised if requested If there is a household identifier then there is an option to keep this identifier in the datafile or change the household identifier into a sequence number Please bear in mind that this variable can be very sensitive The suggested extension of the filename is saf It also makes a new metafile which is the same as for the original microdata Specific meta information for ARGUS will be removed For the new metafile the extension is rds
30. a left or right value This means every code less or greater than the given code In the first example below the new category 1 will contain all the codes less than or equal to 49 and code 4 will contain everything larger than or equal to 150 Example For a variable with the categories 1 till 182 a possible recode is then 1 49 2 50 99 3 100 149 L ARGUS 3 1 user s manual 42 4 150 For a variable with the categories 01 till 10 a possible recode is 1 01 02 2 03 04 3 05 07 4 08 09 10 Dont forget the colon if you forget it the recode will not work The recode 3 05 07 is the same as 3 05 06 07 you can choose what you like best Truncate If you click Truncate this dialog box will pop up Truncate x Number of Digits NEN You can specify how many digits you want to truncate If you apply a truncation and you want to truncate another digit you have to fill in 2 digits in the popup dialog box This is because the recodings are always applied to the original categories Codelist A new codelist can be selected for the recoded variable Note that similar to the codelists in the meta data file RDA this codelist is only used for enhancing the information reported on a variable in the main window of u ARGUS Missing values When a new recoding scheme is selected also a new set of missing values can be specified If these boxes are empty the original missing values will still be used Warning In the warni
31. an unknown respondent Place of residence Urk Sex Female and Occupation Statistician Urk is a small fishing village in the Netherlands in which it is unlikely for many statisticians to live let alone female statisticians So when we find a statistician in Urk a female one moreover in the microdata set then she is probably the only one When this is indeed the case anybody who happens to know this rare female statistician in Urk is able to disclose sensitive information from her record if such information is contained in this record There is a practical problem when applying the above rule that the occurrence in the data file of combinations of scores that are rare in the population should be avoided Namely it is usually not known how often a particular combination of scores occurs in the population In many cases one only has the data file itself available to estimate the frequency of a combination of scores in the population In practice one therefore uses the estimated frequency of a key value k to determine whether or not this key value is safe or not in the population When the estimated frequency of a key value i e a combination of scores is at least equal to the threshold value then this combination is considered safe When the estimated frequency of a key value is less than the threshold value Dy then this combination is considered unsafe An example of such a key is Place of residence
32. ands we have developed some guidelines for the set of tables to be inspected From version 3 1 a new approach for the risk has been included This will be described in the next sections 2 3 2 Notation An individual risk of disclosure allows to estimate a measure for the chance of identification of each record in the released file on the basis of the actual values observed on the public variables In the last few years a number of proposals have been made Fienberg and Makov 1998 and Skinner and Holmes 1998 propose with different motivations a log linear model for the estimation of the individual risk Benedetti and Franconi 1998 propose a methodology for individual risk estimation based on the sampling weight In order to briefly describe the above methodology we present the notation used let the released file be a random sample of size n selected from a finite population of N units For the generic unit i in the population we denote as w its probability to be included in the sample For each record i the released file contains a set of key variables i e variables that allow identification and are accessible to the public and the sensitive variables Under the hypothesis that the key variables are discrete a situation which is classical in household surveys and in population censuses we can focus the analysis on each of the k K subpopulations defined by all the possible combinations of values of such variables note that the maximum
33. ble click the u ARGUS icon the program starts To start the disclosure control with u ARGUS you need to have a microdata file and the metadata describing this microdata file The microdata file must be a fixed format ASCII file If you click FilelOpen microdata you can specify the name of the microdata file The program assumes the extension asc but you can use your own extension Open micro data micro data Microdata G Projects Casc Anco M u amp rgusVB datataD emodata asc Metadata G Projects Casc Anco MuArqus B datata D emodata M Cancel The metadata describing this ASCII file is stored in a separate file For this file the program assumes the extension rda Record Description for Argus but you can select another extension If no metadata file is specified the program has the facility to specify the metadata interactively via the menu option SpecifylMetafile In this section we will give a description of the metadata file for H ARGUS When you enter or change the metadata interactively using u ARGUS the option SpecifylMetafile will bring you at this screen L ARGUS 3 1 user s manual 24 Specify meta data Specify metadata Attributes Variable bs REGION C HH Identifier Starting position 1 C HH Variable Lenath 4 C Weight Decimals Io other Options for Argus Identification level fi v Categorical Weight for
34. ct The current version builds further on the u ARGUS version that has been produced in the fourth framework SDC project The purpose of the present manual is to give a potential user enough information so that he can understand the general principles on which n ARGUS is based and also allow him to apply the package So it contains both general background information and detailed program information H ARGUS is one part of twin the other one named X ARGUS t ARGUS is tool to help produce safe tables Although the twins look quite different from the inside they have a lot in common In a sense they are Siamese twins since their bodies have various organs in common Nevertheless a description of TARGUS cannot be found in the present manual but in a separate About the name ARGUS Somewhat jokingly the name ARGUS can be interpreted as the acronym of Anti Re identification General Utility System As a matter of fact the name ARGUS was inspired by a myth of the ancient Greeks In this myth Zeus has a girl friend named Io Hera Zeus wife did not approve of this relationship and turned Io into a cow She let the monster Argus guard Io Argus seemed to be particularly well qualified for this job because it had a hundred eyes that could watch over Io If it would fall asleep only two of its eyes were closed That would leave plenty of eyes to watch Io Zeus was eager to find a way to get Io back He hired Hermes who could make Argus fal
35. e 13 2 4 3 Base Individual Risk computation eese enne rennen nennen 16 24 4 Frequencies calculation iie etie mette ete eee dese eoa 17 2 4 5 Selection of the threshold un S L NS einen ettet trennen 19 2 4 6 Protection phase and creation of safe file 19 2 5 Information loss e emper reed iet epe e iet per ied eoe tia 21 2 6 Sampling weights ineo etie eet eerte Pee tide t et It eee dU Re ite weedeat 21 2 7 Household variables eei Hb RH E 21 2 8 Functional design of U ARGUS l ga eren nennen ener 23 3 YA tour OF WARGUS euet het teta ee ette IO meas ada eed ete 24 3 1 Preparation aus aa u petere id brote ha bee gere lee te ted 24 Specify tables set eter Ra ee edm eie eee ne ua 27 3 2 The process of Disclosure Control eese nennen 28 3 2 1 Global Recoditig e oou cedi ete ied 30 3 2 2 PRAM SpecIfICatiOn soie d be P dem e e I HER Ree 31 3 2 3 us la T Ga u uma R 32 3 24 Numerical variables u seen tls eee 33 3 3 Local suppression and making a safe microdata file eese 34 4 Description of Menu u h ete dere ee eto eee dr edere einge tette 37 4 1 WIDOW ccs betae ette ette eee Ue Pea da ete ee poene dee eie RE eek 37 4 2 The File Men u e Ge naa ae een ee ele 38 4 24 FilelOpenMicrod ata u u a
36. e all the variables defined by the user as key Suppose we have a file composed by 8 units UnitID _ 1 _ 2 _ 3 _ 4 fi fro l 1 1 2 5 1 18 2 110 I 2 1 2 1 1 455 2 84 5 I 3 1 2 1 1 39 2 84 5 I 4 3 3 1 5 1 1 17 5 4 3 1 4 541 1 541 I 6 4 3 1 1 8 1 8 I 7 6 2 1 5 5 1 5 8 1 2 5 1 92 2 110 With k i k we denote the sub population defined by the combination of categories of the key variables string in the unit i In our example there are 6 sub populations and unit 1 and 8 belong to the same sub population identified by the string 1 2 5 1 With f we represent the frequency count of units in the sub population that are present in the sample i e in the file The estimation of these frequencies in the population F is given by the sum of the weights associated with the units belonging to that sub population F Y W ik k In the example above we get Fy w w 18 92 110 fue 1 1 2 problem may arise if there are missing values in the key variables Actually a missing value could stand for any of the possible categories of the variable considered Thus in our opinion computation of the f should take this into account Consider the set of strings or combinations which are compatible with the one characterising
37. e released with the same combination of categories of the key variables are identical for the intruder in terms of uncertainty to make an identification So that the risk of each record in K i is equal to the risk of the record i So we define with the risk of each record in the domain k i and we have r r The aim is thus to find for each record a reasonable estimate of such function The information given by the sample in terms of identification for the record i consist of the frequency fas So we have neu n 7PIs z PME feo Pko hM fu h2 fka Consider the worst case when the intruder has the whole population available as external w Seay 1 if two units i and i exist in the population presenting the same combinations of categories of the key variables and at least one of these units is included in the sample then if we suppose that the intruder information As a consequence if the unit i of the sample is unique then F performs a probabilistic linkage we obtain 2 fea 1 2 If the same probabilistic reasoning is applied to higher values of we obtain iru 1 p P F h2 fy In order to simplify the notation in what follow we omitted the subscript i The idea is to consider as a random variable distributed according to a negative binomial distribution with f successes and probability of success p Under such a framework the risk of disclosu
38. either less detailed information is provided or some information is not given at all A balance between global recoding and local suppression has to be found in order to make the information loss due to SDC measures as low as possible It is recommended to start by recoding some variables globally until the number of unsafe combinations that have to be protected by local suppression is sufficiently low The remaining unsafe combinations have to be protected by suppressing some values 2 6 Sampling weights So far we only discussed the most direct form of re identification namely that of an intruder recognising an individual Sometimes a more indirect form of disclosure is possible This is for instance the case if a microdata file contains certain explicit information that allows an intruder to deduce other information that a data releaser would not like to release It is not so much the danger of a privacy threat that drives the prevention of this type of disclosure but rather the prevention of an embarrassing situation for the data releaser For it is somewhat embarrassing if information can be derived from the data that is not supposed to be available An example of this is postsampling weights Such weights are often included in a microdata file so as to correct for all sorts of defects in a sample due to selective response and over or undersampling of certain subpopulations However knowledge of the procedure to calculate such weights together wit
39. eographic information If we want to release the regional variable at municipality level but the sampling design contains such variable at province level the estimate of the risk using municipality information contained in the microdata file is not significant As a result the estimate of the risk is not reliable Great care should therefore be used in releasing the level of detail of the key variables contained in the sampling design 2 4 Application of Risk model approach 2 4 1 Introduction The purpose of the present chapter is to describe how to use the methodology of individual risk to produce a safe file In the next section we will describe the application of this methodology at a very general descriptive level In section 2 4 3 the estimation of the individual risk is presented Section 2 4 4 explains how to evaluate the frequencies of combinations of key variables in the sample and discusses the estimation of these frequencies in the population F These two processes will be described also in the presence of missing values In section 2 4 5 we show how to choose the threshold value necessary to the protection step Section 2 4 6 describes the protection phase and how to produce a safe file 2 4 2 An overview of the use of Individual Risk Our approach is based on the need to handle sample data the data file therefore does not include the whole population but a subset of it and each unit in the file represents one or more un
40. h some population data stratum frequencies can be supposed to be generally known With this information an intruder might in some cases be able to re identify the stratum to which a respondent belongs Knowing the stratum means in fact that certain attributes that define the stratum are known to an intruder This could for example be a region If it were the policy of the data releaser not to release any regional data in the file this would be an embarrassment This can be avoided in various ways It is possible to draw a suitable subsample records from the microdata set such that the weights for this new sample are not very different It is also possible to add some noise to the weights in order to hamper the stratum re identification procedure In u ARGUS it is possible to add noise to the weights 2 7 Household variables Some identifiers necessarily yield the same scores for all members of the same household Such variables are called household variables Examples of such variables are the size of the household and occupation of the head of the household If a household variable determines the households in the population uniquely it is called a household key Household variables in a microdata set 21 L ARGUS 3 1 user s manual containing records pertaining to different persons in a household some of whom are represented in this file may allow an intruder to group the records of persons belonging to the same household Th
41. he work by Skinner and Holmes and extended by Luisa Franconi cs at Istat A full description can be found in section 2 3 and 2 4 Here you can specify the threshold level for the risk model records above the chosen risk level are considered unsafe When generating a safe file in the final stage of ARGUS the combination of variables on which the risk model is based is considered unsafe and a missing value will be imputed Use the slider to set a risk level You will see at the same time the number of records that will be considered unsafe when you fix this risk level When you have fixed a risk level you can still go back select an other global recoding and inspect the risk chart again L ARGUS 3 1 user s manual 44 Risk Chart E x REGION X SEX x AGE MARSTAT KINDPERS Cumulative chart 00 700 600 500 400 0 0004 0 0007 0 0012 0 0019 0 003 0 0049 0 008 0 013 0 0211 0 0342 Risk treshold 0 009831 DK Cancel unsafe records 1117 4 4 5 Modify numerical variables Several modifications to a numerical variable are possible e Top Bottom coding e Rounding e Add noise to the weight variable Top and bottom coding is a technique to protect the extreme values of a distribution Typically it is applied to numerical variables All values above respectively below a certain threshold value will be replaced by another value In this stage of using u ARGUS only the instructions will be stored
42. iables must be suppressed This set is chosen in such a way that all unsafe combinations are protected but the total information loss is being minimised Note that if a household variable need to be suppressed in a record it is suppressed in all records of the same household If there is a household identifier there are a three possibilities e Do not change e Change the Household identifier into a simple sequence number e Remove the Household identifier completely from the datarecord Finally there is an option to write the records in a random order This could be done to prevent easy linking to other data files Pressing the make file button will start the writing process First you will have the opportunity to specify the name of the output file Automatically a report file will be written at the same time The name of the reportfile is the same as the outputfile but with the extension HTML This HTML format allows you to inspect the report later with a browser When the datafile is ready however u ARGUS will show the report directly In any case you will now have a safe file safe according to the modifications that you have applied 4 5 2 Output View Report Views the report file which has been generated with OutputlMake suppressed file 47 L ARGUS 3 1 user s manual Fl View Report Oy X ARGUS Report Safe file created date 01 17 2002 time 13 26 23 Original file GAProjectsiCasciAncoWiu Argus VBYdatatfauDe
43. identifier is one that may help an intruder re identify an individual Typically an identifying variable is one that describes a characteristic of a person that is observable that is registered identification numbers etc or generally that can be known to other persons This of course is not very precise and relies on ones personal judgement But once a variable has been declared identifiable it is usually a fairly mechanical procedure how to deal with it in u ARGUS In order to develop a sort of a theory of re identification one should have a disclosure scenario Such a scenario is essentially a re identification model in which it is stated what sort of information an intruder is supposed to have how he uses this information to re identify individuals represented in a given microdata set what sort of disclosures such a person is likely to make and what motives he has for making such disclosures he might want to know something about a particular individual but his motive could also be to embarrass the data collector releaser More formally such a disclosure scenario may lead to a disclosure risk model i e a statistical model that gives a probability that disclosures are being made by an intruder with an assumed level of knowledge of the population and for a given data file Such disclosure risk models can range from a fairly simple intuitive model to a Leon Willenborg and Ton de Waal 1996 Statistical Disclosure Control in
44. ile Make the protected file x Suppression No suppression cu ts C Use entropy Suppression weight per variable REGION 50 SEX 50 AGE 50 MARSTAT 50 KINDPERS 50 NUMYOUNG 50 NUMOLD 50 AGEYOUNG 50 EDUC1 50 EDUC2 50 ETNI 50 PRIOCCU 50 ARM E 25 c ate C henge into sequence number gt Remove fom sete Write records in random order m Cancel Make When actually writing the safe file the data manipulations are applied and the remaining unsafe combinations are protected by local suppression Le certain values are replaced by the first missing value Which variable s are selected is a small optimisation problem If there is only one unsafe L ARGUS 3 1 user s manual 46 combination the variable with the lowest information loss is chosen To calculate the information loss there are two options Either you select the use weights and you are free to assign an information loss suppression weight to each variable The variable with the lowest information loss is then suppressed The alternative option is to use an entropy function The variable with the lowest value of the entropy function will then a suppressed This entropy H x is defined as H x amp x Yo log A N where f x is the frequency of category x of variable X and N the total number of records In the case of more than one unsafe combination in a record a set of var
45. in that their data will be used with the greatest care and in particular will not jeopardise their privacy So statistical use of the data by outside users should not lead to a compromise of confidentiality However making sure that microdata cannot be misused for disclosure purposes requires generally speaking that they should be less detailed or modified in another way that hampers the disclosure risk This is in direct conflict with the wish of researchers to have as detailed data as possible Much detail allows not only more detailed statistical questions to be answered but also more flexibility the user can lump together categories in a way that suits his purposes best The field of statistical disclosure control in fact feeds on this trade off How should a microdata set be modified in such a way that another one is obtained with acceptable disclosure risk and with minimum information loss And how exactly can one define disclosure risk And how should one quantify information loss And once these problems have been solved no matter how provisionary the question is how all this wisdom can actually be applied in case of real microdata If a certain degree of sophistication is reached the conclusion is inescapable specialised software is needed to cope with this problem and uU ARGUS is such software In order to be able to use ARGUS one should understand the ideas about disclosure risk information loss etc on which it is based Therefore
46. ing these two categories If a hyphen is only placed before or after a category this refers to all categories before or after and including this category respectively Example The line 7 4 6 8 10 13 means that the categories 0 1 2 3 4 6 8 9 10 13 14 Gf present are recoded as 7 Between the two windows there are five buttons Read allows you to read a previously made recode file into the program Apply applies the recode in the file to the tables This will add the corresponding cells If the cell was unsafe the new cell might exceed the Safe Limit and become safe 3 Truncate Variables which are preceded by an asterisk in the left window can be truncated This means that you are allowed to chop of digits from the end of the code for the categories When you click on Truncate you can specify how many digits must be chopped This number always applies to the original codes if you want to truncate the same variable twice each time one digit you have to fill in 2 the second time 4 Undo will undo any recodes truncations and replaces performed on a variable This brings the variable to its original state 5 Close will close the dialog box and bring you back to the main window of u ARGUS and show the updated overview of the unsafe combinations per variable N e 3 2 2 PRAM specification Post Randomisation is a technique where deliberately noise is added to categorical variables See
47. is is an extra disclosure risk that threatens a microdata set containing such variables The increased risk follows from the fact that households might be more easily recognisable than individual persons H ARGUS is able to deal with household variables This means that if a value of a household variable in a record is suppressed then the corresponding values are suppressed in the records referring to the other members of the same household L ARGUS 3 1 user s manual 22 2 8 Functional design of u ARGUS Meta Data Microdata description SPECIFY TABLE SET GENERATE TABLES AND MARGINALS APPLY BASE IDENTIFY UNSAFE INDIVIDUAL COMBINATIONS RISK MODEL Threshold method uds RECODE 4 SPECIFICATION EIE CODING p LOCAL SUPPRESSION GENERATE SAFE DATA Updated data description Safe Microdata Disclosure report 23 L ARGUS 3 1 user s manual 3 A tour of ARGUS This section will give the reader an brief introduction through the use of L ARGUS Some Windows experience is assumed In this section we will use a demo dataset called DEMODATA which has been supplied with the u ARGUS software In section 4 a more systematic description of the different parts of ARGUS will be given Please note that all options in ARGUS can be accessed via the drop down menu but also via the toolbar In this tour we will use the drop down menu 3 1 Preparation When you dou
48. its of the population through its individual weights It is therefore necessary to have this variable in the input data set Our method estimates the level of disclosure risk for each unit defined as the probability of identifying an individual for the definition of identification see the section 2 3 Once the user has 13 L ARGUS 3 1 user s manual choose the key variables and selected the variable individual weight it is possible to calculate the individual risk A schematic representation of the step to evaluate this risk is given in Figure 2 4 1 After the application of the risk calculation algorithm each record i will have associated its own value of the disclosure risk At this point the user will input a threshold that he considers the maximum tolerable risk This choice can be based on the graph representing the distribution of the individual risk in the file provide by u Argus Once a has been selected u Argus apply the local suppression only to those records i that gt G following a suppression method that minimise the number of local suppression At this point u Argus shows a report that include all the information relative to the protection process the level of the threshold fixed by the user the number of suppressions for each variable calculated by the protection algorithm and the record description of the new file After the protection step the whole risk calculation algorithm can be run again producing the c
49. l asleep by the enchanting music on his flute When Hermes played his flute to Argus this indeed happened all its eyes closed one by one When Hermes had succeeded in making Argus fall asleep Argus was decapitated Argus eyes were planted onto a bird s tail a type of bird that we now know under the name of peacock That explains why a peacock has these eye shaped marks on its tail This also explains the picture on the cover of this manual It is a copperplate engraving of Gerard de Lairesse 1641 1711 depicting the process where the eyes of Argus are being removed and placed on the peacock s tail Like the mythological Argus the software is supposed to guard something in this case data This is where the similarity between the myth and the package is supposed to end as we believe that the package is a winner and not a looser as the mythological Argus is Contact Feedback from users will help improve future versions of JL ARGUS and is therefore greatly appreciated Suggestions for improvements can be sent to argus cbs nl Acknowledgements u ARGUS has been developed as part of the CASC project that was partly sponsored by the EU under contract number IST 2000 25069 This support is highly appreciated The CASC Computational Aspects of Statistical Confidentiality project is part of the Fifth Framework of the European Union The main part of the ARGUS software has been developed at Statistics Netherlands by Aad van de Wetering who
50. l values above below a certain threshold are replace by a given value For rounding the rounding base should be supplied Additional for the weight variable you could add noise All these modifications will only take place when the safe datafile is generated At this stage the information is only stored 33 L ARGUS 3 1 user s manual Modifying numerical variables In Modify numerical variables x M Variable BottomCoding p TopCodng Minimum 15 Maximum 74 WEIGHT Bottom alue TopValue ASSETS DEBTS Replacement Replacement Round Rounding base Weighty cise Percentage x 3 3 Local suppression and making a safe microdata file When the user is satisfied with the set of global recodings it is time to solve the remaining unsafe combinations with local suppression For this you select the option suppressed file protect the remaining unsafe combinations u ARGUS will suppress certain values i e impute a missing value If there is more than one unsafe combination in a record and the unsafe combinations have a variable in common then ARGUS will suppress the common variable If not u ARGUS will have to choose one of the variables minimising the loss of information This information loss can be either based on the entropy function or a user specified priority function The user defined priority function can be used e g if
51. ld suppress the value of Place of residence instead of the value of Occupation in the above example to protect the unsafe combination A local suppression is only applied to a particular value When for instance the value of Occupation is suppressed in a particular record then this does not imply that the value of Occupation has to be suppressed in another record The freedom that one has in selecting the values that are to be suppressed allows one to minimise the number of local suppressions 2 2 3 Top and bottom coding Global recoding is a technique that can be applied to general categorical variables i e without any requirement of the type In case of ordinal categorical variables one can apply a particular global recoding technique namely top coding for the larger values or bottom coding for the smaller values When for instance top coding is applied to an ordinal variable the top categories are lumped together to form a new category Bottom coding is similar except that it applies to the smallest values instead of the largest In both cases the problem is to calculate tight threshold values In case of top coding this means that the threshold value should be chosen in such a way that the top category is as small as possible under the condition that this category is large enough to guarantee safety In other See P P de Wolf J M Gouweleeuw P Kooiman and L C R J Willenborg Reflections on PRAM P
52. local Numerical suppression 50 m Categories Truncation allowed 1 9999 7 Codelistfile gt 9938 regio cdi Xe KINDFACT TENURE gt New Delete In a first tour you should leave the fields as they are but we will explain here the meaning of these fields In the left pane the available variables of the dataset are shown In the other fields the various attributes relevant for the disclosure control process are shown The position in the datafile will be familiar The role of the variable e Identifier The unique identifier of a household e HH Variable A variable that by nature has the same value for each member of a household e Weight The variable is a sampling weight e Categorical A categorical variable can be used as a spanning variable in table For ARGUS this variable can be defined as identifying e Numerical numerical variable can be used for top bottom coding and rounding The identification level is an option to easily generate the set of tables to be inspected in the disclosure control process The contents of these sets are based on certain rules used by Statistics Netherlands but they could easily be used in other situations as well This procedure is further explained in the next section 0 an individual cannot be identified by this variable and it will not play role in the disclosure control process 1 the variable is most identifying 2 the variable is
53. modatm asc Original meta file Pd data GAProjectsiCasciAncoWluArgus VBidatata x GAProjectsiCasciAncoWlu Argus VBidafatauDemodatm rda E meta GAProjectsiCasciAncoWlu Argus VBi datata x rds Global recodings that have been applied REGION 4 6 Help 4 6 1 Help Contents Shows the content page of the help file This program has context sensitive help Whenever you press the F1 you will get some information referring to the stage of the program where you are 4 6 2 Help About Shows the about box L ARGUS 3 1 user s manual 48 About ARGUS 49 ARGUS 3 1 user s manual
54. more identifying 3 the variable is identifying The codelists of the variables used to span the tables is always generated by ARGUS itself However the user can supply the name of a codelist file The labels in this codelist file are then used when displaying information on this variable Besides the name of the codelist file it is possible to indicate whether truncation is a feasible way of recoding This is the case when the codelist is hierarchical One or two missing values can be specified per variable Missing values play a specific role in the SDC process as missing values will be imputed when local suppression is applied Note that the weight variable cannot have a missing value 25 L ARGUS 3 1 user s manual The SDC file stores the information in the format of keyword enclosed by lt gt followed by the relevant parameters le lt gt e CODELIST This variable may recoded Name of the codelist file e lt IDLEVEL gt e lt TRUNCABLE gt Identification level The codelist is hierarchical and therefore removing one or more digits 15 a relevant way of recoding lt NUMERIC gt DECIMALS The variable is numeric and rounding etc is allowed The number of decimal positions for a numeric variable e WEIGHT HOUSE ID The variable contains the sample weights Randomising the weight is an option This variable
55. n the metadata description the variable has been designed as hierarchical 2 If you apply a global recoding or truncate a variable the colour of the variable will be changed into red and an R or T will be indicated in the first column of the list window 3 If no global recode file is available you can make one yourself or read an existing one You can also edit the recoding 4 In the edit box Codelist you can specify the name of the file containing the codelist labels for the recoded variable 5 It is also possible to change the codes for the missing values 6 If you apply a global recoding L ARGUS will show the result of the recoding in the warning window This can either be an error message showing on which line the error occurred but also it can be a warning E g that some codes of the base code file have not been recoded However this could be the purpose of the global recoding A recoding scheme L ARGUS 3 1 user s manual 30 Global recoding means that certain categories of a variable are collapsed into one new category The syntax of the recode file is as follows each line in the file corresponds with one new category The code of the new category is placed before a colon The old categories to be grouped into the new category are placed behind the colon Single categories are separated by commas and if a hyphen is placed between two categories it refers to all subsequent categories between and includ
56. ng pane ARGUS reports back on a recoding that has been applied Sometimes these are only warnings e g only a part of the codelist has been recoded Sometimes a severe error has occurred preventing ARGUS from applying a recoding scheme Typically the syntax of a recoding scheme has been violated When the focus is moved to another variable or when this window is closed U ARGUS will ask you whether the recoding scheme must be saved Note that also the changed missing codes and the name of the codelist are stored in this file A typical extension is GRC Example of a GRC file 205315 21816 50 BIS 5 45 23 46 60 243 61 5 258 76 90 20391 105 213106 120 AS pz J 298196 150 30 151 165 318166 192 gt 97 98 43 L ARGUS 3 1 user s manual lt CODELIST gt regiop cdl 4 4 3 PRAM specification PRAM is a disclosure control technique that can be applied to categorical data Basically it is a form of deliberate misclassification using a known probability mechanism Applying PRAM means that for each record in a microdatafile the score on one or more categorical variables is changed See also section 2 2 4 At this stage you can specify the probability of changing the score of a category First you select a variable to be PRAMmed In the listbox in the middle you will see all the categories The probability is the probability of not changing a score You can set all probabilities to a certain percentage with
57. nly to the unsafe part of the set This is done to obtain an uniform categorisation of each variable Suppose for instance that we recode the Occupation in the above way Suppose furthermore that both the combinations Place of residence Amsterdam Sex Female and Occupation Statistician and Place of residence Amsterdam Sex Female and Occupation Mathematician are considered safe To obtain a uniform categorisation of Occupation we would however not publish these combinations but only the combination Place of residence Amsterdam Sex Female and Occupation Statistician or Mathematician 2 2 2 Local suppression When local suppression is applied one or more values in an unsafe combination are suppressed i e replaced by a missing value For instance in the above example we can protect the unsafe combination Place of residence Urk Sex Female and Occupation Statistician by suppressing the value of Occupation assuming that the number of females in Urk is sufficiently high The resulting combination is then given by Place of residence Urk Sex Female and Occupation missing Note that instead of suppressing the value of Occupation one could also suppress the value of another variable of the unsafe combination For instance when the number of female statisticians in the Netherlands is sufficiently high then one cou
58. nt true data can still be estimated from the perturbed data file Another technique that is closely related to PRAM is the technique called Randomised Response RR see e g Warner 1965 Where RR is applied before the data are obtained PRAM is applied after the data has been obtained Both methods use known probability mechanisms to change scores on categorical variables RR is used when interviewers have to deal with highly sensitive questions on which the respondent is not likely to report true values in a face to face interview setting By embedding the question in a pure chance experiment the true value of the respondent is never revealed to the interviewer PRAM the method In this section a short theoretical description of PRAM is given For a detailed description of the method see e g Gouweleeuw et al 1998a and 1998b For a discussion of several issues concerning the method and its consequences see e g De Wolf 1998 Consider a categorical variable amp in the original microdata file to which PRAM will be applied and denote the same categorical variable in the perturbed file by X Moreover assume that and hence X as well has K categories numbered 1 K Define the transition probabilities involved in applying PRAM by P X L i e the probability that an original score k will be changed into the score I PRAM is then fully described by K x matrix with entries with k L 1 Ap
59. of Official Statistics Vol 14 4 463 478 Gouweleeuw J M P Kooiman L C RJ Willenborg and P P de Wolf 1998b The post randomisation method for protecting micropdata Q estii Quaderns d Estad stica i Investigaci Operativa Vol 22 1 pp 145 156 Warner S L 1965 Randomized Response a survey technique for eliminating evasive answer bias Journal of the American Statistical Association Vol 57 pp 622 627 De Wolf P P J M Gouweleeuw P Kooiman and L C R J Willenborg 1998 Reflections on PRAM Proceedings of the conference Statistical Data Protection March 25 27 1998 Lisbon Portugal This paper can also be found on the CASC Website http neon vb cbs nl casc Related Papers 2 3 Methodology description of the individual risk approach 2 3 1 The traditional model To be able to distinguish safe from unsafe microdata it is necessary that a disclosure risk model is specified Disclosure models can differ greatly in their degrees of sophistication The basic model in u ARGUS is a fairly simple such model namely one based on a thresholding rule The understanding is that a combination of values is safe only if the estimated frequency of its occurrence in the population or in the file is above a certain threshold value Which combinations to consider is also part of the disclosure risk model that one applies and should be specified for u ARGUS by the data protector At Statistics Netherl
60. plying PRAM then means that given the score k in the original file in record r the score will be replaced by a score drawn from the probability distribution For each record in the original file this procedure is performed independently of the other records Note that py is the probability that the score k in the original file is left unchanged Moreover be aware of the fact that since PRAM involves a chance experiment different perturbed microdata files will be obtained when applying the same PRAM matrix several times to a single original microdata file each time starting with the unperturbed original microdata file Since the transition probabilities are known unbiased estimates of contingency tables are easily obtained Other more elaborate techniques will be needed to correct for the PRAM perturbation in more complex analyses as e g in loglinear models References on PRAM 7 For the other values this is generally not feasible because there are too many One should first categorize the values and apply top or bottom coding to this new ordinal variable Note that in practice there is no difference between a continuous variable and an ordinal variable with many categories since we are dealing with finite populations L ARGUS 3 1 user s manual 10 Gouweleeuw J M P Kooiman L C R J Willenborg and de Wolf 1998a Post Randomisation for Statistical Disclosure Control Theory and Implementation Journal
61. r of unsafe combinations per variable Ff HU Argus iof x File Specify Modify Output Help GEB Birt B unsafe records in every dimension variable REGION Variable dmi dm2 dm3 4 Code Label ES SS 2 6765 11966 1 Aalburg 0 SEX 0 117 11966 2 Aalsmeer 0 45 78 AGE 0 1948 2664 3 Aalten 3 13 34 MARSTAT 0 104 235 4 ar 13 29 54 KINDPERS 185 453 5 Aardenburg 10 0 23 50 NUMYOUNG 0 8 337 6 Aarle Rixtel 12 0 27 51 NUMOLD 0 6 105 7 Abcoude 28 58 81 AGEYOUNG 0 19 823 8 Achtkarspelen 12 23 43 EDUC 0 250 644 3 Akersloot 7 0 25 56 EDUC2 0 389 654 10 Alblasserdam 20 0 33 70 ETNI 0 106 170 11 Albrandswa 11 0 22 33 PRIOCCU 0 236 390 12 Alkemade 43 55 38 POSLABM 0 55 165 13 Alkmaar 16 0 39 54 REGJOBC 0 58 168 14 Almelo 16 0 39 65 RECBEN 0 82 225 15 Almere 10 31 56 RECLINBEN 21 44 16 Alphen aan 19 43 80 RECODBEN 0 50 33 17 Alphen en Riel 4 0 15 39 RECBILL 0 34 51 18 Ambt Delden 2 0 28 54 RECSOSEC 0 19 33 mg 19 Ambt Montfort 18 0 43 81 right panel 1718 02 2 37 PM 2 The unsafe combinations are calculated by checking the threshold value in each combination Please note that if e g a three dimensional combination is checked also the one and two dimensional marginal combinations are checked In this window also the scores of these marginal tables are shown As the unsafe combinations are calculated for the tables specified it is possible to inspect the
62. re for each record can be seen as the negative moment of order one of a negative binomial distribution The substitution of the moment generating function of the negative binomial distribution leads to L ARGUS 3 1 user s manual 12 gms S p exp t dt 1 4 exp t where q 1 p After some algebra we obtain Uf f pe y 1 dy 1 4 y The estimation of the parameter p exploiting the sampling design employed gives as a result the following f f M eeu w The substitution of in the formula leads to the estimation of the risk of disclosure 7 in the case of independence In the formula for the risk it is also possible to include a parameter m relative to the quality of the key variables the probability of being included in the external file and the probability of an attempt to identify unit in the released file If this is the case the new risk p can be written as m f It is important to remark that if we want to evaluate the risk of disclosure of a microdata file with the function defined above the key variables used to calculate that risk obviously have to be coherent with the sample design used to obtain the sample microdata This implies that if certain level of detail is used for a key variable in the sampling design attention should be paid to maintain this level or increase it when releasing such variable in the microdata file For example let us consider the g
63. roceedings SDP98 Lisbon 9 L ARGUS 3 1 user s manual words the threshold value should be chosen as large as possible under the safety constraint In case of bottom coding this threshold value should be as small as possible of course The safety is checked in terms of the frequencies of the value combinations that have to be checked Top and bottom coding can also be applied to continuous variables What is important is that the values of such a variable can be linearly ordered It is possible to calculate threshold values and lump all values larger than this value together in case of top coding or all smaller values in case of bottom coding Checking whether the top or bottom category is large enough is also feasible 2 2 4 The Post RAndomisation Method PRAM General introduction PRAM is a disclosure control technique that can be applied to categorical data Basically it is a form of deliberate misclassification using a known probability mechanism Applying PRAM means that for each record in a microdatafile the score on one or more categorical variables is changed This is done independently of the other records using a predetermined probability mechanism Hence the original file is perturbed so it will be difficult for an intruder to identify records with certainty as corresponding to certain individuals in the population Since the probability mechanism that is used when applying PRAM is known characteristics of the late
64. sion overview Number of E me jus 5r Original data file D Ancoldatilind_10 dat Original meta file D AAncoldatilind 10 rda Safe data file D Ancoldatilppp saf Safe meta file D Ancoldati ppp rds No global recodings have been applied Base Individual Risk has been applied table regione x relaz x sesso x staciv risk 0 000150 No other modifications View Report Record description safe file nrocompo unit is z In that file it is reported the number of suppression for each variable so that it is possible to rate the information contained in the protect file So the user has now two options a He she is satisfied with the result The safe file has been generated and he can leave ARGUS b He She is not satisfied by the results He she has two alternatives to get a better solution 1 Go back to the risk specification part and specify a different z value 2 Go back and recode some variable and again pass through all the procedure above describe or do both i e recode calculate the risk and select a new References to the individual risk approach Benedetti and Franconi L 1998 estimation method for individual risk of disclosure based on sampling design submitted for publication Fienberg S E and Makov U E 1998 Confidentiality Uniqueness and disclosure Limitation for categorical data Journal of Official Statistics 14 4 385 397 Skinner J and Holmes D
65. son who is rare in the data file with respect to a certain key is also rare in the population Secondly an intruder may use another key than the key s considered by the data protector For instance the data protector may consider only keys consisting of at most three variables while the intruder may use a key consisting of four variables Therefore it 1s better to avoid the occurrence of combinations of scores that are rare in the population instead of avoiding only population uniques To define what is meant by rare the data protector has to choose a threshold value for each key value k where the index k indicates that the threshold value may depend on the key k under consideration A combination of scores i e a key value that occurs not more than D times in the population is considered unsafe a key value that occurs more than Dy times in the population is considered safe The unsafe combinations must be protected while the safe ones may be published Re identification of an individual can take place when several values of so called identifying variables such as Place of residence Sex and Occupation are taken into consideration The values of these identifying variables can be assumed to be known to relatives friends acquaintances and colleagues of a respondent When several values of these identifying variables are combined a respondent may be re identified Consider for example the following record obtained from
66. the button Set all codes to default Further it is possible to change individual scores using the slider If a category will be changed randomly an other category will be selected However if you want to restrict this you can select a bandwidth In that case the new category will be a neighbouring one Pressing the Apply button will store the information The actual PRAMming will only be done when the new datafile is generated It is still possible to apply global recoding and come back to re specify your PRAM probabilities PRAM specification x Variables Codes ptions ALB Variable 4 Code Labe Prob a Individual chances x 5 REGION 1 Aalburg 63 T a SEX 2 Aalsmeer 80 Based or lalate AGE 3 Aalten 61 MARSTAT 4 Ter 80 5 Aardenburg 80 KINDPERS 80 NUMYOUNG a m Default Probability NUMOLD Abcoude 80 8 Achtkarspelen 55 AGETOUNG 3 Akersloot 80 M EDUC 10 Alblasserdam 80 a 11 Albrandswaard 80 DOCU ETNI default 12 Alkemade 80 PRIOCCU 13 Alkmaar 80 FOSLABM 14 Almelo 80 REGJOBC Bandwidth 15 Almere 80 2 FIECBEN on V Use bandwidth RECUNBEN RECODBEN of RECBILL RECSOSEC RECPENS PASI ARI ofl Apply Undo Close 4 4 4 Individual risk specification A new risk model has been incorporated in U ARGUS This method is based on t
67. the section 2 is devoted to giving the necessary background information In later sections more detailed information about the working and use of u ARGUS can be found L ARGUS 3 1 user s manual 6 2 Producing safe microdata The purpose of the present chapter is to give the necessary background information for the use of u ARGUS Producing safe micro data is not a trivial issue It should first be explained when microdata are considered safe or unsafe It should also be explained how unsafe data can be modified to become safe 2 1 Safe and unsafe microdata H ARGUS is based on a view of safety unsafety of microdata that is used at Statistics Netherlands In fact the incentive to build a package like U ARGUS was to allow data protectors at Statistics Netherlands to apply the general rules for various types of microdata easily and to relieve them from the chore and tedium that producing a safe file in practice can be Not only should it be easy to produce safe microdata it should also be possible to generate a logfile that documents the modifications of a microdata file When implementing U ARGUS it was the objective to produce a package that on the one hand is able to handle the specific rules that Statistics Netherlands applies to produce safe data and on the other hand is more general Nevertheless the generality has a bound Despite unavoidable limitations we believe that ARGUS can justly be called a general purpose package to help
68. urrent risk values and another risk graph that shows the new distribution of the risk At this stage the user has two choices a he she is satisfied by the result in terms of information preserved in the file and the output file is recorded as safe file b he she discards the results choosing to rollback to the previous risk values e g in order to select another level of a L ARGUS 3 1 user s manual 14 15 Section 3 4 Section 5 Section 6 Risk Computation Graphics L Argus protection L Argus OUTPUT FILE REPORT More information Risk p Computation SAFE FILE L Argus OUTPUT FILE Figure 2 4 1 Process Struct L ARGUS 3 1 user s manual 2 4 3 Base Individual Risk computation This paragraph describes how H Argus calculates the base individual risk according to the methodology presented in section 2 3 3 First of all we recall that it is indispensable to have the information relative to the individual weight The individual risk r fens represents the base individual risk for a unit i having combination k i k of key variables and is the same for every unit belonging to the same sub population An approximation of the formula 1 of section 2 3 3 it is given by ARE bs jc I P j 0 1 0 where DER 3 P y w i k k and w are the individual weights _1_ Y al 2 fk __ Alf
69. wrote the kernel and Anco Hundepool who wrote the interface However this software would not have been possible without the contributions of several others both partners in the Anco Hundepool et al 2002 7 ARGUS user manual Version 2 1 Department of Statistical Methods Statistics Netherlands Voorburg The Netherlands This interpretation is due to Peter Kooiman and dates back to around 1992 when the first prototype of ARGUS was built by Wil de Jong 3 The original copy of this engraving is in the collection of Het Leidsch Prentenkabinet in Leiden The Netherlands 3 L ARGUS 3 1 user s manual CASC project and outsiders New developments included in this version are the risk approach coordinated by Luisa Franconi cs from Istat and the Post Randomisation method PRAM based on work by Peter Kooiman Peter Paul de Wolf and Ardo van den Hout at CBS The CASC project The CASC project on the one hand can be seen as a follow up of the SDC project of the 4th Framework It will build further on the achievements of that successful project On the other hand it will have new objectives It will concentrate more on practical tools and the research needed to develop them For this purpose a new consortium has been brought together It will take over the results and products emerging from the SDC project One of the main tasks of this new consortium will be to further develop the ARGUS software which has been put in the public domain

µ-ARGUS 3.1 manual

Contents

Download Pdf Manuals

Related Search

Related Contents

&micro;-ARGUS 3.1 manual

Contents

Download Pdf Manuals

Related Search

Related Contents

µ-ARGUS 3.1 manual