Data disclosure control for descriptive statistics

1. Create artificial data

Create artificial data for eleven IDs (1-11), ten years (2001-2010), four countries (BE, CH, US, GB), and continuous variables x.

clear

set obs 301
generate hvland = int((_n-1)/100)+1
generate hv     = _n-100*(hvland-1)
generate id     = int((hv-1)/10)+1
generate year   = 2000+hv-10*(id-1)
drop hv

generate str2 land = "BE" if hvland==1
qui replace   land = "CH" if hvland==2
qui replace   land = "US" if hvland==3
qui replace   land = "GB" if hvland==4
drop hvland

set seed 4869382
generate x = runiform()

Create missing values

Often, yearly dummies are used as catchall dummies. However, a dummy for the year 2002 would identify the ids 1 and 2. Therefore, the coefficient for the 2002-dummy causes a disclosure problem.

qui replace x = . if year==2002 & id > 2  
qui replace x = . if _n==301

Create a problematic id

Create an id that has to be ignored (since x is missing, see Create missing values)

qui replace id = 11 if _n==301

Dummy if id==1. It is not permitted to publish information on specific firms.

generate id1 = id==1

Create problematic dummies

Dummy id/year/land: 1/2005/BE, 6/2007/BE: This dummy is 1 for only two distinct ids. Therefore, it must not be published.

generate dum_2ids   = _n==5|_n==57

Dummy id/year/land: 1/2005/BE, 6/2007/BE, 7/2003/BE, 8/2008/CH, 10/2009/BE: This dummy is 1 for five distinct ids. It can be published.

generate dum_5ids = _n==5|_n==57|_n==63|_n==99|_n==178

Create data for percentiles-option

set seed 9385542
generate u = runiform()
sort u
drop u
generate int  xpc   = _n if _n<=99

Problematic range

generate byte pcdum = (xpc>50 & xpc<=60)

Only four distinct ids in intervall (p50,p60]: 4, 5, 8, 10

qui replace id =  4 if pcdum & (id==1|id==3|id==6|id==7)
qui replace id =  3 if xpc==67|xpc==74|xpc==82|xpc==87
qui replace id =  2 if xpc==17|xpc==33|xpc==93|xpc==94|xpc==99
qui replace id =  6 if xpc==77|xpc==78
qui replace id = 10 if xpc== 1|xpc== 7
qui replace id =  7 if id==5 & xpc!=8 & xpc!=57 & xpc!=.
qui replace id =  1 if id==8 & xpc!=58 & xpc!=.

Trigger 1: if set, less than 15 percent outside (p50,p60]

qui replace id =  9 if xpc==40

Trigger 2: if set, only four distinct ids in (p90,max]

qui replace id =  3 if xpc==92

Create variable for dominance check

qui generate byte xfordom = 5 if id==1
qui replace       xfordom = 4 if id==2
qui replace       xfordom =-4 if id==2 & land=="US"
qui replace       xfordom = 1 if id==3
qui replace       xfordom =10 if id==3 & land=="CH" & year>=2009

Create variable for treatment of zeros

qui generate xwithzeros = runiform()
qui replace  xwithzeros = 0 if id==3 & year<2009

Other

cls
* global endmark "$$"

You will be asked for or be interested in the number of distinct ids unrelated to any other variable only rarely. Normally, confidentiality has to be resolved with respect to a continuous or discrete variable. In the artificial data, there are eleven distinct ids. Since x is set to missing for id==11, there are 11 distinct ids but only 10 distinct ids for nonmissing values of the variable x.

. nobsdes5 id   
Is id an identifier? 

Number of distinct values for identifier  id :   11

. nobsdes5 id x 

No problem with dominance. 
Number of distinct ids (id) of variable  x :   10

How to use the BY - option

In 2002 x is non missing only for id==1 and id==2

To see whether there is a problem at all, you may use for large tables in the first step the notab-option.

. nobsdes5 id x, by(year) notab 
D I S C L O S U R E problem: 
Share of largest two ids > 85%. 
Smallest number of distinct ids (id) of variable x for year: too small

Now you know there is a problem, but you do not know the problematic case. Thus, in the second step do not use notab.

. nobsdes5 id x, by(year)

D I S C L O S U R E problem: 
Share of largest two ids > 85%. 
Number of distinct ids (id) of variable x for each year :

----------------------------------
     year |  sum(nobs)   mean(CR2)
----------+-----------------------
     2001 |         10          33
     2002 |          2         100
     2003 |         10          35
     2004 |         10          39
     2005 |         10          30
     2006 |         10          30
     2007 |         10          32
     2008 |         10          33
     2009 |         10          30
     2010 |         10          31
----------------------------------

Now you know the problem is the year 2002. In this case, there seems to be a simple solution to get the output.

. table year if year!=2002, c(sum x)

----------------------
     year |     sum(x)
----------+-----------
     2001 |   14.71751
     2003 |   14.21137
     2004 |   13.52464
     2005 |    12.1655
     2006 |   16.41351
     2007 |   13.74528
     2008 |   14.25568
     2009 |   14.28874
     2010 |   13.06753
----------------------

However, if you drop just one row, the hidden may be inferred by the disclosed rows and the total.

. table year if year!=2002, c(sum x)

----------------------
     year |     sum(x)
----------+-----------
     2001 |   14.71751
     2003 |   14.21137
     2004 |   13.52464
     2005 |    12.1655
     2006 |   16.41351
     2007 |   13.74528
     2008 |   14.25568
     2009 |   14.28874
     2010 |   13.06753
          | 
    Total |   126.3898     
----------------------

. quietly summarize x
. display "Total: " r(N)*r(mean)
Total: 129.47083

Thus, you have to hide at least two rows, or, better, you aggregate.

qui generate int bisyear = 2002 if (year==2001|year==2002)
qui replace bisyear = 2004 if (year==2003|year==2004)
qui replace bisyear = 2006 if (year==2005|year==2006)
qui replace bisyear = 2008 if (year==2007|year==2008)
qui replace bisyear = 2010 if (year==2009|year==2010)

table bisyear, c(sum x)
nobsdes5 id x, by(bisyear)

The by-option is mostly used with categorical data but it may also be used with “sparse” continuous data, for example external tax data. There may be a few tax rates that apply to only one or two firms so that there is a confidentiality problem. You can check using nobsdes id, by(taxrate). If you include a variable denoting regions you have to use the byregio-option.

GEN - option

If you want to tabulate for example the most important countries or industries, use the gen-option and create two auxiliary variables for the number of distinct ids and the concentration ratio and restrict the output to the admissible cases.

. nobsdes id x, by(year) notab gen replace

D I S C L O S U R E problem: 

Share of largest two ids > 85%  
Smallest number of distinct ids (id) of variable x for year: 2

. qui egen hvnobs = sum(tg_id_x), by(year)
. qui egen hvcr2  = mean(cr2_id_x), by(year)

. table year if hvnobs>4 & hvcr2<=85, c(sum x sum tg_id_x mean cr2_id_x)

----------------------------------------------------------
     year |         sum(x)    sum(tg_id_x)  mean(cr2_id_x)
----------+-----------------------------------------------
     2001 |       14.71751              10            33.0
     2003 |       14.21137              10            35.0
     2004 |       13.52464              10            38.8
     2005 |        12.1655              10            30.2
     2006 |       16.41351              10            29.5
     2007 |       13.74528              10            32.4
     2008 |       14.25568              10            32.9
     2009 |       14.28874              10            29.7
     2010 |       13.06753              10            31.4
----------------------------------------------------------

Please, do not show the concentration ratio CR2 because this refers to only two observations and must not be disclosed and it has to be clear that the hidden cannot be inferred by the disclosed rows and the total. hvnobs has to be substantially larger than 4.

. table year if hvnobs>4 & hvcr2<=85, c(sum x sum tg_id_x)

--------------------------------------
     year |       sum(x)  sum(tg_id_x)
----------+---------------------------
     2001 |     14.71751            10
     2003 |     14.21137            10
     2004 |     13.52464            10
     2005 |      12.1655            10
     2006 |     16.41351            10
     2007 |     13.74528            10
     2008 |     14.25568            10
     2009 |     14.28874            10
     2010 |     13.06753            10
--------------------------------------

BYEST - option

So far, the by-option specified deterministic categories. In this case, the attribution of an id to a category is known in advance. In other cases the range of the categories is estimated.

A particular example is the grouping of a continuous variable where the ranges are not fixed in advance like sale brackets. An example are the bins in a histogram. Normally an ordering is implied.

Reidentification is harder to achieve and the by-option might be unnecessarily restrictive. However, no clear-cut decision rule is available. For example, the grouping of a continuous variable may result in such groups that there is a one-to-one mapping between the groups and the identifiers.

The RDSC requires that the number of distinct ids in the lowest category and the highest category is larger than four. If the number of different ids in another category is smaller than five the share of the variable under scrutiny outside this category has to be at least 15 percent. This can be examined with the byest-option. To be on the save side you may additionally use the by-option.

You may be interested in grouping data based on percentiles. The next two lines would do that

. pctile cutp    = xpc, nquantiles(10) 
. xtile  xpcbins = xpc, cutpoints(cutp)

The ranges of these groups are not known in advance they are estimated. Use

. nobsdes5 id xpc, byest(xpcbins)

D I S C L O S U R E problem: 
No problem with dominance. 
Number of distinct ids (id) of variable xpc for each xpcbins :

----------------------------------
xpc       |
categoriz |
ed by     |
cutp      |  sum(nobs)   mean(CR2)
----------+-----------------------
        1 |          6          58
       10 |          4          66
----------------------------------

To be sure, add

. nobsdes5 id xpc, by(xpcbins)

D I S C L O S U R E problem: 
No problem with dominance. 
Number of distinct ids (id) of variable xpc for each xpcbins :

----------------------------------
xpc       |
categoriz |
ed by     |
cutp      |  sum(nobs)   mean(CR2)
----------+-----------------------
        1 |          6          58
        2 |          5          58
        3 |          6          50
        4 |          4          79
        5 |          5          61
        6 |          4          79
        7 |          5          52
        8 |          5          59
        9 |          6          50
       10 |          4          66
----------------------------------

ZEROS

Zero values are an issue for data disclosure. Assume, there are three different ids. A continuous variable xwithzeros has positive values for two ids and for the third id it is zero, meaning “nothing”. If firm 1 knows that one of the three firms does not produce xwithzeros and the total of xwithzeros is published, firm 1 can subtract its own reported value from the total amount and thus the respective value for firm 2. Therefore, zeros must be treated as missing values for continuous variables. Use in this case the miss-option.

. nobsdes5 id xwithzeros, by(year) 

No problem with dominance. 
Zero values for xwithzeros found -> option miss(0)?
Number of distinct ids (id) of variable xwithzeros for each year :

----------------------
     year |  sum(nobs)
----------+-----------
     2001 |         11
     2002 |         10
     2003 |         10
     2004 |         10
     2005 |         10
     2006 |         10
     2007 |         10
     2008 |         10
     2009 |         10
     2010 |         10
----------------------

. nobsdes5 id xwithzeros, by(year) miss(0)

No problem with dominance. 
Number of distinct ids (id) of variable xwithzeros for each year :

----------------------
     year |  sum(nobs)
----------+-----------
     2001 |         10
     2002 |          9
     2003 |          9
     2004 |          9
     2005 |          9
     2006 |          9
     2007 |          9
     2008 |          9
     2009 |         10
     2010 |         10
----------------------

DOMINANCE

There are 5 different ids with values for xfordom in each year.

. table id land if xfordom!=., c(sum xfordom) row

----------------------------
          |       land      
       id |   BE    CH    US
----------+-----------------
        1 |   55    50    70
        2 |   44    48   -48
        3 |   13    30     9
          | 
    Total |  112   128    31
----------------------------

Assume id==2 knows that there are two others and id==1 being much bigger than id==3. If you publish the sum, id==2 can guess the value of xfordom for id==1.

For country==“BE” it is near 100-40 = 60
For country==“US” it is near 20-(-40) = 60
For country==“CH” it is not the case that id==1 is much bigger than id==3

. nobsdes5 id xfordom, by(land)

D I S C L O S U R E problem: 
Share of largest two ids > 85%. 
Number of distinct ids (id) of variable xfordom for each land :

----------------------------------
     land |  sum(nobs)   mean(CR2)
----------+-----------------------
       BE |          3          88
       CH |          3          77
       US |          3          93
----------------------------------

The data is now sorted.

. des

Contains data
  obs:           301                          
 vars:            18                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id              float   %9.0g                 
year            float   %9.0g                 
land            str2    %9s                   
x               float   %9.0g                 
id1             float   %9.0g                 
dum_2ids        float   %9.0g                 
dum_5ids        float   %9.0g                 
xpc             int     %8.0g                 
pcdum           byte    %8.0g                 
xfordom         byte    %8.0g                 
xwithzeros      float   %9.0g                 
bisyear         int     %8.0g                 
tg_id_x         byte    %8.0g                 
cr2_id_x        float   %5.1f                 
hvnobs          float   %9.0g                 
hvcr2           float   %9.0g                 
cutp            float   %9.0g                 percentiles of xpc
xpcbins         byte    %8.0g                 xpc categorized by cutp
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: year
     Note: Dataset has changed since last saved.

MAXIMUM and MINIMUM

MAXIMUM and MINIMUM are single observations. Therefore, they must not be shown. If necessary, the RDSC allows publishing approximate values calculated as averages of a sufficient number of observations (top coding). For maximum and minimum at least five distinct entities have to be covered and the share of the two largest ones must not exceed 85 percentages of the total (dominance criterion).

Please use the ado-file maxrdsc to cope with this problem. If a table or cross-tabulation is desired use nobsdes5 after some preparatory steps.

. maxrdsc id x, min(12) max(12)     

No problems for maximum of x. Average maximum based on  5  distinct ids:  .9832249780495962
No problems for minimum of x. Average minimum based on  5  distinct ids:  .0167608045856468

PERCENTILES

If you display percentiles, the number of distinct ids between the minimum and the lowest percentile and the highest percentile and the maximum all have to be larger than four.

Percentiles have to be specified in ascending order. The option seems to have difficulties with ties. E.g. if 10 percent of a variable are zeros, the smallest possible percentile is larger than 10.

. nobsdes5 id xpc if year==2010, pctile(5 50 90)

D I S C L O S U R E problem: 
Share of largest two ids > 85%. 
Number of distinct ids (id) for each percentile (min<=p5<=p50<=p90<=max) of variable xpc :

----------------------------------
   pctile |  sum(nobs)   mean(CR2)
----------+-----------------------
      max |          1         100
----------------------------------

. nobsdes5 id xpc, pctile(5 50 90) by(year) notab

D I S C L O S U R E problem: 
Share of largest two ids > 85%. 
Number of distinct ids (id) for the smallest percentile (min<=p5<=p50<=p90<=max) of variable xpc for year :  too small

. nobsdes5 id xpc, pctile(10 20 30 40 50 60 70 80 90) 

D I S C L O S U R E problem: 
Less than 15 percent of variable xpc of at least one id (id) outside a percentile (min<=p10<=p20<=p30<=p40<=p50<=p60<=p70<=p80<=p90<=max) with less than 5 distinct ids.

  +----------------+
  | pctile   value |
  |----------------|
  |    p60      60 |
  +----------------+

Number of distinct ids (id) for each percentile (min<=p10<=p20<=p30<=p40<=p50<=p60<=p70<=p80<=p90<=max) of variable xpc :

----------------------------------
   pctile |  sum(nobs)   mean(CR2)
----------+-----------------------
       10 |          6          58
      max |          4          66
----------------------------------

There are two different violations. The first violation is the “ordinary one”, i.e. in the range between p90 and the maximum value there are less than five distinct ids and the dominance criterion is violated. That is the first message. In the range between p50 and p60 there are only four different ids and the share of xpc for all ids that is outside this range is less than 15 percent. That is the second message.

Creating GRAPHS

If you want to make a graph you have to document that the graph is based on sufficient observations and that no two IDs dominate. You have to use nobsdes5 before you create the graph. We recommend saving the graph as .png-file

. nobsdes5 id x, by(year land) notab

D I S C L O S U R E problem: 
Share of largest two ids > 85%. 
Smallest number of distinct ids (id) of variable x for year land: too small

. nobsdes5 id x, by(year land) 

D I S C L O S U R E problem: 
Share of largest two ids > 85%. 
Number of distinct ids (id) of variable x for each year land :

----------------------------
          |       land      
     year |   BE    CH    US
----------+-----------------
     2001 |    8     9     8
          |   48    41    37
          | 
     2002 |    2     2     2
          |  100   100   100
          | 
     2003 |    8     9     9
          |   51    33    39
          | 
     2004 |    9     9     9
          |   44    43    56
          | 
     2005 |    9     9     9
          |   46    41    42
          | 
     2006 |    9    10     8
          |   36    34    39
          | 
     2007 |    9     9    10
          |   30    42    47
          | 
     2008 |    8    10     8
          |   30    37    63
          | 
     2009 |    9    10     7
          |   39    32    43
          | 
     2010 |    9     9     9
          |   43    46    37
----------------------------

. qui egen meanx  = mean(x) if year!=2002, by(year land)
. qui egen tgland = tag(year land)
. gsort -tgland year land

. scatter meanx year if land=="BE" & tgland

Creating HISTOGRAMS

If you want to make a histogram you have to document that the graph is based on sufficient observations and that no two IDs dominate. You have to use nobsdes5 before you create the graph. We recommend saving the graph as .png-file

. nobsdes5 id xpc, histogram(10)           /* e.g. 10 bins */

D I S C L O S U R E problem: 
No problem with dominance. 
Number of distinct ids (id) for the smallest or largest of 10 bins of variable xpc : too small

That differs from the heigth of the bins

. qui histogram xpc, freq bin(10) addlabels

This histogram example and the last percentile example are almost the same. There are 9 ranges with 10 observations and one range with 9 observations. In the percentile example the last range has only nine observations - the last digit runs from 1 to 0 in each range save the last one - while in the histogram example the last digit runs from 1 to 0 in the first four ranges and then from 0 to 9. The range between p40 and p50 has only 9 observations.

Aggregating Data

If you want to aggregate data for further use you have to document that the aggregate consists of enough observations and that no two IDs dominate. You have to use nobsdes5 before you collapse the data.

. qui drop tg_id_x cr2_id_x

. nobsdes5 id x, by(year land) notab

D I S C L O S U R E problem: 
Share of largest two ids > 85%. 
Smallest number of distinct ids (id) of variable x for year land: too small

. qui drop if year==2002

. nobsdes5 id x, by(year land) notab

No problem with dominance. 
Smallest number of distinct ids (id) of variable x for year land: 7

. collapse (sum) x, by(year land)
. table year land, c(mean x) format(%5.1f)

----------------------------------
          |          land         
     year |   BE    CH    GB    US
----------+-----------------------
     2001 |  4.5   5.1   0.0   5.1
     2003 |  5.1   5.2         3.9
     2004 |  5.5   4.2         3.9
     2005 |  3.5   4.4         4.3
     2006 |  5.6   5.5         5.4
     2007 |  6.1   3.7         3.9
     2008 |  5.8   4.6         3.9
     2009 |  4.7   5.2         4.4
     2010 |  4.4   3.2         5.5
----------------------------------

. scatter x year if land=="BE"

/* If you want to document it after the collapse command you have to create a new variable using the generate or name-option.

nobsdes5 id x, by(year) notab generate collapse (sum) x tg_id_x, by(year land) table year, c(sum tg_id_x)
*/