What is Microdata

Microdata are the units of data that aggregate statistics are compiled from. Microdata is data about individual people, households, or organisations and consist of sets of records containing information on individual respondents or other entities. Microdata is the raw data about individual objects such as people, households, events, transactions or organisations as opposed to the aggregated statistics appearing in a published report. Objects have properties which are often expressed as values of variables of the objects. For example, a ‘person’ object may have values of variables such as ‘name’, ‘address’, ‘age’, ‘income’. Microdata represent observed or derived values of certain variables for certain objects. Microdata may also be data about other characteristics of the Pacific Islands such as geographical data.

National microdata is usually available from censuses, surveys and administrative and register data. These data are most commonly collected by the national government or Pacific Island National Statistics Offices (NSO) and access provided by the NSO or the national archive. The data are collected at an individual, household, or institution level as appropriate (Desai and Cowell, 2006).

What is the Pacific Data Hub Microdata Library

The Pacific Data Hub - Microdata Library is a central repository for Pacific Island statistical microdata, reports and documents. It is an online cataloguing and dissemination system of survey and census metadata and microdata. It is a service established to facilitate access to microdata that provide information about people living in Pacific Island developing countries, their institutions, their environment, their communities and the operation of their economies. SPC provides safe access to microdata via its Pacific Data Hub-Microdata Library microdata.pacificdata.org to enable research and analysis that benefits Pacific Island people. Microdata is a level of data that creates a risk of recognition/identification of individual people, households or organisations and as such must be managed carefully to protect against this risk.

PDH _ ML.jpg

Data Acquisition

Data Acquisition – describes ways to collect and collate microdata and its associated metadata. Microdata and metadata are generated from various data collection activities such as household surveys, population censuses, and administrative recording systems. Many organisations in the Pacific as part of their work not only capture their own microdata, but also acquire microdata from other sources as well. They can be generated by many official and non-official producers for example Pacific Island National Statistics Offices (NSO), line ministries, researchers, and the private sector.

Steps involved in Dataset Acquisition for Pacific Island microdata SPC shall undertake the following activities:

  1. The process of identifying suitable and dependable data can be complicated. SPC locates suitable datasets and ensures whether the dataset exists and is available for use, and therefore may be acquired;
  2. SPC staff establish a Data License Agreement and Terms of Use for acquiring the Dataset that is agreed to by the Data Provider or owner and SPC. SPC and Data Provider must evaluate the License/Terms-of-Use to ensure it meets their requirements, and that the both parties can comply with them; or
  3. SPC Staff acquire the Dataset pursuant to either, (i) an MoU (ii) a Legal Agreement, or (iii) an informal document (email) from an authorised representative of the Data Provider to SPC.
  4. The Data License Agreement and Terms of Use include information about how data will be shared, including when the data will be accessible, how long the data will be available, how access can be gained, and any rights that the data provider reserves for using data. They also describe any obligations that exist for sharing data collected and address any ethical or privacy issues and legal requirements with data sharing.
  5. Data ownership: Data providers should ensure they are the data owners with rights to deposit data to be shared with SPC Pacific Data Hub. People submitting datasets must have the legal authority to do so.
  6. Metadata documentation (such as questionnaires, data descriptions, classifications and definitions) are important pieces of information and must be acquired along with microdata.
  7. Identify the data format of the dataset being acquired to ensure safe transfer and integrity of the data. For example, the Pacific Data Hub Microdata Library itself, data are provided in Stata, SPSS and SAS formats. In many cases ASCII versions are also provided with syntax files included for reading the data into SPSS and SAS.  If the demand justifies it, we may consider adding other formats. Data provided by external catalogs are under their control. SPC does not offer a service for data conversion however Software like StatTransfer, or the Nesstar Publisher (which is freeware) can be used to convert datasets into other formats.
  8. Indicate how the data should be cited by others and address intellectual property and copyright issues.
  9. Assure data (conduct data quality checks, undertake disclosure control). Double-check the microdata for completeness and quality (tabulation of aggregates and checking for integrity of dataset). Ensuring the quality of the data is a high priority, for we know that good research is only possible with reliable data.
  10. Maintain preservation copies of the data in the long term.
  11. Create and publish metadata to assist researchers to discover and use the data.
  12. If necessary, modifying data to reduce disclosure risk including r emoval of any direct identifiers such as names, addresses, telephone numbers or any other linkable variables that point explicitly to particular individuals or units and removal of any indirect identifiers. These are variables that can be problematic as they may be used together or in conjunction with other information to identify individual respondents.
  13. Limiting access to datasets for which modifying the data would substantially limit their utility or the risk of disclosure remains high

What is Data Curation and Preservation?

Data curation and preservation is the art of maintaining the value of data. A data curator does this by collecting data from many different sources and then aggregating and integrating it into an information source that is many times more valuable than its independent parts. During this process, data might be annotated, tagged, presented, and published for various purposes. The goal is to keep the data valuable so it can be reused in as many applications as possible. Through the curation process, data are organized, described, cleaned, enhanced, and preserved for public use, much like the work done on paintings or rare books to make the works accessible to the public now and in the future. With modern technology, it's increasingly easy to post and share data. Without curation, however, data can be difficult to find, use, and interpret.

For more information about the process and principles of Data Curation and Preservation 

Cataloging

Cataloging involves publishing detailed metadata in an on-line searchable catalog to make data discoverable. Cataloging also provides information such as creator names, titles, and subject terms that describe resources, typically through the creation of bibliographic records. The records serve as proxies for the stored information resources.

To “catalog” a dataset or information about a data collection involves several interrelated processes. Cataloguing is the process of creating metadata representing information resources, such as datasets. To enable a person to find a particular dataset interested users must be properly informed about the existence and characteristics of the datasets available. Many potential users have very little if any information about the available datasets. Good metadata must be made available, preferably in the form of a searchable on-line catalog.

All datasets deposited with Pacific Data Hub-Microdata Library undergo quality checks to confirm the accuracy and usability of the data. Anomalies in data files and documents are corrected in consultation with data depositors. Missing values, errors and corrections are recorded as Data Quality Notes in the metadata provided with each dataset.

Data structure, completeness and correctness are checked- for example the structure, size and type, completeness, and correctness of the dataset agrees with description of the dataset content and with level of data curation within the Pacific Data Hub-Microdata Library repository. For example, it is important to make sure that, in all data files, the identification variables(s) provide a unique identifier. Use the duplicate function in SPSS or isid command in Stata to verify this. Verify the completeness of your data files by comparing the content of these files with the survey questionnaire. Make sure that data from all sections of the questionnaire are included in the dataset. Verify that the number of records in each file corresponds to what is expected.

Metadata Completeness is also checked for the microdata file for example whether a citation exists, including authorship, year, comprehensive title, persistent identifier (e.g. DOI). Any data files with data quality changes will receive a new version number. File naming and versioning is according to the Data Documentation Initiative (DDI) standard. Verify that all variables are labelled (variable labels) and that the codes for all categorical variables are labelled (value labels).

The objective of Pacific Data Hub Microdata Library is to provide easy access to data and documentation in a format most convenient for users. A survey catalog provides tools for:

  • Finding the dataset most appropriate to the user’s needs. This may be simple when the number of microdata files is small. But, as the number of files increases, a tool that can search data files at the variable level becomes essential.
  • Evaluating information that has been identified to ensure compatibility with the researcher’s needs, e.g., the universe, concepts, and definitions employed in the survey. This role is supported by the metadata used to document the file.
  • Accessing the data by involving an extraction and/or some type of delivery system. Commonly, such files can be delivered via a website/portal to deliver the data via download.
  • Using the data. There is no such thing as a single tool for researchers to undertake their analytical work. Researchers prefer data available in a variety of formats so they can use tools of their choice. Typically, these include formats for SPSS, STATA, SAS, and ASCII.

Data Discovery and re-use (citations)

Data citation refers to the practice of providing a reference to data in the same way as researchers routinely provide a bibliographic reference to other scholarly resources. They are references that can be included at the study level which point to published works that have used the data from a particular study such as a journal article, working paper, or news article. A citation gives credit to the data source and distributor and identifies data sources for validation. They are also a good way of showing the funders of surveys that the data are being used for policy and research purposes. Citations support researchers to manage and share data and enabling data citation and linking data with publications increase visibility and accessibility of data and the research itself.

Bibliographic references are important when you are using the data or ideas of others in your written work: references credit your sources and permit your readers to find those sources.Citing statistics and data has been a neglected grey area in academic publishing and often citation styles preferred by scientific journals largely ignore datasets and tables.

One of the features of the Pacific Data Hub – Microdata Library is a bibliography of publications that have cited the use of a dataset listed in the catalog. Selecting a publication from the list will show which study dataset was used and provide a link to the study in the catalog. This helps to improve data discovery and re-use. Data citation is an important practice in publishing of research. As data is shared with more frequency, data citation provides numerous advantages including reproducibility through direct reference to the data used in a research study; providing credit to data producers and authors, and ability for researchers to track the use of their datasets in other studies.

The process of citing a dataset from the Pacific Data Hub – Microdata Library consists of the following steps:

  1. Cite data in your references, for example reference the data used in your data tables. Identify which one(s) of the survey datasets is (are) quoted in the paper. Cite the exact version of the data used in your research, to support data discovery.
  2. Verify that the citation is not a “false positive”, i.e. that it indeed quotes one or several of the survey datasets.
  3. Find out if any other dataset is mentioned in the document
  4. Identify the type of paper (journal, working paper, e.t.c)
  5. Identify if there is a URL that exists. If the URL of cited web site in bibliography is no longer online then also check whether the content has a Digital Object Identifier DOI (see step 9)
  6. Check that the citation is not already entered in the citation catalog
  7. Add the citation to the catalog by entering in the bibtex 
  8. When all citations related to keywords you searched are entered, activate the “Email alert” 
  9. Include a unique identifier in your citation, such as Direct Object Identifiers (DOIs). These will enable the data to be accessed even if URLs change and thus provide a permanent link to the data.

Metadata Documentation

Data documentation is important because it helps the researcher to find the information that is necessary for them to fully exploit the analytic potential of the data. Names, abstracts, keywords and other important metadata elements make it easier for a researcher to locate specific datasets and variables. Documentation also helps the researcher to understand what the data are measuring and how they have been created.

Data documentation also explains how data were created, what data mean, what their content and structure are and any data manipulations that may have taken place. Documenting data should be considered best practice when creating, organising and managing data and is important for data preservation. Whenever data are used sufficient contextual information is required to make sense of that data.

Without proper descriptions of the survey design and the methods used when collecting and processing the data, the risk is high that the user will misunderstand and even misuse them. It also helps the researcher to assess the quality of the data. Information about the data collection standards, as well as any deviations from the planned standards can help them determine whether the data are useful for a research project. Rich metadata also reduces the burden on the data producer, as it reduces the need to provide regular support to users of the data. 

The Data Documentation Initiative (DDI) is the international standard for documenting data. It is used by government agencies, universities, research institutions and archives around the world. DDI enables the discovery and use of data by describing how the dataset was created and what it contains. DDI can be read by people, but also used by software systems, and computers. DDI codes the description of the dataset in XML It uses a schema for each piece of information.

Evaluation of Microdata and Metadata quality

The Pacific Data Hub Microdata Library (PML) receives datasets from many different sources. As a general rule, we do not modify the datasets unless we work directly with the producer, except to apply statistical disclosure control and to format the data files for the convenience of users. Data files are always preserved in their original format, as well as in a Stata-consistent format. For dissemination the Library provides for data files to be converted into other commonly used formats such as SPSS, SAS, STATA or ASCII. Documents are stored in their original format but are usually disseminated as PDF files.

SPC works with various data producers to promote better practices of data management including variable and file naming rules and the use of labels. Because generally speaking the PML has no control over the data collection or management procedures used, there is no guarantee that these practices will have been used in any specific survey from its inception.

While an evaluation of microdata and metadata can be undertaken. Data are often provided "as is". We make all possible efforts to ensure that the metadata are as comprehensive as possible. This documentation includes, whenever possible, identification of problems and weaknesses in the datasets. It is, however, the responsibility of the researcher to make his/her own assessment of the reliability and suitability of the data for his/her specific purpose, based on all the information provided.

Evaluating a dataset is a crucial process following it’s acquisition. This piece of work ensures the dataset is ready for preservation and documentation and is made of the following steps:

  1. Tabulating aggregates from a report
  2. Integrity of the dataset,
  3. Checking whether IDs are unique,
  4. De-identification of the dataset,
  5. Identification of sensitive variables,
  6. Labelling all variables and all values.

Introduction to Statistical Disclosure Control (anonymisation)

Microdata is a level of data that potentially creates a risk of recognition/identification of individual people, households or organisations and as such must be managed carefully to protect against this risk.

When disseminating microdata files, the data producer must safeguard the confidentiality of information about individual respondents. Processes aimed at protecting confidentiality are referred to as Statistical Disclosure Control (SDC). SDC techniques include the removal of direct identifiers (names, phone numbers, addresses, etc) and indirect identifiers (detailed geographic information, names of organizations to which the respondent belongs, exact occupations, exact dates of events such as birth, death and marriage) from the data files.

The confidential nature of the information provided by households is guaranteed in national laws governing census taking and is also one of the United Nations Fundamental Principles of Official Statistics (United Nations, 2014). Information is collected on the understanding that it will be treated as confidential and census respondents are generally given some kind of guarantee or assurance in this regard. Official statisticians are charged with guarding the confidentiality of this information and the respondent’s trust in the statistical office’s adherence to this commitment is an important factor in their willingness to participate in the census.

Statistical disclosure control methods have been developed to make it possible for statistical offices to anonymize and release microdata in a controlled way which protects the privacy and statistical confidentiality of individuals and other entities so that there is a low risk of individuals and households being identified within the data. Such methods make it possible to disseminate microdata to researchers in universities or in government thus more fully exploiting its potential value for social research and policy analysis.

  1. Consider and document de-identification strategy early. De-identification efforts often require data permutations – such as suppression of specific variables’ values, including, top and bottom coding, conversion of continuous variables to categorical or removal of any identifiable variation. Data providers should consider their de-identification strategy early and prior to analysis and document it. Evaluators are encouraged to share their de-identification strategy to discuss implications for future verification of analysis and public and/or restricted access of microdata.

 

http://www.ihsn.org/sites/default/files/resources/ihsn-working-paper-007-Oct27.pdf

Introduction to Statistical Disclosure Control (SDC)

http://ico.org.uk/for_organisations/data_protection/topic_guides/~/media/documents/library/Data_Protection/Practical_application/anonymisation-codev2.pdf

Managing Statistical Confidentiality and Microdata Access - Principles and guidelines of Good Practice

http://www.unece.org/fileadmin/DAM/stats/publications/Managing.statistical.confidentiality.and.microdata.access.pdf

 

Dissemination of Microdata Files

Dissemination of Microdata Files- Principles, Procedures and Practices http://www.ihsn.org/sites/default/files/resources/IHSN-WP005.pdf

Managing and sharing data - Best practice for researchers

Description        By the UK Data Archive, May 2011

https://ukdataservice.ac.uk/media/622417/managingsharing.pdf

Powered by NADA 4.4 and DDI