John S. Bailey Library: Research Data Management: Handling data

Handling data

Decisions you make about your data will affect - and maybe even determine - the research process. Acting wisely and proactively ensures timely and effective management. Data requires thorough attendance because it is the foundation of all decision-making. When data is accurate and complete, it can be used to identify trends, patterns, and insights that would be difficult to see otherwise. There are several factors to consider when managing data, including:

Data quality: Data must be accurate and complete in order to be useful. This means that it is important to have processes in place to ensure that data is collected and entered correctly, and that it is regularly reviewed for errors and omissions.
Data security: Data must be protected from unauthorized access and use. This means implementing appropriate security measures, such as encryption and access control.
Data governance: Data governance is the process of managing data throughout its lifecycle, from collection to storage to disposal. It is important to have a data governance framework in place to ensure that data is managed consistently and in accordance with organizational policies and regulations.

In addition to these general factors, there are also specific considerations that may be relevant depending on the type of data being managed.

Data considerations simplified

Researchers need to be aware of data related issues that can arise throughout their research process. These issues can range from technical concerns such as file formats and metadata to ethical considerations such as copyright and intellectual property rights. It is essential to consider these issues carefully in order to ensure the quality, integrity, and accessibility of the data used and produced, to ensure rigorous and reproducible research findings.

Research data come in different formats and types, hence they are "any material you use and analyse in your studies. Some disciplines prefer to talk about research materials rather than research data". The major categories research data can be grouped in, are: observational, experimental, simulation, derived / compiled.

Observational data refer to information gathered without the subject of the research (for example an individual customer, patient, employee, etc.) having to be explicitly involved in recording what they are doing. Observational data can be based on census or it can be based on sample.

Experimental data may be qualitative or quantitative, each being appropriate for different investigations. Thus, experimental research is research conducted with a scientific approach using variables. This type of data is typically projectable to a larger population and could be reproducible.

Simulation data are computer generated in large amounts, in order to underlie the mechanisms that control the behavior of a system. Examples of these data are the weather conditions forecast, economic models, chemical reactions, or seismic activity.

Derived data are compiled from different sources in order to produce a new set of data. The research process combines data elements using a mathematical, logical, or other type of transformation. (e.g. combining population data with geographic data to create population density data).

You should be consistent and descriptive in naming and organizing files, depending on the research you are conducting. It is important to determine from an early stage which specifics make the most sense to you (and your team) and document these protocols where are easily accessible by everyone in your research team and/or any external collaborators you may have.

Use folders to group together all the work relevant to your current research/study, with names that are meaningful. This makes the data easily findable. Don't use the same folder for everything. Separate data to ongoing and completed work and use respective folders. By separating any work you have already completed and the work you are currently working on will help you keep track of your data.

Make sure you use the same file organization for both your active data and your backup data. Here are some suggestions to consider:

Research name (project name / experiment name / survey / etc.)
Researcher / Research team name
Date or date range of the research
Directory structure (location)
Type of data
Individual file structure
Individual file conditions
Individual file version

Types of data

In planning a research project, it is important that you consider which file formats you will use to store your data. In some cases, this will be dictated by the software you are using or the conventions of your discipline. In other cases you may have to make a choice between several options. Research data can be anything that is collected, processed and studied for your research purpose:

Documents, spreadsheets
Laboratory notebooks, field notebooks, diaries
Questionnaires, transcripts, codebooks
Films, audio or video tapes/files
Photographs, slides, physical samples
Collections of digital outputs
Data files
Database contents (video, audio, text, images)
Recordings; interview notes
Models, algorithms, scripts
Contents of an application (i.e. logfiles)
Methodologies and workflows
Procedures and protocols

Here are some good file formats for the preservation of the most common data types that you can use:

Textual data: XML, TXT, HTML, PDF/A (Archival PDF)
Tabular data (including spreadsheets): CSV
Databases: XML, CSV
Images: TIFF, PNG, JPEG (note: JPEGS are a 'lossy' format which lose information when re-saved, so only use them if you are not concerned about image quality)
Audio: FLAC, WAV, MP3

While naming and organizing research data and files, very few times you will do so on you own and in one sitting. Usually, there will be several people involved in the process and it will occur over an extended period of time. To secure control of the process and avoid confusion over which version is the most recent, what does the dataset refer to, etc., it is imperative that you should document them properly, using consistent metadata. Proper documentation encompasses all the information necessary to interpret, understand and use data or a dataset. Documentation can be embedded, which means to be included within the data, or it can provide additional context to the data and thus, be included in separate files that accompany the data.

According to the UK Data Archive metadata are a subset of core data documentation, which provides standardized structured information explaining the

purpose,
origin,
time references,
geographic location,
creator,
access conditions; and
terms of use of a data collection.

There is no single schema defining what metadata elements should be collected and used to document data. There are several general metadata shemas as well as discipline-based ones. To ensure easy access to the metadata used, you can keep them in a readme file or any other machine readable format.

And, just like with other aspect of RDM, in cases of funded research the funders require metadata becoming openly available to facilitate easy access and re-use.

This metadata standards catalog has been created by the University of Bath and it is a collaborative, open directory of metadata standards applicable to research data.

Data management glossaries

It is normal to feel confused and uncertain when trying to name your data. Each word might have a different meaning based on the context and/or the relation with other terms. Here are some resources to enhance your understanding and help you find your way easily:

National Library of Medicine - Data Glossary
DCC - Digital Curation Glossary
Cornell - Data Management Glossary
ICPSR - Social Science Glossary

Renaming Files

There might be cases when you will need to rename a large number of files at the same time. It will be difficult and time consuming to complete the process manually, renaming one file at a time. Instead, you could benefit from batch renaming software like Bulk Rename Utility for Windows systems (free Software) or Renamer for Mac systems (pay for software). These software packages allow you to rename multiple files or folders at the same time. Here is a list with 16 free file rename software for Windows . Linux users can take a look at Open Source bulk rename utility alternatives and for Mac, you can use any of these options .

The importance of README files

A Readme is a text file that introduces and explains a project. It contains information that is commonly required to understand what the project is about. This nice blog post from the Databrarians, explains how a readme file communicates important information about a project.

Intellectual Property refers to any form of intellectual creation thus, the broad umbrella term that includes various forms such as copyright, patents, or trademarks. Therefore, research data is the intellectual property of the researcher, or possibly of their funder or supporting institution and can be shared and used by other researchers within a specific frame, under the appropriate attribution.

Intellectual property rights and attributions should be determined and clarified at the start of the research process, to prevent limitations and entanglements later on to:

your research
its dissemination
future related research projects
associated profit or credit

Since the benefits of data sharing are so well known, a researcher may wish to share their research outcome with others. Others can only fully utilize external data if they know the terms of use (if any) for that data. Although data itself cannot be copyrighted, you may be able to own a copyright in the compilation of the data. Creative arrangement, annotation, or selection of data can be protected by copyright. Patent law may apply if your data collection leads to new and useful inventions such as machines, processes, manufactures, or improvements. Your data may be protected by trade secret if your formula, process, design, or method offers a commercial advantage. Keeping in mind that some contracts or grants come with non-disclosure agreements or other conditions requiring secrecy.

Keep in mind that some data elements must remain confidential and are protected under specific considerations and laws, locally and internationally.

While working on a research project, it is important to understand whether there are any institutional or funder policies that impact data ownership. Data protection is an ethical issue that includes rights to privacy and respect to the use of the information. Data protection issues should be raised and defined at the the outset of your project as they might affect its timing, design or scope. Data can be confidential, private or public, with each category bearing a different set of requirements and attention.

Sensitive data will need to be classified into different categories, each requiring its own level of security. Ensure that access to all confidential and sensitive data is managed appropriately, by using strong passwords and changing any original administrator accounts and passwords. Depending on the data, you can also restrict user access and apply different user permissions.

Data encryption may be used to further protect confidential and sensitive research data. You need to ensure that the encryption will remain with the data, on all storage areas and throughout all research stages.

Researchers must process all personal data in accordance with the 'data protection principles', unless there is a relevant exemption.

ACG Sensitive Data Protection Policy

Research Data at ACG

To find relevant datasets for your research topic you can contact the librarians who can guide you efficiently to the right resources. Depending on the kinds of data you’re looking for, you may find them in the library's resources:

Surveys and polls

There might be cases when research data will be collected in a form of a survey, an interview or a poll. Using an online tool to collect data you increase your productivity, as the process is completed easily, in a timely manner. Data are available instantly, they are transferable and you can generate analytics.

ACG supports researchers with Qualtrics .

Research Data Management

Handling data

Data considerations simplified

Types of data

Data management glossaries

Renaming Files

The importance of README files

Research Data at ACG

Surveys and polls

UK Data Service file formats

LC recommended formats

RDM Glossary

Work with messy data

Ethics & data protection