Collecting data | University of Antwerp

Types and formats

Research generates different types (s) of data. An important question to ask is what type(s) of data you will produce during research to determine how you will capture and process the data.

While actively working on your data, you should use a file format that best suits the way you work. This will also depend on the software you use. However, when not actively working on your data, you should transform the format into a standard form for the archive. To avoid your files being unreadable in the future, you should save your files in an open format. This way, you and other people with access to those data will be able to open the files at a later date.

The rule of thumb is that you should use a format (1) that can be readable by free tools (without an expensive license), (2) that is commonly used by the research community, and (3) that can be accessed by a wide variety of software. This is also how you FAIRify your data.

If you need to use a proprietary format (for example, if transforming your files to an open format would cause a loss of data) include a README.txt file to explain the name and version of the software used.

Some preferred file formats:

Text: PDF/A, ODT, HTML, TXT, XML (non-preferred: DOC, DOCX, RTF)
Spreadsheets: CSV, ODS (non-preferred: XLS, XLSX)
Databases: SQL, CSV (non-preferred: MDB, DBF)
Statistics: ASCII, DTA, POR, SAS, SAV, R
Images: TIFF, JPEG 2000, PNG
Audio: MXF, BWF (non-preferred: MP3, Wave, AAC)
Video: MXF, MKV (non-preferred: MPEG, AVI)

Exhaustive examples of preferred (or not preferred) formats for each type of file are listed on DANS or UK Data Service websites. You can also consult a repository of your choice which file formats are acceptable (or not).

For further advice on file formats, contact the University Archive.

To formulate a data management plan ask yourself if you will collect new data or reuse the existing data.

At this stage, think already about the size of your data and if the size will grow over time. This is important in planning the storage and backup of your data.

Organizing data

Always be consistent in organizing your data. Files can very quickly become unmanageable if file names and/or structures are not consistently and logically organized. Reorganization and/or searching for files takes a lot of time and also gives a lot of frustration.

Read some tips to organize your data:

Structure your files hierarchically. Have folders covering broad topics at the highest level with more specific files nested within them. Each folder should not contain more than 100 items. Review your file structure regularly and remove any unnecessary files.
Keep folder and file names brief and clear. For example project name, location, personal name, date, type data, conditions, version number, etc. It is recommended to have 30 tokens for folders, and 60 tokens for files, while the path length should be a maximum of 255.
Use the international ISO 8601 standard for the date. That means YYYYMMDD (preferably at the beginning of your files to maintain chronological order)
Avoid special characters, dots, and spaces.
Use hyphens or underscore to separate words.
Use file or folder naming conventions of your research group or department, if applicable.
Apply version control to track directly which changes have been done in the same file throughout the drafting process, so that you can recall specific versions later, or early versions can be restored if needed. You should decide how many versions to keep for one file, which versions to keep, and for how long. We advise you to keep a single master file (this means “the first version”) in a separate (and identified) location and perform all changes only on copies of that master. Use consistent numbering (e.g. test.v0_1, test.v0_2, test.v0_3, etc.) and/or identification systems (e.g. the date, version description (draft, final), etc.) in the file name to perform this work correctly. You can also create a file history or version control table, where versions, dates, author's locations, and details of changes to the file are recorded. If you wish, you can use versioning software: Subversion, Git (for all files), or Github (only for software development) are applied frequently. In particular for researchers in Life Sciences, a user-friendly Electronic Lab Notebook (ELN) software is a good option and has a built-in control system. It also allows you to optimize your research efficiency and reproducibility, ensures better protection of your intellectual property, and makes you a more attractive partner for industrial collaborations. For more information on ELN, please contact Siham Benramdane
Include a README.txt file to explain the (complex) name convention.

Data Documentation

Data Documentation refers to the contextual information used to discover, access, and reuse data. If you do not provide information about your data, future users, including yourself may have difficulties understanding the data. If you document your data properly, you prevent misinterpretation of your data and increase the reproducibility of your research.

Data documentation may include questionnaires, surveys, codebooks, ELNs, experimental protocols, methodology reports, information about instrument calibration, information about the software used in research, analytical and procedural information, data preparations, data manipulations from raw data, units of measurements, and their definitions, data characteristics, the definition of codes or specialty terminology, structure and organization of data files, file naming conventions, version control, data quality control measures, any comments, and remarks, etc.

To fill out this information, you use a README.txt file, Electronic Lab Notebooks (ELN), a codebook, a logbook, an electronic diary, etc. and add it to a folder where the data is (will be) stored. An example of a README file template can be found here.

Metadata

Metadata means “data about data” or “information about data”. It provides (structural) information about research data to find the data easier in online databases, repositories, archives, etc. Examples of metadata include creator, title, identifier (e.g. DOI), resource type, publication year, publisher, contributor, date (of creation e.g.), description/abstract, location (of creation), page number, file format, version, language, rights, etc.

A more detailed description of these properties is found in a paper written by a well-known organization service called DataCite. Other good service websites regarding RDM such as Digital Curation Centre (DCC) and FAIRsharing also give a good overview of metadata standards for each discipline.

In most cases, repositories and other mechanisms (in which your data will be/are stored and/or shared) have implemented their own guidelines regarding metadata standards, therefore it is recommended to read first which are adapted to suit the needs of research data.

For further advice, please contact The RDM Team and the University Archive.