Orbis Historical Data Processing Manual

Orbis Historical Data Processing Manual

Orbis Historical Data Processing Manual provides detailed instructions for processing historical data from Bureau van Dijk's Orbis database. It covers system requirements, folder structures, and specific processing steps for firm description, financial, and ownership data. Designed for researchers and analysts, this manual guides users through using Stata for efficient data management. Key topics include data chunking, merging techniques, and cleaning procedures, ensuring accurate and streamlined analysis of firm-level data. Ideal for users handling large datasets in economic and financial research.

Key Points

  • Outlines system requirements for processing Orbis historical data using Stata.
  • Describes folder structures for organizing firm description, financial, and ownership data.
  • Includes step-by-step instructions for data chunking and merging in Stata.
  • Provides guidelines for cleaning and managing large datasets effectively.
105
/ 15
PROCESSING ORBIS HISTORICAL DISK
Sebnem Kalemli-Ozcan, Jingting Fan, and Veronika Penciakova
in cooperation with Bureau van Dijk
I. PRELIMINARIES
This manual relates to the processing of the historical disk delivered by BvD. The user receives an
external hard drive, which contains the following folders:
Financials Histo Dec SQL
Financials Histo Dec text: the contents of this folder, and the processing of files contained
within are described in section III.
Orbis Dec text: the contents of this folder, and the processing of files contained within are
described in section II
Ownership histo Dec SQL
Ownership histo Dec text: the contents of this folder, and the processing of files contained
within are described in IV
This manual details the processing of the files in the “text” folders since these contain files that can be
read into Stata. The “SQL” folders contain the same information as those in the “text” folders, but in a
different format (ie one that cannot be read into Stata).
This section describes system requirements and pre-processing steps that facilitate the processing of
data contained in the historical disk. The hardware used for this task was the Windows 10 Enterprise
workstation with Intel Xeon CPU E5-2680 @ @.75GhZ and 128 Gb of RAM.
1. SYSTEM REQUIREMENTS
The historical disk is around 280 GB. We recommend using an 8 TB drive to process and store the data.
The guide describes how to process and store the BvD data using Stata MP, version 12 or higher.
2. FOLDER STRUCTURE
In the 8 TB drive, create the following folder structure that logically corresponds to the internal
organization of the Orbis database.
1. Firm_description: will contain the processed descriptive firm-level data including name,
address, legal form, industry classification, and national identifiers. The steps for processing
these data are described in section II. Within this folder, create the following sub-folders:
1.1. Codes: stores do-files for processing data.
1.2. Txt: contains unzipped (unprocessed) data text files. Within this folder, create a
subfolder called chunky, which will be referenced during the processing
procedures described below.
2
1.3. Dta: contains processed data files. Within this folder, create two sub-folders,
which will be needed for processing the data: intermediate and final. Within
each of these folders, create a series of individual folders for each country
covered in the BvD data, using the two-letter country ISO code. This country
ISO code is used by BvD to create the identification numbers for companies,
their owners, and subsidiaries.
2. Financials: will contain the processed firm-level financial data. The steps for processing these
data are described in section III. Within this folder, create the same sub-folders as in folder 1
(Firm_description).
3. Ownership: will contain the processed firm-level ownership data. The steps for processing
these data are described in section IV.
1.1. Codes: stores do-files for processing data.
1.2. Txt: contains unzipped (unprocessed) data text files. Within this folder, create a
subfolder called chunky, which will be referenced during the processing
procedures described below.
1.3. Dta: contains processed data files. Within this folder, create two sub-folders,
which will be needed for processing the data: entity_information and
ownership_structure. Within each of these folders, create two sub-folders:
intermediate and final. Within each of these folders, create a series of individual
folders for each country covered in the BvD data, using the two-letter ISO code.
4. Temp: this folder is needed for running Stata directly from the 8 TB drive, which is detailed in
section I.3.
3. RUNNING STATA
By default, Stata runs from the C:\ drive, which often has less capacity than other installed drives. When
processing the historical disk, Stata should be run from the 8 TB storage disk.
1. Using a text editor (such as Notepad), save the following two lines of code as a run_stata.bat
within the 8 TB drive.
set STATATMP=DRIVE:\FOLDER\Temp
"C:\Program Files (x86)\Stata14\StataMP-64.exe"
Note that the first line of code references the location of the 8TB drive, and the second line
of code references the location of Stata. Both lines should therefore be changed
accordingly.
2. In order to run the codes for processing the historical data open Stata by double clicking on
run_stata.bat .
II. FIRM DESCRIPTION DATA
The folder Orbis Dec text on the historical drive contains 32 RAR files that need to be processed. This
section describes the content and details the processing of the 12 files that contain firm-level
descriptive data. The remaining files contain financial information and are therefore described in
3
section III. Each of the 12 descriptive files discussed in this section are static, contain the latest year for
which information is available, and therefore do not have a time dimension. All of these files contain
the firm ID (or BVDID) in the first column and can be linked by merging on this identifier.
1.DATA CONTENTS
The following describes the contents of the 12 firm description RAR files.
1. All addresses.txt: contains detailed address information. The variables include the main firm
identifier BVDID, the first four lines of the street address (both in English and the native
language), city (both in English and native language), postcode, country, country ISO code,
region in country, type of region in country, telephone and fax numbers, and address type.
a. There may be multiple entries per BVDID because one firm can have multiple address
types. The most common address type is incorporation address, but others include
previous address, branch address, and postal address.
b. Depending on the purpose of the study, the user can create the dataset that only
contains one observation per BVDID. For this, the user can implement the following
steps. Identify cases where there is more than one entry per BVDID (using the
duplicates tag command in Stata). If a BVDID does have multiple entries:
i. First, keep the incorporation address
ii. Some firms do not report an incorporation address, and therefore multiple
entries per BVDID need to be dealt with using different criteria. Second, among
remaining multiple entries, drop the previous address
iii. The remaining cases of multiple entries usually have two types of addresses:
office and postal. As a last step, keep the office address.
2. Contact info.txt: this file is similar to All addresses.txt, but has only one entry per BVDID. That
is, while the All addresses.txt file contains information on a firm’s previous address, branch
address, etc., the Contact info.txt file only contains the latest address for each firm. The file
contains the firm name, first four lines of the street address (both in English and the native
language), postcode, city (in English and the native language), country, country ISO,
metropolitan area (for the US), state/province (in US and Canada), county (US and Canada), fax
and telephone number, website, email address, region in the country and region type.
3. Identifiers.txt: contains various firm identifiers for each BVDID, including a national ID number,
the label of that national ID, the national VAT/tax identifier, trade register number, European
VAT number, LEI (legal entity identifier), and ticker symbol.
a. This file often has multiple entries for each BVDID. One common reason is that a
country has more than one type of national identifier. For example, in many countries
firms are assigned both a VAT/tax identifier and a LEI. Since both are types of national
IDs, there will be two observations per firm. One observation where the national ID
variable is populated with the VAT/tax identifier and the other where the national ID
variable is populated with the LEI.
/ 15
End of Document
105
You May Also Like

FAQs of Orbis Historical Data Processing Manual

What are the key steps for processing firm description data?
Processing firm description data involves unzipping RAR files, saving them in designated folders, and using Stata to import and clean the data. Users should identify and handle multiple entries per firm ID by prioritizing incorporation addresses and standardizing formats. The manual recommends breaking large text files into smaller chunks to facilitate easier handling and analysis. Finally, users can merge the cleaned data into a final dataset for further analysis.
How does the manual recommend handling financial data?
The manual suggests processing financial data by breaking down large files into smaller, manageable pieces, especially for extensive datasets like the Industry - Global financials. Each piece can be imported into Stata, corrected for discrepancies in country codes, and saved by country for streamlined analysis. This method ensures that users can efficiently manage and analyze financial information across various firms and countries.
What is the purpose of the Ownership data section?
The Ownership data section details the relationships between firms and their shareholders, providing insights into ownership structures. It includes specific files that outline various types of ownership links, such as domestic and global ultimate owners. The manual guides users through processing these links, ensuring accurate representation of ownership dynamics in the data. This information is crucial for understanding corporate governance and financial networks.
What are the recommended practices for data cleaning in Stata?
Recommended practices for data cleaning in Stata include dropping unnecessary variables, standardizing formats, and ensuring that each dataset has a consistent structure. The manual emphasizes the importance of handling missing values and duplicates effectively. Users are encouraged to implement specific commands to streamline the cleaning process, which can significantly reduce file sizes and enhance the quality of the analysis.

Related of Orbis Historical Data Processing Manual