Orbis Historical Data Processing Manual

PROCESSING ORBIS HISTORICAL DISK

Sebnem Kalemli-Ozcan, Jingting Fan, and Veronika Penciakova

in cooperation with Bureau van Dijk

I. PRELIMINARIES

This manual relates to the processing of the historical disk delivered by BvD. The user receives an

external hard drive, which contains the following folders:

 Financials Histo Dec SQL

 Financials Histo Dec text: the contents of this folder, and the processing of files contained

within are described in section III.

 Orbis Dec text: the contents of this folder, and the processing of files contained within are

described in section II

 Ownership histo Dec SQL

 Ownership histo Dec text: the contents of this folder, and the processing of files contained

within are described in IV

This manual details the processing of the files in the “text” folders since these contain files that can be

read into Stata. The “SQL” folders contain the same information as those in the “text” folders, but in a

different format (ie one that cannot be read into Stata).

This section describes system requirements and pre-processing steps that facilitate the processing of

data contained in the historical disk. The hardware used for this task was the Windows 10 Enterprise

workstation with Intel Xeon CPU E5-2680 @ @.75GhZ and 128 Gb of RAM.

1. SYSTEM REQUIREMENTS

The historical disk is around 280 GB. We recommend using an 8 TB drive to process and store the data.

The guide describes how to process and store the BvD data using Stata MP, version 12 or higher.

2. FOLDER STRUCTURE

In the 8 TB drive, create the following folder structure that logically corresponds to the internal

organization of the Orbis database.

1. Firm_description: will contain the processed descriptive firm-level data including name,

address, legal form, industry classification, and national identifiers. The steps for processing

these data are described in section II. Within this folder, create the following sub-folders:

1.1. Codes: stores do-files for processing data.

1.2. Txt: contains unzipped (unprocessed) data text files. Within this folder, create a

subfolder called chunky, which will be referenced during the processing

procedures described below.

1.3. Dta: contains processed data files. Within this folder, create two sub-folders,

which will be needed for processing the data: intermediate and final. Within

each of these folders, create a series of individual folders for each country

covered in the BvD data, using the two-letter country ISO code. This country

ISO code is used by BvD to create the identification numbers for companies,

their owners, and subsidiaries.

2. Financials: will contain the processed firm-level financial data. The steps for processing these

data are described in section III. Within this folder, create the same sub-folders as in folder 1

(Firm_description).

3. Ownership: will contain the processed firm-level ownership data. The steps for processing

these data are described in section IV.

1.1. Codes: stores do-files for processing data.

1.2. Txt: contains unzipped (unprocessed) data text files. Within this folder, create a

subfolder called chunky, which will be referenced during the processing

procedures described below.

1.3. Dta: contains processed data files. Within this folder, create two sub-folders,

which will be needed for processing the data: entity_information and

ownership_structure. Within each of these folders, create two sub-folders:

intermediate and final. Within each of these folders, create a series of individual

folders for each country covered in the BvD data, using the two-letter ISO code.

4. Temp: this folder is needed for running Stata directly from the 8 TB drive, which is detailed in

section I.3.

3. RUNNING STATA

By default, Stata runs from the C:\ drive, which often has less capacity than other installed drives. When

processing the historical disk, Stata should be run from the 8 TB storage disk.

1. Using a text editor (such as Notepad), save the following two lines of code as a run_stata.bat

within the 8 TB drive.

set STATATMP=DRIVE:\FOLDER\Temp

"C:\Program Files (x86)\Stata14\StataMP-64.exe"

● Note that the first line of code references the location of the 8TB drive, and the second line

of code references the location of Stata. Both lines should therefore be changed

accordingly.

2. In order to run the codes for processing the historical data open Stata by double clicking on

run_stata.bat .

II. FIRM DESCRIPTION DATA

The folder Orbis Dec text on the historical drive contains 32 RAR files that need to be processed. This

section describes the content and details the processing of the 12 files that contain firm-level

descriptive data. The remaining files contain financial information and are therefore described in

section III. Each of the 12 descriptive files discussed in this section are static, contain the latest year for

which information is available, and therefore do not have a time dimension. All of these files contain

the firm ID (or BVDID) in the first column and can be linked by merging on this identifier.

1.DATA CONTENTS

The following describes the contents of the 12 firm description RAR files.

1. All addresses.txt: contains detailed address information. The variables include the main firm

identifier BVDID, the first four lines of the street address (both in English and the native

language), city (both in English and native language), postcode, country, country ISO code,

region in country, type of region in country, telephone and fax numbers, and address type.

a. There may be multiple entries per BVDID because one firm can have multiple address

types. The most common address type is incorporation address, but others include

previous address, branch address, and postal address.

b. Depending on the purpose of the study, the user can create the dataset that only

contains one observation per BVDID. For this, the user can implement the following

steps. Identify cases where there is more than one entry per BVDID (using the

duplicates tag command in Stata). If a BVDID does have multiple entries:

i. First, keep the incorporation address

ii. Some firms do not report an incorporation address, and therefore multiple

entries per BVDID need to be dealt with using different criteria. Second, among

remaining multiple entries, drop the previous address

iii. The remaining cases of multiple entries usually have two types of addresses:

office and postal. As a last step, keep the office address.

2. Contact info.txt: this file is similar to All addresses.txt, but has only one entry per BVDID. That

is, while the All addresses.txt file contains information on a firm’s previous address, branch

address, etc., the Contact info.txt file only contains the latest address for each firm. The file

contains the firm name, first four lines of the street address (both in English and the native

language), postcode, city (in English and the native language), country, country ISO,

metropolitan area (for the US), state/province (in US and Canada), county (US and Canada), fax

and telephone number, website, email address, region in the country and region type.

3. Identifiers.txt: contains various firm identifiers for each BVDID, including a national ID number,

the label of that national ID, the national VAT/tax identifier, trade register number, European

VAT number, LEI (legal entity identifier), and ticker symbol.

a. This file often has multiple entries for each BVDID. One common reason is that a

country has more than one type of national identifier. For example, in many countries

firms are assigned both a VAT/tax identifier and a LEI. Since both are types of national

IDs, there will be two observations per firm. One observation where the national ID

variable is populated with the VAT/tax identifier and the other where the national ID

variable is populated with the LEI.

/ 15

105

FAQs of Orbis Historical Data Processing Manual

What are the key steps for processing firm description data?

Processing firm description data involves unzipping RAR files, saving them in designated folders, and using Stata to import and clean the data. Users should identify and handle multiple entries per firm ID by prioritizing incorporation addresses and standardizing formats. The manual recommends breaking large text files into smaller chunks to facilitate easier handling and analysis. Finally, users can merge the cleaned data into a final dataset for further analysis.

How does the manual recommend handling financial data?

The manual suggests processing financial data by breaking down large files into smaller, manageable pieces, especially for extensive datasets like the Industry - Global financials. Each piece can be imported into Stata, corrected for discrepancies in country codes, and saved by country for streamlined analysis. This method ensures that users can efficiently manage and analyze financial information across various firms and countries.

What is the purpose of the Ownership data section?

The Ownership data section details the relationships between firms and their shareholders, providing insights into ownership structures. It includes specific files that outline various types of ownership links, such as domestic and global ultimate owners. The manual guides users through processing these links, ensuring accurate representation of ownership dynamics in the data. This information is crucial for understanding corporate governance and financial networks.

What are the recommended practices for data cleaning in Stata?

Recommended practices for data cleaning in Stata include dropping unnecessary variables, standardizing formats, and ensuring that each dataset has a consistent structure. The manual emphasizes the importance of handling missing values and duplicates effectively. Users are encouraged to implement specific commands to streamline the cleaning process, which can significantly reduce file sizes and enhance the quality of the analysis.

Orbis Historical Data Processing Manual

Key Points

FAQs of Orbis Historical Data Processing Manual

Related of Orbis Historical Data Processing Manual