Overview
The overall data is available in two states: as raw data
and/or as pre-processed data
. Additionally there are three reference tables for variable lookup.
Reference tables
- variable reference (*hirid_variable_reference.csv*) - reference table for variables (for raw stage)
- ordinal variable reference (*ordinal_vars_ref.csv*) - reference table for categorical/ordinal variables for string value lookup
- pre-processed variable reference (*hirid_variable_reference_preprocessed.csv*) - reference table for variables (for merged and imputed stage)
Raw data
The raw data was only processed if this was necessary for patient de-identification and otherwise left unchanged compared to the original source. The raw data contains the complete set of available variables (681 variables). It consists of the following tables:
- observations
- pharma records
- general data
Pre-processed data
The pre-processed data consists of intermediary pipeline stages from our original Nature medicine publication. Source variables representing the same clinical concepts were merged into one meta-variable per concept. The data contains the 18 most predictive meta-variables only, as defined in our publication. Two different stages of the pipeline are available
Merged stage
source variables are merged into meta-variables by clinical concepts e.g. non-opioid-analgesics. The time grid is left unchanged and is sparse.Imputed stage
the data from the merged stage is down sampled to a five-minute time grid. The time grid is filled with imputed values. The imputation strategy is complex and is discussed in the original publication.
The code used to generate these stages can be found in this GitHub repo under the preprocessing folder.
Which data to use?
The pre-processed data is intended mainly as a quick way to jump-start a project or for use in a proof of concept. We recommend using the source data whenever possible for regular projects. It is the most flexible form and contains the complete set of variables in the original time resolution.
Data formats
Data is available in two formats: CSV
for wide compatibility and Apache Parquet
for convenience and performance. Parquet is a strongly typed, binary format that is supported by many major data processing tools such as pandas
, spark
, R
, matlab
, etc.
Since the data sets are fairly large, they are split into partitions, such that they can be processed in parallel in a straightforward way. The lookup table mapping patient id to partition id is provided in the file named {data_set}_index.csv
along with the data. The partitions are aligned between the different data sets and tables, such that the data of a patient can always be found in the partition with the same id. Note however, that a patient may not occur in all data sets, e.g. a patient might be missing in the preprocessed data, because a patient didn't meet the demographic criteria to be included in the study.
Data schemata
Field Name | Type | Modifications | Comment | Version |
---|---|---|---|---|
patientid | integer | mapped id | ||
admissiontime | timestamp | timeshift | ||
sex | string | none | 'M' or 'F' | |
age | long | ages >89 → 90 | age at admission | |
discharge_status | string | none | ICU (not hospital) discharge alive, dead or unknown | Available from version 1.1.1 |
Field Name | Type | Modifications | Comment |
---|---|---|---|
patientid
| integer | mapped id | |
datetime
| timestamp | timeshift | Time point the observation was made |
entertime
| timestamp | timeshift | Time point the entry was made into the data base. |
status
| short | 1=out of range 2=invalidated 4=first of connection 8=caused by event 16=compressed 32=notified, not measured 64=is bigger than 128=is smaller than 1024=mandatory | |
stringvalue
| string | ||
type
| string | For lab values only: C=correction F=Final result P=preliminary result | |
value
| float | ||
variableid
| long |
Field Name | Type | Comment |
---|---|---|
variableid | long | variableid of a categorical/ordinal variable as found in the observation table |
code | integer | value of the variable |
stringvalue | string | meaning/description |
Field Name | Type | Modifications | Comment |
---|---|---|---|
patientid
| integer | mapped id | |
pharmaid
| integer | See reference file (hirid_variable_reference.csv) for pharma id lookup. | |
givenat
| timestamp | timeshift | Time of administration |
enteredentryat
| timestamp | timeshift | Time point the entry was made into the data base |
givendose
| float | Unit see doseunit | |
cumulativedose
| float | Cumulative dose given since the start of the infusion (Infusion ID) | |
fluidamount_calc
| double | calculated | Unit is Milliliters. This is a calculated value from the source system and is not reliable in some of the cases when drugs (not fluids) are given. |
cumulfluidamount_calc
| double | calculated | Unit is Milliliters. Sum of all fluidamount_calc for this infusionid. |
doseunit
| string | Unit of givendose and cumulativedose | |
route
| string | ||
infusionid
| long | mapped id | Unique ID of infusion. |
typeid
| short | 0=Fluids 1=Drugs | |
subtypeid
| double | 0=crystalloid 1=blood product 2=colloid 3=enteral 4=parenteral 5=concentrate 7=other 8=drug | |
recordstatus
| short | 2=invalidated 4=start 8=record 32=notified, not administered 256=stop 512=include in record reports | |
Field Name | Type | Modifications | Comment |
---|---|---|---|
patientid
| integer | mapped id | |
datetime
| timestamp | timeshift | |
vm1
| double | ||
vm3
| double | ||
vm4
| double | ||
vm5
| double | ||
vm13
| double | ||
vm20
| double | ||
vm28
| double | ||
vm62
| double | ||
vm136
| double | ||
vm146
| double | ||
vm172
| double | ||
vm174
| double | ||
vm176
| double | ||
pm41
| double | ||
pm42
| double | ||
pm43
| double | ||
pm44
| double | ||
pm87
| long |
Field Name | Type | Modifications | Comment |
---|---|---|---|
patientid
| integer | mapped id | |
reldatetime
| double | seconds since admisson | |
vm1
| double | ||
vm3
| double | ||
vm4
| double | ||
vm5
| double | ||
vm13
| double | ||
vm20
| double | ||
vm28
| double | ||
vm62
| double | ||
vm136
| double | ||
vm146
| double | ||
vm172
| double | ||
vm174
| double | ||
vm176
| double | ||
pm41
| double | ||
pm42
| double | ||
pm43
| double | ||
pm44
| double | ||
pm87
| double |