Analyzing 21,194,349 Hospitalization Records from Poland's National Health Fund NFZ (2019-2022)

NFZ Hospitalization Data ETL & Visualization | A Low-Level Approach with SQL, Shell Scripts, SQLLoader and Apache Superset

About the Project

This project is based on real NFZ data sourced from the open-data portal, dane.gov.pl. The dataset comprises all hospitalizations covered by the NFZ (National Health Fund of Poland) in 2019-2022 *.

Dane dotyczące hospitalizacji rozliczonych JGP w latach 2019-2021 (containt hospitalizations start with 2017, 2018)
Dane dotyczące hospitalizacji rozliczonych JGP w latach 2022

The dataset comprises over 20 million records, with each record representing an individual patient's hospitalization data.

It's based on the public goverment repositories hub named data.gov.pl.

Source Licence

Used repository is not bad quality, but I saw small differences between files. I want to make some improvements, create DWH star schema, materialized view as fact table, dimensions etc... I will clean and process the data to visualize it with Apache Superset.

Tech stack

This repository showcases my Data Engineering project, highlighting my diverse data-related skills. It includes database administration, data warehousing, and ETL development tasks. My objective is to prepare and process this data for visualization purposes, including the creation of charts and dashboards. I aim to learn modern data visualization BI tool.

The project is currently only available on my localhost, so included some scripts and documentation images for the initial version.

Apache Superset for visualisation
Oracle SQL for cleaning data
VirtualBox, network bridged adapter. VMs:
- VM 1: Oracle Linux with DB 21
- VM 2: Centos, Docker, Apache Superset container with cx_oracle connector
MINGW Bash for shell scripting
SqlLoader
IDE:
- SQLDeveloper 23 for DBA tasks and queries
- Visual Studio Code with new Oracle plugin for new experience :)
- SQLPlus & SQLCli
Others
- PlantUml for diagrams

Model

Figure 1: DWH Model

Database Objects & Names Explanation

Pluggable Database (PDB)

datamart

Application schema

dm_nfzhosp

Users

sysdm: Data Mart Administrator
C##JDOE: Data Analyst
C##JSMITH: Database Developer
sys: Oracle Root User

Roles

Role R_ENGINEER is used for develepers. That's have strong privilages. Granted user: C##JSMITH
R_ANALYST is used for. Granted user: C##JDOE

Domain dictionaries

The data contains some foreign keys pointing to static dictionaries:

dim_discharge_mode_dict
dim_admission_mode_dict (both as general dictionary, no-changing dimension)
dim_nfz_dept_dict (inline yet as CTE)
dim_institution (#todo)

The dictionary source for Polish HL7 implementations includes:

discharge modes https://www.cez.gov.pl/HL7POL-1.3.2/plcda-html-1.3.2/plcda-html/voc-2.16.840.1.113883.3.4424.13.11.36-2015-10-26T000000.html

Installation

Tool

Set up default coding to UTF-8 (SQLDeveloper, Sublime, DB instance)

Data Part:

Set up the Oracle database.
Copy the repository
Review server parameters (TNS, IP, service name) in the sql/ directory scripts and install.sh
Run the install.sh script (the script will prompt for database passwords for security and a better user experience)

Figure 1: Demonstration of the installation process (gif animations)

Visualization Tool:

Configure the Apache Superset datasource and dataset with c##jdoe credentials (analytical role), using SqlAlchemu url

url oracle+cx_oracle://c##jdoe:oracle@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.0.51)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=datamart)))

Set up charts.

Source Data preview

ROK;MIESIAC;OW_NFZ;NIP_PODMIOTU;KOD_PRODUKTU_KONTRAKTOWEGO;KOD_PRODUKTU_JEDNOSTKOWEGO;KOD_TRYBU_PRZYJECIA;KOD_TRYBU_WYPISU;PLEC_PACJENTA;GRUPA_WIEKOWA_PACJENTA;PRZEDZIAL_DLUGOSCI_TRWANIA_HOSPITALIZACJI;LICZBA_HOSPITALIZACJI 2022;4;"07";"1132866688";"03.4580.991.02";"5.51.01.0008013";6;2;"K";"65 i wiecej";"6 i wie™cej dni";"<5" 2022;8;"02";"5562239217";"03.4220.030.02";"5.51.01.0001087";3;2;"K";"45-64";"6 i wiecej dni";"<5" 2022;11;"03";"9462146139";"03.4580.991.02";"5.51.01.0008015";6;2;"K";"65 i wiecej";"6 i wiecej dni";"<5" 2022;9;"15";"7842008454";"03.4450.040.02";"5.51.01.0012014";3;2;"K";"65 i wiecej";"0 dni";"<5" 2022;4;"13";"6572195982";"03.4500.030.02";"5.51.01.0006109";6;2;"K";"65 i wiecej";"3-5 dni";"<5"

My notes

top 5 of disase by age group
increase in hospitalization count for each age group
increase in hospitalization length for each age group

Issues

Gender code: 1 2 K M 0 9

Key words

#nfz #data #etl #elt #bi #BuissnessInteligance #dataVisualisation #healthcare #dataengineering #analyst #pipeline #dataops #utplsql #liquid

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
assets		assets
ctl		ctl
data-demo		data-demo
sql		sql
.gitignore		.gitignore
draft.sql		draft.sql
install.sh		install.sh
readme.md		readme.md
rebuild.sh		rebuild.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

ctl

ctl

data-demo

data-demo

sql

sql

.gitignore

.gitignore

draft.sql

draft.sql

install.sh

install.sh

readme.md

readme.md

rebuild.sh

rebuild.sh

Repository files navigation

Analyzing 21,194,349 Hospitalization Records from Poland's National Health Fund NFZ (2019-2022)

Table of Contents

About the Project

Tech stack

Model

Database Objects & Names Explanation

Domain dictionaries

Installation

Source Data preview

My notes

Issues

Key words

About

Releases

Packages

Languages

szwrk/nfz-hospitalization-data

Folders and files

Latest commit

History

Repository files navigation

Analyzing 21,194,349 Hospitalization Records from Poland's National Health Fund NFZ (2019-2022)

Table of Contents

About the Project

Tech stack

Model

Database Objects & Names Explanation

Domain dictionaries

Installation

Source Data preview

My notes

Issues

Key words

About

Topics

Resources

Stars

Watchers

Forks

Languages