Skip to content Skip to sidebar Skip to footer

Writing production-ready ETL pipelines in Python / Pandas

Writing production-ready ETL pipelines in Python / Pandas

Writing production-ready ETL pipelines in Python / Pandas Who this course is for: · Data engineers, scientists and developers who want to write professional production-ready

What you'll learn

  • How to write professional ETL pipelines in Python.
  • Steps to write production level Python code.
  • How to apply functional programming in Data Engineering.
  • How to do a proper object oriented code design.
  • How to use a meta file for job control.
  • Coding best practices for Python in ETL/Data Engineering.
  • How to implement a pipeline in Python extracting data from an AWS S3 source, transforming and loading the data to another AWS S3 target.

Requirements

  • Basic Python and Pandas knowledge is desirable.
  • Basic ETL and AWS S3 knowledge is desirable.

Description

This course will show each step to write an ETL pipeline in Python from scratch to production using the necessary tools such as Python 3.9, Jupyter Notebook, Git and Github, Visual Studio Code, Docker and Docker Hub and the Python packages Pandas, boto3, pyyaml, awscli, jupyter, pylint, moto, coverage and the memory-profiler.

Two different approaches how to code in the Data Engineering field will be introduced and applied - functional and object oriented programming.

Best practices in developing Python code will be introduced and applied:

  • design principles
  • clean coding
  • virtual environments
  • project/folder setup
  • configuration
  • logging
  • exeption handling
  • linting
  • dependency management
  • performance tuning with profiling
  • unit testing
  • integration testing
  • dockerization

What is the goal of this course?

In the course we are going to use the Xetra dataset. Xetra stands for Exchange Electronic Trading and it is the trading platform of the Deutsche Börse Group. This dataset is derived near-time on a minute-by-minute basis from Deutsche Börse’s trading system and saved in an AWS S3 bucket available to the public for free.

The ETL Pipeline we are going to create will extract the Xetra dataset from the AWS S3 source bucket on a scheduled basis, create a report using transformations and load the transformed data to another AWS S3 target bucket.

The pipeline will be written in a way that it can be deployed easily to almost any production environment that can handle containerized applications. The production environment we are going to write the ETL pipeline for consists of a GitHub Code repository, a DockerHub Image Repository, an execution platform such as Kubernetes and an Orchestration tool such as the container-native Kubernetes workflow engine Argo Workflows or Apache Airflow.

Enroll Now

Post a Comment for "Writing production-ready ETL pipelines in Python / Pandas"

N7DWHALVYX3VQRL