Job Title
Lead Data Engineer – Python, PySpark & SQL
Location
Canada
Job Type
Full time contract
Responsibilities
• Build scalable data ingestion and transformation pipelines using Python, PySpark, and SQL.
• Process raw CSV/text files from AWS S3, including validating headers, schema checks, and malformed file detection.
• Convert raw data into structured DataFrames and implement reusable data quality checks.
• Develop advanced transformations using SQL/PySpark (Window functions, LAG(), grouping logic, date gap detection, etc.).
• Deploy and tune PySpark applications on AWS EMR, optimizing executor memory, cores, shuffle behavior, and cluster performance.
• Work with AWS services such as S3, EMR, Glue, Lambda, IAM.
• Debug performance issues (OOM errors, shuffle spill, GC problems) and improve pipeline reliability.
• Lead design discussions, code reviews, and mentor junior engineers.
Required Skills
• 8+ years of experience in Data Engineering.
• Expert Python (file processing, scripting, validation automation).
• Strong PySpark (DataFrames, job tuning, distributed processing).
• Advanced SQL (analytical functions, performance tuning).
• Hands‑on with AWS data stack: S3, EMR, Glue, Lambda.
• Strong understanding of Spark memory allocation, YARN container usage, and EMR resource tuning.
• Excellent debugging, communication, and problem‑solving skills.
Nice to Have
• Airflow or Databricks experience.
• Terraform or CloudFormation.
• Experience with data lake formats (Delta, Iceberg, Hudi).
Seniority level
Mid-Senior level
Employment type
Contract
Job function
Information Technology
#J-18808-Ljbffr