Data Engineer(Hadoop and Pyspark)_Contract

Singapore 7 days agoContractor External
Negotiable
Roles & Responsibilities We're looking for a hands-on Data Engineer to design, build, and optimize scalable data pipelines on Hadoop ecosystem leveraging PySpark. You will partner with data scientists, analytics teams, and product engineering to deliver reliable datasets and streaming/Batch pipelines that power business insights and ML use cases. Key Responsibilities • Design & build data pipelines using PySpark/Spark (Batch & Streaming), integrating data from diverse sources (RDBMS, APIs, files, Kafka). • Develop and optimize ETL/ELT workflows on Hadoop ecosystem (HDFS, Hive, Spark), ensuring data quality, lineage, and reliability. • Model data for analytics/BI/ML (star/snowflake schemas, partitioning, bucketing) and implement efficient storage formats (Parquet/ORC). • Orchestrate workflows using Airflow (or similar schedulers) with robust dependency management, retries, alerting, and SLA monitoring. • Implement streaming pipelines with Kafka (or similar), including windowed aggregations, exactly-once semantics (where applicable), and schema evolution management. • Enable data governance & security (RBAC, masking, encryption at-rest/in-transit, audit logging, schema registry). • Performance tuning & cost optimization across Spark configs, shuffle strategies, broadcast joins, caching, and resource sizing. • Automate CI/CD for data pipelines (unit/integration tests, data quality checks, deployment automation, infrastructure-as-code). • Collaborate cross-functionally with data scientists, analytics, and platform teams to define SLAs, data contracts, and consumption patterns. • Documentation & support: Maintain runbooks, metadata, lineage, and provide L2/L3 support for production incidents. Required Qualifications • 2-6 years of experience in data engineering (adjust as needed), with strong expertise in PySpark and the Hadoop ecosystem (HDFS, Hive, Spark). • Advanced Python and SQL skills (analytical functions, performance tuning). • Experience with workflow orchestration (Airflow/Luigi/Prefect) and version control (Git). • Hands-on with data warehousing and modeling concepts; experience optimizing large-scale distributed computations. • Exposure to streaming (Kafka, Spark Structured Streaming) and schema management. • Experience in at least one cloud platform (AWS EMR/Glue, Azure HDInsight/Synapse/Databricks, or GCP Dataproc/BigQuery). • Strong understanding of data quality (DQ rules, Great Expectations or equivalent), metadata, and lineage. • Familiarity with Linux, shell scripting, and containerization (Docker); basic understanding of CI/CD. • Excellent communication, stakeholder management, and problem solving skills. Tell employers what skills you have Version Control PySpark Airflow Modeling Pipelines Hadoop Data Quality Data Governance Data Engineering SQL Python Docker Metadata Orchestration Data Warehousing Linux