Job Description
Build ingestion pipelines for structured/unstructured data using Python.
Clean, normalize, and prepare data formats suitable for LLM fine-tuning (e.g., JSONL, CSV).
Create high quality, task specific datasets for training and evaluation.
Apply versioning to datasets using DVC or LakeFS for reproducibility.
Generate embeddings using HuggingFace or Sentence Transformers.
Manage vector indexes (FAISS, Weaviate) and optimize retrieval workflows.
Tokenize and chunk long form data for context window optimization.
Requirements
• 10 years experience in data engineering role.
• 2 years experience in AI adjacent data role.
• Proficiency in Python, pandas, and text processing tools.
• Familiarity with tokenization libraries (HuggingFace Tokenizers, SentencePiece).
• Experience managing datasets and object storage (MinIO, NFS).
• Understanding of LLM data constraints (context windows, formatting, prompt injection).
Key Skills
Apache Hive, S3, Hadoop, Redshift, Spark, AWS, Apache Pig, NoSQL, Big Data, Data Warehouse, Kafka, Scala.
Employment type: Full time
Vacancy: 1