Design and Implement Innovative Data Solutions: Develop and maintain advanced ETL pipelines using SQL, Python, and Generative AI, transforming traditional data processes into highly efficient and automated solutions.
Orchestrate Complex Data Workflows: Utilize tools such as Dagster and Airflow for sophisticated pipeline orchestration, ensuring seamless integration and automation of data processes.
Leverage Generative AI for Data Solutions: Create and implement smart data solutions using Generative AI techniques like Retrieval-Augmented Generation (RAG). This includes building solutions that retrieve and integrate external data sources with LLMs to provide accurate and contextually enriched responses.
Employ Prompt Engineering: Develop and refine prompt engineering techniques to effectively communicate with large language models (LLMs), enhancing the accuracy and relevance of generated responses in various applications.
Utilise Embeddings and Vector Databases: Apply embedding language models to convert data into numerical representations, storing them in vector databases. Perform relevancy searches using these embeddings to match user queries with the most relevant data.
Incorporate Semantic Search Techniques: Implement semantic search to enhance the accuracy and relevance of search results, ensuring that data retrieval processes are highly optimised and contextually aware.
Collaborate Across Teams: Work closely with cross-functional teams, including data science and business analytics to understand and deliver on unique and evolving data requirements.
Ensure High-Quality Data Flow: Leverage stream, batch, and Change Data Capture (CDC) processes to ensure a consistent and reliable flow of high-quality data across all systems.
Enable Business User Empowerment: Use data transformation tools like DBT to prepare and curate datasets, empowering business users to perform self-service analytics.
Maintain Data Quality and Consistency: Implement rigorous standards to ensure data quality and consistency across all data stores, continuously innovating to improve data reliability.
Monitor and Enhance Pipeline Performance: Regularly monitor data pipelines to identify and resolve performance and reliability issues, using innovative approaches to keep systems running optimally.
Desired Candidate Profile
7+ years of experience as a data engineer.
Proficiency in SQL and Python.
Experience with modern cloud data warehousing and data lake solutions such as Snowflake, BigQuery, Redshift, and Azure Synapse.
Expertise in ETL/ELT processes, and experience building and managing batch and streaming data processing pipelines.
Strong ability to investigate and troubleshoot data issues, providing both short-term fixes and long-term solutions.
Experience with Generative AI, including Retrieval-Augmented Generation (RAG), prompt engineering, and embedding techniques for creating and managing vector databases.
Knowledge of AWS services, including DMS, Glue, Bedrock, SageMaker, and Athena.
Familiarity with dbt or other data transformation tools.
Other Desired Experience:
Familiarity with AWS Bedrock Agents and experience in fine-tuning models for specific use cases, enhancing the performance of AI-driven applications.
Proficiency in implementing semantic search to enhance the accuracy and relevance of data retrieval.
Experience with LangChain techniques and platforms for building applications that require complex, multi-step reasoning, such as conversational AI, document retrieval, content generation, and automated decision-making processes.
Experience with AWS services and concepts, including AWS OpenSearch, EC2, ECS, EKS, VPC, IAM, and others.
Proficiency with orchestration tools like Dagster, Airflow, AWS Step Functions, and similar platforms.
Experience with pub-sub, queuing, and streaming frameworks such as AWS Kinesis, Kafka, SQS, and SNS.
Familiarity with CI/CD pipelines and automation, ensuring efficient and reliable deployment processes.