We are seeking a skilled Data Engineer to develop and manage the infrastructure required to support data processing, storage, and retrieval for our LLM system. The ideal candidate will have expertise in building scalable, reliable, and high-performance data pipelines and systems. You will collaborate closely with our engineering and research teams to optimize data workflows, ensure data quality and reliability, and drive innovation in the field of language technology.
**Responsibilities:**
- Design, implement, and maintain data pipelines and ETL processes to ingest, process, and transform data for use by our LLM system.
- Build and manage scalable and reliable data storage solutions, including data lakes, data warehouses, and NoSQL databases.
-Ensure seamless integration with either our existing electronic document management system (ECM) or a new external document storage system as part of data pipeline development and management.
- Optimize data processing and retrieval performance to meet the requirements of our LLM applications.
- Collaborate with cross-functional teams to understand data requirements and develop solutions to meet business needs.
- Implement robust security measures and protocols across data infrastructure to ensure the highest levels of data privacy, integrity, and compliance with industry standards and regulations.
- Monitor and troubleshoot data pipelines and systems to ensure reliability and uptime.
- Continuously evaluate and implement new technologies and tools to improve data infrastructure and workflows.
- Develop and maintain documentation for data infrastructure configurations, processes, and procedures.
**Qualifications:**
- Bachelor's degree in Computer Science, Data Engineering, or related field (Optional but a plus).
- Proficiency in SQL and other data query languages.
- Experience with data modeling, schema design, and optimization techniques.
- Strong proficiency in at least one programming language, such as Python, Java, or Scala.
- Hands-on experience with big data technologies and frameworks, such as Hadoop, Spark, or Kafka.
- Experience with cloud-based data platforms and services, such as AWS, Google Cloud, or Azure.
- Knowledge of data integration and ETL tools, such as Apache NiFi, Talend, or Informatica.
- Familiarity with data governance, security, and compliance frameworks.
- Excellent problem-solving skills and attention to detail.
- Strong communication and collaboration skills.