The CEA LIST Institute and Genoscope are looking for a talented computational biologist to join its laboratory of semantic analysis of texts and images.
In this exciting project,you will integrate an interdisciplinary team aiming to move closer to the goal of predictive and generative artificial intelligence for biology by exploiting deep contextual language models of biological sequences, whose representations could generalize to several applications like the prediction of mutational effects of sequence variation.
Exponential growth in sequencing throughput together with the sampling of natural (uncultured) populations are providing a deeper view of the diversity of proteins sequences across the tree of life. Proteins are molecular engines sustaining cellular life and the unobserved determinants of their structure and function are encoded in the distribution of observed natural sequences. Therefore, such vast amounts of (unlabelled) sequences provide evolutionary data that can form the ground for unsupervised learning of predictive and generative models of biological function.
Our focus in this project will be to train high-capacity Transformer-based language models on sequence data, in a way analogous to what is done in natural language understanding, where the semantics of words is determined from the contexts in which they appear in sentences. Intrinsic organizing principles captured in the resulting representations can then be applied in transfer learning settings to different prediction sub-tasks using limited experimental data, like the effect of sequence variation on function. Following promising recent results, we plan to also explore zero-shot inference with no additional training and/or supervision from experimental data.
This project will be an excellent opportunity for a candidate who is looking to contribute to cutting-edge research and to train with experts in the field. We are seeking a detail-oriented computer scientist and problem solver passionate in science.
- Tune and optimize existing unsupervised transformer-based language models for protein sequences.
- Develop and optimize code and machine learning algorithms for predictive models.
- Integrate and analyze large biological data volumes.
- Interact continuously with scientists in an interdisciplinary team.
- Ph.D. or M.Sc. in a quantitative discipline, e.g. Applied Mathematics, Computer Science, Computational Biology, Physics or a closely related discipline.
- Experience with Python, open-source software libraries for machine learning and Linux (file systems, shell, hardware/software monitoring, etc).
- Strong mathematical background and analytical skills.
- Effective organizational skills, e.g. the ability to prioritize work and contribute to the planning of a program of scientific research.
- Demonstrated interpersonal skills including both the ability to work independently and perform collaborative research in an interdisciplinary team environment.
- Good oral and written communication skills.
Previous experience with transformer-based techniques for NLP pre-training and unsupervised transformer language models.
TERMS & COMPENSATION
This 2 years position is open to a range of candidates from recent college graduates to more experienced scientists (e.g. post-docs) – the chosen candidate's salary will be commensurate with their level of education, skills, and experience.
Other benefits include:
- 48 days of paid holidays.
- on-site subsidized restaurant.
- partial (up to 50%) remote work is possible.
- CEA contribution to the personal company savings plan.
We are based on the Paris-Saclay research campus in the south of Paris.
HOW TO APPLY
Interested candidates should submit a resume and short cover letter to deepgenseq « at » saxifrage.saclay.cea.fr
About CEA LIST: https://list.cea.fr/en/
About the LASTI lab: https://kalisteo.cea.fr/index.php/ai/ and https://kalsteo.cea.fr/index.php/textual-and-visual-semantic/
About Genoscope: https://www.genoscope.cns.fr