[Corpora-List] Two public online talks in August and September on Pre-trained Language Models / Foundation Models (NLPV seminars, Exeter)

h.dong2--- via Corpora Wed, 20 Aug 2025 03:07:07 -0700

We welcome you to the next Natural Language Processing and Vision (NLPV) 
seminars at the University of Exeter. 
 
Talk 1
Scheduled: Thursday 21 Aug 2025 at 16:00 to 17:00, GMT+1
Location: 
https://Universityofexeter.zoom.us/j/97587944439?pwd=h4rnPO0PafT9oRrrqQsezGZspPUvdg.1
 (Meeting ID: 975 8794 4439 Password: 064414)


Title: Trustworthy Optimization of Pre-Trained Models for Healthcare: 
Generalizability, Adaptability, and Security

Abstract: Pre-trained language models have opened new possibilities in 
healthcare, showing promise in mining scientific literature, analyzing 
large-scale clinical data, identifying patterns in emerging diseases, and 
automating workflows, positioning themselves as intelligent research 
assistants. However, general-purpose models, typically trained on web-scale 
corpora, often lack the clinical grounding necessary for reliable deployment in 
high-stakes domains like healthcare. To be effective, they must be adapted to 
meet domain-specific requirements. My PhD thesis addresses three core 
challenges in leveraging pre-trained models for healthcare: (i) the scarcity of 
labeled data for fine-tuning, (ii) the evolving nature of healthcare data, and 
(iii) the need to ensure transparency and traceability of AI-generated content. 
In this talk, I will focus on the third challenge: enabling traceability of 
content generated by large language models. I will begin with an overview of 
prior watermarking approaches and then present our proposed solution. We 
introduce a watermarking algorithm applied at inference time that perturbs the 
model’s logits to bias generation toward a subset of vocabulary tokens 
determined by a secret key. To ensure that watermarking does not compromise 
generation quality, we propose a multi-objective optimization (MOO) framework 
that employs lightweight networks to produce token-specific watermarking logits 
and splitting ratios, specifying how many tokens to bias and by how much. This 
approach effectively balances watermark detectability with semantic coherence. 
Experimental results show that our method significantly improves detectability 
and robustness against removal attacks while preserving the semantics of the 
generated text, outperforming existing watermarking techniques.

Speaker's bio: Dr. Sai Ashish Somayajula is a Senior Applied Scientist in 
Generative AI at Oracle Cloud Infrastructure, where he develops large-scale 
foundation models for enterprise applications. He earned his PhD in Electrical 
and Computer Engineering from the University of California (UC), San Diego. His 
research focused on addressing key challenges in adapting and utilizing 
pre-trained models for healthcare. Specifically, his work spanned three core 
areas: (1) synthetic data generation using meta-learning-based feedback 
mechanisms, (2) continual learning for handling dynamic data streams without 
catastrophic forgetting, and (3) token-level watermarking techniques to ensure 
content provenance and security. His research has been published in premier 
venues, including the International Conference on Machine Learning (ICML), 
Annual Meeting of the Association for Computational Linguistics (ACL), 
Transactions of the Association for Computational Linguistics (TACL), 
Conference of the North American Chapter of the Association for Computational 
Linguistics (NAACL), Scientific Reports (Nature Portfolio), and Transactions of 
Machine Learning Research (TMLR). He is a recipient of the Jacobs School of 
Engineering Departmental Fellowship at UC San Diego. Ashish has collaborated 
with leading industrial research labs through internships at Apple and Tencent 
AI Lab. He holds a Bachelor's degree in Electrical Engineering with a minor in 
Computer Science from the Indian Institute of Technology, Hyderabad, where he 
was twice awarded the Academic Excellence Award, and a Master’s in Intelligent 
Systems and Robotics from UC San Diego.

Talk 2
Scheduled: Thursday 4 Sep 2025 at 13:00 to 14:00, GMT+1
Location: 
https://Universityofexeter.zoom.us/j/95827730937?pwd=Te1wejfgr68A5lplwLQjxwgcIWGc5K.1
(Meeting ID: 958 2773 0937 Password: 879296)

Title: Towards end-to-end tokenization and adaptive memory in foundation models

Abstract: Foundation models (FMs) process information as a sequence of internal 
representations; however, the length of this sequence is fixed and entirely 
determined by tokenization. This essentially decouples representation 
granularity from information content, which exacerbates the deployment costs of 
FMs and narrows their “horizons” in long sequences. What if, instead, we could 
free FMs from tokenizers by modelling bytes directly, while making them faster 
than current tokenizer-bound FMs? I argue that a recipe to achieve this goal 
already exists. In particular, I helped prototype how to: 1) dynamically pool 
representations in internal layers, progressively learning abstractions from 
raw data; 2) compress the KV cache of Transformers during generation without 
loss of performance; 3) predict multiple bytes per time step in an efficient 
yet expressive way; 4) retrofit existing tokenizer-bound FMs into byte-level 
FMs through cross-tokenizer distillation. By blending these ingredients, we may 
soon witness the emergence of efficient byte-level FMs.

Speaker's short bio (based on website): Edoardo Ponti is an assistant professor 
in Natural Language Processing at the University of Edinburgh and a visiting 
professor at NVIDIA. His research focuses on efficient architectures (see 
NeurIPS 2024 tutorial on dynamic sparsity), modular deep learning (designing 
neural architectures that route information to specialised modules, e.g., 
sparse subnetworks), and computational typology (understand how languages vary, 
across the world and its cultures, within a computational and mathematical 
framework). Previously, Edorado was a visiting postdoctoral scholar at Stanford 
University and a postdoctoral fellow in computer science at Mila - Quebec AI 
Institute in Montreal. In 2021, Edorado obtained a PhD from the University of 
Cambridge, St John’s College. Once upon a time Edorado studied typological and 
historical linguistics at the University of Pavia. Edoardo’s research has been 
featured on the Economist and Scientific American, among others. Edoardo 
received a Google Research Faculty Award and 2 Best Paper Awards at EMNLP 2021 
and RepL4NLP 2019. Edoardo is a board member of SIGTYP, the ACL special 
interest group for computational typology, a Scholar of the European Lab for 
Learning and Intelligent Systems (ELLIS), and part of the TACL journal 
editorial team.

We will update future talks at the website: 
https://sites.google.com/view/neurocognit-lang-viz-group/seminars 
 
Joining our *Google group* for future seminar and research information: 
https://groups.google.com/g/neurocognition-language-and-vision-processing-group
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] Two public online talks in August and September on Pre-trained Language Models / Foundation Models (NLPV seminars, Exeter)

Reply via email to