Large Language Models (LLMs) have revolutionized natural language processing but have posed significant challenges in training and inference due to their enormous memory requirements. In this talk, we delve into techniques and optimizations to mitigate memory constraints across the entire lifecycle of LLMs.
The first segment explores Memory Optimized LLM Training. We discuss Training challenges and cover different techniques under Parameter Efficient Fine Tuning (PEFT). like prompt tuning with LoRA, and adapters.
LLMs inference is more memory bound rather than compute bound, In this section we will explore inference optimizations mostly for transformer architectures like Paged Key-Value (KV) Cache, Speculative Decoding, Quantization, Inflight Batching strategies, Flash Attention, each contributing to enhanced inference speed and efficiency.
Finally, we explore the concept of Coherent Memory, and how it helps with Inference optimizations by KV Cache offloading and LoRA weight re-computation.
By illuminating these advancements, this talk aims to provide a comprehensive understanding of state-of-the-art memory optimization techniques for LLMs, empowering practitioners to push the boundaries of natural language processing further.
Arun Raman
Arun Raman is an AI solution architect at NVIDIA, adept at navigating the intricate challenges of deploying AI applications across edge, cloud, and on-premises environments within the consumer Internet industry. In his current role, he works on the design of end-to-end accelerated AI pipelines, for consumer internet customers meticulously addressing preprocessing, training, and inference optimizations. His experience extends beyond AI, having worked with distributed systems and multi-cloud infrastructure. He shares practical strategies and real-world experiences, empowering organizations to leverage AI effectively.