Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM
International Symposium on High-Performance Computer Architecture(2025)
Key words
GPU Memory,Large Language Models,Processing Unit,Weight Parameters,Computational Load,Computational Capabilities,Load Balancing,Inference System,Memory Size,Single GPU,Optimal Partition,Efficient Inference,Inference Performance,Neuronal Activity,Batch Size,Data Transfer,Input Sequence,Memory Capacity,State Machine,Computational Overhead,Limited Memory,Limited Bandwidth,Memory Bandwidth,Total Execution Time,Neuronal Computation,Large Batch Size,ReLU Function,Offloading Strategy,Scheduling Scheme,Optimal Mapping
AI Read Science
Must-Reading Tree
Example

Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined