Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining ResearchLuca Soldaini,Rodney Kinney,Akshita Bhagia,Dustin Schwenk,David Atkinson,Russell Authur,Ben Bogin,Khyathi Chandu,Jennifer Dumas,Yanai Elazar,Valentin Hofmann,Ananya Harsh Jha,Sachin Kumar,Li Lucy,Xinxi Lyu,Nathan Lambert,Ian Magnusson,Jacob Morrison,Niklas Muennighoff,Aakanksha Naik,Crystal Nam,Matthew E. Peters,Abhilasha Ravichander,Kyle Richardson,Zejiang Shen,Emma Strubell,Nishant Subramani,Oyvind Tafjord,Pete Walsh,Luke Zettlemoyer,Noah A. Smith,Hannaneh Hajishirzi,Iz Beltagy,Dirk Groeneveld,Jesse Dodge,Kyle LoACL (1)(2024)引用 147|浏览117关键词Language Modeling,Neural Machine TranslationAI 理解论文溯源树样例生成溯源树,研究论文发展脉络Chat Paper正在生成论文摘要