Computing and Information Technology, Databases / Data Management

Data Engineering for Large Foundation Models: A Handbook

Jun Yu (editor-in-chief), Chang Wen Chen (editor-in-chief)

Hardback Published on: 14/12/2026

£169.99

UK delivery included

Not available
This product is currently unavailable

Make and edit your lists in your account

wordery

has a fantastic rating on

Not available
This product is currently unavailable

wordery

has a fantastic rating on

Synopsis

Data quality has become a decisive foundation for large foundation models, shaping their capability, reliability, alignment, and real-world applicability. Data Engineering for Large Foundation Models: A Handbook provides a systematic and practice-oriented guide to data engineering for foundation models. Moving beyond a narrow focus on large language models, the book covers the data lifecycle behind language models, vision-language models, multimodal understanding systems, text-to-image and text-to-video generative models, reasoning models, agentic systems, and domain-specific AI applications.

The book presents a full-stack framework for building high-quality data pipelines for foundation-model development. It covers large-scale pre-training data engineering, including data sourcing, acquisition, cleaning, deduplication, decontamination, tokenization, serialization, efficient loading, and quality evaluation. It also addresses multimodal data engineering for image-text, document, video, and audio data, as well as post-training and alignment data construction, including SFT, preference data, RLHF, Chain-of-Thought reasoning data, tool-use data, agent memory, and multi-turn interaction data.

The book further examines data-centric AI systems, including synthetic data factories, knowledge distillation, enterprise-grade RAG and multimodal RAG pipelines, online feedback loops, knowledge updating, DataOps platforms, data governance, privacy protection, federated learning, and compliance-aware data engineering. Through end-to-end projects and reproducible system designs, readers gain hands-on experience with distributed pre-training data pipelines, domain-specific SFT datasets, multimodal instruction data factories, reasoning data flywheels, agent tool-use data factories, enterprise DataOps platforms, privacy-preserving pipelines, open-source model reproduction, and text-to-video training data pipelines. Using modern tools such as Ray, Spark, Dask, Parquet, WebDataset, vector databases, DVC, MLflow, and Airflow, this handbook equips data engineers, MLOps and DataOps professionals, AI researchers, and technical product teams to build reliable, scalable, and continuously improving foundation-model systems.

Publisher information

Publisher: Springer Nature Singapore
ISBN: 9789819228492
Dimensions: 235 x 155 mm
Languages: English