Abstract
Abstract
Access to large-scale, annotated EHR is limited by privacy rules. This creates a major setback for training strong clinical NLP models. Synthetic data provides a way to protect privacy, but how well synthetic text
works for fine-tuning LLMs in real-world tasks is still an important issue to explore. This thesis presents a
framework that uses synthetic patient summaries to fine-tune a medical LLM model for multi-label disease
diagnosis. This approach offers a cost-effective and privacy-focused method for creating clinical diagnostic
tools with minimal use of sensitive real-world data. The results show that synthetic data can successfully
reshape the medical models. This also helps the hospitals that are struggling with triage and the overcrowding
of patients.