📝 Abstract:
Access to large-scale, annotated EHR is limited by privacy rules. This creates a major setback for training strong clinical NLP models. Synthetic data provides a way to protect privacy, but how well synthetic text works for fine-tuning LLMs in real-world tasks is still an important issue to explore. This thesis presents a framework that uses synthetic patient summaries to fine-tune a medical LLM model for multi-label disease diagnosis. This approach offers a cost-effective and privacy-focused method for creating clinical diagnostic tools with minimal use of sensitive real-world data. The results show that synthetic data can successfully reshape the medical models. This also helps the hospitals that are struggling with triage and the overcrowding of patients.