Synthetic Data for Evaluation
When prototyping a system, you may not have enough real data to thoroughly evaluate how the component will behave. This guide will walk you through two techniques for generating synthetic data to augment your dataset:
Paraphrasing Existing Inputs: Generate semantically similar variations of your existing examples to test your system's consistency.
Generating New Inputs: Create entirely new, plausible inputs to test the how your system generalizes to other scenarios.
Synthetic data is not a full substitute for real data. The quality of the data generated by these methods depends on factors like the model, prompt, and existing data. Always inspect synthetic datasets to ensure they capture the information you want to model and align with your use case.
Prerequisites
This guide assumes you've already connected to LangSmith and have a starter dataset. Though we will use LangChain core below, the technique is simple enough to apply using whatever tools you're comfortable with.
pip install langsmith langchain_openai
Paraphrasing
Paraphrasing existing inputs helps check if your chain's behavior is consistent across similar inputs. Since paraphrasing is a semantically invariant transformation, the outputs should remain the same as the original. Here's how to set it up:
Step 1: Define the Paraphrase Generator
Create a chain for generating paraphrases using the ChatOpenAI
model with custom system prompts.
import re
from typing import List
from langchain_openai import ChatOpenAI
from langchain_core.prompts import (
ChatPromptTemplate,
)
paraphrase_llm = ChatOpenAI(temperature=0.5)
prompt_template = ChatPromptTemplate.from_messages(
[
("system",
"You are a helpful paraphrasing assistant tasked with rephrasing text."
),
("system", "Input: <INPUT>{query}</INPUT>"),
("user",
"What are {n_paraphrases} different ways you could paraphrase the INPUT text?"
" Do not significantly change the meaning."
" Respond using numbered bullets. If you cannot think of any,"
" just say 'I don't know.'"
),
]
)
def parse(output: str) -> List[str]:
return re.findall(r"\d+\.\s+(.*?)\n", output)
paraphrase_chain = prompt | llm | parse
Step 2: Paraphrase the Dataset
Use the Client
from LangSmith to access your dataset and generate paraphrases for it.
from langsmith import Client
client = Client()
n_paraphrases = 3
dataset_name = "Your Dataset Name"
examples = client.list_examples(dataset_name=dataset_name)
results = paraphrase_chain.batch(
[{"query": next(iter(example.inputs.values())), "n_paraphrases": n_paraphrases}
for example in examples]
)
inputs, outputs = [], []
for example, batch_r in zip(examples, results):
input_key = next(iter(example.inputs))
for r in result:
inputs.append({input_key: r})
outputs.append(example.outputs)
client.create_examples(
inputs=inputs,
outputs=outputs,
dataset_name=dataset_name,
)
After running, your dataset should be roughly 3x the original size, with paraphrased variations of your original inputs.
Generating New Inputs
To expand your dataset's semantic range and test your system's robustness, you can generate entirely new, plausible inputs. This method examines a random set of 5 existing examples and creates 6 novel ones that align with the inferred system but are distinct enough to have likely originated from different individuals.
Step 1: Define the New Input Generator
Create a chain for generating new inputs using the ChatOpenAI
model with custom system prompts.
input_gen_llm = ChatOpenAI(temperature=0.5)
input_gen_prompt_template = ChatPromptTemplate.from_messages(
[
("system",
# Update this prompt to more closely match your use case
"You are a creative assistant tasked with coming up with new inputs for an application."
"\nThe following are some examples you can use to understand the domain:\n\n{examples}"
),
("user",
"Can you generate {n_inputs} unique and plausible inputs that could be asked by different users?"
),
]
)
input_gen_chain = prompt | llm | parse
Step 2: Generate New Inputs for the Dataset
Use the Client from LangSmith to access your dataset, sample a set of existing inputs, and generate new inputs based on them. Note that new inputs don't come with corresponding outputs, so you may need to manually label them or use a separate model to generate the outputs.
import random
client = Client()
n_inputs = 6
dataset_name = "Your Dataset Name"
sample_size = 5
examples = list(client.list_examples(dataset_name=dataset_name))
example_inputs = [next(iter(example.inputs.values())) for example in random.sample(examples, sample_size)]
example_inputs_str = '\n'.join(f"- {input}" for input in example_inputs)
results = input_gen_chain.batch(
[{"examples": example_inputs_str, "n_inputs": n_inputs}]
)
inputs = [{"input": r} for r in results[0]]
outputs = [{}] * len(inputs)
client.create_examples(
inputs=inputs,
outputs=outputs,
dataset_name=dataset_name,
)
After running, your dataset should contain new examples that differ more significantly from the original ones, helping you test your system's robustness to a wider range of inputs.
Considerations
Remember that the quality of the paraphrases and generated inputs will depend on the model and prompt used, and these approaches may not be appropriate for all use cases. Always check your augmented data to ensure it maintains the original meaning, aligns with the system's context, and is suitable for your application.
Synthetic data is most useful early in the development process, when you're trying to gauge how sensitive your chain or model is to input variations. By combining paraphrasing, new input generation, and other augmentation methods, you can expand and diversify your dataset to verify the feasibility and robustness of a feature before deploying it to production.