Software Engineering for Data Scientists, Part 1: Pydantic Is All You Need for Poor Performance Spaghetti Code
I love Pydantic. And I’ve witnessed some of the worst code written with Pydantic. Pure spaghetti, non-performant code.
There are two major anti-patterns in abusing Pydantic for maximum spaghetti. First anti-pattern is serdes debt. Instead of using Pydantic only at service boundaries for validation, it’s being used everywhere, incurring heavy serialization and deserialization and memory allocation costs. Second anti-pattern is inheritance over composition, where it’s common to see Pydantic being used to construct objects based on heavy layers of inheritance, breaking basic OOP SOLID principles.
In this post, we will discuss the serdes debt anti-pattern from using Pydantic.
Anti-pattern: SerDes Debt
Pydantic is primarily used for data validation, with support for data schema and data serialization and deserialization (serdes).
Serialization is when we need to take an object, in this case a Pydantic object, and convert it into a JSON string. Deserialization is when we need to take a JSON string and deserialize it into an object.
In the language of Python, it’s taking a string and converting it into a nested dictionary of mixed types (yay dynamic typing). Sometimes when doing these conversions, we do need to validate to ensure the data is as expected.
If it’s just for pure serdes, there are far faster and more efficient serdes packages like msgspec
, orjson
, or attrs
. In fact, Pydantic can be set up to use orjson
.
The core usecase for Pydantic is for data validation. Outside of custom data validation, the best practice is to avoid Pydantic.
Here’s a simple benchmark to demonstrate why.
Performance Benchmark
We based our benchmark on a simple two-class data structure. The Python dataclass implementation is shown below, and we bench this vs the equivalent in Pydantic. We did not implement data validation in either.
@dataclass
class Address:
street: str
city: str
country: str
postal_code: str
@dataclass
class User:
id: int
name: str
email: str
age: int
is_active: bool
address: Address
tags: List[str]
Based on this simple data model, we observe that Python dataclasses run far superior in both time and space complexity.
- Creation Performance:
- Dataclasses are ~6.5x faster for creating instances from dictionaries
- JSON Operations Performance:
- Serialization: Dataclasses ~1.5x faster
- Deserialization: Dataclasses ~1.5x faster
- Full round-trip: Dataclasses ~1.5x faster overall
- Bulk Operations: The performance gap remains consistent at scale
- Field Access Performance: Nearly identical performance between Dataclasses and Pydantic
- Memory Consumption: Dataclasses consume ~2.5x less memory
Tips on Fixing Pydantic Anti-patterns
- Only use Pydantic at service boundaries, e.g., API request and response validation. Do not use Pydantic within a service itself.
Here’s the Pydantic team themselves:
- Static type checking with mypy. Avoid dynamic type checking. If dynamic type-checking is really needed, rewrite in Rust.
- Composition over Inheritance. Object inheritance creates additional layers of abstraction. Duplication is far cheaper than having more abstraction. Don’t Repeat Yourself (DRY) should be used sparingly.
References
- JSON extra uses orjson instead of ujson #599
- Reddit: Should I use pydantic for all my classes?
- X: Developer priorities throughout their career - LeaVerou
Appendix
Time complexity
Performance Comparison: Pydantic vs Dataclasses
============================================================
Test data structure: Nested user profile with address
Iterations per test: 10,000
Python dataclasses: Built-in
Pydantic version: 2.5.3
Warming up...
Running benchmarks...
Benchmarking: Instance Creation from Dict
Instance Creation from Dict
============================================================
Metric Pydantic Dataclasses
------------------------------------------------------------
Mean 0.0543 ms 0.0084 ms
Median 0.0531 ms 0.0082 ms
Min 0.0497 ms 0.0076 ms
Max 0.0892 ms 0.0156 ms
Stdev 0.0041 ms 0.0008 ms
Dataclasses is 6.46x faster
Benchmarking: Convert to Dictionary
Convert to Dictionary
============================================================
Metric Pydantic Dataclasses
------------------------------------------------------------
Mean 0.0287 ms 0.0153 ms
Median 0.0282 ms 0.0151 ms
Min 0.0265 ms 0.0142 ms
Max 0.0421 ms 0.0234 ms
Stdev 0.0023 ms 0.0011 ms
Dataclasses is 1.88x faster
Benchmarking: Serialize to JSON String
Serialize to JSON String
============================================================
Metric Pydantic Dataclasses
------------------------------------------------------------
Mean 0.0361 ms 0.0247 ms
Median 0.0355 ms 0.0243 ms
Min 0.0334 ms 0.0228 ms
Max 0.0512 ms 0.0387 ms
Stdev 0.0029 ms 0.0019 ms
Dataclasses is 1.46x faster
Benchmarking: Deserialize from JSON String
Deserialize from JSON String
============================================================
Metric Pydantic Dataclasses
------------------------------------------------------------
Mean 0.0678 ms 0.0463 ms
Median 0.0669 ms 0.0457 ms
Min 0.0632 ms 0.0431 ms
Max 0.0943 ms 0.0612 ms
Stdev 0.0048 ms 0.0027 ms
Dataclasses is 1.46x faster
Benchmarking: Field Access
Field Access
============================================================
Metric Pydantic Dataclasses
------------------------------------------------------------
Mean 0.0013 ms 0.0012 ms
Median 0.0013 ms 0.0011 ms
Min 0.0011 ms 0.0010 ms
Max 0.0019 ms 0.0018 ms
Stdev 0.0001 ms 0.0001 ms
Dataclasses is 1.08x faster
Benchmarking: Bulk Creation (100 items)
Bulk Creation (100 items)
============================================================
Metric Pydantic Dataclasses
------------------------------------------------------------
Mean 54.312 ms 8.427 ms
Median 53.867 ms 8.356 ms
Min 52.145 ms 8.123 ms
Max 58.923 ms 9.234 ms
Stdev 1.234 ms 0.187 ms
Dataclasses is 6.45x faster
SUMMARY
============================================================
Performance Summary:
- Instance Creation from Dict: Dataclasses is 6.46x faster
- Convert to Dictionary: Dataclasses is 1.88x faster
- Serialize to JSON String: Dataclasses is 1.46x faster
- Deserialize from JSON String: Dataclasses is 1.46x faster
- Field Access: Dataclasses is 1.08x faster
- Bulk Creation (100 items): Dataclasses is 6.45x faster
DETAILED JSON OPERATIONS COMPARISON
============================================================
Round-trip JSON test (dict -> object -> JSON -> object -> dict):
Pydantic: 10.42 ms
Dataclasses: 7.15 ms
Ratio: 0.69x
Tested with Pydantic v2.5.3
Space Complexity
MEMORY USAGE COMPARISON
============================================================
1. Single Instance Memory Usage:
Address object (deep size):
Pydantic: 1,776 bytes
Dataclass: 568 bytes
Difference: 1,208 bytes (212.7% more)
User object (deep size):
Pydantic: 3,424 bytes
Dataclass: 1,312 bytes
Difference: 2,112 bytes (161.0% more)
2. Bulk Creation Memory Usage (1000 instances):
Pydantic: 3,287.45 KB (3,287.45 bytes per instance)
Dataclasses: 1,245.78 KB (1,245.78 bytes per instance)
Difference: 2,041.67 KB (163.9% more)
3. JSON Operations Memory Usage:
Per JSON deserialization:
Pydantic: 4,256 bytes
Dataclasses: 2,184 bytes
Difference: 2,072 bytes
4. Attribute Storage Analysis:
Pydantic User attributes: 8 stored attributes
Dataclass User attributes: 7 stored attributes
Pydantic internals:
__dict__: 296 bytes (dict)
__pydantic_fields_set__: 216 bytes (set)
__pydantic_extra__: 0 bytes (NoneType)
__pydantic_private__: 0 bytes (NoneType)
address: 72 bytes (PydanticAddress)
age: 28 bytes (int)
email: 74 bytes (str)
id: 28 bytes (int)
is_active: 28 bytes (bool)
name: 57 bytes (str)
tags: 88 bytes (list)
Dataclass internals:
address: 72 bytes (DataclassAddress)
age: 28 bytes (int)
email: 74 bytes (str)
id: 28 bytes (int)
is_active: 28 bytes (bool)
name: 57 bytes (str)
tags: 88 bytes (list)
5. Memory Efficiency Summary:
- Dataclasses use ~40-50% less memory per instance
- Pydantic stores additional metadata for validation
- The memory gap increases with more complex models
- Consider memory usage for large-scale applications
6. Visual Memory Comparison (per 1000 instances):
Pydantic: [████████████████████████████████████████] 3,287.5 KB
Dataclasses: [███████████████ ] 1,245.8 KB
@article{
leehanchung,
author = {Lee, Hanchung},
title = {No Code, Low Code, Full Code},
year = {2025},
month = {07},
howpublished = {\url{https://leehanchung.github.io}},
url = {https://leehanchung.github.io/blogs/2025/06/26/no-code-low-code-full-code/}
}