%xmode minimal
import pandera as pa
from pandera.typing import DataFrame, Series, Index
import pandas as pd
= pd.DataFrame(
data
{"name": [
"Josh",
"Bob",
"Mary",
"John",
"Jane",
],"score": [4, pd.NA, 6, 7, 8],
},=pd.Series(["a", "b", "c", "d", "e"], name="user_id"),
index"score": pd.Int64Dtype()}) ).astype({
Pandera Tutorial
Pandera is a package which provides schemas for data validation and type hinting. It provides 3 main benefits:
- Explicitly define the schema of your data
- Use the schema to type hint your code
- Validate your data against the schema
Define a schema
While you are working with a dataset, you will probably encounter the question “what are the columns of this dataset?” several times.
This is specially the case when you are no longer at the initial stage of your project, or you are working with a dataset that has been created by someone else.
Pandera provides a schema object which allows you to explicitly define the schema of your data, creating a python object that can be used to contain all the information about the columns of the dataset.
Let’s start with a dataset that contains the following columns:
data
name | score | |
---|---|---|
user_id | ||
a | Josh | 4 |
b | Bob | <NA> |
c | Mary | 6 |
d | John | 7 |
e | Jane | 8 |
A schema for this dataframe can be defined as follows:
class MyMinimalSchema(pa.SchemaModel):
name: Series
score: Series user_id: Index
Or with additional rules like this:
class MySchema(pa.SchemaModel):
str] = pa.Field(nullable=False)
name: Series[= pa.Field(ge=4, lt=9, nullable=True)
score: Series[pd.Int64Dtype] str] = pa.Field(check_name=True) user_id: Index[
Schema Rules
The schema defines the data types and constraints for each column in the DataFrame.
MySchema
contains the following information:
name
is astr
column that is nullablescore
is afloat
column that cannot receive null values, is greater than or equal (ge
) to 4, and is less than (lt
) 9.user_id
is anstr
index
Magic Strings
Not only MySchema
works like documentation for your data, with the expected columns and its requirements, it also allows you to use this information in your code.
data[MySchema.name]
user_id
a Josh
b Bob
c Mary
d John
e Jane
Name: name, dtype: object
Using this inside your code has two main benefits: first, it makes your life easier through autocomplete by giving you the columns of the dataframe.
Second, it avoids the usage of magic strings, so you no longer use a string to refer to a column, but a variable that is defined in the schema.
Here is a very good explanation of what are magic strings and why we should avoid them.
Use the schema to type hint your code
Not knowing the contents of your dataframe while reading the code was already bad, but not knowing the output of your functions is just as bad.
Pandera to the rescue! You can also use the schema to type hint your data, so you can know the inputs and outputs of your functions.
def get_score_column(df: DataFrame[MySchema]) -> Series[pd.Int64Dtype]:
return df[MySchema.score]
Validate your data against the schema
With all this information we have documented about our data, we can also use it to validate it.
The first way we can do this is using the check_types
annotation. It will check your inputs and outputs. This code is expected to pass:
@pa.check_types
def fill_score_column(df: DataFrame[MySchema]) -> DataFrame[MySchema]:
return df.fillna({MySchema.score: 5})
fill_score_column(data)
name | score | |
---|---|---|
user_id | ||
a | Josh | 4 |
b | Bob | 5 |
c | Mary | 6 |
d | John | 7 |
e | Jane | 8 |
But this one is not, because 0 is not a valid score according to the schema
@pa.check_types
def fill_score_column(df: DataFrame[MySchema]) -> DataFrame[MySchema]:
return df.fillna({MySchema.score: 0})
fill_score_column(data)
SchemaError: error in check_types decorator of function 'fill_score_column': <Schema Column(name=score, type=DataType(Int64))> failed element-wise validator 0:
<Check greater_than_or_equal_to: greater_than_or_equal_to(4)>
failure cases:
index failure_case
0 b 0
More usage
Checks and coercions
By default, check_types will check that:
- the data has all the required columns
- all columns match the types
- all values are non null
It will not:
- check the index name
- coerce the data types
- check only required columns are present
We can customize most of the configurations:
- its not possible to ignore a column, but you can subclass the schema (see below)
- removing the type definition will remove the type check
pa.Field(nullable=True)
will allow null valuespa.Field(check_name=True)
will check the index namepa.Field(coerce=True)
will coerce the data types,Config.coerce
will coerce the whole data.strict=True
will check that only required columns are present
class ExtraSchema(pa.SchemaModel):
= pa.Field(nullable=False) # no type check
name: Series = pa.Field(ge=4, lt=9, nullable=True) # null values allowed
score: Series[pd.Int64Dtype] str] = pa.Field(check_name=True) # index name check
user_id: Index[= pa.Field(nullable=True) # extra column
date: Series[pd.Timestamp]
class Config:
coerce = True # coerce all values to the specified type
= True # allow extra columns strict
Inheritance
Just like ordinary classes, you can inherit schemas. It can be useful if you are building one dataset on top of another
class BaseUserSchema(pa.SchemaModel):
str]
name: Series[str] = pa.Field(check_name=True)
user: Index[
class UserWithScoreSchema(BaseUserSchema):
int]
score: Series[
UserWithScoreSchema
will therefore also have all properties defined in BaseUserSchema
Programatic validation
You can turn schemas into objects to access the rules at running time, and validate the data at will.
= MySchema.to_schema()
schema_object schema_object
<Schema DataFrameSchema(columns={'name': <Schema Column(name=name, type=None)>, 'score': <Schema Column(name=score, type=DataType(Int64))>}, checks=[], index=<Schema Index(name=user_id, type=DataType(str))>, coerce=True, dtype=None, strict=False, name=MySchema, ordered=False, unique_column_names=False)>
schema_object.columns
{'name': <Schema Column(name=name, type=None)>,
'score': <Schema Column(name=score, type=DataType(Int64))>}
list(schema_object.columns.keys())
['name', 'score']
schema_object.validate(data)
name | score | |
---|---|---|
user_id | ||
a | Josh | 4 |
b | Bob | <NA> |
c | Mary | 6 |
d | John | 7 |
e | Jane | 8 |