Drawing

Pandera Tutorial

Pandera is a package which provides schemas for data validation and type hinting. It provides 3 main benefits:

  1. Explicitly define the schema of your data
  2. Use the schema to type hint your code
  3. Validate your data against the schema

Define a schema

While you are working with a dataset, you will probably encounter the question “what are the columns of this dataset?” several times.

This is specially the case when you are no longer at the initial stage of your project, or you are working with a dataset that has been created by someone else.

Pandera provides a schema object which allows you to explicitly define the schema of your data, creating a python object that can be used to contain all the information about the columns of the dataset.

Let’s start with a dataset that contains the following columns:

%xmode minimal
import pandera as pa
from pandera.typing import DataFrame, Series, Index
import pandas as pd

data = pd.DataFrame(
    {
        "name": [
            "Josh",
            "Bob",
            "Mary",
            "John",
            "Jane",
        ],
        "score": [4, pd.NA, 6, 7, 8],
    },
    index=pd.Series(["a", "b", "c", "d", "e"], name="user_id"),
).astype({"score": pd.Int64Dtype()})
data
name score
user_id
a Josh 4
b Bob <NA>
c Mary 6
d John 7
e Jane 8

A schema for this dataframe can be defined as follows:

class MyMinimalSchema(pa.SchemaModel):
    name: Series
    score: Series
    user_id: Index

Or with additional rules like this:

class MySchema(pa.SchemaModel):
    name: Series[str] = pa.Field(nullable=False)
    score: Series[pd.Int64Dtype] = pa.Field(ge=4, lt=9, nullable=True)
    user_id: Index[str] = pa.Field(check_name=True)

Schema Rules

The schema defines the data types and constraints for each column in the DataFrame.

MySchema contains the following information:

  • name is a str column that is nullable
  • score is a float column that cannot receive null values, is greater than or equal (ge) to 4, and is less than (lt) 9.
  • user_id is an str index

Magic Strings

Not only MySchema works like documentation for your data, with the expected columns and its requirements, it also allows you to use this information in your code.

data[MySchema.name]
user_id
a    Josh
b     Bob
c    Mary
d    John
e    Jane
Name: name, dtype: object

Using this inside your code has two main benefits: first, it makes your life easier through autocomplete by giving you the columns of the dataframe.

Second, it avoids the usage of magic strings, so you no longer use a string to refer to a column, but a variable that is defined in the schema.

Here is a very good explanation of what are magic strings and why we should avoid them.

Use the schema to type hint your code

Not knowing the contents of your dataframe while reading the code was already bad, but not knowing the output of your functions is just as bad.

Pandera to the rescue! You can also use the schema to type hint your data, so you can know the inputs and outputs of your functions.

def get_score_column(df: DataFrame[MySchema]) -> Series[pd.Int64Dtype]:
    return df[MySchema.score]

Validate your data against the schema

With all this information we have documented about our data, we can also use it to validate it.

The first way we can do this is using the check_types annotation. It will check your inputs and outputs. This code is expected to pass:

@pa.check_types
def fill_score_column(df: DataFrame[MySchema]) -> DataFrame[MySchema]:
    return df.fillna({MySchema.score: 5})


fill_score_column(data)
name score
user_id
a Josh 4
b Bob 5
c Mary 6
d John 7
e Jane 8

But this one is not, because 0 is not a valid score according to the schema

@pa.check_types
def fill_score_column(df: DataFrame[MySchema]) -> DataFrame[MySchema]:
    return df.fillna({MySchema.score: 0})


fill_score_column(data)
SchemaError: error in check_types decorator of function 'fill_score_column': <Schema Column(name=score, type=DataType(Int64))> failed element-wise validator 0:
<Check greater_than_or_equal_to: greater_than_or_equal_to(4)>
failure cases:
  index  failure_case
0     b             0

More usage

Checks and coercions

By default, check_types will check that:

  • the data has all the required columns
  • all columns match the types
  • all values are non null

It will not:

  • check the index name
  • coerce the data types
  • check only required columns are present

We can customize most of the configurations:

  1. its not possible to ignore a column, but you can subclass the schema (see below)
  2. removing the type definition will remove the type check
  3. pa.Field(nullable=True) will allow null values
  4. pa.Field(check_name=True) will check the index name
  5. pa.Field(coerce=True) will coerce the data types, Config.coerce will coerce the whole data.
  6. strict=True will check that only required columns are present
class ExtraSchema(pa.SchemaModel):
    name: Series = pa.Field(nullable=False) # no type check
    score: Series[pd.Int64Dtype] = pa.Field(ge=4, lt=9, nullable=True) # null values allowed
    user_id: Index[str] = pa.Field(check_name=True) # index name check
    date: Series[pd.Timestamp] = pa.Field(nullable=True) # extra column

    class Config:
        coerce = True # coerce all values to the specified type
        strict = True # allow extra columns

Inheritance

Just like ordinary classes, you can inherit schemas. It can be useful if you are building one dataset on top of another

class BaseUserSchema(pa.SchemaModel):
    name: Series[str]
    user: Index[str] = pa.Field(check_name=True)

class UserWithScoreSchema(BaseUserSchema):
    score: Series[int]

UserWithScoreSchema will therefore also have all properties defined in BaseUserSchema

Programatic validation

You can turn schemas into objects to access the rules at running time, and validate the data at will.

schema_object = MySchema.to_schema()
schema_object
<Schema DataFrameSchema(columns={'name': <Schema Column(name=name, type=None)>, 'score': <Schema Column(name=score, type=DataType(Int64))>}, checks=[], index=<Schema Index(name=user_id, type=DataType(str))>, coerce=True, dtype=None, strict=False, name=MySchema, ordered=False, unique_column_names=False)>
schema_object.columns
{'name': <Schema Column(name=name, type=None)>,
 'score': <Schema Column(name=score, type=DataType(Int64))>}
list(schema_object.columns.keys())
['name', 'score']
schema_object.validate(data)
name score
user_id
a Josh 4
b Bob <NA>
c Mary 6
d John 7
e Jane 8