This notebook aims to create a Pandas DataFrame containing multiple values in its fields and then query them as if the rows have a many-to-one relationship.
We will begin by generating random test data in the first cell. The output will be a dataframe containing two columns: one for strings and one for integers.
import pandas as pd
import numpy as np
def randomStrings():
return np.random.choice([chr(64+x)*3 for x in range(1,27)],
size=np.random.randint(low=1, high=4), replace=False)
def randomInts():
return np.random.randint(low=1000000, high=9999999,
size=np.random.randint(low=1, high=4))
df = pd.DataFrame({'strings': [randomStrings() for x in range(100)],
'ints': [randomInts() for x in range(100)]})
df.head(3)
The resulting dataframe will look like the table below with a simple incrementing index, a column containing arrays of strings, and a column containing arrays of integers.
strings | ints | |
---|---|---|
0 | [SSS, NNN] | [4529611, 2199346] |
1 | [JJJ] | [7057753, 3754390, 5389197] |
2 | [ZZZ, KKK] | [7894309] |
The fields’ values are Python arrays, and standard Python array operations can be used to create masks to query the columns.
df.loc[df.strings.apply(lambda x: 'FFF' in x)]
df.loc[df.ints.apply(lambda x: np.logical_and(x >= 7000000, x < 8000000).any())]
This works for querying a dataframe, but you have to flatten the dataframe if you are going to sort or group the data. Making changes to those fields can be complicated.
df.loc[df.strings.apply(lambda x: 'TTT' in x), 'strings'] = df.loc[
df.strings.apply(lambda x: 'TTT' in x)].strings.apply(
lambda x: np.select([x == 'TTT'], ['hello'], x))
You can use to_parquet() and read_parquet() to write and read the dataframe in a format that will retain the arrays in the columns. However, if you use to_csv(), it will write the arrays as strings, and read_csv() will parse them as strings. The behavior of the masks will become inconsistent. Some of them, especially those operating on strings, will mostly work. The ones for operating on arrays of integers will likely fail because you will match against strings instead of numbers.