Files
palkkakysely/pulkka/data_utils.py
Aarni Koskela 663cd3d349 Add 2025 survey data support
The 2025 survey uses a single English-only xlsx (instead of separate
fi/en files) with a restructured schema: compensation is split into
base salary, commission, lomaraha, bonus, and equity components;
working time is h/week instead of percentage; and competitive salary
is categorical instead of boolean. Vuositulot is now synthesized
from the component fields.

Drop COLUMN_MAP_2024, COLUMN_MAP_2024_EN_TO_FI, VALUE_MAP_2024_EN_TO_FI,
read_initial_dfs_2024, read_data_2024, map_sukupuoli, map_vuositulot,
split_boolean_column_to_other, apply_fixups, and the associated gender
value lists and boolean text maps. All of this exists in version history.

- KKPALKKA now includes base salary + commission (median 5500 → 5800)
- Apply map_numberlike to tuntilaskutus and vuosilaskutus columns to
  handle string values like "60 000" and "100 000"
- Filter out zeros when computing tunnusluvut on the index page so
  stats reflect actual reported values

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:40:13 +02:00

71 lines
1.6 KiB
Python

from typing import Optional
import pandas as pd
def q25(x):
return x.quantile(0.25)
def q50(x):
return x.quantile(0.5)
def q75(x):
return x.quantile(0.75)
def q90(x):
return x.quantile(0.9)
def get_categorical_stats(
df: pd.DataFrame,
category_col: str,
value_col: str,
*,
na_as_category: Optional[str] = None,
) -> pd.DataFrame:
# Drop records where value is not numeric before grouping...
df = df.copy()
df[value_col] = pd.to_numeric(df[value_col], errors="coerce")
df = df[df[value_col].notna() & df[value_col] > 0]
if na_as_category:
rename_na(df, category_col, na_as_category)
# ... then carry on.
group = df[[category_col, value_col]].groupby(category_col, observed=False)
return group[value_col].agg(
["mean", "min", "max", "median", "count", q25, q50, q75, q90],
)
def rename_na(df: pd.DataFrame, col: str, na_name: str) -> None:
df[col] = df[col].astype("string")
df.loc[df[col].isna(), col] = na_name
df[col] = df[col].astype("category")
def explode_multiselect(
series: pd.Series,
*,
sep: str = ", ",
top_n: int | None = None,
) -> pd.Series:
"""
Explode a comma-separated multiselect column into value counts.
Returns a Series of counts indexed by individual values,
sorted descending. Optionally limited to top_n entries.
"""
counts = (
series.dropna()
.str.split(sep)
.explode()
.str.strip()
.loc[lambda s: s != ""]
.value_counts()
)
if top_n is not None:
counts = counts.head(top_n)
return counts