Оглядовий аналіз та візуалізація даних у Python

Лекція з практичними прикладами на Python

Лекції

Аналітика даних

Author

Богдан Красюк

Published

20 жовтня 2025 р.

1 Презентація

🎯 Мета лекції. Навчитися створювати інформативні графіки для одновимірних, двовимірних та часових даних у Python за допомогою pandas і matplotlib; опанувати базові патерни сторітелінгу, вибору типу графіка та підготовки даних до візуалізації.

2 Підготовка середовища

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

# Розмір фігури за замовчуванням
plt.rcParams["figure.figsize"] = (8, 5)

# параметри відображення таблиць
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 20)

DATA_PATH = Path("data/input_data.csv")
print("Шлях до даних:", DATA_PATH.resolve())

Шлях до даних: E:\websiteDataAnalitycs\bc-2025\bootcamp\course\data\input_data.csv

3 Завантаження та первинний огляд даних 🗂️

Code

# Завантаження CSV
df = pd.read_csv(DATA_PATH)

# Попередній перегляд
display(df.head(10))

# Розмір і типи
print("shape:", df.shape)
display(df.dtypes)

# Загальна інформація
df.info()

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	EnvironmentSatisfaction	Gender	HourlyRate	JobInvolvement	JobLevel	JobRole	JobSatisfaction	MaritalStatus	MonthlyIncome	MonthlyRate	NumCompaniesWorked	Over18	OverTime	PercentSalaryHike	PerformanceRating	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	2	Female	94	3	2	Sales Executive	4	Single	5993	19479	8	Y	Yes	11	3	1	80	0	8	0	1	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	3	Male	61	2	2	Research Scientist	2	Married	5130	24907	1	Y	No	23	4	4	80	1	10	3	3	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	4	Male	92	2	1	Laboratory Technician	3	Single	2090	2396	6	Y	Yes	15	3	2	80	0	7	3	3	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	4	Female	56	3	1	Research Scientist	3	Married	2909	23159	1	Y	Yes	11	3	3	80	0	8	3	3	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	1	Male	40	3	1	Laboratory Technician	2	Married	3468	16632	9	Y	No	12	3	4	80	1	6	3	3	2	2	2	2
5	32	No	Travel_Frequently	1005	Research & Development	2	2	Life Sciences	1	8	4	Male	79	3	1	Laboratory Technician	4	Single	3068	11864	0	Y	No	13	3	3	80	0	8	2	2	7	7	3	6
6	59	No	Travel_Rarely	1324	Research & Development	3	3	Medical	1	10	3	Female	81	4	1	Laboratory Technician	1	Married	2670	9964	4	Y	Yes	20	4	1	80	3	12	3	2	1	0	0	0
7	30	No	Travel_Rarely	1358	Research & Development	24	1	Life Sciences	1	11	4	Male	67	3	1	Laboratory Technician	3	Divorced	2693	13335	1	Y	No	22	4	2	80	1	1	2	3	1	0	0	0
8	38	No	Travel_Frequently	216	Research & Development	23	3	Life Sciences	1	12	4	Male	44	2	3	Manufacturing Director	3	Single	9526	8787	0	Y	No	21	4	2	80	0	10	2	3	9	7	1	8
9	36	No	Travel_Rarely	1299	Research & Development	27	3	Medical	1	13	3	Male	94	3	2	Healthcare Representative	3	Married	5237	16577	6	Y	No	13	3	2	80	2	17	3	2	7	7	7	7

shape: (1470, 35)

Age                         int64
Attrition                  object
BusinessTravel             object
DailyRate                   int64
Department                 object
                            ...  
WorkLifeBalance             int64
YearsAtCompany              int64
YearsInCurrentRole          int64
YearsSinceLastPromotion     int64
YearsWithCurrManager        int64
Length: 35, dtype: object

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                  1470 non-null   int64 
 15  JobRole                   1470 non-null   object
 16  JobSatisfaction           1470 non-null   int64 
 17  MaritalStatus             1470 non-null   object
 18  MonthlyIncome             1470 non-null   int64 
 19  MonthlyRate               1470 non-null   int64 
 20  NumCompaniesWorked        1470 non-null   int64 
 21  Over18                    1470 non-null   object
 22  OverTime                  1470 non-null   object
 23  PercentSalaryHike         1470 non-null   int64 
 24  PerformanceRating         1470 non-null   int64 
 25  RelationshipSatisfaction  1470 non-null   int64 
 26  StandardHours             1470 non-null   int64 
 27  StockOptionLevel          1470 non-null   int64 
 28  TotalWorkingYears         1470 non-null   int64 
 29  TrainingTimesLastYear     1470 non-null   int64 
 30  WorkLifeBalance           1470 non-null   int64 
 31  YearsAtCompany            1470 non-null   int64 
 32  YearsInCurrentRole        1470 non-null   int64 
 33  YearsSinceLastPromotion   1470 non-null   int64 
 34  YearsWithCurrManager      1470 non-null   int64 
dtypes: int64(26), object(9)
memory usage: 402.1+ KB

Code

# Описові статистики
# (include='all' для виводу і категоріальних полів)
display(df.describe(include="all").T)

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
Age	1470.0	NaN	NaN	NaN	36.92381	9.135373	18.0	30.0	36.0	43.0	60.0
Attrition	1470	2	No	1233	NaN	NaN	NaN	NaN	NaN	NaN	NaN
BusinessTravel	1470	3	Travel_Rarely	1043	NaN	NaN	NaN	NaN	NaN	NaN	NaN
DailyRate	1470.0	NaN	NaN	NaN	802.485714	403.5091	102.0	465.0	802.0	1157.0	1499.0
Department	1470	3	Research & Development	961	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...
WorkLifeBalance	1470.0	NaN	NaN	NaN	2.761224	0.706476	1.0	2.0	3.0	3.0	4.0
YearsAtCompany	1470.0	NaN	NaN	NaN	7.008163	6.126525	0.0	3.0	5.0	9.0	40.0
YearsInCurrentRole	1470.0	NaN	NaN	NaN	4.229252	3.623137	0.0	2.0	3.0	7.0	18.0
YearsSinceLastPromotion	1470.0	NaN	NaN	NaN	2.187755	3.22243	0.0	0.0	1.0	3.0	15.0
YearsWithCurrManager	1470.0	NaN	NaN	NaN	4.123129	3.568136	0.0	2.0	3.0	7.0	17.0

35 rows × 11 columns

4 Підготовка до візуалізації: типи, дати/час, відсутні значення 🧰

Code

# Пошук потенційних колонок із датою/часом
date_like = [c for c in df.columns if any(k in c.lower() for k in ["date", "time", "dt", "timestamp"])]
for c in date_like:
    try:
        df[c] = pd.to_datetime(df[c], errors="raise", utc=False, dayfirst=True, infer_datetime_format=True)
        print(f"Стовпець перетворено у datetime: {c}")
    except Exception as e:
        print(f"Не вдалося перетворити {c} у datetime: {e}")

# Розділимо на числові/категоріальні для зручності
num_cols = df.select_dtypes(include=["number"]).columns.tolist()
cat_cols = df.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
print("Числові:", num_cols[:8], ("..." if len(num_cols) > 8 else ""))
print("Категоріальні:", cat_cols[:8], ("..." if len(cat_cols) > 8 else ""))

# Оцінка пропусків
missing = df.isna().sum().sort_values(ascending=False)
display(missing.head(10))

Не вдалося перетворити OverTime у datetime: Unknown datetime string format, unable to parse: Yes, at position 0
Стовпець перетворено у datetime: TrainingTimesLastYear
Числові: ['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate'] ...
Категоріальні: ['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18'] ...

Age                 0
Attrition           0
BusinessTravel      0
DailyRate           0
Department          0
DistanceFromHome    0
Education           0
EducationField      0
EmployeeCount       0
EmployeeNumber      0
dtype: int64

💡 Порада. Перед побудовою графіків переконайтеся, що типи даних коректні (особливо дати) і пропуски або заповнені, або обробляються прямо у візуалізації (наприклад, через dropna()).

5 Одновимірні розподіли 📊

5.1 Гістограма (кількісна змінна)

Code

# Оберемо першу числову колонку для прикладу
if num_cols:
    col = num_cols[0]
    series = df[col].dropna()

    plt.figure()
    plt.hist(series, bins=30)
    plt.title(f"Histogram: {col}")
    plt.xlabel(col); plt.ylabel("Count")
    plt.show()
else:
    print("Немає числових колонок для гістограми.")

5.2 Щільність розподілу (KDE) / емпірична CDF

Code

if num_cols:
    col = num_cols[0]
    series = df[col].dropna().sort_values()

    # KDE через pandas
    plt.figure()
    series.plot(kind="kde")
    plt.title(f"KDE: {col}")
    plt.xlabel(col)
    plt.show()

    # Емпірична CDF
    y = np.arange(1, len(series) + 1) / len(series)
    plt.figure()
    plt.plot(series.values, y)
    plt.title(f"Empirical CDF: {col}")
    plt.xlabel(col); plt.ylabel("F(x)")
    plt.show()

5.3 Категоріальні частоти (Top-k барчарт)

Code

if cat_cols:
    c = cat_cols[0]
    counts = df[c].value_counts(dropna=False).head(15)

    plt.figure()
    counts.plot(kind="bar")
    plt.title(f"Top-15 categories: {c}")
    plt.xlabel(c); plt.ylabel("Count")
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()
else:
    print("Немає категоріальних колонок для барчарту.")

6 Двовимірні відношення 🔗

6.1 Точкова діаграма (Scatter): кореляції та тренди

Code

# Візьмемо пару числових ознак із найбільшою за модулем кореляцією (крім діагоналі)
pair = None
if len(num_cols) >= 2:
    corr = df[num_cols].corr().abs()
    np.fill_diagonal(corr.values, 0.0)
    i, j = np.unravel_index(np.nanargmax(corr.values), corr.shape)
    xcol, ycol = num_cols[i], num_cols[j]
    pair = (xcol, ycol)

if pair:
    xcol, ycol = pair
    sub = df[[xcol, ycol]].dropna()

    plt.figure()
    plt.scatter(sub[xcol], sub[ycol], s=12, alpha=0.7)
    plt.title(f"Scatter: {xcol} vs {ycol}")
    plt.xlabel(xcol); plt.ylabel(ycol)
    plt.show()
else:
    print("Недостатньо числових ознак для побудови точкової діаграми.")

6.2 Boxplot за категорією (ряд-до-ряду)

Code

# Boxplot однієї числової змінної за топ-5 категорій
if num_cols and cat_cols:
    vcol, gcol = num_cols[0], cat_cols[0]
    top_cats = df[gcol].value_counts().index[:5]
    data = [df.loc[df[gcol] == k, vcol].dropna().values for k in top_cats]

    plt.figure()
    plt.boxplot(data, labels=list(top_cats), showmeans=True)
    plt.title(f"Boxplot: {vcol} by {gcol} (top-5 categories)")
    plt.xticks(rotation=30, ha="right")
    plt.tight_layout()
    plt.show()
else:
    print("Потрібні принаймні 1 числова і 1 категоріальна змінні для boxplot.")

7 Часові ряди ⏱️

Code

# Якщо є колонка datetime — побудуємо лінійний графік за місяцями для першої числової
dt_cols = df.select_dtypes(include=["datetime64[ns]", "datetime64[ns, UTC]"]).columns.tolist()
if dt_cols and num_cols:
    dtc, vcol = dt_cols[0], num_cols[0]
    ts = (
        df[[dtc, vcol]]
        .dropna()
        .sort_values(dtc)
        .set_index(dtc)[vcol]
        .resample("M")
        .mean()
    )

    plt.figure()
    plt.plot(ts.index, ts.values)
    plt.title(f"Time series (monthly mean): {vcol}")
    plt.xlabel("Date"); plt.ylabel(vcol)
    plt.tight_layout()
    plt.show()
else:
    print("Не знайдено datetime + числової колонки для часової візуалізації.")

8 Теплова карта кореляцій 🧭

Code

if len(num_cols) >= 2:
    corr = df[num_cols].corr()

    plt.figure()
    im = plt.imshow(corr.values, interpolation="nearest")
    plt.title("Correlation heatmap")
    plt.colorbar(im, fraction=0.046, pad=0.04)
    plt.xticks(range(len(num_cols)), num_cols, rotation=45, ha="right")
    plt.yticks(range(len(num_cols)), num_cols)
    plt.tight_layout()
    plt.show()
else:
    print("Недостатньо числових змінних для кореляційної теплокарти.")

9 Сторітелінг і дизайн 🧩

Починайте з питання: що саме хочемо побачити/довести?
Вибирайте графік під тип змінних (кількісні/категоріальні/часові).
Уникайте «діаграм-феєрверків»: 1 сюжет — 1 фігура. Якщо огляд, робіть серію простих графіків.
Підписи осей, одиниці виміру, джерела — обов’язково.
Для порівнянь показуйте однакові масштаби, для великих діапазонів — лог-шкалу.
Не перевантажуйте кольорами та сітками; акценти — через коментарі й підсумки.

10 Міні‑завдання для самоперевірки 🧪

Оберіть 2 числові показники з максимальною кореляцією та побудуйте scatter з коротким висновком.
Для однієї ключової метрики побудуйте histogram + KDE + CDF і поясніть, як це впливає на бізнес‑рішення.
Створіть boxplot цієї метрики за 3–5 найчастіших категорій змінної групування і сформулюйте гіпотезу про відмінності.
Якщо є часова змінна — побудуйте лінійний графік агрегованого показника за тижнями/місяцями та прокоментуйте сезонність.

11 Подальші кроки 🚀

Додайте контроль якості даних у вигляді перевірок (діапазони, унікальність ключів, частка пропусків).
Зробіть функції/ноутбук для повторного використання в інших проєктах.
Підготуйте версію графіків для звітів/дашбордів (напр., Quarto + HTML).

11.1 Ресурси для поглиблення

Jupyter Notebook з лекції: Pandas
Лекційні матеріали «Оглядовий аналіз та візуалізація даних у Python».
Kirthi Raman, Mastering Python Data Visualization, Packt.

Лого

Проєкт реалізується за підтримки Європейського Союзу в межах програми Дім Європи.