Python for Data Science: Essential Libraries and Techniques
Python remains the dominant language for data science in 2026. This guide covers essential libraries, techniques, and best practices for modern data analysis and machine learning.
Core Libraries
NumPy - Numerical Computing
import numpy as np
# Array operations
arr = np.array([1, 2, 3, 4, 5])
print(arr.mean()) # 3.0
print(arr.std()) # 1.41
# Matrix operations
matrix = np.array([[1, 2], [3, 4]])
inverse = np.linalg.inv(matrix)
Pandas - Data Manipulation
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# Data cleaning
df = df.dropna()
df['date'] = pd.to_datetime(df['date'])
# Analysis
summary = df.groupby('category')['sales'].agg(['mean', 'sum', 'count'])
Data Visualization
Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style('whitegrid')
# Create visualization
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='age', y='income', hue='category')
plt.title('Income vs Age by Category')
plt.show()
Machine Learning with scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Prepare data
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# Evaluate
predictions = model.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, predictions)}')
Best Practices
- Use virtual environments
- Version control your notebooks
- Document your analysis
- Validate your data
- Test your models
Conclusion
Python’s rich ecosystem makes it ideal for data science. Master these libraries and techniques to unlock powerful insights from your data.
Comments