Python Pandas: Creative Data Manipulation and Analysis
Python Pandas offers two primary data structures: DataFrame and Series, which are powerful and flexible for data manipulation.
- Handling Time Zones: This code converts a datetime column to a specific time zone (
US/Eastern
) from UTC.
df['Datetime'] = pd.to_datetime(df['Datetime']).dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
2. Creating Time Lags: This adds a new column that contains the previous value of another column.
df['Previous_Value'] = df['Value'].shift(1)
3. Reshaping with pivot_table
: This pivots and aggregates data, creating a pivot table with values summed up for each category on each date.
df_pivot = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')
4. Combining String Columns: This combines two string columns to create a full name column.
df['Full_Name'] = df[['First_Name', 'Last_Name']].agg(' '.join, axis=1)
5. Custom Aggregation with agg
: This custom aggregates data using different aggregation functions for a specific column within a group.
custom_agg = {'Value': ['mean', 'std', lambda x: x.max() - x.min()]}
df_agg = df.groupby('Category').agg(custom_agg)
6. Pandas Styling for DataFrames: This code applies custom styling to a DataFrame to highlight the maximum value in each column with a yellow background.
def highlight_max(s):
is_max = s == s.max()
return ['background-color: yellow' if v else '' for v in is_max]
df.style.apply(highlight_max)
7. Sampling Data Randomly: This randomly samples 50% of the data using a specified random seed.
df_sampled = df.sample(frac=0.5, random_state=42)
8. Creating Rolling Windows: This calculates a rolling mean over a specified window size.
df['Rolling_Mean'] = df['Value'].rolling(window=3).mean()
9. Grouping by Time Periods: This groups data by month and calculates the sum for each month.
df['Month'] = df['Date'].dt.to_period('M')
df_grouped = df.groupby('Month').sum()
10. Using crosstab
for Cross-Tabulation: This creates a cross-tabulation table between two categorical columns.
cross_tab = pd.crosstab(df['Category1'], df['Category2'])
11. Handling Large Files with Chunking: This code reads and processes large CSV files in smaller, manageable chunks to avoid memory issues.
chunk_size = 10000
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)
for chunk in chunks:
process_chunk(chunk)
12. Creating Bins with cut
: This code creates categorical bins for numeric data using the cut
function.
bins = [0, 25, 50, 75, 100]
labels = ['Low', 'Medium', 'High', 'Very High']
df['Category'] = pd.cut(df['Value'], bins=bins, labels=labels)
13. Working with MultiIndex DataFrames: This code sets a multi-index and selects data using a multi-index slice.
df.set_index(['Index1', 'Index2'], inplace=True)
df.loc[('A', 'X'):('B', 'Y')]
14. Using merge
for Database-Like Joins: This performs an inner join between two DataFrames using a common key column.
merged_df = pd.merge(df1, df2, on='Key_Column', how='inner')
15. Applying groupby
with Lambda Functions: This groups data by the year extracted from a datetime column.
df.groupby(lambda x: x.year)['Value'].sum()
16. Custom Aggregation with agg
and Named Aggregations: This performs custom aggregation with named aggregations using Pandas’ named aggregation feature.
custom_agg = pd.NamedAgg(column='Value', mean='mean', std='std')
df_agg = df.groupby('Category').agg(**custom_agg)
Additional Blogs by Author
- Python Function: Type of Arguments in a Function
2. Understanding Python’s init Method: Object Initialization in Depth
3. Python’s main: Setting the Stage for Your Code
4. Understanding Python’s Try-Except Statements: A Safety Net for Your Code
5. Exploring Python Classes and Object-Oriented Programming
6. Lambda Functions in Python
7. Python Pandas: Creative Data Manipulation and Analysis
8. Python OOP Concepts Made Simple