Python Pandas: Creative Data Manipulation and Analysis

--

Python Pandas offers two primary data structures: DataFrame and Series, which are powerful and flexible for data manipulation.

  1. Handling Time Zones: This code converts a datetime column to a specific time zone (US/Eastern) from UTC.
df['Datetime'] = pd.to_datetime(df['Datetime']).dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

2. Creating Time Lags: This adds a new column that contains the previous value of another column.

df['Previous_Value'] = df['Value'].shift(1)

3. Reshaping with pivot_table: This pivots and aggregates data, creating a pivot table with values summed up for each category on each date.

df_pivot = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')

4. Combining String Columns: This combines two string columns to create a full name column.

df['Full_Name'] = df[['First_Name', 'Last_Name']].agg(' '.join, axis=1)

5. Custom Aggregation with agg: This custom aggregates data using different aggregation functions for a specific column within a group.

custom_agg = {'Value': ['mean', 'std', lambda x: x.max() - x.min()]}
df_agg = df.groupby('Category').agg(custom_agg)

6. Pandas Styling for DataFrames: This code applies custom styling to a DataFrame to highlight the maximum value in each column with a yellow background.

def highlight_max(s):
is_max = s == s.max()
return ['background-color: yellow' if v else '' for v in is_max]

df.style.apply(highlight_max)

7. Sampling Data Randomly: This randomly samples 50% of the data using a specified random seed.

df_sampled = df.sample(frac=0.5, random_state=42)

8. Creating Rolling Windows: This calculates a rolling mean over a specified window size.

df['Rolling_Mean'] = df['Value'].rolling(window=3).mean()

9. Grouping by Time Periods: This groups data by month and calculates the sum for each month.

df['Month'] = df['Date'].dt.to_period('M')
df_grouped = df.groupby('Month').sum()

10. Using crosstab for Cross-Tabulation: This creates a cross-tabulation table between two categorical columns.

cross_tab = pd.crosstab(df['Category1'], df['Category2'])

11. Handling Large Files with Chunking: This code reads and processes large CSV files in smaller, manageable chunks to avoid memory issues.

chunk_size = 10000

chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)

for chunk in chunks:
process_chunk(chunk)

12. Creating Bins with cut: This code creates categorical bins for numeric data using the cut function.

bins = [0, 25, 50, 75, 100]
labels = ['Low', 'Medium', 'High', 'Very High']
df['Category'] = pd.cut(df['Value'], bins=bins, labels=labels)

13. Working with MultiIndex DataFrames: This code sets a multi-index and selects data using a multi-index slice.

df.set_index(['Index1', 'Index2'], inplace=True)
df.loc[('A', 'X'):('B', 'Y')]

14. Using merge for Database-Like Joins: This performs an inner join between two DataFrames using a common key column.

merged_df = pd.merge(df1, df2, on='Key_Column', how='inner')

15. Applying groupby with Lambda Functions: This groups data by the year extracted from a datetime column.

df.groupby(lambda x: x.year)['Value'].sum()

16. Custom Aggregation with agg and Named Aggregations: This performs custom aggregation with named aggregations using Pandas’ named aggregation feature.

custom_agg = pd.NamedAgg(column='Value', mean='mean', std='std')
df_agg = df.groupby('Category').agg(**custom_agg)

Additional Blogs by Author

  1. Python Function: Type of Arguments in a Function

2. Understanding Python’s init Method: Object Initialization in Depth

3. Python’s main: Setting the Stage for Your Code

4. Understanding Python’s Try-Except Statements: A Safety Net for Your Code

5. Exploring Python Classes and Object-Oriented Programming

6. Lambda Functions in Python

7. Python Pandas: Creative Data Manipulation and Analysis

8. Python OOP Concepts Made Simple

--

--

Responses (2)