Wednesday, May 31, 2023

Chapter 5: Exploratory Data Analysis for Fashion Management

Back to Table of Contents

Exploratory Data Analysis (EDA) is a crucial step in understanding and gaining insights from fashion industry data. In this chapter, we will explore various techniques and visualizations that can be used to analyze and interpret fashion data. By applying EDA, fashion managers can uncover patterns, trends, and relationships within the data, enabling informed decision-making and strategy development.


Understanding the Fashion Dataset:


Data Overview: Provide an overview of the fashion dataset, including the variables, data types, and basic summary statistics.


The following is an example of summary statistics in python. 

========================

import pandas as pd


# Sample fashion dataset

fashion_data = pd.DataFrame({

    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],

    'brand': ['A', 'B', 'A', 'C', 'B'],

    'color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],

    'price': [29.99, 49.99, 39.99, 59.99, 69.99],

    'quantity': [10, 5, 8, 3, 12]

})


# Calculate summary statistics for 'price' variable

price_stats = fashion_data['price'].describe()

print("Summary Statistics for 'price' variable:")

print(price_stats)

==============================


In this example, we have a sample fashion dataset with variables such as 'product_id', 'brand', 'color', 'price', and 'quantity'. We focus on calculating summary statistics for the 'price' variable.


Using the describe() function in pandas, we can obtain key summary statistics for the 'price' variable, including count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values. The describe() function provides a comprehensive overview of the distribution and central tendency of the numerical variable.


Upon running the code, you will see the summary statistics for the 'price' variable printed, including the count, mean, standard deviation, minimum value, and quartile values.



Data Cleaning: Briefly touch upon the importance of data cleaning before conducting EDA, including handling missing values, outliers, and inconsistencies.


Descriptive Statistics:


Summary Statistics: Calculate and interpret descriptive statistics such as mean, median, mode, standard deviation, and range for relevant fashion variables.


Distribution Analysis: Examine the distributions of numeric variables using histograms, box plots, and density plots. Interpret the skewness, kurtosis, and central tendency of the distributions.


Example

============================

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns


# Sample fashion dataset

fashion_data = pd.DataFrame({

    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],

    'brand': ['A', 'B', 'A', 'C', 'B'],

    'color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],

    'price': [29.99, 49.99, 39.99, 59.99, 69.99],

    'quantity': [10, 5, 8, 3, 12]

})


# Histogram: Distribution of 'price' variable

plt.figure(figsize=(8, 6))

sns.histplot(fashion_data['price'], bins=10, kde=True)

plt.title("Histogram: Distribution of 'price' Variable")

plt.xlabel('Price')

plt.ylabel('Frequency')

plt.show()


# Box plot: Distribution of 'quantity' variable

plt.figure(figsize=(8, 6))

sns.boxplot(x=fashion_data['quantity'])

plt.title("Box Plot: Distribution of 'quantity' Variable")

plt.xlabel('Quantity')

plt.show()


# Density plot: Distribution of 'quantity' variable

plt.figure(figsize=(8, 6))

sns.kdeplot(fashion_data['quantity'], shade=True)

plt.title("Density Plot: Distribution of 'quantity' Variable")

plt.xlabel('Quantity')

plt.ylabel('Density')

plt.show()


===============================

In this example, we use a sample fashion dataset that includes variables such as 'product_id', 'brand', 'color', 'price', and 'quantity'. We focus on examining the distributions of the numeric variables 'price' and 'quantity' using histograms, box plots, and density plots.


Using the seaborn library, we create visualizations for each variable:


Histogram: We plot the distribution of the 'price' variable using a histogram with 10 bins and a kernel density estimate (KDE) line.


Box plot: We create a box plot to visualize the distribution of the 'quantity' variable, showing the median, quartiles, and any potential outliers.


Density plot: We use a density plot to display the distribution of the 'quantity' variable, highlighting the shape and concentration of data points.


Interpreting the visualizations:


Histogram: The histogram provides an overview of the frequency distribution of prices. It allows us to identify the range of prices and observe the shape of the distribution.


Box plot: The box plot illustrates the median, quartiles, and any outliers in the distribution of quantities. It helps us understand the central tendency and spread of the data.


Density plot: The density plot represents the estimated probability density function of the quantity variable. It allows us to assess the shape and smoothness of the distribution.


By analyzing the histograms, box plots, and density plots, we can gain insights into the central tendency, spread, skewness, and potential outliers of the numeric variables in the fashion dataset.


Visualizing Fashion Data:


Scatter Plots: Explore the relationships between variables by creating scatter plots and interpreting the correlation or association between them. Discuss the concepts of positive, negative, or no correlation.


Bar Plots: Use bar plots to compare categorical variables such as brand, color, or product category. Analyze the frequency or proportion of each category and identify dominant or popular choices.


Line Plots: Visualize trends over time, such as sales volume, revenue, or customer engagement, using line plots. Discuss the seasonal patterns, growth trends, or irregularities observed.


Heatmaps: Create heatmaps to identify patterns and relationships among multiple variables simultaneously. Discuss the color encoding and interpretation of the heatmap.


Example

========================

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns


# Sample fashion dataset

fashion_data = pd.DataFrame({

    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],

    'brand': ['A', 'B', 'A', 'C', 'B'],

    'color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],

    'price': [29.99, 49.99, 39.99, 59.99, 69.99],

    'quantity': [10, 5, 8, 3, 12]

})


# Scatter plot: Price vs. Quantity

plt.figure(figsize=(8, 6))

sns.scatterplot(x='price', y='quantity', data=fashion_data)

plt.title("Scatter Plot: Price vs. Quantity")

plt.xlabel('Price')

plt.ylabel('Quantity')

plt.show()


# Bar plot: Brand Frequencies

plt.figure(figsize=(8, 6))

sns.countplot(x='brand', data=fashion_data)

plt.title("Bar Plot: Brand Frequencies")

plt.xlabel('Brand')

plt.ylabel('Frequency')

plt.show()


# Line plot: Price Trend

plt.figure(figsize=(8, 6))

sns.lineplot(x=fashion_data.index, y='price', data=fashion_data)

plt.title("Line Plot: Price Trend")

plt.xlabel('Index')

plt.ylabel('Price')

plt.show()


# Heatmap: Correlation Matrix

correlation_matrix = fashion_data[['price', 'quantity']].corr()

plt.figure(figsize=(8, 6))

sns.heatmap(correlation_matrix, annot=True, cmap='RdBu_r')

plt.title("Heatmap: Correlation Matrix")

plt.show()


========================================


In this example, we utilize a sample fashion dataset with variables such as 'product_id', 'brand', 'color', 'price', and 'quantity'. We demonstrate how to create scatter plots, bar plots, line plots, and heatmaps using these variables.


Scatter plot: We visualize the relationship between 'price' and 'quantity' using a scatter plot. Each point represents a product, with the x-axis representing the price and the y-axis representing the quantity.


Bar plot: We create a bar plot to show the frequency of each brand in the dataset. The x-axis represents the brand categories, and the y-axis represents the frequency count.


Line plot: We plot the price trend over the index of the dataset using a line plot. Each point represents a product's price, and the x-axis represents the index or time period.


Heatmap: We construct a heatmap to display the correlation matrix between 'price' and 'quantity'. The heatmap provides a visual representation of the correlation strength between the two variables.


Segmentation and Profiling:


Customer Segmentation: Utilize clustering techniques such as K-means or hierarchical clustering to segment customers based on their purchasing behavior, preferences, or demographics. Interpret and profile each customer segment.


Example:

==================

import pandas as pd

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt


# Sample fashion dataset

fashion_data = pd.DataFrame({

    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],

    'age': [25, 35, 45, 30, 20, 40, 50, 55, 28, 32],

    'gender': ['M', 'F', 'F', 'M', 'M', 'F', 'M', 'F', 'F', 'M'],

    'purchase_frequency': [2, 5, 1, 4, 3, 2, 1, 3, 4, 5],

    'average_spend': [100, 200, 50, 150, 120, 180, 60, 90, 110, 130]

})


# Select relevant features for clustering

features = ['age', 'purchase_frequency', 'average_spend']

X = fashion_data[features]


# Perform K-means clustering

k = 3  # Number of clusters

kmeans = KMeans(n_clusters=k, random_state=42)

clusters = kmeans.fit_predict(X)


# Add cluster labels to the dataset

fashion_data['cluster'] = clusters


# Visualize the clusters

plt.figure(figsize=(8, 6))

plt.scatter(X['purchase_frequency'], X['average_spend'], c=clusters, cmap='viridis')

plt.title('Customer Segmentation using K-means Clustering')

plt.xlabel('Purchase Frequency')

plt.ylabel('Average Spend')

plt.show()


# Profile each customer segment

cluster_profiles = fashion_data.groupby('cluster').mean()

print("Cluster Profiles:")

print(cluster_profiles)


=======================================

In this example, we have a fashion dataset with variables such as 'customer_id', 'age', 'gender', 'purchase_frequency', and 'average_spend'. We focus on segmenting customers based on their purchasing behavior using K-means clustering.


First, we select relevant features for clustering, which in this case are 'age', 'purchase_frequency', and 'average_spend'. We perform K-means clustering with a specified number of clusters (k) and assign cluster labels to each customer.


Next, we visualize the clusters using a scatter plot, where the x-axis represents purchase frequency and the y-axis represents average spend. Each data point is colored based on its assigned cluster label.


Finally, we profile each customer segment by calculating the mean values of the variables within each cluster. This provides insights into the characteristics and preferences of customers in each segment.


Upon running the code, you will see the scatter plot displaying the customer segments and the cluster profiles printed, indicating the average values of age, purchase frequency, and average spend for each cluster.



Market Segmentation:

Conduct market segmentation analysis to identify distinct market segments based on variables like age, gender, location, or shopping preferences. Analyze the characteristics and needs of each segment.


==================================

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns


# Sample fashion dataset

fashion_data = pd.DataFrame({

    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],

    'age': [25, 35, 45, 30, 20, 40, 50, 55, 28, 32],

    'gender': ['M', 'F', 'F', 'M', 'M', 'F', 'M', 'F', 'F', 'M'],

    'location': ['City A', 'City B', 'City A', 'City C', 'City C',

                 'City B', 'City C', 'City A', 'City B', 'City C'],

    'shopping_preferences': ['Quality', 'Price', 'Brand', 'Quality', 'Price',

                             'Quality', 'Brand', 'Price', 'Brand', 'Quality']

})


# Perform market segmentation analysis

segmentation = fashion_data.groupby(['location', 'shopping_preferences']).size().unstack()


# Visualize the market segments

plt.figure(figsize=(8, 6))

sns.heatmap(segmentation, cmap='Blues', annot=True, fmt='g')

plt.title('Market Segmentation Analysis')

plt.xlabel('Shopping Preferences')

plt.ylabel('Location')

plt.show()


# Analyze the characteristics and needs of each segment

segment_profiles = fashion_data.groupby(['location', 'shopping_preferences']).size().reset_index(name='count')

print("Segment Profiles:")

print(segment_profiles)

====================================

In this example, we have a fashion dataset with variables such as 'customer_id', 'age', 'gender', 'location', and 'shopping_preferences'. We aim to conduct market segmentation analysis to identify distinct market segments based on these variables.


Using the groupby() function in pandas, we group the data by 'location' and 'shopping_preferences' and calculate the size of each segment. This provides a summary of the number of customers within each combination of location and shopping preference.


Next, we visualize the market segments using a heatmap, where the x-axis represents shopping preferences, the y-axis represents location, and the color intensity represents the size of each segment. This visualization helps identify the composition of different market segments.


Finally, we analyze the characteristics and needs of each segment by examining the segment profiles. We group the data by 'location' and 'shopping_preferences' again and calculate the size of each segment. This allows us to understand the distribution of customers across different segments.


Upon running the code, you will see the heatmap displaying the market segments and the segment profiles printed, indicating the location, shopping preferences, and the count of customers within each segment.

======================================================

Data Visualization Tools:


Introduction to Data Visualization Libraries:


Matplotlib:


Matplotlib is a widely used data visualization library in Python. It provides a flexible and comprehensive set of plotting functions and customization options.

Benefits: Matplotlib offers a high level of customization, allowing you to create a wide range of visualizations. It supports various plot types, including line plots, scatter plots, bar plots, histograms, and more.

Use cases: Matplotlib is suitable for basic to advanced visualizations in fashion data analysis. You can use it to plot trends, distribution of variables, correlation matrices, and other standard visualizations


Seaborn:


Seaborn is a statistical data visualization library built on top of Matplotlib. It provides a higher-level interface for creating attractive and informative visualizations.

Benefits: Seaborn simplifies the creation of complex plots by providing optimized functions for statistical visualization. It offers a wide range of color palettes, themes, and built-in statistical functionalities.

Use cases: Seaborn is suitable for visualizing relationships, distributions, and categorical data in fashion data analysis. It is particularly useful for creating heatmaps, violin plots, box plots, and visualizations involving categorical variables.


Plotly:


Plotly is an interactive data visualization library that allows for the creation of interactive and dynamic visualizations. It offers both Python and JavaScript APIs.

Benefits: Plotly provides interactive features like zooming, panning, and hover tooltips. It supports a wide range of visualizations, including scatter plots, line plots, bar plots, box plots, and 3D visualizations.

Use cases: Plotly is suitable for creating interactive dashboards, exploratory data analysis, and sharing visualizations in fashion data analysis. It allows for creating interactive plots for better understanding and exploration of data.


Example

=======================

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import plotly.express as px


# Sample fashion dataset

fashion_data = pd.DataFrame({

    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],

    'age': [25, 35, 45, 30, 20, 40, 50, 55, 28, 32],

    'gender': ['M', 'F', 'F', 'M', 'M', 'F', 'M', 'F', 'F', 'M'],

    'location': ['City A', 'City B', 'City A', 'City C', 'City C',

                 'City B', 'City C', 'City A', 'City B', 'City C'],

    'shopping_preferences': ['Quality', 'Price', 'Brand', 'Quality', 'Price',

                             'Quality', 'Brand', 'Price', 'Brand', 'Quality']

})


# Matplotlib: Bar Plot

plt.figure(figsize=(8, 6))

fashion_data['location'].value_counts().plot(kind='bar')

plt.title('Bar Plot: Customer Distribution by Location')

plt.xlabel('Location')

plt.ylabel('Count')

plt.show()


# Seaborn: Box Plot

plt.figure(figsize=(8, 6))

sns.boxplot(x='location', y='age', data=fashion_data)

plt.title('Box Plot: Age Distribution by Location')

plt.xlabel('Location')

plt.ylabel('Age')

plt.show()


# Plotly: Scatter Plot

fig = px.scatter(fashion_data, x='age', y='average_spend', color='gender',

                 title='Scatter Plot: Age vs. Average Spend', trendline='ols')

fig.show()

===========================

In this example, we use a sample fashion dataset containing variables like 'customer_id', 'age', 'gender', 'location', and 'shopping_preferences'. We demonstrate the use of Matplotlib, Seaborn, and Plotly for data visualization.


For Matplotlib, we create a bar plot to visualize the customer distribution by location. This helps understand the distribution of customers across different cities.


Using Seaborn, we create a box plot to visualize the age distribution by location. This provides insights into the age differences among customers from different cities.


Lastly, using Plotly, we create a scatter plot to visualize the relationship between age and average spend, with color differentiation based on gender. Additionally, a trendline is added to show the overall trend in the data.



Interactive Visualizations:


Showcase interactive visualizations using tools like Plotly or Tableau to enable dynamic exploration and presentation of fashion data.


Example

=====================

import pandas as pd

import plotly.express as px


# Sample fashion dataset

fashion_data = pd.DataFrame({

    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],

    'age': [25, 35, 45, 30, 20, 40, 50, 55, 28, 32],

    'gender': ['M', 'F', 'F', 'M', 'M', 'F', 'M', 'F', 'F', 'M'],

    'location': ['City A', 'City B', 'City A', 'City C', 'City C',

                 'City B', 'City C', 'City A', 'City B', 'City C'],

    'shopping_preferences': ['Quality', 'Price', 'Brand', 'Quality', 'Price',

                             'Quality', 'Brand', 'Price', 'Brand', 'Quality']

})


# Plotly: Interactive Scatter Plot

fig = px.scatter(fashion_data, x='age', y='average_spend', color='gender',

                 hover_data=['customer_id', 'location', 'shopping_preferences'],

                 title='Fashion Customer Segmentation',

                 labels={'age': 'Age', 'average_spend': 'Average Spend'},

                 template='plotly_dark')


fig.update_traces(marker=dict(size=10))


fig.show()


==========================

In this example, we use a sample fashion dataset with variables like 'customer_id', 'age', 'gender', 'location', and 'shopping_preferences'.


Using Plotly, we create an interactive scatter plot to visualize the relationship between age and average spend. The data points are color-coded based on gender, and additional hover data is displayed on mouseover, including customer ID, location, and shopping preferences.


The interactive features of Plotly allow users to zoom in and out, pan across the plot, and view specific details by hovering over the data points. This enables dynamic exploration and presentation of the fashion data, facilitating deeper insights and analysis.


Upon running the code, an interactive scatter plot will be displayed, allowing you to interact with the plot and explore the fashion data in a dynamic and engaging way.


Exploratory Data Analysis plays a vital role in uncovering meaningful insights and patterns in fashion industry data. By conducting descriptive statistics, visualizations, and segmentation analysis, fashion managers can gain a deep understanding of their data and make data-driven decisions. The techniques discussed in this chapter provide a foundation for further analysis and modeling in fashion management.


EXERCISES


Part 1: Open-Ended Questions Describe the importance of Exploratory Data Analysis (EDA) in the fashion industry. How can EDA help fashion managers in decision-making and strategy development? Why is data cleaning an essential step before conducting EDA? Discuss the potential issues that data cleaning addresses, such as missing values, outliers, and inconsistencies. Choose one numeric variable from the fashion dataset provided in the example. Calculate and interpret its mean, median, and standard deviation. How do these descriptive statistics help in understanding the variable's distribution? Part 2: Close-Ended Questions Which type of visualization would you use to analyze the distribution of a numeric variable in the fashion dataset? a) Scatter plot b) Histogram c) Bar plot d) Line plot What can you infer from a box plot? a) Central tendency and spread of data b) Correlation between variables c) Frequency distribution of categories d) Seasonal patterns in data How would you interpret a negative correlation between two variables in a scatter plot? a) As one variable increases, the other variable decreases. b) As one variable increases, the other variable increases. c) There is no relationship between the two variables. d) The correlation is not significant. Part 3: Multiple Choice Questions Which type of plot would you use to compare the frequency of different product brands in the fashion dataset? a) Scatter plot b) Bar plot c) Line plot d) Heatmap What can you infer from a density plot? a) The frequency of each category in a categorical variable. b) The distribution of a numeric variable. c) The correlation between two numeric variables. d) The average values of a numeric variable in different segments. How would you interpret a positive skewness in a histogram of prices? a) The distribution is symmetric. b) The majority of prices are higher than the mean. c) The majority of prices are lower than the mean. d) The distribution is highly dispersed.


Programming Exercises


Create a dataset using this code

import pandas as pd import numpy as np # Generate random fashion dataset np.random.seed(1) brands = ['Brand A', 'Brand B', 'Brand C'] categories = ['Shirts', 'Pants', 'Shoes'] prices = np.random.normal(50, 10, 100) quantities = np.random.randint(1, 10, 100) ratings = np.random.randint(1, 6, 100) group_variable = np.random.choice(['Group1', 'Group2'], 100) fashion_data = pd.DataFrame({ 'brand': np.random.choice(brands, 100), 'category': np.random.choice(categories, 100), 'price': prices, 'quantity': quantities, 'rating': ratings, 'group_variable': group_variable }) # Save the dataset to a CSV file fashion_data.to_csv("fashion_dataset.csv", index=False)


Exercise 1: Exploratory Data Analysis (EDA) Write a Python function that takes a fashion dataset as input and performs the following tasks: Calculate and print the mean and standard deviation of a numeric variable. Generate a histogram to visualize the distribution of the variable. Plot a scatter plot to analyze the relationship between two numeric variables. Exercise 2: Data Cleaning Write a Python function that takes a fashion dataset as input and performs the following tasks: Handle missing values by either removing rows or imputing values. Detect and handle outliers using appropriate techniques, such as Z-score or IQR. Resolve inconsistencies, such as duplicate records or mismatched data types. Exercise 3: Visualization Techniques Write a Python program that reads a fashion dataset and creates the following visualizations: Generate a bar plot to compare the frequency of different product brands. Create a box plot to analyze the distribution of a numeric variable. Plot a density plot to visualize the distribution of a numeric variable. Exercise 4: Correlation Analysis Write a Python function that takes a fashion dataset as input and performs the following tasks: Calculate and print the correlation matrix of numeric variables. Create a heatmap to visualize the correlation matrix. Identify and print the variables with the highest positive and negative correlations. Exercise 5: Statistical Analysis Write a Python program that reads a fashion dataset and performs the following tasks: Conduct a t-test to compare the mean of a numeric variable between two groups. Perform ANOVA to analyze the differences in means across multiple groups. Calculate and print the p-values of the statistical tests.

Solutions to the Programming Problems

Exercise 1: Exploratory Data Analysis (EDA)

import pandas as pd import matplotlib.pyplot as plt def perform_eda(data, numeric_variable): # Calculate mean and standard deviation mean = data[numeric_variable].mean() std = data[numeric_variable].std() print("Mean:", mean) print("Standard Deviation:", std) # Generate a histogram data[numeric_variable].hist() plt.xlabel(numeric_variable) plt.ylabel("Frequency") plt.title("Distribution of " + numeric_variable) plt.show() # Plot a scatter plot plt.scatter(data[numeric_variable1], data[numeric_variable2]) plt.xlabel(numeric_variable1) plt.ylabel(numeric_variable2) plt.title("Scatter plot: " + numeric_variable1 + " vs " + numeric_variable2) plt.show() # Example usage fashion_data = pd.read_csv("fashion_dataset.csv") perform_eda(fashion_data, "price")

Exercise 2: Data Cleaning

import pandas as pd

def clean_data(data):
    # Handle missing values
    data.dropna(inplace=True)  # Remove rows with missing values

    # Handle outliers
    z_scores = (data["numeric_variable"] - data["numeric_variable"].mean()) / data["numeric_variable"].std()
    data = data[(z_scores > -3) & (z_scores < 3)]  # Keep only values within 3 standard deviations

    # Resolve inconsistencies
    data.drop_duplicates(inplace=True)  # Remove duplicate records
    data["column_name"] = data["column_name"].astype(int)  # Convert data type to int

    return data

# Example usage
fashion_data = pd.read_csv("fashion_dataset.csv")
cleaned_data = clean_data(fashion_data)


Exercise 3: Visualization Techniques

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns def visualize_data(data): # Bar plot plt.figure(figsize=(10, 6)) sns.countplot(x="brand", data=data) plt.xlabel("Brand") plt.ylabel("Frequency") plt.title("Frequency of Different Product Brands") plt.xticks(rotation=45) plt.show() # Box plot plt.figure(figsize=(10, 6)) sns.boxplot(x="numeric_variable", data=data) plt.xlabel("Numeric Variable") plt.ylabel("Value") plt.title("Distribution of Numeric Variable") plt.show() # Density plot plt.figure(figsize=(10, 6)) sns.kdeplot(data["numeric_variable"], shade=True) plt.xlabel("Numeric Variable") plt.ylabel("Density") plt.title("Distribution of Numeric Variable") plt.show() # Example usage fashion_data = pd.read_csv("fashion_dataset.csv") visualize_data(fashion_data)

Exercise 4: Correlation Analysis

import pandas as pd import seaborn as sns def perform_correlation_analysis(data): # Calculate correlation matrix corr_matrix = data.corr() # Print correlation matrix print("Correlation Matrix:") print(corr_matrix) # Create heatmap plt.figure(figsize=(10, 8)) sns.heatmap(corr_matrix, annot=True, cmap="RdYlBu") plt.title("Correlation Matrix") plt.show() # Find variables with highest positive and negative correlations max_corr = corr_matrix.unstack().sort_values(ascending=False) highest_positive_corr = max_corr[(max_corr < 1) & (max_corr > 0)].index[0] highest_negative_corr = max_corr[(max_corr < 0) & (max_corr > -1)].index[0] print("Variable with highest positive correlation:", highest_positive_corr) print("Variable with highest negative correlation:", highest_negative_corr) # Example usage fashion_data = pd.read_csv("fashion_dataset.csv") perform_correlation_analysis(fashion_data)

Exercise 5: Statistical Analysis

import pandas as pd import scipy.stats as stats def perform_statistical_analysis(data, numeric_variable, group_variable): # Perform t-test group1 = data[data[group_variable] == "Group1"][numeric_variable] group2 = data[data[group_variable] == "Group2"][numeric_variable] t_statistic, p_value = stats.ttest_ind(group1, group2) print("T-Statistic:", t_statistic) print("P-Value:", p_value) # Perform ANOVA groups = [data[data[group_variable] == group][numeric_variable] for group in data[group_variable].unique()] f_statistic, p_value = stats.f_oneway(*groups) print("F-Statistic:", f_statistic) print("P-Value:", p_value) # Example usage fashion_data = pd.read_csv("fashion_dataset.csv") perform_statistical_analysis(fashion_data, "numeric_variable", "group_variable")


No comments: