Wednesday, May 31, 2023

Chapter 4: Data Preprocessing and Cleaning

Back to Table of Contents

Data preprocessing and cleaning are essential steps in the data science process. Raw data often contains inconsistencies, errors, missing values, outliers, and noise, which can negatively impact the accuracy and reliability of the analysis. This chapter explores the importance of data preprocessing and cleaning in fashion management and discusses various techniques used to handle these challenges effectively.


Data Cleaning Techniques


Data cleaning involves identifying and rectifying errors, inconsistencies, and inaccuracies in the dataset. Here are some common techniques used for data cleaning:


a) Handling Missing Data: Missing data can occur due to various reasons, such as incomplete surveys, technical issues, or data entry errors. Handling missing data requires careful consideration. Techniques for handling missing data include imputation (replacing missing values with estimated values based on other data), deletion of incomplete records or variables, or using advanced techniques like multiple imputation.


Let us take an example, and how we can do it in Python:


===========================


import pandas as pd


# Create a sample fashion dataset

data = {

    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],

    'brand': ['A', 'B', 'C', None, 'D'],

    'color': ['Red', None, 'Blue', 'Green', 'Black'],

    'price': [29.99, 39.99, None, 19.99, 49.99],

    'quantity': [10, 5, 8, None, 12]

}


fashion_data = pd.DataFrame(data)


# Check for missing data

print(fashion_data.isnull().sum())


# Handling missing data


# 1. Remove rows with missing values

fashion_data_cleaned = fashion_data.dropna()


# 2. Impute missing values with mean

mean_value = fashion_data['price'].mean()

fashion_data['price'].fillna(mean_value, inplace=True)


# 3. Impute missing values with mode

mode_value = fashion_data['color'].mode()[0]

fashion_data['color'].fillna(mode_value, inplace=True)


# 4. Impute missing values with a custom value

fashion_data['brand'].fillna('Unknown', inplace=True)


# Verify if missing data is handled

print(fashion_data.isnull().sum())

===========================================

In this example, we create a sample fashion dataset using a dictionary. The dataset contains information about product IDs, brands, colors, prices, and quantities. Some values are intentionally set as missing (None) to demonstrate the handling of missing data.


We then check for missing data using the isnull().sum() function, which returns the count of missing values in each column.


Next, we apply various techniques to handle missing data:


Removing rows with missing values: We use the dropna() function to remove any rows that have missing values. This approach reduces the dataset size but ensures that only complete records are used for analysis.


Imputing missing values with the mean: We calculate the mean value of the 'price' column using the mean() function and fill the missing values with the mean value using fillna().


Imputing missing values with the mode: We calculate the mode value of the 'color' column using the mode() function and fill the missing values with the mode value using fillna().


Imputing missing values with a custom value: We fill the missing values in the 'brand' column with the custom value 'Unknown' using fillna().


Finally, we verify if missing data is handled by checking again for missing values using isnull().sum().


Remember to adapt this code to your specific fashion dataset and column names. Additionally, consider the appropriateness of the chosen imputation methods based on the characteristics and distribution of the data.



b) Removing Duplicate Entries: Duplicates in the dataset can distort the analysis and lead to biased results. Identifying and removing duplicate entries ensures data integrity. Duplicate entries can be detected by comparing records based on unique identifiers or combinations of variables.


========================

import pandas as pd


# Create a sample fashion dataset with duplicate entries

data = {

    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005', 'P002'],

    'brand': ['A', 'B', 'C', 'D', 'E', 'B'],

    'color': ['Red', 'Blue', 'Green', 'Yellow', 'Black', 'Blue'],

    'price': [29.99, 39.99, 19.99, 49.99, 59.99, 39.99],

    'quantity': [10, 5, 8, 2, 12, 5]

}


fashion_data = pd.DataFrame(data)


# Check for duplicate entries

print(fashion_data.duplicated().sum())


# Removing duplicate entries

fashion_data_cleaned = fashion_data.drop_duplicates()


# Verify if duplicate entries are removed

print(fashion_data_cleaned)


==============================================


In this example, we create a sample fashion dataset using a dictionary. The dataset contains information about product IDs, brands, colors, prices, and quantities. Some entries are intentionally duplicated to demonstrate the removal of duplicate entries.


We first check for duplicate entries using the duplicated() function, which returns a boolean Series indicating whether each row is a duplicate of a previous row. The sum() function is then used to count the total number of duplicate entries.


Next, we remove the duplicate entries using the drop_duplicates() function. This function removes all duplicated rows, keeping only the first occurrence of each unique row.


Finally, we verify if duplicate entries are removed by printing the cleaned fashion dataset.


Remember to adapt this code to your specific fashion dataset and column names. Additionally, consider the criteria for identifying duplicates based on the unique characteristics of your dataset, such as product ID or a combination of columns.


c) Correcting Inconsistencies: Inconsistencies in the dataset can arise from human errors, data entry mistakes, or inconsistencies in data sources. Techniques such as standardization, normalization, or regular expression matching can be used to correct inconsistent values and ensure uniformity in the dataset.


Example

===============================


import pandas as pd

import re


# Create a sample fashion dataset with inconsistent values

data = {

    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],

    'brand': ['A', 'B', 'C', 'D', 'E'],

    'color': ['Red', 'blue', 'GREEN', 'Green', 'black'],

    'price': [29.99, 39.99, 19.99, 49.99, 59.99],

    'quantity': [10, 5, 8, 2, 12]

}


fashion_data = pd.DataFrame(data)


# Standardization example: Converting brand names to uppercase

fashion_data['brand'] = fashion_data['brand'].str.upper()


# Normalization example: Converting color values to lowercase

fashion_data['color'] = fashion_data['color'].str.lower()


# Regular expression example: Removing non-numeric characters from product ID

fashion_data['product_id'] = fashion_data['product_id'].apply(lambda x: re.sub(r'\D', '', x))


# Verify if inconsistencies are corrected

print(fashion_data)


================================================


In this example, we create a sample fashion dataset using a dictionary. The dataset contains information about product IDs, brands, colors, prices, and quantities. Some values are intentionally set with inconsistent cases, capitalization, or non-numeric characters to demonstrate the correction of inconsistencies.


To remove inconsistencies, we use the following techniques:


Standardization: We use the str.upper() function to convert all values in the 'brand' column to uppercase, ensuring consistent capitalization.


Normalization: We use the str.lower() function to convert all values in the 'color' column to lowercase, ensuring consistent capitalization.


Regular expression: We use a regular expression pattern (\D) to remove all non-numeric characters from the 'product_id' column using the re.sub() function. This ensures that the product IDs contain only numeric characters.


Finally, we verify if the inconsistencies are corrected by printing the cleaned fashion dataset.




d) Handling Outliers: Outliers are extreme values that deviate significantly from the majority of the data. Outliers can occur due to measurement errors, data entry errors, or genuine extreme observations. It is important to identify and handle outliers appropriately, as they can impact statistical analysis and model performance. Techniques such as filtering based on statistical measures, Winsorizing, or imputation can be used to handle outliers.


Example


======================================

import pandas as pd


# Create a sample fashion dataset with outliers

data = {

    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],

    'brand': ['A', 'B', 'C', 'D', 'E'],

    'price': [29.99, 39.99, 19.99, 999.99, 59.99],

    'quantity': [10, 5, 8, 2, 12]

}


fashion_data = pd.DataFrame(data)


# Handling outliers


# 1. Identify outliers using z-score

z_score = (fashion_data['price'] - fashion_data['price'].mean()) / fashion_data['price'].std()

threshold = 3  # Adjust the threshold as needed


outliers = fashion_data[abs(z_score) > threshold]

print("Outliers:")

print(outliers)


# 2. Replace outliers with median value

median_value = fashion_data['price'].median()

fashion_data.loc[abs(z_score) > threshold, 'price'] = median_value


# Verify if outliers are handled

print("Fashion data:")

print(fashion_data)


=================================

In this example, we create a sample fashion dataset using a dictionary. The dataset contains information about product IDs, brands, prices, and quantities. The price column intentionally includes an outlier value to demonstrate the handling of outliers.


To handle outliers, we use the following steps:


Identify outliers using z-score: We calculate the z-score for the 'price' column by subtracting the mean and dividing by the standard deviation. We then define a threshold (e.g., 3) to identify values that are significantly different from the mean. Rows with z-scores greater than the threshold are considered outliers.


Replace outliers with the median value: We calculate the median value of the 'price' column and replace the outliers with the median value. This step helps mitigate the impact of outliers on statistical analyses or models.


Finally, we verify if the outliers are handled by printing the outliers and the modified fashion dataset.


Remember to adapt this code to your specific fashion dataset and column names. Additionally, consider other techniques for outlier detection and handling, such as using the interquartile range (IQR) or domain-specific knowledge to define the outlier threshold.



Data Transformation Techniques


Data transformation involves modifying the original dataset to improve its suitability for analysis or to meet specific requirements. Here are some common data transformation techniques:


a) Feature Scaling: Feature scaling ensures that variables are on a similar scale to prevent one variable from dominating the analysis due to its magnitude. Common scaling techniques include standardization (subtracting mean and dividing by standard deviation) or normalization (scaling values to a specific range, such as 0 to 1).


b) Variable Encoding: Categorical variables need to be encoded into numerical form for analysis. Techniques like one-hot encoding, label encoding, or ordinal encoding can be used depending on the nature of the categorical variable and the analysis requirements.


c) Dimensionality Reduction: Dimensionality reduction techniques reduce the number of variables in the dataset while preserving important information. Techniques like principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), or feature selection algorithms can be used to reduce dimensionality and improve computational efficiency or interpretability.


d) Handling Skewed Data: Skewed data distributions can impact the performance of certain models and statistical tests. Techniques such as logarithmic transformation, square root transformation, or Box-Cox transformation can be applied to reduce skewness and normalize the data distribution.


5.4 Data Integration and Transformation


Data integration involves combining data from multiple sources or merging different datasets to create a unified dataset for analysis. This step ensures that relevant information from different sources is appropriately combined. Data transformation techniques, such as aggregation, merging, or joining, are applied during the integration process to align variables, resolve inconsistencies, and handle data format differences.


Example of Data Merging

===========================

import pandas as pd


# Create two sample fashion datasets

data1 = {

    'product_id': ['P001', 'P002', 'P003'],

    'brand': ['A', 'B', 'C'],

    'color': ['Red', 'Blue', 'Green'],

    'price': [29.99, 39.99, 19.99],

    'quantity': [10, 5, 8]

}


data2 = {

    'product_id': ['P004', 'P005', 'P006'],

    'brand': ['D', 'E', 'F'],

    'color': ['Yellow', 'Black', 'White'],

    'price': [49.99, 59.99, 69.99],

    'quantity': [2, 12, 7]

}


fashion_data1 = pd.DataFrame(data1)

fashion_data2 = pd.DataFrame(data2)


# Merge the two datasets based on product_id

merged_data = pd.merge(fashion_data1, fashion_data2, on='product_id', how='inner')


# Verify the merged dataset

print(merged_data)


============================================

In this example, we create two sample fashion datasets (fashion_data1 and fashion_data2) using dictionaries. Each dataset contains information about product IDs, brands, colors, prices, and quantities.


To merge the datasets, we use the merge() function from pandas. In this case, we merge the datasets based on the 'product_id' column, using an inner join (how='inner'). This means only the matching records between the two datasets will be included in the merged dataset.


The resulting merged dataset (merged_data) will contain all the columns from both datasets, with matching records based on the 'product_id' column.


Finally, we print the merged dataset to verify the successful merging of data from different fashion industry datasets.


Remember to adapt this code to your specific fashion datasets and column names. Additionally, consider the appropriate merge strategy (how parameter) based on your specific data requirements, such as inner join, outer join, left join, or right join.


Example of Data Aggregating 


===========================

import pandas as pd


# Create a sample fashion dataset

data = {

    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],

    'brand': ['A', 'B', 'A', 'C', 'C'],

    'color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],

    'price': [29.99, 39.99, 19.99, 49.99, 59.99],

    'quantity': [10, 5, 8, 2, 12]

}


fashion_data = pd.DataFrame(data)


# Aggregating data


# 1. Aggregating total sales by brand

total_sales_by_brand = fashion_data.groupby('brand')['quantity'].sum()


print("Total Sales by Brand:")

print(total_sales_by_brand)


# 2. Aggregating average price by color

average_price_by_color = fashion_data.groupby('color')['price'].mean()


print("\nAverage Price by Color:")

print(average_price_by_color)


====================================

In this example, we create a sample fashion dataset using a dictionary. The dataset contains information about product IDs, brands, colors, prices, and quantities.


To aggregate the data, we use the following steps:


Aggregating total sales by brand: We group the dataset by the 'brand' column using groupby(), and then calculate the sum of the 'quantity' column to get the total sales for each brand.


Aggregating average price by color: We group the dataset by the 'color' column using groupby(), and then calculate the mean of the 'price' column to get the average price for each color.


Finally, we print the aggregated results, which include the total sales by brand and the average price by color.


Remember to adapt this code to your specific fashion dataset and column names. Additionally, consider other aggregation functions and groupings based on the specific analysis or insights you want to derive from your fashion industry data.



This chapter has highlighted the importance of data preprocessing and cleaning in the data science process. Data cleaning techniques help to rectify errors, handle missing values, remove duplicates, and correct inconsistencies in the dataset. Data transformation techniques facilitate feature scaling, variable encoding, dimensionality reduction, and handling skewed data. By effectively preprocessing and cleaning data, fashion management professionals can ensure the quality and reliability of their analyses, leading to more accurate and meaningful insights.



Exercises


Consider this Dataset, you can generate by executing this code

==========


import pandas as pd


data = {

    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],

    'brand': ['A', 'B', 'A', 'C', 'C'],

    'color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],

    'price': [29.99, 39.99, None, 49.99, 59.99],

    'quantity': [10, 5, 8, 2, 12]

}


fashion_data = pd.DataFrame(data)


================


Exercise 1:

Remove any duplicate entries from the fashion dataset.


Exercise 2:

Handle missing values in the dataset by filling them with the mean price of the available data.


Exercise 3:

Normalize the 'price' column by scaling the values between 0 and 1 using min-max normalization.


Exercise 4:

Create a new column 'total_price' by multiplying the 'price' and 'quantity' columns.


Exercise 5:

Convert the 'color' column to categorical values using one-hot encoding.


Solution:

==========================================

import pandas as pd


# Sample dataset

data = {

    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],

    'brand': ['A', 'B', 'A', 'C', 'C'],

    'color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],

    'price': [29.99, 39.99, None, 49.99, 59.99],

    'quantity': [10, 5, 8, 2, 12]

}


fashion_data = pd.DataFrame(data)


# Exercise 1: Remove duplicate entries

fashion_data = fashion_data.drop_duplicates()


# Exercise 2: Handle missing values by filling with mean price

mean_price = fashion_data['price'].mean()

fashion_data['price'].fillna(mean_price, inplace=True)


# Exercise 3: Normalize 'price' column using min-max normalization

min_price = fashion_data['price'].min()

max_price = fashion_data['price'].max()

fashion_data['normalized_price'] = (fashion_data['price'] - min_price) / (max_price - min_price)


# Exercise 4: Create 'total_price' column by multiplying 'price' and 'quantity'

fashion_data['total_price'] = fashion_data['price'] * fashion_data['quantity']


# Exercise 5: One-hot encode 'color' column

one_hot_encoded = pd.get_dummies(fashion_data['color'], prefix='color')

fashion_data = pd.concat([fashion_data, one_hot_encoded], axis=1)


# Print the resulting dataset after preprocessing

print(fashion_data)


=========================================

This solution demonstrates how to perform the given data preprocessing and cleaning exercises using the sample fashion dataset. Here's a summary of each exercise's solution:


Remove duplicate entries: The drop_duplicates() function is used to remove any duplicate rows from the dataset.


Handle missing values by filling with mean price: The mean price is calculated using the mean() function, and the missing values in the 'price' column are filled using the fillna() function.


Normalize 'price' column using min-max normalization: The minimum and maximum values of the 'price' column are calculated using the min() and max() functions, respectively. Then, the 'price' column is normalized by subtracting the minimum value and dividing by the range (maximum minus minimum).


Create 'total_price' column by multiplying 'price' and 'quantity': The 'total_price' column is created by multiplying the 'price' and 'quantity' columns together.


One-hot encode 'color' column: The get_dummies() function is used to perform one-hot encoding on the 'color' column, creating separate binary columns for each unique color value. The resulting one-hot encoded columns are then concatenated with the original dataset using concat().


After applying these preprocessing steps, the resulting fashion dataset will have removed duplicates, handled missing values, normalized prices, created a 'total_price' column, and one-hot encoded the 'color' column.


Exercise


Here is a synthetic dataset of 10000 rows for the purpose of the case studies


=====================

import pandas as pd

import numpy as np


# Create synthetic fashion dataset

np.random.seed(42)


product_ids = ['P' + str(i).zfill(4) for i in range(1, 10001)]

brands = np.random.choice(['A', 'B', 'C'], size=10000)

colors = np.random.choice(['Red', 'Blue', 'Green', 'Yellow'], size=10000)

prices = np.random.uniform(10, 100, size=10000)

quantities = np.random.randint(1, 20, size=10000)


fashion_data = pd.DataFrame({

    'product_id': product_ids,

    'brand': brands,

    'color': colors,

    'price': prices,

    'quantity': quantities

})


# Print the synthetic fashion dataset

print(fashion_data.head())


===========================

For all the case study below, consider the dataset above:


Case Study 1: Handling Missing Data

Explore a fashion dataset and identify missing values in important columns such as price or quantity. Implement different strategies to handle missing data, such as dropping rows with missing values, filling missing values with mean or median, or using advanced imputation techniques like K-nearest neighbors or regression-based imputation.


Case Study 2: Outlier Detection and Treatment

Analyze a fashion dataset to identify outliers in numerical columns such as price or quantity. Apply outlier detection techniques, such as Z-score or IQR (Interquartile Range), to identify and remove or correct the outliers. Evaluate the impact of outlier treatment on data distribution and statistical analysis.


Case Study 3: Data Standardization and Normalization

Examine a fashion dataset with numeric columns, such as price or quantity, that have different scales. Implement techniques like standardization or normalization to transform the data to a common scale. Analyze the effects of these techniques on data distribution and model performance.


Case Study 4: Handling Inconsistent Categorical Data

Investigate a fashion dataset with categorical columns like brand or color that may have inconsistent values or spelling variations. Utilize techniques such as data cleaning, string manipulation, or regular expressions to standardize and correct inconsistencies in the categorical data. Assess the impact of this preprocessing step on data quality and analysis outcomes.


Case Study 5: Data Integration and Transformation

Merge multiple fashion datasets from different sources, considering common columns such as product ID or brand. Perform data integration using pandas' merge functionality and handle any inconsistencies or conflicts that may arise during the merging process. Transform the merged dataset to create new derived features or aggregated variables that provide additional insights into the fashion industry.


These case studies allow you to apply the concepts and techniques of data preprocessing and cleaning to real-world scenarios in the fashion industry. Each case study presents a unique challenge and provides an opportunity to gain practical experience in handling data quality issues, ensuring data consistency, and preparing data for further analysis.



Solution to The Case Studies


Case Study 1: Handling Missing Data


# Identify missing values

missing_values = fashion_data.isnull().sum()

print("Missing Values:")

print(missing_values)


# Fill missing values with mean price

mean_price = fashion_data['price'].mean()

fashion_data['price'].fillna(mean_price, inplace=True)



Case Study 2: Outlier Detection and Treatment


# Detect and treat outliers in the 'price' column using Z-score

z_scores = (fashion_data['price'] - fashion_data['price'].mean()) / fashion_data['price'].std()

outliers = fashion_data[abs(z_scores) > 3]

print("Outliers:")

print(outliers)


# Remove outliers from the dataset

fashion_data = fashion_data[abs(z_scores) <= 3]


Case Study 3: Data Standardization and Normalization


# Standardize 'price' column using Z-score

fashion_data['standardized_price'] = (fashion_data['price'] - fashion_data['price'].mean()) / fashion_data['price'].std()


# Normalize 'quantity' column using min-max normalization

min_quantity = fashion_data['quantity'].min()

max_quantity = fashion_data['quantity'].max()

fashion_data['normalized_quantity'] = (fashion_data['quantity'] - min_quantity) / (max_quantity - min_quantity)



Case Study 4: Handling Inconsistent Categorical Data


# Replace inconsistent values in the 'brand' column

fashion_data['brand'].replace({'A': 'Brand A', 'B': 'Brand B', 'C': 'Brand C'}, inplace=True)


# Correct inconsistencies in the 'color' column using regular expressions

fashion_data['color'] = fashion_data['color'].str.replace('Red', 'Red Color').str.replace('Blue', 'Blue Color')


Case Study 5: Data Integration and Transformation


# Create a second synthetic fashion dataset

np.random.seed(123)

product_ids = ['P' + str(i).zfill(4) for i in range(10001, 20001)]

brands = np.random.choice(['A', 'D', 'E'], size=10000)

colors = np.random.choice(['Red', 'Green', 'Yellow'], size=10000)

prices = np.random.uniform(50, 150, size=10000)

quantities = np.random.randint(5, 25, size=10000)


fashion_data2 = pd.DataFrame({

    'product_id': product_ids,

    'brand': brands,

    'color': colors,

    'price': prices,

    'quantity': quantities

})


# Merge the two fashion datasets

merged_data = pd.concat([fashion_data, fashion_data2], ignore_index=True)


# Group by 'brand' and calculate average price and total quantity

grouped_data = merged_data.groupby('brand').agg({'price': 'mean', 'quantity': 'sum'})

print(grouped_data)



No comments: