Monday, June 5, 2023

Chapter 6: Predictive Analytics for Fashion Forecasting: Exercises and Solutions

Back to Table of Contents 

Exercise 1:

Write a program that loads a dataset from a CSV file, splits it into training and testing sets using train_test_split, and fits a Support Vector Machine (SVM) classifier on the training data. Finally, evaluate the model using accuracy_score on the test set.

Dataset 

import pandas as pd

from sklearn.datasets import make_classification


# Generate synthetic dataset

X, y = make_classification(

    n_samples=1000,

    n_features=5,

    n_informative=3,

    n_redundant=2,

    n_classes=2,

    random_state=42

)


# Create a DataFrame from the generated data

df = pd.DataFrame(X, columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5'])

df['target_variable'] = y


# Save the dataset to a CSV file

df.to_csv('dataset.csv', index=False)


Solution

import pandas as pd

from sklearn.svm import SVC

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score


# Load the dataset from CSV file

df = pd.read_csv('dataset.csv')


# Define the predictor variables and target variable

X = df.drop('target_variable', axis=1)

y = df['target_variable']


# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Create and fit the Support Vector Machine classifier

svm = SVC()

svm.fit(X_train, y_train)


# Make predictions on the test set

y_pred = svm.predict(X_test)


# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)


Exercise 2:

Create a program that reads a dataset from a CSV file, preprocesses the data by scaling the numerical features and encoding categorical variables, and then performs dimensionality reduction using Principal Component Analysis (PCA). Fit a logistic regression model on the transformed data and evaluate its performance using cross_val_score.

Dataset

import pandas as pd

from sklearn.datasets import make_classification

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder


# Generate synthetic dataset

X, y = make_classification(

    n_samples=1000,

    n_features=5,

    n_informative=3,

    n_redundant=2,

    n_classes=2,

    random_state=42

)


# Create a DataFrame from the generated data

df = pd.DataFrame(X, columns=['numerical1', 'numerical2', 'numerical3', 'categorical1', 'categorical2'])

df['target_variable'] = y


# Map categorical columns to string labels

df['categorical1'] = df['categorical1'].map({0: 'A', 1: 'B'})

df['categorical2'] = df['categorical2'].map({0: 'X', 1: 'Y'})


# Scale numerical features

scaler = MinMaxScaler()

df[['numerical1', 'numerical2', 'numerical3']] = scaler.fit_transform(df[['numerical1', 'numerical2', 'numerical3']])


# One-hot encode categorical variables

encoder = OneHotEncoder(sparse=False)

encoded_features = pd.DataFrame(encoder.fit_transform(df[['categorical1', 'categorical2']]), columns=encoder.get_feature_names(['categorical1', 'categorical2']))

df.drop(['categorical1', 'categorical2'], axis=1, inplace=True)

df = pd.concat([df, encoded_features], axis=1)


# Save the dataset to a CSV file

df.to_csv('dataset.csv', index=False)


 Solution

import pandas as pd

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.decomposition import PCA

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import cross_val_score


# Load the dataset from CSV file

df = pd.read_csv('dataset.csv')


# Separate the predictor variables and target variable

X = df.drop('target_variable', axis=1)

y = df['target_variable']


# Preprocess the data

# Scale the numerical features

numerical_features = X.select_dtypes(include=['float64', 'int64'])

scaler = StandardScaler()

scaled_numerical_features = scaler.fit_transform(numerical_features)


# Encode categorical variables

categorical_features = X.select_dtypes(include=['object'])

encoder = OneHotEncoder(sparse=False)

encoded_categorical_features = encoder.fit_transform(categorical_features)


# Combine the scaled numerical and encoded categorical features

preprocessed_X = pd.DataFrame(

    data=scaled_numerical_features,

    columns=numerical_features.columns

).join(

    pd.DataFrame(

        data=encoded_categorical_features,

        columns=encoder.get_feature_names(categorical_features.columns)

    )

)


# Perform dimensionality reduction using PCA

pca = PCA(n_components=3)

transformed_X = pca.fit_transform(preprocessed_X)


# Fit a logistic regression model on the transformed data

logreg = LogisticRegression()

logreg.fit(transformed_X, y)


# Evaluate the model using cross_val_score

scores = cross_val_score(logreg, transformed_X, y, cv=5)

average_accuracy = scores.mean()

print("Average Accuracy:", average_accuracy)


Exercise 3:

Write a program that loads a dataset from a CSV file, splits it into training and testing sets, and trains a Random Forest Classifier on the training data. Use GridSearchCV to tune the hyperparameters of the Random Forest Classifier and find the best combination. Finally, evaluate the model's performance on the test set using classification_report.

Dataset 

import pandas as pd

from sklearn.datasets import make_classification


# Generate synthetic dataset

X, y = make_classification(

    n_samples=1000,

    n_features=5,

    n_informative=3,

    n_redundant=2,

    n_classes=2,

    random_state=42

)


# Create a DataFrame from the generated data

df = pd.DataFrame(X, columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5'])

df['target_variable'] = y


# Save the dataset to a CSV file

df.to_csv('dataset.csv', index=False)


Solution

import pandas as pd

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.metrics import classification_report


# Load the dataset from CSV file

df = pd.read_csv('dataset.csv')


# Separate the predictor variables and target variable

X = df.drop('target_variable', axis=1)

y = df['target_variable']


# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Create a Random Forest Classifier

rf = RandomForestClassifier()


# Define the hyperparameters to tune

param_grid = {

    'n_estimators': [100, 200, 300],

    'max_depth': [None, 5, 10],

    'min_samples_split': [2, 5, 10]

}


# Perform GridSearchCV to find the best combination of hyperparameters

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5)

grid_search.fit(X_train, y_train)


# Get the best model

best_model = grid_search.best_estimator_


# Make predictions on the test set

y_pred = best_model.predict(X_test)


# Evaluate the model's performance

report = classification_report(y_test, y_pred)

print("Classification Report:")

print(report)


Exercise 4:

Create a program that reads a dataset from a CSV file, preprocesses the data by imputing missing values and scaling the features, and splits it into training and testing sets. Fit a K-Nearest Neighbors (KNN) classifier on the training data and determine the optimal value of K using cross-validation. Evaluate the model's performance on the test set using accuracy_score.

Dataset 

import pandas as pd

from sklearn.datasets import make_classification

from numpy import nan


# Generate synthetic dataset with missing values

X, y = make_classification(

    n_samples=1000,

    n_features=5,

    n_informative=3,

    n_redundant=2,

    n_classes=2,

    random_state=42

)


# Introduce missing values

X[10:20, 1] = nan

X[50:55, 3] = nan

X[200:210, 2] = nan


# Create a DataFrame from the generated data

df = pd.DataFrame(X, columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5'])

df['target_variable'] = y


# Save the dataset to a CSV file

df.to_csv('dataset.csv', index=False)


Solution

import pandas as pd

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.metrics import accuracy_score


# Load the dataset from CSV file

df = pd.read_csv('dataset.csv')


# Separate the predictor variables and target variable

X = df.drop('target_variable', axis=1)

y = df['target_variable']


# Preprocess the data

# Impute missing values

imputer = SimpleImputer(strategy='mean')

X_imputed = imputer.fit_transform(X)


# Scale the features

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X_imputed)


# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


# Fit a K-Nearest Neighbors (KNN) classifier

k_values = [3, 5, 7, 9, 11]  # Values of K to evaluate

best_accuracy = 0

best_k = 0


for k in k_values:

    knn = KNeighborsClassifier(n_neighbors=k)

    scores = cross_val_score(knn, X_train, y_train, cv=5)

    average_accuracy = scores.mean()


    if average_accuracy > best_accuracy:

        best_accuracy = average_accuracy

        best_k = k


# Fit the best KNN model on the training data

knn = KNeighborsClassifier(n_neighbors=best_k)

knn.fit(X_train, y_train)


# Make predictions on the test set

y_pred = knn.predict(X_test)


# Evaluate the model's performance

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)



Exercise 5:

Write a program that loads a dataset from a CSV file, preprocesses the data by applying feature selection techniques such as SelectKBest or Recursive Feature Elimination (RFE). Split the data into training and testing sets and train a Decision Tree Classifier on the selected features. Evaluate the model's performance using a confusion matrix and plot the decision tree using graphviz.

Dataset

import pandas as pd

from sklearn.datasets import make_classification


# Generate synthetic dataset

X, y = make_classification(

    n_samples=1000,

    n_features=10,

    n_informative=5,

    n_redundant=2,

    n_classes=2,

    random_state=42

)


# Create a DataFrame from the generated data

df = pd.DataFrame(X, columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5',

                              'feature6', 'feature7', 'feature8', 'feature9', 'feature10'])

df['target_variable'] = y


# Save the dataset to a CSV file

df.to_csv('dataset.csv', index=False)


Solution

import pandas as pd

from sklearn.feature_selection import SelectKBest, RFE

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix

from sklearn.tree import export_graphviz

import graphviz


# Load the dataset from CSV file

df = pd.read_csv('dataset.csv')


# Separate the predictor variables and target variable

X = df.drop('target_variable', axis=1)

y = df['target_variable']


# Preprocess the data - Apply feature selection

# SelectKBest

kbest = SelectKBest(k=3)  # Select top 3 features

X_selected = kbest.fit_transform(X, y)


# Recursive Feature Elimination (RFE)

# estimator = DecisionTreeClassifier()  # or any other classifier

# rfe = RFE(estimator, n_features_to_select=3)  # Select top 3 features

# X_selected = rfe.fit_transform(X, y)


# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)


# Train a Decision Tree Classifier on the selected features

dt = DecisionTreeClassifier()

dt.fit(X_train, y_train)


# Make predictions on the test set

y_pred = dt.predict(X_test)


# Evaluate the model's performance using a confusion matrix

cm = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")

print(cm)


# Plot the decision tree using graphviz

dot_data = export_graphviz(dt, out_file=None, filled=True, rounded=True, special_characters=True)

graph = graphviz.Source(dot_data)

graph.render("decision_tree")  # Save the decision tree to a file


Saturday, June 3, 2023

Appendix 2: Use of Python for Data Science

Back to Table of Contents

 In recent years, the fashion industry has witnessed a significant transformation with the integration of data science and analytics. The ability to analyze and interpret vast amounts of data has become crucial for fashion companies to gain a competitive edge. Python, a versatile and powerful programming language, has emerged as a preferred language for data science in the fashion industry. In this chapter, we will explore the reasons behind Python's popularity and its applications in the fashion industry.


The Rise of Python in Data Science

Python has gained immense popularity in the field of data science due to its simplicity, flexibility, and extensive ecosystem of libraries and frameworks. The language's clear and readable syntax makes it accessible to both experienced programmers and beginners. Additionally, Python's vast collection of libraries, such as NumPy, Pandas, and Matplotlib, provides a rich set of tools for data manipulation, analysis, and visualization.


Data Collection and Cleaning

Data is the foundation of any data science project. In the fashion industry, data can be collected from various sources, including e-commerce websites, social media platforms, customer feedback, and supply chain systems. Python offers powerful libraries like Beautiful Soup and Scrapy, which assist in web scraping, enabling fashion companies to extract relevant data from websites. Once the data is collected, Python's data manipulation libraries, such as Pandas, allow for efficient cleaning, preprocessing, and transforming of the data to make it suitable for analysis.


Data Analysis and Machine Learning

Python's extensive ecosystem of libraries makes it a go-to language for data analysis and machine learning in the fashion industry. Fashion companies can leverage libraries like Scikit-learn and TensorFlow to build and train machine learning models for various applications, such as customer segmentation, demand forecasting, and trend analysis. These models can provide valuable insights into customer preferences, optimize inventory management, and predict fashion trends.


Image Analysis and Computer Vision

Visual data plays a crucial role in the fashion industry, and Python provides excellent support for image analysis and computer vision tasks. Libraries such as OpenCV, TensorFlow, and Keras enable fashion companies to develop advanced computer vision models for tasks like image classification, object detection, and image generation. These techniques can be applied to analyze product images, identify fashion trends, and create personalized shopping experiences for customers.


Natural Language Processing

In addition to visual data, textual data is abundant in the fashion industry through customer reviews, social media comments, and fashion articles. Python's Natural Language Processing (NLP) libraries, such as NLTK and SpaCy, allow fashion companies to extract insights from text data. Sentiment analysis can help monitor customer feedback, topic modeling can identify emerging fashion trends, and text generation techniques can be used to create personalized fashion recommendations.


Data Visualization and Reporting

Effective communication of data insights is crucial in the fashion industry. Python's visualization libraries, such as Matplotlib, Seaborn, and Plotly, provide a wide range of options to create compelling visualizations and interactive dashboards. These visualizations can be used to present trends, sales performance, and consumer behavior to stakeholders, enabling data-driven decision-making.


Collaboration and Community Support

Python's popularity in the data science community ensures a vast pool of resources, tutorials, and forums for fashion professionals to learn and collaborate. The open-source nature of Python encourages the development and sharing of libraries, ensuring continuous innovation and access to cutting-edge techniques.


Case Study: Personalized Fashion Recommendations

To illustrate the power of Python in data science for the fashion industry, let's consider a case study on personalized fashion recommendations. By analyzing customer browsing history, purchase behavior, and preferences, a fashion company can leverage Python's data science capabilities to build a recommendation system. This system can suggest relevant fashion items to individual customers, enhancing the shopping experience and increasing sales.


Using Python's data manipulation libraries, the company can preprocess and clean the customer data. Then, by applying machine learning algorithms from Scikit-learn or deep learning models from TensorFlow, the company can create a personalized recommendation model. Finally, Python's visualization libraries can be used to present the recommendations in an interactive and visually appealing manner.


Python has emerged as a preferred language for data science in the fashion industry due to its simplicity, flexibility, and powerful ecosystem of libraries. From data collection and cleaning to advanced analytics, machine learning, computer vision, and natural language processing, Python provides a wide range of tools and techniques to extract valuable insights from fashion data. By harnessing the power of Python, fashion companies can optimize their operations, enhance customer experiences, and stay ahead in this data-driven industry.

Appendix1: Analytics Tools used in the Book

Back to Table of Contents

The following are the analytics and machine learning tools used in the book and their application

1. Summary Statistics

2. Distribution Analysis

3. K-means Clustering

4. Regression  Regression 2 Regression 3

5. Time Series Analysis

6. Machine Learning- Random Forest Random Forest 2

7. Machine Learning- Gradient Boosting

8. Machine Learning - Support Vector Machines

9. Machine Learning- Neural Networks

10. Hierarchical Clustering

11. Linear Programming

12. Recommendation Systems

13. Sentiment Analysis  Sentiment Analysis 2

14. Network Analysis

15. Markov Chains



Preface: Data Science for Fashion Management using Python

Back to Table of Contents

In today's digital age, data has become a valuable asset for businesses across various industries, and the fashion industry is no exception. Data science, a multidisciplinary field that combines statistical analysis, machine learning, and domain knowledge, offers powerful tools and techniques to extract insights from vast amounts of data. In the context of fashion management, data science plays a pivotal role in driving strategic decision-making, enhancing operational efficiency, and understanding consumer preferences.

The primary objective of this book is to provide a comprehensive introduction to data science and its applications in fashion management. It aims to equip fashion professionals, managers, and aspiring data scientists with the necessary knowledge and skills to leverage data-driven approaches in their decision-making processes. By combining the principles of data science with fashion management expertise, this book aims to bridge the gap between the two domains and foster innovation within the fashion industry.

Data science offers numerous benefits and opportunities for the fashion industry. By analyzing large datasets, fashion businesses can gain valuable insights into consumer behavior, market trends, and product performance. This enables them to make informed decisions regarding product development, inventory management, pricing strategies, marketing campaigns, and more. Additionally, data science techniques can optimize supply chain operations, enhance customer segmentation, and personalize shopping experiences, leading to improved customer satisfaction and loyalty.

To facilitate a comprehensive understanding of data science in fashion management, this book is divided into several chapters, each focusing on different aspects of the field. Here's a brief overview of the chapters:


Chapter 1: Fundamentals of Fashion Management - This chapter provides a foundational understanding of fashion management, covering key areas such as product development, retail operations, supply chain management, merchandising, marketing, and consumer behavior.


Chapter 2: Introduction to Data Science - Here, we introduce the fundamental concepts and techniques of data science, including data collection, preprocessing, exploratory data analysis, statistical modeling, and machine learning.


Chapter 3: Data Sources and Data Collection in Fashion - This chapter explores the various sources of data available in the fashion industry and the process of collecting and organizing relevant data for analysis.


Chapter 4: Data Preprocessing and Cleaning - We delve into the critical steps involved in ensuring data quality through preprocessing and cleaning techniques specifically tailored for fashion data.


Chapter 5: Exploratory Data Analysis in Fashion - In this chapter, we showcase how exploratory data analysis techniques can be applied to gain insights into fashion trends, customer preferences, and market dynamics.


Chapter 6: Predictive Analytics for Fashion Forecasting - Here, we demonstrate how predictive modeling techniques can be used to forecast sales, demand, and consumer behavior in the fashion industry.


Chapter 7: Customer Segmentation and Personalization - This chapter explores the importance of customer segmentation and how data science can enable personalized experiences in the fashion industry.


Chapter 8: Supply Chain Optimization - We discuss how data science techniques can optimize the fashion supply chain, from inventory management to production planning and logistics optimization.


Chapter 9: Pricing and Revenue Optimization - This chapter highlights how data science can inform pricing strategies, dynamic pricing, markdown optimization, and revenue management in the fashion industry.


Chapter 10: Social Media and Fashion Influence - We delve into the role of social media in shaping fashion trends and how data science can analyze social media data to identify influencers and measure brand sentiment.


Chapter 11: Markov Chains in Fashion Management

Here we talk about how the concept of Markov Chains can be used to address some of the most important issues in fashion management. 


Chapter 12: Ethical Considerations in Fashion Data Science - We address the ethical implications of data collection, privacy concerns, algorithmic bias, and the fair use of data in the context of fashion management.


Chapter 13: Case Studies and Real-world Examples - This chapter presents practical case studies and real-world examples of successful applications of data science in various aspects of fashion management.


Chapter 14: The Future of Data Science in Fashion - We discuss emerging trends, technologies, and potential future applications of data science in the fashion industry.


Chapter 15: Conclusion - The book concludes by summarizing the key concepts covered and highlighting the transformative potential of data science in driving innovation and success in fashion management.


By the end of this book, readers will have gained a solid foundation in data science principles and a deep understanding of how these principles can be applied to address the unique challenges and opportunities in the field of fashion management.


A Note about the Python Programs Used in the Book

Throughout the book, we primarily use small datasets for illustrative purposes. These datasets are carefully selected to highlight specific concepts and provide a clear understanding of the techniques being discussed. While working with small datasets, it becomes easier for readers to comprehend and reproduce the results presented in the book. However, it is important to note that the techniques and programs can be easily adapted to handle large-scale industry datasets commonly encountered in the fashion industry.

To make the most of the programming examples provided in this book, it is assumed that readers have a basic understanding of Python programming and related libraries such as Pandas, Matplotlib, and machine learning libraries like Scikit-learn or TensorFlow. Familiarity with these libraries will allow readers to grasp the code logic and adapt it to their specific needs. If you are new to Python or these libraries, it is recommended to first acquire the necessary foundational knowledge before diving into the programming examples.

The Python programs presented in this book are designed to be plug and play. This means that readers can simply copy the code provided and use it in a suitable programming environment, such as Jupyter Notebook or any Python Integrated Development Environment (IDE). It is important to note that the programs may have dependencies on specific libraries or packages, which need to be installed beforehand. Instructions for installing the required libraries are typically provided in the introductory chapters or in the program's documentation.

Throughout the book, readers will find programming exercises at appropriate intervals. These exercises are designed to reinforce the concepts covered in the preceding chapters and provide readers with hands-on experience. We strongly encourage readers to attempt these exercises as they offer valuable opportunities to apply the knowledge gained and develop practical skills. Solutions to the exercises are often provided in the book or can be found in supplementary materials or online resources.