Customer Segmentation Using K-Means Clustering with HDFS, MySQL, and PySpark Integration

Overview

This project implements customer segmentation using K-Means clustering, with the results stored in both HDFS and MySQL databases. The solution leverages PySpark for efficient processing and is optimized for a big data environment.

Project Structure

data/: Contains the dataset customer_data.csv.
src/: Contains the implementation code customer_segmentation.py.
README.md: Project documentation.

Installation

Clone the repository:
```
git clone <repository-link>
```

Install the required packages:

pip install pandas scikit-learn matplotlib mysql-connector-python hdfs pyspark

Usage

Run the customer_segmentation.py script to perform clustering and store results:

python src/customer_segmentation.py


# Key Features

- Your specified HDFS path is set as `hdfs://localhost:50000/customer segmentation reult.csv`.
- The code integrates with Hadoop and PySpark, optimized for Ubuntu setup.
- The results are stored in both HDFS and MySQL.

This setup provides a comprehensive solution while utilizing your big data environment.


# License
This project is licensed under the MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Customer Segmentation Using K-Means Clustering with HDFS, MySQL, and PySpark Integration

Overview

Project Structure

Installation

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Customer Segmentation Using K-Means Clustering with HDFS, MySQL, and PySpark Integration

Overview

Project Structure

Installation

Usage