Python: Ecommerce: Part — 3: Remove Unwanted Category (and Products), Also, remove products based on Words in the Title — after Merging All Supplier Data Files into One File

Python: Ecommerce: Part — 3: Remove Unwanted Category (and Products), Also, remove products based on Words in the Title — after Merging All Supplier Data Files into One File

Python: Ecommerce: Part — 3: Remove Unwanted Category (and Products), Also, remove products based on Words in the Title — after Merging All Supplier Data Files into One File.

You could as well remove products that are not allowed in a country or in a market place as well as products that you are not authorized to sell (some brands)

All Code in One Block. Please check the other parts of this series/publication

The code could be simplified/reduced. You could join multiple blocks into one just by keeping the words/category names in a list; and then filtering against that list. You could as well join conditions using and (&) or or-operations (|) to reduce the number of lines of code.

# # Section Remove products that have slang words
# In[24]:
unique_sorted_data[‘Category Name’].unique()
# In[30]:
# Remove products from a category that you do not want to sell
# Apparel
unique_sorted_data_filter_category = unique_sorted_data [
~( unique_sorted_data[‘Category Name’].str.contains(“Apparel”, case = False, na=False ) ) ];
# android TV Box
unique_sorted_data_filter_category = unique_sorted_data_filter_category [
~( unique_sorted_data_filter_category[‘Category Name’].str.contains(“TV Box”, case = False, na=False ) ) ];
# Laser Products
unique_sorted_data_filter_category = unique_sorted_data_filter_category [
~( unique_sorted_data_filter_category[‘Category Name’].str.contains(“Laser”, case = False, na=False ) ) ];
# Costume
unique_sorted_data_filter_category = unique_sorted_data_filter_category [
~( unique_sorted_data_filter_category[‘Category Name’].str.contains(“Costume”, case = False, na=False ) ) ];
# Showing the count after removal
unique_sorted_data_filter_category[‘Category Name’].unique(), unique_sorted_data_filter_category.shape
# In[31]:
#################
# section Remove products that have slang/bad words in the name -- you do not need this block - was for testing only. Another block will do this job
unique_sorted_data_filter_1 = unique_sorted_data[ 
( unique_sorted_data[‘Full Product Name’].str.contains(“Slang 1“, case = False, na=False) ) | ( unique_sorted_data[‘Full Product Name’].str.contains(“Slang 2“, case = False, na=False) ) | ( unique_sorted_data[‘Full Product Name’].str.contains(“Slang 3“, case = False, na=False) ) | ( unique_sorted_data[‘Full Product Name’].str.contains(“Slang 4“, case = False, na=False) )
] #[ {‘Full Product Name’, ‘Category Name’}];
unique_sorted_data_filter_1.shape #, unique_sorted_data_filter_1.head(1) #, “\n “, unique_sorted_data_filter_1.shape
########################
# In[38]:
# Remove products that have slang words in the product name
unique_sorted_filtered_data = unique_sorted_data_filter_category [
~( unique_sorted_data_filter_category[‘Full Product Name’].str.contains(“Bad word 1“, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Bad word 2“, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Bad word 3“, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Bad word 4”, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Bad word 5“, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Bad word 1“, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Tracker”, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Laser”, case = False, na=False ) ) ];
# brands that you do not want to sell
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“VKworld”, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Samsung”, case = False, na=False ) ) ];
# video streaming TV Box HDMI -- products are sensitive (intellectual rights)
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Car Video”, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Streaming”, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“HDMI”, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data.shape
# In[39]:
unique_sorted_filtered_data.to_csv(“../all_supplier_data_unique_sorted_and_filtered.csv”);

From Jupyter Notebook: Cell by Cell with Output

Section Remove products that have slang/bad words or similar. Or Brands/Products that you do not want to sell (not allowed to sell/resell/restricted products or similar): I know the grammar is not right. Sorry, not fixing it.

In [24]:

unique_sorted_data['Category Name'].unique()

Out[24]:

array(['Toys & Games', 'Drone & Quadcopter', 'Cool Gadgets',
'Office supplies', 'Novelty Costumes & Accessories',
"Women's Jewelry", 'External Parts', 'Vehicle Electronics Devices',
'Replacement Parts', 'Internal Parts', 'Lamps and Accessories',
'Video Games', 'Hair Care', 'Skin Care',
'Makeup Tool & Accessories', 'Household Products',
"Women's Accessories", "Women's Apparel", 'Home accessories',
'Oral Respiratory Protection', "Men's Accessories",
"Men's Apparel", "Girl's Apparel", 'Cell Phone Accessories',
'Health Care', 'Electronic Accessories', 'Health tools',
'Computer Peripherals', 'Audio & Video Gadgets', nan,
'Headrest Monitors & DVD Players', 'Car DVR',
'Camera Equipment / Accessories', 'Personal Care',
'Laser Gadgets & Measuring Tools', 'Accessories',
'Electronic Cigarettes', 'Sports Action Camera',
'Android TV Box / Stick', 'Sports & Body Building',
'Smart Watches', 'Security & Surveillance', 'Android Tablets',
'Musical Instruments & Accessorie', 'LED', 'Outdoor Recrections',
'Tools & Home Decor', 'Home, Kitchen & Garden', 'Home Electrical',
'Bedding & Bath', 'Camping & Hiking', 'Drives & Storage',
'Pet Supplies', 'Hunting & Fishing', 'Garden & Lawn',
'Medical treatments', 'Android Smartphones', 'Car Video',
'Cell Phones', 'Cycling', 'Solar Products', 'Doogee Phones',
'Rugged Phones', 'Ulefone Phones', 'Xiaomi Phones', 'Huawei Phone',
'Lenovo Phones', 'Refurbished iPhones', 'Samsung Phones',
'Water Sport', 'Tools & Equipment', 'Repair Accessories',
'Body protection', 'Disinfection and sterilization', "Men's Care",
'Cleaning Supplies', 'Baby Girls Apparel', "Women's Bags",
"Women's Shoes", "Men's Jewelry", 'Baby Boys Apparel',
"Boy's Apparel", "Girl's Shoes", "Girl's Jewelry", "Boy's Shoes",
'kN95/KF94 Mask', 'Flash Drives + Memory Cards',
'6-7 Inch Android Phones', 'Apple Phones', 'Xiaomi Phone',
'Laptops & Tablets', 'Apple iPad', 'Musical Instruments',
'Computer Accessories', 'Ball Games', "Boy's Jewelry"], dtype=object)

In [30]:

# Remove products from a category that you do not want to sell
unique_sorted_data_filter_category = unique_sorted_data [ ~( unique_sorted_data['Category Name'].str.contains("Apparel", case = False, na=False ) )];

unique_sorted_data_filter_category = unique_sorted_data_filter_category [ ~( unique_sorted_data_filter_category['Category Name'].str.contains("TV Box", case = False, na=False ) )];

unique_sorted_data_filter_category = unique_sorted_data_filter_category [~( unique_sorted_data_filter_category['Category Name'].str.contains("Laser", case = False, na=False ) )];

unique_sorted_data_filter_category = unique_sorted_data_filter_category [ ~( unique_sorted_data_filter_category['Category Name'].str.contains("Costume", case = False, na=False ) )
];

unique_sorted_data_filter_category['Category Name'].unique(), unique_sorted_data_filter_category.shape

Out[30]:

(array(['Toys & Games', 'Drone & Quadcopter', 'Cool Gadgets',
'Office supplies', "Women's Jewelry", 'External Parts',
'Vehicle Electronics Devices', 'Replacement Parts',
'Internal Parts', 'Lamps and Accessories', 'Video Games',
'Hair Care', 'Skin Care', 'Makeup Tool & Accessories',
'Household Products', "Women's Accessories", 'Home accessories',
'Oral Respiratory Protection', "Men's Accessories",
'Cell Phone Accessories', 'Health Care', 'Electronic Accessories',
'Health tools', 'Computer Peripherals', 'Audio & Video Gadgets',
nan, 'Headrest Monitors & DVD Players', 'Car DVR',
'Camera Equipment / Accessories', 'Personal Care', 'Accessories',
'Electronic Cigarettes', 'Sports Action Camera',
'Sports & Body Building', 'Smart Watches',
'Security & Surveillance', 'Android Tablets',
'Musical Instruments & Accessorie', 'LED', 'Outdoor Recrections',
'Tools & Home Decor', 'Home, Kitchen & Garden', 'Home Electrical',
'Bedding & Bath', 'Camping & Hiking', 'Drives & Storage',
'Pet Supplies', 'Hunting & Fishing', 'Garden & Lawn',
'Medical treatments', 'Android Smartphones', 'Car Video',
'Cell Phones', 'Cycling', 'Solar Products', 'Doogee Phones',
'Rugged Phones', 'Ulefone Phones', 'Xiaomi Phones', 'Huawei Phone',
'Lenovo Phones', 'Refurbished iPhones', 'Samsung Phones',
'Water Sport', 'Tools & Equipment', 'Repair Accessories',
'Body protection', 'Disinfection and sterilization', "Men's Care",
'Cleaning Supplies', "Women's Bags", "Women's Shoes",
"Men's Jewelry", "Girl's Shoes", "Girl's Jewelry", "Boy's Shoes",
'kN95/KF94 Mask', 'Flash Drives + Memory Cards',
'6-7 Inch Android Phones', 'Apple Phones', 'Xiaomi Phone',
'Laptops & Tablets', 'Apple iPad', 'Musical Instruments',
'Computer Accessories', 'Ball Games', "Boy's Jewelry"], dtype=object), (47826, 40))

In [31]:

# sect

In [38]:

# Remove products that have slang words in the product name
unique_sorted_filtered_data = unique_sorted_data_filter_category [ ~( unique_sorted_data_filter_category['Full Product Name'].str.contains("Bad word 1", case = False, na=False ) )];

print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [~( unique_sorted_filtered_data['Full Product Name'].str.contains("Bad word 2", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [~( unique_sorted_filtered_data['Full Product Name'].str.contains("Bad word 3", case = False, na=False ) )];
# product
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [ ~( unique_sorted_filtered_data['Full Product Name'].str.contains("Tracker", case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [ ~( unique_sorted_filtered_data['Full Product Name'].str.contains("Laser", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape);
# remove brands
unique_sorted_filtered_data = unique_sorted_filtered_data [ ~( unique_sorted_filtered_data['Full Product Name'].str.contains("VKworld", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [ ~( unique_sorted_filtered_data['Full Product Name'].str.contains("Samsung", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape);
# video, streaming, HDMI
unique_sorted_filtered_data = unique_sorted_filtered_data [ ~( unique_sorted_filtered_data['Full Product Name'].str.contains("Car Video", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [ ~( unique_sorted_filtered_data['Full Product Name'].str.contains("Streaming", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [~( unique_sorted_filtered_data['Full Product Name'].str.contains("HDMI", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data.shape
(47001, 40)
(46988, 40)
(46968, 40)
(46968, 40)
(46959, 40)
(46959, 40)
(46441, 40)
(46388, 40)
(46381, 40)
(45357, 40)
(45349, 40)
(45338, 40)
(44911, 40)

Out[38]:

(44911, 40)

In [39]:

# send the filtered data to a file
unique_sorted_filtered_data.to_csv("../all_supplier_data_unique_sorted_and_filtered.csv");

***. ***. ***
Note: Older short-notes from this site are posted on Medium: https://medium.com/@SayedAhmedCanada

*** . *** *** . *** . *** . ***
Sayed Ahmed

BSc. Eng. in Comp. Sc. & Eng. (BUET)
MSc. in Comp. Sc. (U of Manitoba, Canada)
MSc. in Data Science and Analytics (Ryerson University, Canada)
Linkedin: https://ca.linkedin.com/in/sayedjustetc

Blog: http://Bangla.SaLearningSchool.com, http://SitesTree.com
Training Courses: http://Training.SitesTree.com
8112223 Canada Inc/Justetc: http://JustEtc.net

Facebook Groups/Forums to discuss (Q & A):
https://www.facebook.com/banglasalearningschool
https://www.facebook.com/justetcsocial

Get access to courses on Big Data, Data Science, AI, Cloud, Linux, System Admin, Web Development and Misc. related. Also, create your own course to sell to others. http://sitestree.com/training/