Python: Ecommerce: Part — 3: Remove Unwanted Category (and Products), Also, remove products based on Words in the Title — after Merging All Supplier Data Files into One File
Python: Ecommerce: Part — 3: Remove Unwanted Category (and Products), Also, remove products based on Words in the Title — after Merging All Supplier Data Files into One File.
You could as well remove products that are not allowed in a country or in a market place as well as products that you are not authorized to sell (some brands)
All Code in One Block. Please check the other parts of this series/publication
The code could be simplified/reduced. You could join multiple blocks into one just by keeping the words/category names in a list; and then filtering against that list. You could as well join conditions using and (&) or or-operations (|) to reduce the number of lines of code.
# # Section Remove products that have slang words
# In[24]:
unique_sorted_data[‘Category Name’].unique()
# In[30]:
# Remove products from a category that you do not want to sell
# Apparel
unique_sorted_data_filter_category = unique_sorted_data [
~( unique_sorted_data[‘Category Name’].str.contains(“Apparel”, case = False, na=False ) ) ];
# android TV Box
unique_sorted_data_filter_category = unique_sorted_data_filter_category [
~( unique_sorted_data_filter_category[‘Category Name’].str.contains(“TV Box”, case = False, na=False ) ) ];
# Laser Products unique_sorted_data_filter_category = unique_sorted_data_filter_category [
~( unique_sorted_data_filter_category[‘Category Name’].str.contains(“Laser”, case = False, na=False ) ) ];
# Costume unique_sorted_data_filter_category = unique_sorted_data_filter_category [
~( unique_sorted_data_filter_category[‘Category Name’].str.contains(“Costume”, case = False, na=False ) ) ];
# Showing the count after removal unique_sorted_data_filter_category[‘Category Name’].unique(), unique_sorted_data_filter_category.shape
# In[31]:
#################
# section Remove products that have slang/bad words in the name -- you do not need this block - was for testing only. Another block will do this job
unique_sorted_data_filter_1 = unique_sorted_data[
( unique_sorted_data[‘Full Product Name’].str.contains(“Slang 1“, case = False, na=False) ) | ( unique_sorted_data[‘Full Product Name’].str.contains(“Slang 2“, case = False, na=False) ) | ( unique_sorted_data[‘Full Product Name’].str.contains(“Slang 3“, case = False, na=False) ) | ( unique_sorted_data[‘Full Product Name’].str.contains(“Slang 4“, case = False, na=False) )
] #[ {‘Full Product Name’, ‘Category Name’}]; unique_sorted_data_filter_1.shape #, unique_sorted_data_filter_1.head(1) #, “\n “, unique_sorted_data_filter_1.shape
########################
# In[38]:
# Remove products that have slang words in the product name
unique_sorted_filtered_data = unique_sorted_data_filter_category [
~( unique_sorted_data_filter_category[‘Full Product Name’].str.contains(“Bad word 1“, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Bad word 2“, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Bad word 3“, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Bad word 4”, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Bad word 5“, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Bad word 1“, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Tracker”, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Laser”, case = False, na=False ) ) ];
# brands that you do not want to sell print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“VKworld”, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Samsung”, case = False, na=False ) ) ];
# video streaming TV Box HDMI -- products are sensitive (intellectual rights) print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Car Video”, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“Streaming”, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [
~( unique_sorted_filtered_data[‘Full Product Name’].str.contains(“HDMI”, case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data.shape
# In[39]:
unique_sorted_filtered_data.to_csv(“../all_supplier_data_unique_sorted_and_filtered.csv”);
From Jupyter Notebook: Cell by Cell with Output
Section Remove products that have slang/bad words or similar. Or Brands/Products that you do not want to sell (not allowed to sell/resell/restricted products or similar): I know the grammar is not right. Sorry, not fixing it.
In [24]:
unique_sorted_data['Category Name'].unique()
Out[24]:
array(['Toys & Games', 'Drone & Quadcopter', 'Cool Gadgets',
'Office supplies', 'Novelty Costumes & Accessories',
"Women's Jewelry", 'External Parts', 'Vehicle Electronics Devices',
'Replacement Parts', 'Internal Parts', 'Lamps and Accessories',
'Video Games', 'Hair Care', 'Skin Care',
'Makeup Tool & Accessories', 'Household Products',
"Women's Accessories", "Women's Apparel", 'Home accessories',
'Oral Respiratory Protection', "Men's Accessories",
"Men's Apparel", "Girl's Apparel", 'Cell Phone Accessories',
'Health Care', 'Electronic Accessories', 'Health tools',
'Computer Peripherals', 'Audio & Video Gadgets', nan,
'Headrest Monitors & DVD Players', 'Car DVR',
'Camera Equipment / Accessories', 'Personal Care',
'Laser Gadgets & Measuring Tools', 'Accessories',
'Electronic Cigarettes', 'Sports Action Camera',
'Android TV Box / Stick', 'Sports & Body Building',
'Smart Watches', 'Security & Surveillance', 'Android Tablets',
'Musical Instruments & Accessorie', 'LED', 'Outdoor Recrections',
'Tools & Home Decor', 'Home, Kitchen & Garden', 'Home Electrical',
'Bedding & Bath', 'Camping & Hiking', 'Drives & Storage',
'Pet Supplies', 'Hunting & Fishing', 'Garden & Lawn',
'Medical treatments', 'Android Smartphones', 'Car Video',
'Cell Phones', 'Cycling', 'Solar Products', 'Doogee Phones',
'Rugged Phones', 'Ulefone Phones', 'Xiaomi Phones', 'Huawei Phone',
'Lenovo Phones', 'Refurbished iPhones', 'Samsung Phones',
'Water Sport', 'Tools & Equipment', 'Repair Accessories',
'Body protection', 'Disinfection and sterilization', "Men's Care",
'Cleaning Supplies', 'Baby Girls Apparel', "Women's Bags",
"Women's Shoes", "Men's Jewelry", 'Baby Boys Apparel',
"Boy's Apparel", "Girl's Shoes", "Girl's Jewelry", "Boy's Shoes",
'kN95/KF94 Mask', 'Flash Drives + Memory Cards',
'6-7 Inch Android Phones', 'Apple Phones', 'Xiaomi Phone',
'Laptops & Tablets', 'Apple iPad', 'Musical Instruments',
'Computer Accessories', 'Ball Games', "Boy's Jewelry"], dtype=object)
In [30]:
# Remove products from a category that you do not want to sell
unique_sorted_data_filter_category = unique_sorted_data [ ~( unique_sorted_data['Category Name'].str.contains("Apparel", case = False, na=False ) )];
unique_sorted_data_filter_category = unique_sorted_data_filter_category [ ~( unique_sorted_data_filter_category['Category Name'].str.contains("TV Box", case = False, na=False ) )];
unique_sorted_data_filter_category = unique_sorted_data_filter_category [~( unique_sorted_data_filter_category['Category Name'].str.contains("Laser", case = False, na=False ) )];
unique_sorted_data_filter_category = unique_sorted_data_filter_category [ ~( unique_sorted_data_filter_category['Category Name'].str.contains("Costume", case = False, na=False ) )
];
unique_sorted_data_filter_category['Category Name'].unique(), unique_sorted_data_filter_category.shape
Out[30]:
(array(['Toys & Games', 'Drone & Quadcopter', 'Cool Gadgets',
'Office supplies', "Women's Jewelry", 'External Parts',
'Vehicle Electronics Devices', 'Replacement Parts',
'Internal Parts', 'Lamps and Accessories', 'Video Games',
'Hair Care', 'Skin Care', 'Makeup Tool & Accessories',
'Household Products', "Women's Accessories", 'Home accessories',
'Oral Respiratory Protection', "Men's Accessories",
'Cell Phone Accessories', 'Health Care', 'Electronic Accessories',
'Health tools', 'Computer Peripherals', 'Audio & Video Gadgets',
nan, 'Headrest Monitors & DVD Players', 'Car DVR',
'Camera Equipment / Accessories', 'Personal Care', 'Accessories',
'Electronic Cigarettes', 'Sports Action Camera',
'Sports & Body Building', 'Smart Watches',
'Security & Surveillance', 'Android Tablets',
'Musical Instruments & Accessorie', 'LED', 'Outdoor Recrections',
'Tools & Home Decor', 'Home, Kitchen & Garden', 'Home Electrical',
'Bedding & Bath', 'Camping & Hiking', 'Drives & Storage',
'Pet Supplies', 'Hunting & Fishing', 'Garden & Lawn',
'Medical treatments', 'Android Smartphones', 'Car Video',
'Cell Phones', 'Cycling', 'Solar Products', 'Doogee Phones',
'Rugged Phones', 'Ulefone Phones', 'Xiaomi Phones', 'Huawei Phone',
'Lenovo Phones', 'Refurbished iPhones', 'Samsung Phones',
'Water Sport', 'Tools & Equipment', 'Repair Accessories',
'Body protection', 'Disinfection and sterilization', "Men's Care",
'Cleaning Supplies', "Women's Bags", "Women's Shoes",
"Men's Jewelry", "Girl's Shoes", "Girl's Jewelry", "Boy's Shoes",
'kN95/KF94 Mask', 'Flash Drives + Memory Cards',
'6-7 Inch Android Phones', 'Apple Phones', 'Xiaomi Phone',
'Laptops & Tablets', 'Apple iPad', 'Musical Instruments',
'Computer Accessories', 'Ball Games', "Boy's Jewelry"], dtype=object), (47826, 40))
In [31]:
# sect
In [38]:
# Remove products that have slang words in the product name
unique_sorted_filtered_data = unique_sorted_data_filter_category [ ~( unique_sorted_data_filter_category['Full Product Name'].str.contains("Bad word 1", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [~( unique_sorted_filtered_data['Full Product Name'].str.contains("Bad word 2", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [~( unique_sorted_filtered_data['Full Product Name'].str.contains("Bad word 3", case = False, na=False ) )];
# product print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [ ~( unique_sorted_filtered_data['Full Product Name'].str.contains("Tracker", case = False, na=False ) ) ];
print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data = unique_sorted_filtered_data [ ~( unique_sorted_filtered_data['Full Product Name'].str.contains("Laser", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape);
# remove brands unique_sorted_filtered_data = unique_sorted_filtered_data [ ~( unique_sorted_filtered_data['Full Product Name'].str.contains("VKworld", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [ ~( unique_sorted_filtered_data['Full Product Name'].str.contains("Samsung", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape);
# video, streaming, HDMI unique_sorted_filtered_data = unique_sorted_filtered_data [ ~( unique_sorted_filtered_data['Full Product Name'].str.contains("Car Video", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [ ~( unique_sorted_filtered_data['Full Product Name'].str.contains("Streaming", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape);
unique_sorted_filtered_data = unique_sorted_filtered_data [~( unique_sorted_filtered_data['Full Product Name'].str.contains("HDMI", case = False, na=False ) )];
print(unique_sorted_filtered_data.shape); unique_sorted_filtered_data.shape
(47001, 40) (46988, 40) (46968, 40) (46968, 40) (46959, 40) (46959, 40) (46441, 40) (46388, 40) (46381, 40) (45357, 40) (45349, 40) (45338, 40) (44911, 40)
Out[38]:
(44911, 40)
In [39]:
# send the filtered data to a file unique_sorted_filtered_data.to_csv("../all_supplier_data_unique_sorted_and_filtered.csv");
***. ***. ***
Note: Older short-notes from this site are posted on Medium: https://medium.com/@SayedAhmedCanada
*** . *** *** . *** . *** . ***
Sayed Ahmed
BSc. Eng. in Comp. Sc. & Eng. (BUET)
MSc. in Comp. Sc. (U of Manitoba, Canada)
MSc. in Data Science and Analytics (Ryerson University, Canada)
Linkedin: https://ca.linkedin.com/in/sayedjustetc
Blog: http://Bangla.SaLearningSchool.com, http://SitesTree.com
Training Courses: http://Training.SitesTree.com
8112223 Canada Inc/Justetc: http://JustEtc.net
Facebook Groups/Forums to discuss (Q & A):
https://www.facebook.com/banglasalearningschool
https://www.facebook.com/justetcsocial
Get access to courses on Big Data, Data Science, AI, Cloud, Linux, System Admin, Web Development and Misc. related. Also, create your own course to sell to others. http://sitestree.com/training/