Easy Data Collection for Continous Machine Learning Development with Listly – DataDrivenInvestor

Easy non-coding data scrapper tools for your work

Photo by Hal Gatewood on Unsplash

Data is the central part of any machine learning project. That is why many efforts were exerted to collect data that could be used in the machine learning project. Without data, our project was unable to start.

One of the most complex parts of data collection is to collect the data. From defining the dataset definition, where to find the data, and the data collection methodology — this data collection aspect needs to be thought through.

However, the data collection work became more manageable with the era of technology development. Many companies have invested their time and money to develop an insightful way for data collection, for example, Listly.

In this article, I explain how we could continue collecting data with Listly for machine learning development purposes. Let’s get into it.

Listly is an easy-to-use web scrapping browser extension that you could set up to collect the data automatically. The service is based on click-and-scrap, so we don’t need to know much about the coding programming to use the service.

We only need the webpage we want to do data extraction (we can control which part of the page) and the Listly extension installed. The process is automated, and we would quickly get the result in excel form.

Let’s try to use Listly for our data collection and develop a machine learning model based on the collected data. This article will follow the project outline depicted in the image below.

Image by author

In this article, I want to analyze data and create a rating prediction model based on the data collected from Listly.

Installing Listly

First, we must install the Listly browser extensions to start our web scrapping process. You could easily install the browser via this link, and you would find the extension on your browser. The page will be shown like the image below when you click the extension.

Image by Author

Next, we need to create the account for Listly. Luckily, Listly offers a free account that we can take advantage of to scrap various data from the web page. However, you could always try to use the Business plan if you want.

Image by Author

With all the essential preparation ready, we could try to collect the data with Listly.

Data Collection

Data collection would depend on the data project that we want to do. In this article, let’s say I want to analyze the Aliexpress e-commerce VR product data, then we can try to set it up with Listly.

Let’s take a look at how to start our data scrapping. First, as a starting point, I would need the URL of the webpage we want to scrap, which is the Aliexpress link here.

GIF by Author

By selecting the Listly Part on the extension, we could control which part of the data within the page we want to collect. The Listly governs the rest to download the data from the web page automatically.

Currently, we are trying to download a single web page data using Listly. However, we know that most e-commerce web has several pages. So, how to download many web pages quickly with Listly? In this case, there are some steps we need to go through.

First, after we did the Listly Part on a single web page, we needed to select the group button to scrap data from more pages.

Image by Author

On the next page, we need to provide all the URLs of the pages we want to scrap separated per line.

Image by Author

With Python, we can try to loop the previous link we used to obtain all the URL pages we want quickly. For example, I want to scrap 20 pages of the Aliexpress VR product search.

with open('readme.txt', 'w') as fi:
for i in range(1,21):
fi.write(f"https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&CatId=0&SearchText=VR&ltype=wholesale&SortType=default&page={i}")
fi.write('\n')

Often on a web with several pages, there is a query that specifies the page number. We take advantage and create a looping to acquire all the URLs we need into a notepad.

Image by Author

Additionally, because Listly is based in South Korea, some websites, such as Aliexpress, would switch their region based on the location. We can change the proxy setting to ensure you obtain the correct location. However, the feature is only available for the business tier.

Image by Author

After submitting the group scrapping, we would get the data in an excel format. Click the Group Excel button and download the data.

Image by Author

Let’s check the excel result before we proceed with additional analysis.

Image by Author

Some of the data could be messy because of the web page structure, which means we need to clean the data a little bit. Let’s start to explore the data we have scrapped.

Data Cleaning

The previously scrapped data is stored in this GitHub repository for reproducibility purposes. Let’s start by loading the data and doing some data cleaning.

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)df = pd.read_excel('aliexpress_vr_20.xlsx')

As we have found out previously, the data require some cleaning work to gain meaningful insight from the data. To help clean the data, I have already done the data understanding and only need to provide the code for various cleaning work.

Getting the product price

df['LABEL-6'] = df['LABEL-6'].apply(lambda x: x * 0.01) 
df['Price'] = df['LABEL-4'] + df['LABEL-6']

Getting the product rating

def check_rating(x):
if np.invert(pd.isna(x['LABEL-9'])):
return x['LABEL-9']
elif np.invert(pd.isna(x['LABEL-20'])):
return x['LABEL-20']
elif np.invert(pd.isna(x['LABEL-24'])):
return x['LABEL-24']
elif np.invert(pd.isna(x['LABEL-29'])):
return x['LABEL-29']
else:
return np.nan
df['Rating'] = df.apply(check_rating, axis =1)

Getting the product sold numbers

def check_sold(x):
if np.invert(pd.isna(x['LABEL-7'])) and 'sold' in x['LABEL-7']:
return int(x['LABEL-7'].split('sold')[0])
elif np.invert(pd.isna(x['LABEL-16'])) and 'sold' in x['LABEL-16']:
return int(x['LABEL-16'].split('sold')[0])
elif np.invert(pd.isna(x['LABEL-17'])) and 'sold' in x['LABEL-17']:
return int(x['LABEL-17'].split('sold')[0])
elif np.invert(pd.isna(x['LABEL-18'])) and 'sold' in x['LABEL-18']:
return int(x['LABEL-18'].split('sold')[0])
elif np.invert(pd.isna(x['LABEL-36'])) and 'sold' in x['LABEL-36']:
return int(x['LABEL-36'].split('sold')[0])
else:
return 0
df['Sold_number'] = df.apply(check_sold, axis =1)

Getting the free shipping offer data

def check_ship(x):
if np.invert(pd.isna(x['LABEL-10'])):
if 'Free Shipping' in x['LABEL-10']:
return 1
elif np.invert(pd.isna(x['LABEL-21'])):
if 'Free Shipping' in x['LABEL-21']:
return 1
elif np.invert(pd.isna(x['LABEL-25'])):
if 'Free Shipping' in x['LABEL-25']:
return 1
elif np.invert(pd.isna(x['LABEL-30'])):
if 'Free Shipping' in x['LABEL-30']:
return 1
elif np.invert(pd.isna(x['LABEL-32'])):
if 'Free Shipping' in x['LABEL-32']:
return 1
else:
return 0
df['free_shipping_offer'] = df.apply(check_ship, axis =1)
df['free_shipping_offer'] = df['free_shipping_offer'].fillna(0)
df['free_shipping_offer'] = df['free_shipping_offer'].apply(lambda x: int(x))

Getting the free return offer data

def check_return(x):
if np.invert(pd.isna(x['LABEL-11'])):
if 'Free Return' in x['LABEL-11']:
return 1
elif np.invert(pd.isna(x['LABEL-26'])):
if 'Free Return' in x['LABEL-26']:
return 1
elif np.invert(pd.isna(x['LABEL-31'])):
if 'Free Return' in x['LABEL-31']:
return 1
elif np.invert(pd.isna(x['LABEL-40'])):
if 'Free Return' in x['LABEL-40']:
return 1
elif np.invert(pd.isna(x['LABEL-43'])):
if 'Free Return' in x['LABEL-43']:
return 1
elif np.invert(pd.isna(x['LABEL-44'])):
if 'Free Return' in x['LABEL-44']:
return 1
else:
return 0
df['free_return_offer'] = df.apply(check_return, axis =1)
df['free_return_offer'] = df['free_return_offer'].fillna(0)

If you want to add features further and clean the data, you could do more, but for now, we would go on with the current features. Let’s do one last cleaning step and prepare our data frame.

df = df.rename(columns = {'LABEL-2': 'Product_name', 'LABEL-12':'Store_name'})data = df[['Product_name', 'Store_name', 'Rating', 'Price', 'Sold_number', 'free_shipping_offer', 'free_return_offer' ]].copy()data.head()
Image by Author

Data Analysis

Let’s start understanding the dataset we acquire for better insight. First, let’s look at the data’s basic information.

data.info()
Image by Author

As we can see, we have around 1200 rows of data, but only around half of the data contain the rating data. Most likely, it’s because the amount of the product sold is too little, and people were too lazy to provide ratings.

Let’s take a look at the basic statistic.

data.describe()
Image by Author

Let’s use visualization to understand the data even better.

import seaborn as sns
sns.distplot(data['Rating'])
plt.title('Rating Distribution')
Image by Author

I would do the same distribution code for the Price and Sold Number.

Image by Author

Then I would try to visualize the count plot of the free shipping and return.

sns.countplot(data['free_return_offer'])
plt.title('Free Return Offer Count Plot')
Image by Author

From the statistic, there is some information we acquire:

  1. Most ratings given were good (above 4.5)
  2. Prices were below 25 $USD for 75% of the data
  3. Some of the products published don’t have any buyers (more than 25%)
  4. Sellers prefer to offer free shipping compared to free returns.

Building a simple regression model

After exploring the data, I want to create a simple regression model to predict how many VR products will be sold.

Of course, in actual model building, we need to explore the data more and have a more rigid assumption. However, right now, we only try a simple model building from the data we scrap with Listly.

First, let’s check the correlation between numerical features.

sns.heatmap(data.corr(), annot = True)
Image by Author

It seems the Return Offer variable affects the Sold Number the most. Now, let’s split the data into the train and test data. As a note, I would not use the Rating data as the independent variable.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(data[['Price', 'free_shipping_offer', 'free_return_offer']], data['Sold_number'], shuffle=True, train_size=0.3, random_state =42 )model = LinearRegression()
model.fit(X_train, y_train)

With a few lines, we have managed to build the regression model. Let’s see how the model evaluation.

from sklearn.metrics import mean_squared_error, r2_score
predictions = model.predict(X_test)
r2 = r2_score(y_test, predictions)
rmse = mean_squared_error(y_test, predictions, squared=False)
print('The r2 is: ', r2)
print('The rmse is: ', rmse)
Image by Author

As we can see, the evaluation shows that the model is not that good. We can improve it with various feature engineering and additional data cleaning, but let’s leave it like this our target to build a model based on the data from Listly scrapping have been fulfilled.

Data Collection Scheduler

When building a model or monitoring the data, we know that with time there is always new data to be collected. Luckily, Listly offers a scheduler to scrap the data.

To find the scheduler, go to your Listly data board, then find the Schedule button in your scrapping job.

Image by Author

After you press the schedule, you could set the time you want to scrap the page — either daily, weekly, or monthly. You could also set the time you want to scrap the data.

GIF by Author

Additional Ideas to use with Listly

With how easy to scrap the data with Listly, there are various things you could do as a data people, such as:

  1. Building a Monitoring Dashboard
  2. Social Media Portfolio Optimization
  3. Keyword Research
  4. Search Engine Development

And many ideas you could think of; the limit is the sky. As data collection were made easy with Listly, you should try the various thing you could think of.

Spread the love

Leave a Reply

Your email address will not be published.