Book Subtitle: With Natural Language Processing and Recommender Systems
Authors: Pramod Singh
DOI: https://doi.org/10.1007/978-1-4842-7777-5
Publisher: Apress Berkeley, CA
eBook Packages: Professional and Applied Computing, Professional and Applied Computing (R0), Apress Access Books
Copyright Information: Pramod Singh 2022
Softcover ISBN: 978-1-4842-7776-8Published: 09 December 2021
eBook ISBN: 978-1-4842-7777-5Published: 08 December 2021
Edition Number: 2
Number of Pages: XVIII, 220
Number of Illustrations: 202 b/w illustrations
Topics: Machine Learning, Python, Artificial Intelligence, Open Source
This article was published as a part of the Data Science Blogathon.
In this article, we will learn about machine learning using Spark. Our previous articles discussed Spark databases, installation, and working of Spark in Python. If you haven’t read it yet, here is the link.
In this article, we will mainly talk about implementing the machine learning model using Pyspark. We will also build a regressor model and bind it with cross-validation and parameter tuning.
As we already know that Spark is an in-memory data processing tool that can handle petabytes of data in a distributed manner. Implementing a machine learning model on such a big amount of data is possible using the Spark Ml-lib package that works in a distributed manner.
The conventional way of implementing a machine learning model was with the help of Apache Mahout, which was eventually slow and not flexible.
Spark Machine learning pipeline binds with real-time data as well as streaming data and it uses in-memory computation to fasten the process.
The best part of Spark is that it offers various built-in packages for machine learning, making it more versatile.
These Inbuilt machine learning packages are known as ML-lib in Apache Spark.
Spark offers a completely different package for handling all machine learning-related tasks. Other third parties libraries also can be coupled in Spark.
These are some important features of Spark:
scikitlearn API. We don’t need to learn it separately.
Implementation of the Machine Learning pipeline in Spark requires several stages. Spark Supports data in the form of feature vectors only.
For Illustration, we will build a Regressor model in Python Using Spark.
Spark can be easily installed using pip package manage python. Setting up spark on cloud-based notebooks is recommended as Installing Spark on a Local Computer might take some time.
These basic libraries need to be imported to start the Spark Cluster.
import pandas as pd import matplotlib.pyplot as plt from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession
After successful installation, we need to create a spark context and session. Making Spark context can be understood as creating a cluster for a specific project. It is an entry point to our Spark Cluster, and it saves the configuration of our Spark clusters, like the number of cores that need to be used and the location of the data stream, etc.
from pyspark import SparkConf, SparkContext # Creating a spark context class scou = SparkSession.builder.master(“local[*]”).getOrCreate()
local[*]→ It defines the number of available CPU cores to be used.
getOrCreate→ It creates a new session if the defined session is already not created.
In this article, we will use the US car price dataset publicly available on Kaggle. If you make a cloud notebook, you don’t need to download the dataset.
data = sc.read.csv(‘../input/cars/data_car.csv’, inferSchema = True, header = True) data.show(5)
inferSchema = True
: It preserves the schema of the dataset.
read.csv
It reads the CSV files into the spark.
Loaded CSV file
data.printSchema()
Schema
printSchema()
Prints the Schema with structured columns and rows.
Spark provides some basic methods to see basic statistics for our loaded data.
data.describe().toPandas().transpose()
Statistics
Data Cleaning is the most important step for machine learning lifecycles. We remove unwanted rows and unwanted columns.
At this Step, we drop all the useless, resentment information from our dataset. Removing redundant data improves the overall model performance and accuracy.
In our case, the dataset has some NA values. And we aim to drop the nan values.
We made a function to replace a column value and replace with the
None
.
data = data.withColumn(“Market Categories “, replace(col(“Market Category”),”N/A”))
Before removing anything, let’s look at how many redundant values are available for different columns.
from pyspark.sql.functions import when,lit,count,isnan,col data.select([count(when(isnan(c)|col(c).isNull(),c)).alias(c) for c in data.columns]).show()
Null values
As we noticed, the Market Category column has a maximum of 3742 null values, meaning this field is redundant and can be removed safely.
#deleting the column Market Category data = data.drop(“Market Category”) # deleting the all null values data = data.na.drop()
So far, we have cleaned our data, which is ready to pass for model training. But before, we need to convert all the data into the spark feature vector.
print((data.count(), len(data.columns)))
Spark supports only Feature Vectors data format for working on Machine learning tasks. Feature Vectors help spark inference faster.
Before proceeding, we need to convert our regular dataframe into a feature vector for fast and better inference.
Spark offers a class Vector Assembler that is used to convert our dataframe into feature vector series.
We must pass these columns as input features for our feature vector and model training.
The vector assembler will assemble all these columns’ data into a single series of Vectors that will be passed to our model for the training in Spark.
: Defines the name of the generated feature vector.
We will call our Vector Assembler in our machine learning pipeline. The real data will not be used at once. The pipeline takes data from one end and generates the data to the other end by performing all the preprocessing specified inside.
Spark offers different inbuilt machine learning models. We need to import it and train it on our data simply.
pyspark.ml.regressionContains all the regression models.
pyspark.ml.classificationContains all the classification models.
input feature vector, the combination of all the features, and
labelCol
the output feature vector(target feature).
A pipeline is a combination of multiple steps, and it works sequentially. Data goes from one end and, after performing all the sequential operations, comes to the other end.
We must create a sequence of all the transformations required in our pipeline.
After building our pipeline object, we can save our Pipeline on disk and load it anytime as required.
from pyspark.ml import Pipeline pipeline = Pipeline(stages = [assembler,regressor])
#–Saving the Pipeline pipeline.write().overwrite().save(“pipeline_saved_model”)
stages
: It is a sequence of transformations that must be performed on our data.
We can load the saved pipeline by using the method
Pipeline.load
: Higher is better. It tells the proportion of variance that has converged by our model.
RMSE
: It is a Squared mean error between real and predicted values.
In this article, we discussed the Spark MLlib package and learned the steps involved in building an ML pipeline in Spark. We learned data cleaning, data transformation, and pipeline in detail.
ML-lib pipeline can be fitted with real-time streaming and static batched data.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Top 3 PySpark Machine Learning Project Ideas for Practice
Loan Default Prediction using Machine Learning
The main idea behind this Machine Learning project is to build a model that will classify how much loan the user can take. You can build a pyspark machine-learning pipeline from scratch to predict loan default.
Here you will learn how to classify fake news from real one. You can build a machine learning model based on Naive Bayes or Logistic Regression implemented using PySpark Dataframes and in-built algorithmic implementations in Spark MLlib.
Bitcoin Price Detector Project
With the rise in the popularity of Bitcoin, create a PySpark ML model to predict the future price of bitcoin using forecasting techniques.
Conclusion
Machine Learning denotes a step taken forward in how computers can learn and make predictions. It has applications in various sectors and is being extensively used. Having knowledge of Machine Learning will not only open multiple doors of opportunities for you, but it also makes sure that, if you have mastered Machine Learning, you are never out of jobs.
Machine Learning has been gaining popularity ever since it came into the picture and it won’t stop any time soon.
Pyspark and Pandas are two libraries that we use in data science tasks in python. In this article, we will discuss pyspark vs Pandas to compare their memory consumption, speed, and performance in different situations.
How PySpark Machine Learning Pipelines Work?
In Machine Learning, running several algorithms and taking insights from the data can be easily performed by using Pipelines. For example, let’s assume you have to extract information from a text document:
Converting all the document’s text into words.
Machine learning algorithms take only numerical values; hence, converting text into a numerical representation is compulsory.
Train a model given the input features.
PySpark Machine Learning Library (MLLib) has this workflow as a Pipeline, which includes multiple stages that run sequentially.
The way a pipeline works is straightforward; it has a specified number of stages that run sequentially, like a Transformer or an Estimator. During the Transformer stage, we call the function transform(), and Estimator uses the fit() method; both these stages are performed on the DataFrame. The Pipeline fit() method is usually called upon the original DataFrame, which has text documents and features. The Tokenizer transform() method converts the text document into words, which further leads to adding a new column/feature to our DataFrame.
There are two ways to pass parameters to a Machine Learning Algorithm
Set parameters, for instance. E.g., if LR represents Logistic Regression, one could call LR.setMaxIter(30) to make LR.fit() use at most 30 iterations.
The next step is to directly pass the fit() and transform() methods to the DataFrame.
Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization
An Introduction to Apache Spark
Apache Spark is a distributed processing system used to perform big data and machine learning tasks on large datasets.
As a data science enthusiast, you are probably familiar with storing files on your local device and processing it using languages like R and Python. However, local workstations have their limitations and cannot handle extremely large datasets.
This is where a distributed processing system like Apache Spark comes in. Distributed processing is a setup in which multiple processors are used to run an application. Instead of trying to process large datasets on a single computer, the task can be divided between multiple devices that communicate with each other.
With Apache Spark, users can run queries and machine learning workflows on petabytes of data, which is impossible to do on your local device.
This framework is even faster than previous data processing engines like Hadoop, and has increased in popularity in the past eight years. Companies like IBM, Amazon, and Yahoo are using Apache Spark as their computational framework.
The ability to analyze data and train machine learning models on large-scale datasets is a valuable skill to have if you want to become a data scientist. Having the expertise to work with big data frameworks like Apache Spark will set you apart from others in the field.
Practice using Pyspark with hands-on exercises in our Introduction to PySpark course.
Machine Learning in the Industry
Computer systems with the ability to learn to predict from a given data and improve themselves without having to be reprogrammed used to be a dream until recent years. But now, it has been made possible using Machine Learning. Today, Machine Learning is the most used branch of Artificial Intelligence that is being adopted by big industries in order to benefit their businesses. Machine Learning is a very demanding skill in the industry right now. Machine Learning experts are getting higher pay.
Following are some of the organizations where Machine Learning has various use cases:
PayPal: PayPal uses Machine Learning to detect suspicious activity.
IBM: There is a Machine Learning technology patented by IBM that helps decide when to hand over the control of a self-driving vehicle between a vehicle control processor and a human driver.
Google: Machine Learning is used to gather information from users that is further used to improve its search engine results.
Walmart: Machine Learning in Walmart is used to improve its efficiency.
Amazon: Amazon uses Machine learning to design and implement personalized product recommendations.
Facebook: Machine Learning is used to filter out poor quality content.
Know everything about Spark through our Spark Tutorial.
DataFrames
What are DataFrames?
DataFrame is a new API for Apache Spark. It is basically a distributed, strongly-typed collection of data, i.e., a dataset, which is organized into named columns. A DataFrame is equivalent to what a table is in a relational database, except for the fact that it has richer optimization options.
How to create DataFrames?
There are multiple ways to create DataFrames in Apache Spark:
DataFrames can be created using an existing RDD
You can create a DataFrame by loading a CSV file directly
You can programmatically specify a schema to create a DataFrame
If you want to know about the working procedure of Kafka, refer to this insightful Blog!
This tutorial uses DataFrames created from an existing CSV file.
Get 100% Hike!
Master Most in Demand Skills Now !
How to install PySpark
Pre-requisites:
Before installing Apache Spark and PySpark, you need to have the following software set up on your device:
Python
If you don’t already have Python installed, follow our Python developer set-up guide to set it up before you proceed to the next step.
Java
Next, follow this tutorial to get Java installed on your computer if you are using Windows. Here is an installation guide for MacOs, and here’s one for Linux.
Jupyter Notebook
A Jupyter Notebook is a web application that you can use to write code and display equations, visualizations, and text. It is one of the most commonly used programming editors by data scientists. We will use a Jupyter Notebook to write all the PySpark code in this tutorial, so make sure to have it installed.
You can follow our tutorial to get Jupyter up and running on your local device.
Dataset
We will be using Datacamp’s e-commerce dataset for all the analysis in this tutorial, so make sure to have it downloaded. We’ve renamed the file to “datacamp_ecommerce.csv” and saved it to the parent directory, and you can do the same so it’s easier to code along.
Why PySpark?
Companies that collect terabytes of data will have a big data framework like Apache Spark in place. To work with these large-scale datasets, knowledge of Python and R frameworks alone will not suffice.
You need to learn a framework that allows you to manipulate datasets on top of a distributed processing system, as most data-driven organizations will require you to do so. PySpark is a great place to get started, since its syntax is simple and can be picked up easily if you are already familiar with Python.
The reason companies choose to use a framework like PySpark is because of how quickly it can process big data. It is faster than libraries like Pandas and Dask, and can handle larger amounts of data than these frameworks. If you had over petabytes of data to process, for instance, Pandas and Dask would fail but PySpark would be able to handle it easily.
While it is also possible to write Python code on top of a distributed system like Hadoop, many organizations choose to use Spark instead and use the PySpark API since it is faster and can handle real-time data. With PySpark, you can write code to collect data from a source that is continuously updated, while data can only be processed in batch mode with Hadoop.
Apache Flink is a distributed processing system that has a Python API called PyFlink, and is actually faster than Spark in terms of performance. However, Apache Spark has been around for a longer period of time and has better community support, which means that it is more reliable.
Furthermore, PySpark provides fault tolerance, which means that it has the capability to recover loss after a failure occurs. The framework also has in-memory computation and is stored in random access memory (RAM). It can run on a machine that does not have a hard-drive or SSD installed.
Hyperparameter Tuning using PySpark MLlib
One of the most critical tasks in Machine learning is model selection or using the given data to find the best model or parameters for a given business problem. MLlib has many tools for model selection, such as CrossValidator, and TrainValidationSplit; below is the functionality of these tools:
They split the data into training and test sets (usually, an 80%-20% allocation is given).
For each training and test pair, iterations occur through a set of Parameters.
Finally, the model parameters are selected for the best results.
Cross-Validation in PySpark
Cross Validation is a process of splitting datasets into many folds, which are used to distinguish training and testing data. It is super helpful when the amount of data is insufficient since it trains all the data points as training and test sets and helps overcome overfitting.
You can implement cross-validation in PySpark by importing this library:
from pyspark.ml.tuning import CrossValidator
Commonly Used Parameters in Spark MLLib
load(): Reads an ML instance from the input path, a shortcut of read().load(path).
Rank: It shows the Rank of the feature matrices computed (number of features).
Iterations: This defines the number of Iterations. (default is 5).
Lambda: It is a Regularization parameter. (default: 0.01).
Blocks: It is used to parallelize the computation. (default: -1).
Nonnegative: Gets the value of nonnegative or its default value.
fit(): Fits a model to the input dataset with optional parameters.
clear(param): Clears a param from the param map if it has been explicitly set.
write(): Returns an MLWriter instance for this ML instance.
save(): Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
Advantages of PySpark Over Pandas
Just like pandas, PySpark also has many advantages. Let us discuss some of them.
Scalability: PySpark is designed to handle large-scale datasets and distributed computing. Using pyspark, we can perform parallel processing across a cluster of machines. We can split data into smaller partitions and perform parallel processing on them. This makes pyspark faster and more efficient than Pandas for large-scale data processing.
Distributed Computing: PySpark can distribute computations across a cluster of machines. This helps us process large-scale data that may not fit into the memory of a single machine. Due to this, PySpark is ideal for big data processing.
Speed: PySpark is faster than Pandas when processing large datasets. It can leverage the computing power of a cluster of machines to perform parallel processing. This can significantly reduce processing times.
Integration with Big Data Tools: PySpark integrates with a wide range of big data tools and technologies, including Hadoop, Hive, Cassandra, and HBase. This makes it easier to work with large datasets stored in distributed file systems and other big data stores.
Integration with Hadoop Ecosystem: PySpark integrates seamlessly with the Hadoop ecosystem. This enables us to work with data stored in Hadoop Distributed File System (HDFS) and other data sources such as HBase, Hive, and Cassandra.
Streaming Data Processing: PySpark Streaming allows users to process real-time data streams using Spark’s distributed computing capabilities. It can ingest data from various sources, including Kafka, Flume, and Twitter, and process them in near real-time. Pandas doesn’t have any such feature.
Introduction to Spark MLlib
Apache Spark comes with a library named MLlib to perform Machine Learning tasks using the Spark framework. Since there is a Python API for Apache Spark, i.e., PySpark, you can also use this Spark ML library in PySpark. MLlib contains many algorithms and Machine Learning utilities.
In this tutorial, you will learn how to use Machine Learning in PySpark. The dataset of Fortune 500 is used in this tutorial to implement this. This dataset consists of the information related to the top 5 companies ranked by Fortune 500 in the year 2017. This tutorial will use the first five fields. You can download the dataset by clicking here.
The dataset looks like below:
Rank
Title
Website
Employees
Sector
Walmart
http://www.walmart.com
2,300,000
Retail
Berkshire Hathaway
http://www.berkshirehathaway.com
367,700
Finance
Apple
http://www.apple.com
116,000
Technology
ExxonMobil
http://www.exxonmobil.com
72,700
Energy
McKesson
http://www.mckesson.com
68,000
Wholesale
In this Spark ML tutorial, you will implement Machine Learning to predict which one of the fields is the most important factor to predict the ranking of the above-mentioned companies in the coming years. Also, you will use DataFrames to implement Machine Learning.
Before diving right into this Spark MLlib tutorial, have a quick rundown of all the topics included in this tutorial:
DataFrames
What are DataFrames?
DataFrame is a new API for Apache Spark. It is basically a distributed, strongly-typed collection of data, i.e., a dataset, which is organized into named columns. A DataFrame is equivalent to what a table is in a relational database, except for the fact that it has richer optimization options.
How to create DataFrames?
There are multiple ways to create DataFrames in Apache Spark:
DataFrames can be created using an existing RDD
You can create a DataFrame by loading a CSV file directly
You can programmatically specify a schema to create a DataFrame
If you want to know about the working procedure of Kafka, refer to this insightful Blog!
This tutorial uses DataFrames created from an existing CSV file.
Get 100% Hike!
Master Most in Demand Skills Now !
Installation Guide
Now that you have all the prerequisites set up, you can proceed to install Apache Spark and PySpark.
Installing Apache Spark
To get Apache Spark set up, navigate to the download page and download the .tgz file displayed on the page:
Then, if you are using Windows, create a folder in your C directory called “spark.” If you use Linux or Mac, you can paste this into a new folder in your home directory.
Next, extract the file you just downloaded and paste its contents into this “spark” folder. This is what the folder path should look like:
Now, you need to set your environment variables. There are two ways you can do this:
Method 1: Changing Environment Variables Using Powershell
If you are using a Windows machine, the first way to change your environment variables is by using Powershell:
Step 1: Click on Start -> Windows Powershell -> Run as administrator
Step 2: Type the following line into Windows Powershell to set SPARK_HOME:
setx SPARK_HOME "C:\spark\spark-3.3.0-bin-hadoop3" # change this to your path
Step 3: Next, set your Spark bin directory as a path variable:
setx PATH "C:\spark\spark-3.3.0-bin-hadoop3\bin"
Method 2: Changing Environment Variables Manually
Step 1: Navigate to Start -> System -> Settings -> Advanced Settings
Step 2: Click on Environment Variables
Step 3: In the Environment Variables tab, click on New.
Step 4: Enter the following values into Variable name and Variable value. Note that the version you install might be different from the one shown below, so copy and paste the path to your Spark directory.
Step 5: Next, in the Environment Variables tab, click on Path and select Edit.
Step 6: Click on New and paste in the path to your Spark bin directory. Here is an example of what the bin directory looks like:
C:\spark\spark-3.3.0-bin-hadoop3\bin
Here is a guide on setting your environment variables if you use a Linux device, and here’s one for MacOS.
Installing PySpark
Now that you have successfully installed Apache Spark and all other necessary prerequisites, open a Python file in your Jupyter Notebook and run the following lines of code in the first cell:
!pip install pyspark
Alternatively, you can follow along to this end-to-end PySpark installation guide to get the software installed on your device.
Table of Contents
Why PySpark for Machine Learning?
Spark MLLib to Leverage PySpark for Machine Learning
How to Build Machine Learning Pipelines with PySpark?
How PySpark Machine Learning Pipelines Work?
Hyperparameter Tuning using PySpark MLlib
PySpark Machine Learning Example to Implement Linear Regression
PySpark Linear Regression Example with Source Code
Let’s Create some Predictions from the Model
Top 3 PySpark Machine Learning Project Ideas for Practice
Now is the Best Time to Learn PySpark!
PySpark vs Pandas Memory Consumption
If we discuss memory consumption, Pyspark is better than Pandas. Pyspark does lazy processing. It doesn’t keep all the data in memory. When data is required, then only the data is retrieved from the disk. On the other hand, the pandas module keeps all the data in memory. Due to this, the memory consumption of a code written using pandas is always greater than pyspark.
What is Machine Learning?
Machine Learning is one of the many applications of Artificial Intelligence (AI) where the primary aim is to enable computers to learn automatically without any human assistance. With the help of Machine Learning, computers are able to tackle the tasks that were, until now, only handled and carried out by people. It is basically a process of teaching a system how to make accurate predictions when fed with the right data. It has the ability to learn and improve from past experience without being specifically programmed for a task. Machine Learning mainly focuses on developing computer programs and algorithms that make predictions and learn from the provided data.
Get certified from the top Big Data Course in Singapore now!
First, learn the basics of DataFrames in PySpark to get started with Machine Learning in PySpark.
What is PySpark MLlib?
Basic Introduction to PySpark MLlib
Spark MLlib is the short form of the Spark Machine Learning library. Machine Learning in PySpark is easy to use and scalable. It works on distributed systems. You can use Spark Machine Learning for data analysis. There are various techniques you can make use of with Machine Learning algorithms such as regression, classification, etc., all because of the PySpark MLlib.
Parameters in PySpark MLlib
Some of the main parameters of PySpark MLlib are listed below:
Ratings: This parameter is used to create an RDD of ratings, rows, or tuples.
Rank: It shows the number of features computed and ranks them.
Lambda: Lambda is a regularization parameter.
Blocks: Blocks are used to parallel the number of computations. The default value for this is −1.
Watch this PySpark Course for Beginners video from Intellipaat:
What is PySpark?
PySpark is an interface for Apache Spark in Python. With PySpark, you can write Python and SQL-like commands to manipulate and analyze data in a distributed processing environment. To learn the basics of the language, you can take Datacamp’s Introduction to PySpark course. This is a beginner program that will take you through manipulating data, building machine learning pipelines, and tuning models with PySpark.
Machine Learning in the Industry
Computer systems with the ability to learn to predict from a given data and improve themselves without having to be reprogrammed used to be a dream until recent years. But now, it has been made possible using Machine Learning. Today, Machine Learning is the most used branch of Artificial Intelligence that is being adopted by big industries in order to benefit their businesses. Machine Learning is a very demanding skill in the industry right now. Machine Learning experts are getting higher pay.
Following are some of the organizations where Machine Learning has various use cases:
PayPal: PayPal uses Machine Learning to detect suspicious activity.
IBM: There is a Machine Learning technology patented by IBM that helps decide when to hand over the control of a self-driving vehicle between a vehicle control processor and a human driver.
Google: Machine Learning is used to gather information from users that is further used to improve its search engine results.
Walmart: Machine Learning in Walmart is used to improve its efficiency.
Amazon: Amazon uses Machine learning to design and implement personalized product recommendations.
Facebook: Machine Learning is used to filter out poor quality content.
Know everything about Spark through our Spark Tutorial.
Conclusion
Machine Learning denotes a step taken forward in how computers can learn and make predictions. It has applications in various sectors and is being extensively used. Having knowledge of Machine Learning will not only open multiple doors of opportunities for you, but it also makes sure that, if you have mastered Machine Learning, you are never out of jobs.
Machine Learning has been gaining popularity ever since it came into the picture and it won’t stop any time soon.
Machine Learning Library (MLlib) Guide
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:
ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
Featurization: feature extraction, transformation, dimensionality reduction, and selection
Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
Persistence: saving and load algorithms, models, and Pipelines
Utilities: linear algebra, statistics, data handling, etc.
Announcement: DataFrame-based API is primary API
The MLlib RDD-based API is now in maintenance mode.
As of Spark 2.0, the RDD-based APIs in the
spark.mllib
package have entered maintenance mode.
The primary Machine Learning API for Spark is now the DataFrame-based API in the
spark.ml
package.
What are the implications?
MLlib will still support the RDD-based API in
spark.mllib
with bug fixes.
MLlib will not add new features to the RDD-based API.
In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
Why is MLlib switching to the DataFrame-based API?
DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.
What is “Spark ML”?
“Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API.
This is majorly due to the
org.apache.spark.ml
Scala package name used by the DataFrame-based API, and the “Spark ML Pipelines” term we used initially to emphasize the pipeline concept.
Is MLlib deprecated?
No. MLlib includes both the RDD-based API and the DataFrame-based API. The RDD-based API is now in maintenance mode. But neither API is deprecated, nor MLlib as a whole.
Dependencies
MLlib uses linear algebra packages Breeze and dev.ludovic.netlib for optimised numerical processing1. Those packages may call native acceleration libraries such as Intel MKL or OpenBLAS if they are available as system libraries or in runtime library paths.
However, native acceleration libraries can’t be distributed with Spark. See MLlib Linear Algebra Acceleration Guide for how to enable accelerated linear algebra processing. If accelerated native libraries are not enabled, you will see a warning message like below and a pure JVM implementation will be used instead:
WARNING: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
To use MLlib in Python, you will need NumPy version 1.4 or newer.
Highlights in 3.0
The list below highlights some of the new features and enhancements added to MLlib in the
3.0
release of Spark:
Multiple columns support was added to
Binarizer
(SPARK-23578),
StringIndexer
(SPARK-11215),
StopWordsRemover
(SPARK-29808) and PySpark
QuantileDiscretizer
(SPARK-22796).
Tree-Based Feature Transformation was added (SPARK-13677).
Two new evaluators
MultilabelClassificationEvaluator
(SPARK-16692) and
RankingEvaluator
(SPARK-28045) were added.
Sample weights support was added in
DecisionTreeClassifier/Regressor
(SPARK-19591),
RandomForestClassifier/Regressor
(SPARK-9478),
GBTClassifier/Regressor
(SPARK-9612),
MulticlassClassificationEvaluator
(SPARK-24101),
RegressionEvaluator
(SPARK-24102),
BinaryClassificationEvaluator
(SPARK-24103),
BisectingKMeans
(SPARK-30351),
KMeans
(SPARK-29967) and
GaussianMixture
(SPARK-30102).
R API for
PowerIterationClustering
was added (SPARK-19827).
Added Spark ML listener for tracking ML pipeline status (SPARK-23674).
Fit with validation set was added to Gradient Boosted Trees in Python (SPARK-24333).
RobustScaler
transformer was added (SPARK-28399).
Factorization Machines
classifier and regressor were added (SPARK-29224).
Gaussian Naive Bayes Classifier (SPARK-16872) and Complement Naive Bayes Classifier (SPARK-29942) were added.
ML function parity between Scala and Python (SPARK-28958).
predictRaw
is made public in all the Classification models.
predictProbability
is made public in all the Classification models except
LinearSVCModel
(SPARK-30358).
Migration Guide
The migration guide is now archived on this page.
To learn more about the benefits and background of system optimised natives, you may wish to watch Sam Halliday’s ScalaX talk on High Performance Linear Algebra in Scala. ↩
PySpark vs Pandas Performance
Pyspark has been created to help us work with big data on distributed systems. On the other hand, the pandas module is used to manipulate and analyze datasets up to a few GigaBytes (Less than 10 GB to be specific). So, PySpark, when used with a distributed computing system, gives better performance than pandas. Pyspark also uses resilient distributed datasets (RDDs) to work parallel on the data. Hence, it performs better than pandas.
Master Logistic and Linear Regression in PySpark
Logistic and linear regression are essential machine learning techniques that are supported by PySpark. You’ll learn to build and evaluate logistic regression models, before moving on to creating linear regression models to help you refine your predictors to only the most relevant options.
By the end of the course, you’ll feel confident in applying your new-found machine learning knowledge, thanks to hands-on tasks and practice data sets found throughout the course.
1
Introduction
Free
Spark is a framework for working with Big Data. In this chapter you’ll cover some background about Spark and Machine Learning. You’ll then find out how to connect to Spark using Python and load CSV data.
2
Classification
Now that you are familiar with getting data into Spark, you’ll move onto building two types of classification model: Decision Trees and Logistic Regression. You’ll also find out about a few approaches to data preparation.
Data Preparation50 xpRemoving columns and rows100 xpColumn manipulation100 xpCategorical columns100 xpAssembling columns100 xpDecision Tree50 xpTrain/test split100 xpBuild a Decision Tree100 xpEvaluate the Decision Tree100 xpLogistic Regression50 xpBuild a Logistic Regression model100 xpEvaluate the Logistic Regression model100 xpTurning Text into Tables50 xpPunctuation, numbers and tokens100 xpStop words and hashing100 xpTraining a spam classifier100 xp
3
Regression
Next you’ll learn to create Linear Regression models. You’ll also find out how to augment your data by engineering new predictors as well as a robust approach to selecting only the most relevant predictors.
Finally you’ll learn how to make your models more efficient. You’ll find out how to use pipelines to make your code clearer and easier to maintain. Then you’ll use cross-validation to better test your models and select good model parameters. Finally you’ll dabble in two types of ensemble model.
Pipeline50 xpFlight duration model: Pipeline stages100 xpFlight duration model: Pipeline model100 xpSMS spam pipeline100 xpCross-Validation50 xpCross validating simple flight duration model100 xpCross validating flight duration model pipeline100 xpGrid Search50 xpOptimizing flights linear regression100 xpDissecting the best flight duration model100 xpSMS spam optimised100 xpHow many models for grid search?50 xpEnsemble50 xpDelayed flights with Gradient-Boosted Trees100 xpDelayed flights with a Random Forest100 xpEvaluating Random Forest100 xpClosing thoughts50 xp
In the following tracks
Big Data with PySparkMachine Learning Scientist with Python
Collaborators
Andrew Collier
See More
Data Scientist @ Exegetic Analytics
Andrew Collier is a Data Scientist, working mostly in R and Python but also dabbling in a wide range of other technologies. When not in front of a computer he spends time with his family and runs obsessively.
Learning PySpark From Scratch – Next Steps:
If you managed to follow along with this entire PySpark tutorial, congratulations! You have now successfully installed PySpark onto your local device, analyzed an e-commerce dataset, and built a machine learning algorithm using the framework.
One caveat of the analysis above is that it was conducted with 2,500 rows of ecommerce data collected on a single day. The outcome of this analysis can be solidified if we had a larger amount of data to work with, as techniques like RFM modeling are usually applied onto months of historical data.
However, you can take the principles learned in this article and apply them to a wide variety of larger datasets in the unsupervised machine learning space.
Check out this cheat sheet by Datacamp to learn more about PySpark’s syntax and its modules.
Finally, if you’d like to go beyond the concepts covered in this tutorial and learn the fundamentals of programming with PySpark, you can take the Big Data with PySpark learning track on Datacamp. This track contains a series of courses that will teach you to do the following with PySpark:
Data Management, Analysis, and Pre-processing
Building and Tuning Machine Learning Pipelines
Big Data Analysis
Feature Engineering
Building Recommendation Engines
Courses for Data Visualization
Course
Building Recommendation Engines with PySpark
Course
Understanding Data Visualization
A Deep Dive into the Phi-2 Model
Python List Size: 8 Different Methods for Finding the Length of a List in Python
An End-to-End ML Model Monitoring Workflow with NannyML in Python
Bex Tuychiev
15 min
Build and Test Decision Trees
Building your own decision trees is a great way to start exploring machine learning models. You’ll use an algorithm called ‘Recursive Partitioning’ to divide data into two classes and find a predictor within your data that results in the most informative split of the two classes, and repeat this action with further nodes. You can then use your decision tree to make predictions with new data.
Now is the Best Time to Learn PySpark!
PySpark has risen in demand due to its various data science compatibility, handling large amounts of data, and providing cluster computing. MLLib tries to fill this void, especially for engineers unfamiliar with machine learning theory but would like to include predictive features in their big data projects. However, the Spark ecosystem is growing very fast and is now one of the biggest names in the cluster computing ecosystem. Both technologies(PySpark & Hadoop) are production-ready and being used by hundreds of companies worldwide. Competition is healthy, and data engineers and scientists should learn PySpark for machine learning.
When to Use PySpark vs Pandas?
The choice between PySpark and Pandas depends on the specific data analysis tasks and requirements. Here are some factors you can consider when deciding whether to use PySpark or Pandas:
Dataset Size: If you are working with small to medium-sized datasets that can fit in the memory of a single machine, Pandas is likely to be the better choice. However, if you are dealing with large-scale datasets that cannot fit in the memory of a single machine, PySpark is the better choice.
Computing Resources: PySpark is designed to leverage distributed computing resources to process large-scale datasets across a cluster of machines. If you have access to a distributed computing environment, such as a Hadoop cluster, PySpark can provide significant performance benefits.
The complexity of Data Processing Tasks: PySpark is more suitable for complex data processing tasks that involve multiple stages of data transformation and analysis. Pandas is more suitable for simple data analysis tasks that involve filtering, selecting, and aggregating data.
Learning curve: Understanding the Spark architecture and using PySpark can be a tedious task. On the other hand, if you know Python, you can start working with pandas within an hour. So, if you have a small dataset and you immediately want to perform analytical tasks on the data, go for pandas.
What is PySpark used for?
Most data scientists and analysts are familiar with Python and use it to implement machine learning workflows. PySpark allows them to work with a familiar language on large-scale distributed datasets.
Apache Spark can also be used with other data science programming languages like R. If this is something you are interested in learning, the Introduction to Spark with sparklyr in R course is a great place to start.
What is Pandas?
Pandas is a popular open-source data analysis library for the Python programming language. If you are a data analyst or data scientist who works with python, you must have used pandas in data analysis tasks.
Pandas provides data structures and functions for working with structured data. It provides us with the Series and DataFrame data structures using which we can analyze one-dimensional and tabular data respectively.
The Pandas library also provides a range of functions for data manipulation, including filtering, selecting, joining, grouping, and aggregating data. It also provides functionality for handling missing data, reshaping data, and handling time-series data.
Pandas also provides great data visualization capabilities. The matplotlib library is integrated into pandas. Due to this, we can plot different metrics from dataframes using only a single function call.
Data scientists and data analysts use pandas extensively due to their simplicity and ease of use. Pandas is built on top of the numpy library. Due to this, it also performs well in numerical data analysis.
End-to-end Machine Learning PySpark Tutorial
Now that you have PySpark up and running, we will show you how to execute an end-to-end customer segmentation project using the library.
Customer segmentation is a marketing technique companies use to identify and group users who display similar characteristics. For instance, if you visit Starbucks only during the summer to purchase cold beverages, you can be segmented as a “seasonal shopper” and enticed with special promotions curated for the summer season.
Data scientists usually build unsupervised machine learning algorithms such as K-Means clustering or hierarchical clustering to perform customer segmentation. These models are great at identifying similar patterns between user groups that often go unnoticed by the human eye.
In this tutorial, we will use K-Means clustering to perform customer segmentation on the e-commerce dataset we downloaded earlier.
By the end of this tutorial, you will be familiar with the following concepts:
Reading csv files with PySpark
Exploratory Data Analysis with PySpark
Grouping and sorting data
Performing arithmetic operations
Aggregating datasets
Data Pre-Processing with PySpark
Working with datetime values
Type conversion
Joining two dataframes
The rank() function
PySpark Machine Learning
Creating a feature vector
Standardizing data
Building a K-Means clustering model
Interpreting the model
Step 1: Creating a SparkSession
A SparkSession is an entry point into all functionality in Spark, and is required if you want to build a dataframe in PySpark. Run the following lines of code to initialize a SparkSession:
Using the codes above, we built a spark session and set a name for the application. Then, the data was cached in off-heap memory to avoid storing it directly on disk, and the amount of memory was manually specified.
To find the country from which most purchases are made, we need to use the groupBy() clause in PySpark:
from pyspark.sql.functions import * from pyspark.sql.types import * df.groupBy('Country').agg(countDistinct('CustomerID').alias('country_count')).show()
The following table will be rendered after running the codes above:
Almost all the purchases on the platform were made from the United Kingdom, and only a handful were made from countries like Germany, Australia, and France.
Notice that the data in the table above isn’t presented in the order of purchases. To sort this table, we can include the orderBy() clause:
The output displayed is now sorted in descending order:
When was the most recent purchase made by a customer on the e-commerce platform?
To find when the latest purchase was made on the platform, we need to convert the “InvoiceDate” column into a timestamp format and use the max() function in Pyspark:
You should see the following table appear after running the code above:
When was the earliest purchase made by a customer on the e-commerce platform?
Similar to what we did above, the min() function can be used to find the earliest purchase date and time:
df.select(min("date")).show()
Notice that the most recent and earliest purchases were made on the same day just a few hours apart. This means that the dataset we downloaded contains information of only purchases made on a single day.
Step 4: Data Pre-processing
Now that we have analyzed the dataset and have a better understanding of each data point, we need to prepare the data to feed into the machine learning algorithm.
Let’s take a look at the head of the dataframe once again to understand how the pre-processing will be done:
df.show(5,0)
From the dataset above, we need to create multiple customer segments based on each user’s purchase behavior.
The variables in this dataset are in a format that cannot be easily ingested into the customer segmentation model. These features individually do not tell us much about customer purchase behavior.
Due to this, we will use the existing variables to derive three new informative features – recency, frequency, and monetary value (RFM).
RFM is commonly used in marketing to evaluate a client’s value based on their:
Recency: How recently has each customer made a purchase?
Frequency: How often have they bought something?
Monetary Value: How much money do they spend on average when making purchases?
We will now preprocess the dataframe to create the above variables.
Recency
First, let’s calculate the value of recency – the latest date and time a purchase was made on the platform. This can be achieved in two steps:
i) Assign a recency score to each customer
We will subtract every date in the dataframe from the earliest date. This will tell us how recently a customer was seen in the dataframe. A value of 0 indicates the lowest recency, as it will be assigned to the person who was seen making a purchase on the earliest date.
One customer can make multiple purchases at different times. We need to select only the last time they were seen buying a product, as this is indicative of when the most recent purchase was made:
Let’s look at the head of the new dataframe. It now has a variable called “recency” appended to it:
df2.show(5,0)
An easier way to view all the variables present in a PySpark dataframe is to use its printSchema() function. This is the equivalent of the info() function in Pandas:
df2.printSchema()
The output rendered should look like this:
Frequency
Let’s now calculate the value of frequency – how often a customer bought something on the platform. To do this, we just need to group by each customer ID and count the number of items they purchased:
Look at the head of this new dataframe we just created:
df_freq.show(5,0)
There is a frequency value appended to each customer in the dataframe. This new dataframe only has two columns, and we need to join it with the previous one:
Now that we have created all the necessary variables to build the model, run the following lines of code to select only the required columns and drop duplicate rows from the dataframe:
These are the scaled features that will be fed into the clustering algorithm.
If you’d like to learn more about data preparation with PySpark, take this feature engineering course on Datacamp.
Step 5: Building the Machine Learning Model
Now that we have completed all the data analysis and preparation, let’s build the K-Means clustering model.
The algorithm will be created using PySpark’s machine learning API.
i) Finding the number of clusters to use
When building a K-Means clustering model, we first need to determine the number of clusters or groups we want the algorithm to return. If we decide on three clusters, for instance, then we will have three customer segments.
The most popular technique used to decide on how many clusters to use in K-Means is called the “elbow-method.”
This is done simply running the K-Means algorithm for a wide range of clusters and visualizing the model results for each cluster. The plot will have an inflection point that looks like an elbow, and we just pick the number of clusters at this point.
Read this Datacamp K-Means clustering tutorial to learn more about how the algorithm works.
Let’s run the following lines of code to build a K-Means clustering algorithm from 2 to 10 clusters:
from pyspark.ml.clustering import KMeans from pyspark.ml.evaluation import ClusteringEvaluator import numpy as np cost = np.zeros(10) evaluator = ClusteringEvaluator(predictionCol='prediction', featuresCol='standardized',metricName='silhouette', distanceMeasure='squaredEuclidean') for i in range(2,10): KMeans_algo=KMeans(featuresCol='standardized', k=i) KMeans_fit=KMeans_algo.fit(data_scale_output) output=KMeans_fit.transform(data_scale_output) cost[i] = KMeans_fit.summary.trainingCost
With the codes above, we have successfully built and evaluated a K-Means clustering model with 2 to 10 clusters. The results have been placed in an array, and can now be visualized in a line chart:
import pandas as pd import pylab as pl df_cost = pd.DataFrame(cost[2:]) df_cost.columns = ["cost"] new_col = range(2,10) df_cost.insert(0, 'cluster', new_col) pl.plot(df_cost.cluster, df_cost.cost) pl.xlabel('Number of Clusters') pl.ylabel('Score') pl.title('Elbow Curve') pl.show()
The codes above will render the following chart:
ii) Building the K-Means Clustering Model
From the plot above, we can see that there is an inflection point that looks like an elbow at four. Due to this, we will proceed to build the K-Means algorithm with four clusters:
Notice that there is a “prediction” column in this dataframe that tells us which cluster each CustomerID belongs to:
Step 6: Cluster Analysis
The final step in this entire tutorial is to analyze the customer segments we just built.
Run the following lines of code to visualize the recency, frequency, and monetary value of each customerID in the dataframe:
import matplotlib.pyplot as plt import seaborn as sns df_viz = preds.select('recency','frequency','monetary_value','prediction') df_viz = df_viz.toPandas() avg_df = df_viz.groupby(['prediction'], as_index=False).mean() list1 = ['recency','frequency','monetary_value'] for i in list1: sns.barplot(x='prediction',y=str(i),data=avg_df) plt.show()
The codes above will render the following plots:
Here is an overview of characteristics displayed by customers in each cluster:
Cluster 0: Customers in this segment display low recency, frequency, and monetary value. They rarely shop on the platform and are low potential customers who are likely to stop doing business with the ecommerce company.
Cluster 1: Users in this cluster display high recency but haven’t been seen spending much on the platform. They also don’t visit the site often. This indicates that they might be newer customers who have just started doing business with the company.
Cluster 2: Customers in this segment display medium recency and frequency and spend a lot of money on the platform. This indicates that they tend to buy high-value items or make bulk purchases.
Cluster 3: The final segment comprises users who display high recency and make frequent purchases on the platform. However, they don’t spend much on the platform, which might mean that they tend to select cheaper items in each purchase.
To go beyond the predictive modelingmodelling concepts covered in this course, you can take the Machine Learning with PySpark course on Datacamp.
Let’s Create some Predictions from the Model
PySpark Linear Regression Predict
We can see that the R-squared value obtained is 0.85 or 85% which is quite good, considering that we did not perform exploratory data analysis and hyperparameter Tuning.
In this image, you can see the features and the predictions made by the model on the test data.
Build an Awesome Job Winning Project Portfolio with Solved End-to-End Big Data Projects
PySpark Machine Learning Example to Implement Linear Regression
The Linear Regression model predicts a continuous value from given features. The regression model targets prediction value based on independent variables also used to find the relationship between variables and forecasting.
Importing the Libraries
Creating Spark Session and reading the CSV file, which you can find Insurance.CSV.
Data Exploration for better understanding via data samples.
Our data has 7 columns where the ‘charges’ column is our dependent variable and ‘age’, ‘gender’, ‘BMI’, ‘children’, ‘smoker’, and ‘region’ are the independent variables.
Performing Exploratory Data Analysis
Exploratory Data Analysis is a process of performing initial experiments on data to discover patterns, detect outliers, and test assumptions with the help of statistics and graphical representations. Here we are using the describe data function to get the statistics of our data.
Using StringIndexer for dealing with Categorical data
StringIndexer: It is used to convert a string column into numerical form. It allocates unique values to each of the categories present in the respective column.
Output Variables
We can see from the output that columns like children, region, and smoker have been assigned to numeric values from the string.
Feature Engineering
Given the multiple columns, we need to merge them into a single column using VectorAssembler. It is a feature transformer that merges multiple columns into a vector column. One can select the number of columns used as input features and pass only those columns through the VectorAssembler. We will pass all seven input columns to create a single feature vector column in our case.
VectorAssembler: It is a transformer that helps to select the input columns that we need to create a single feature vector to train our Machine Learning models.
Output: Here the column features represent the VectorAssembler values.
Now that we have our features ready and the input variables are known, let’s get them together and dive into the machine-learning technique.
Splitting the Dataset
Splitting the Data into Train and Test sets to train our model and check its efficiency.
Checking the statistics of our train and test sets.
Get More Practice, More Big Data and Analytics Projects, and More guidance.Fast-Track Your Career Transition with ProjectPro
Build and Train Linear Regression Model
In this part, we build and train the Linear Regression model using features of the input and output columns. We can fetch the coefficients and intercept values of the model as well. We can also evaluate the performance of the model on training data as well using r2. This model gives a very good accuracy (77%) on training datasets which we will see in some time.
The Math Behind Measuring R-Squared Value
Assume that without Linear Regression, the Sum of Squared error is coming to 158 and after applying the Linear Regression model, we get that to 1.2; let’s check a way to calculate the R-squared value:
TSS (Total Sum of Squared Errors ) = SSE (Sum of squared errors) + SSR (Residual Sum of squared errors)
The total sum of squares is the sum of the squared difference between the actual and the mean values and is always fixed. This is equal to 158.
The SSE is the squared difference from the actual to predicted values of the target variable, which, as mentioned, is 1.2 after using Linear Regression.
SSR is the Sum of Squares explained by Regression and can be calculated by (TSS – SSE).
We use the evaluate function to make predictions for the test data and can use r2 to check the model’s accuracy on test data.
SSR = 158 – 1.2 =156.8
‘rsquare’ (Coefficient of determination) = SSR/TSS = 156.8/158 = 0.99
This percentage indicates that the Linear Regression model can predict with 99 % accuracy in predicting the salary amount given the person’s age. The other 1% can be attributed to errors the model cannot explain. Our Linear Regression line fits the model well, but it can also be a case of overfitting.
Overfitting occurs when the model predicts high accuracy on training data, but its performance drops on the unseen/test data, which causes weak prediction models. The technique to address the overfitting issues is known as regularization, and there are different types of regularization techniques.
In terms of Linear Regression, one can use Ridge, Lasso, or Elasticnet Regularization techniques to handle overfitting. Ridge Regression is also known as L2 regularization. It focuses on restricting the coefficient values of input features close to zero, whereas Lasso regression (L1) makes some of the coefficients zero to improve the model’s generalization. Elasticnet is a combination of both techniques.
Regression is still a parametric-driven approach and assumes few underlying patterns about distributions of input data points. The Linear Regression model does not perform well if the input data does not affiliate with those assumptions. Hence it is essential to go over these assumptions before using the Linear Regression model –
There must be a linear relationship between the input and output variables.
The independent variables (input features) should not be correlated to each other (also known as multicollinearity).
There must be no correlation between the error values.
There must be a linear relationship between the error and the output variable.
The error values must be normally distributed.
What is PySpark?
PySpark is a Python library that provides an interface for Apache Spark. Spark is an open-source framework for big data processing. Spark is built to process large amounts of data quickly by distributing computing tasks across a cluster of machines.
PySpark allows us to use Apache Spark and its ecosystem of libraries, such as Spark SQL for working with structured data.
We can also use Spark MLlib for machine learning and GraphX for graph processing using Pyspark in Python.
PySpark supports many data sources, including Hadoop Distributed File System (HDFS), Apache Cassandra, and Amazon S3.
Along with the data processing capabilities, we can also use pyspark with popular Python libraries such as NumPy and Pandas.
Learn to Use Apache Spark for Machine Learning
Spark is a powerful, general purpose tool for working with Big Data. Spark transparently handles the distribution of compute tasks across a cluster. This means that operations are fast, but it also allows you to focus on the analysis rather than worry about technical details. In this course you’ll learn how to get data into Spark and then delve into the three fundamental Spark Machine Learning algorithms: Linear Regression, Logistic Regression/Classifiers, and creating pipelines.
What is Machine Learning?
Machine Learning is one of the many applications of Artificial Intelligence (AI) where the primary aim is to enable computers to learn automatically without any human assistance. With the help of Machine Learning, computers are able to tackle the tasks that were, until now, only handled and carried out by people. It is basically a process of teaching a system how to make accurate predictions when fed with the right data. It has the ability to learn and improve from past experience without being specifically programmed for a task. Machine Learning mainly focuses on developing computer programs and algorithms that make predictions and learn from the provided data.
Get certified from the top Big Data Course in Singapore now!
First, learn the basics of DataFrames in PySpark to get started with Machine Learning in PySpark.
What is PySpark MLlib?
Basic Introduction to PySpark MLlib
Spark MLlib is the short form of the Spark Machine Learning library. Machine Learning in PySpark is easy to use and scalable. It works on distributed systems. You can use Spark Machine Learning for data analysis. There are various techniques you can make use of with Machine Learning algorithms such as regression, classification, etc., all because of the PySpark MLlib.
Parameters in PySpark MLlib
Some of the main parameters of PySpark MLlib are listed below:
Ratings: This parameter is used to create an RDD of ratings, rows, or tuples.
Rank: It shows the number of features computed and ranks them.
Lambda: Lambda is a regularization parameter.
Blocks: Blocks are used to parallel the number of computations. The default value for this is −1.
Watch this PySpark Course for Beginners video from Intellipaat:
Why PySpark for Machine Learning?
Over 3,000 companies use Apache Spark, including top players like Oracle, Hortonworks, Cisco, Verizon, Visa, Microsoft, Databricks, and Amazon. Spark made waves in the recent year as the Big Data product with the shortest learning curve, popular with SMBs and Enterprise teams. Big companies heavily use machine learning models for making predictive analyses and keeping themselves ahead in the market; models like logistic regression, XGboost, SVMs, Random forest classifier model, decision trees, logistic regression model, and gradient boosted tree classifier are being heavily deployed from there end.
You gain the ability to work with Big Data.
Very good native SQL support.
Great documentation is available.
Supports machine learning for supervised learning algorithms.
Provides more scalable analysis.
PySpark for Machine Learning is fast and uses multiple machines for large-scale data processing. It runs on distributed computing, such as YARN, Mesos, and Standalone clusters. PySpark has two main Abstractions:
RDD – A distributed collection of objects.
Dataframe – Distributed dataset for tabular data.
RDD – Resilient Distributed Datasets are a distributed collection of immutable JVM objects that allows you to perform calculations very quickly, and they are the backbone of Apache spark. As the name suggests, the dataset is distributed; it is split into chunks based on some key and distributed to nodes. Doing this allows calculations to be performed very quickly. RDDs are schema-less data structures.
Dataframe – This is an immutable distributed collection of data organized into named columns, the same as a table in a relational database. People who work with pandas data frame can completely relate to this abstraction.
About this book
Master the new features in PySpark 3.1 to develop data-driven, intelligent applications. This updated edition covers topics ranging from building scalable machine learning models, to natural language processing, to recommender systems.
Machine Learning with PySpark, Second Edition begins with the fundamentals of Apache Spark, including the latest updates to the framework. Next, you will learn the full spectrum of traditional machine learning algorithm implementations, along with natural language processing and recommender systems. You’ll gain familiarity with the critical process of selecting machine learning algorithms, data ingestion, and data processing to solve business problems. You’ll see a demonstration of how to build supervised machine learning models such as linear regression, logistic regression, decision trees, and random forests. You’ll also learn how to automate the steps using Spark pipelines, followed by unsupervised models such as K-means and hierarchical clustering. A section on Natural Language Processing (NLP) covers text processing, text mining, and embeddings for classification. This new edition also introduces Koalas in Spark and how to automate data workflow using Airflow and PySpark’s latest ML library.
After completing this book, you will understand how to use PySpark’s machine learning library to build and train various machine learning models, along with related components such as data ingestion, processing and visualization to develop data-driven intelligent applications
What you will learn:
Build a spectrum of supervised and unsupervised machine learning algorithms
Use PySpark’s machine learning library to implement machine learning and recommender systems
Leverage the new features in PySpark’s machine learning library
Understand data processing using Koalas in Spark
Handle issues around feature engineering, class balance, bias and variance, and cross validation to build optimally fit models
Who This Book Is For
Data science and machine learning professionals.
Advantages of Pandas Over PySpark
Pandas and PySpark are both popular tools for data analysis and processing. However, they have different strengths and weaknesses. Here are some advantages of Pandas over PySpark.
Ease of Use: Pandas is generally easier to use and has a lower learning curve compared to PySpark. The pandas API is simple and the syntax is similar to SQL and Excel. This makes it easy for analysts and data scientists to get started with data analysis and manipulation using pandas.
Interactivity: Pandas provides an interactive environment for data exploration and analysis through Jupyter notebooks. This allows us to visualize data and experiment with code more easily. PySpark, on the other hand, can have a higher barrier to entry. It requires setting up a distributed computing cluster before running code.
Well-suited for small to medium-sized data: Pandas is well-suited for handling small to medium-sized datasets that can fit in memory. It provides fast and efficient data manipulation and processing on a single machine, without requiring distributed computing resources.
Flexibility: The pandas module is highly flexible and can work with a wide variety of data sources. We can use CSV, Excel, SQL databases, parquet files, and more. It also provides a wide range of data manipulation functions that can handle complex data transformation tasks.
Integration with Other Libraries: Pandas integrates well with other data science libraries in the Python ecosystem, such as NumPy, Matplotlib, and Scikit-learn. This makes it easy to build end-to-end data analysis pipelines and machine learning workflows using a variety of tools.
Community Support: Pandas has a large and active community of users and contributors. It also has extensive documentation that explains each function with examples. This makes it easy to find help and resources when working with Pandas.
Spark MLLib to Leverage PySpark for Machine Learning
MLLib is Spark’s machine learning (ML) library that helps make functional machine-learning models scalable and manageable.
Spark MLLib consists of in-built tools such as.
Machine Learning Algorithms: Common learning algorithms include logistic regression models, regression, decision trees, random forest algorithms, and many other ensemble methods.
Featurization: Feature extraction, feature scaling, feature selection, and dimensionality reduction.
Pipelines: Tools for constructing, evaluating, and building machine learning pipelines.
Persistence: Saving and loading algorithms, models, and pipelines.
While building a machine learning model, data scientists must perform many tasks, including data cleaning, feature engineering, making inferences from data, building machine learning pipelines, saving models, and finally deploying it. Spark MLlib library provides multiple features, making it a go-to choice for data scientists across top tech companies.
Performing Linear Regression on a Real-world Dataset
Let’s understand Machine Learning better by implementing a full-fledged code to perform linear regression on the dataset of the top 5 Fortune 500 companies in the year 2017.
Go through these Spark Interview Questions and Answers to excel in your Apache Spark interview!
Loading Data
As mentioned above, you are going to use a DataFrame that is created directly from a CSV file. Following are the commands to load data into a DataFrame and to view the loaded data.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
Sc = SparkContext()
sqlContext = SQLContext(sc)
To check the data type of every column of a DataFrame and to print the schema of the DataFrame in a tree format, you can use the following commands, respectively:
Become an Apache Spark Specialist by going for this Big Data Online Course in London!
Performing Descriptive Analysis
company_df.describe().toPandas().transpose()
Summary
count
mean
stddev
min
max
Rank
3.0
1.581138830084
Title
None
None
Apple
Walmart
Website
None
None
www.apple.com
www.walmart.com
Employees
584880.0
966714.2168190142
68000
2300000
Sector
None
None
Energy
Wholesalers
Finding the Correlation Between Independent Variables
To find out if any of the variables, i.e., fields have correlations or dependencies, you can plot a scatter matrix. Plotting a scatter matrix is one of the best ways in Machine Learning to identify linear correlations if any.
You can plot a scatter matrix on your DataFrame using the following code:
import pandas as pdnumeric_features = [t[0] for t in company_df.dtypes if t[1] == ‘int’ or t[1] == ‘double’]
sampled_data = company_df.select(numeric_features).sample(False, 0.8).toPandas()
axs = pd.scatter_matrix(sampled_data, figsize=(10, 10))
n = len(sampled_data.columns)
for i in range(n):
v = axs[i, 0]
v.yaxis.label.set_rotation(0)
v.yaxis.label.set_ha(‘right’)
v.set_yticks(())
h = axs[n-1, i]
h.xaxis.label.set_rotation(90)
h.set_xticks(())
Here, you can come to the conclusion that in the dataset, the “Rank” and “Employees” columns have a correlation. Let’s dig a little deeper into finding the correlation specifically between these two columns.
In case you have doubts or queries related to Spark and Hadoop, kindly refer to our Big Data Hadoop and Spark Community!
Correlation Between Independent Variables
import six
for i in comapny_df.columns:
if not( isinstance(company_df.select(i).take(1)[0][0], six.string_types)):
print( “Correlation to Employees for “, i, company_df.stat.corr(‘Employees’,i))
Correlation to Employees for Rank −0.778372714650932
Correlation to Employees 1.0
The value of correlation ranges from −1 to 1, the closer it is to ‘1’ the more positive correlation can be found between the fields. If the value is closer to −1, it means that there is a strong negative correlation between the fields. Now, you can analyze your output and see if there is a correlation or not, and if there is, then if it is a strong positive or negative correlation.
After performing linear regression on the dataset, you can finally come to the conclusion that ‘Employees’ is the most important field or factor, in the given dataset, which can be used to predict the ranking of the companies in the coming future. ‘Ranks’ has a linear correlation with ‘Employees,’ indicating that the number of employees in a particular year, in the companies in our dataset, has a direct impact on the Rank of those companies.
Enhance your skills in Apache Spark by grabbing this Big Data Training!
Performing Linear Regression on a Real-world Dataset
Let’s understand Machine Learning better by implementing a full-fledged code to perform linear regression on the dataset of the top 5 Fortune 500 companies in the year 2017.
Go through these Spark Interview Questions and Answers to excel in your Apache Spark interview!
Loading Data
As mentioned above, you are going to use a DataFrame that is created directly from a CSV file. Following are the commands to load data into a DataFrame and to view the loaded data.
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
Sc = SparkContext()
sqlContext = SQLContext(sc)
To check the data type of every column of a DataFrame and to print the schema of the DataFrame in a tree format, you can use the following commands, respectively:
Become an Apache Spark Specialist by going for this Big Data Online Course in London!
Performing Descriptive Analysis
company_df.describe().toPandas().transpose()
Summary
count
mean
stddev
min
max
Rank
3.0
1.581138830084
Title
None
None
Apple
Walmart
Website
None
None
www.apple.com
www.walmart.com
Employees
584880.0
966714.2168190142
68000
2300000
Sector
None
None
Energy
Wholesalers
Finding the Correlation Between Independent Variables
To find out if any of the variables, i.e., fields have correlations or dependencies, you can plot a scatter matrix. Plotting a scatter matrix is one of the best ways in Machine Learning to identify linear correlations if any.
You can plot a scatter matrix on your DataFrame using the following code:
import pandas as pdnumeric_features = [t[0] for t in company_df.dtypes if t[1] == ‘int’ or t[1] == ‘double’]
sampled_data = company_df.select(numeric_features).sample(False, 0.8).toPandas()
axs = pd.scatter_matrix(sampled_data, figsize=(10, 10))
n = len(sampled_data.columns)
for i in range(n):
v = axs[i, 0]
v.yaxis.label.set_rotation(0)
v.yaxis.label.set_ha(‘right’)
v.set_yticks(())
h = axs[n-1, i]
h.xaxis.label.set_rotation(90)
h.set_xticks(())
Here, you can come to the conclusion that in the dataset, the “Rank” and “Employees” columns have a correlation. Let’s dig a little deeper into finding the correlation specifically between these two columns.
In case you have doubts or queries related to Spark and Hadoop, kindly refer to our Big Data Hadoop and Spark Community!
Correlation Between Independent Variables
import six
for i in comapny_df.columns:
if not( isinstance(company_df.select(i).take(1)[0][0], six.string_types)):
print( “Correlation to Employees for “, i, company_df.stat.corr(‘Employees’,i))
Correlation to Employees for Rank −0.778372714650932
Correlation to Employees 1.0
The value of correlation ranges from −1 to 1, the closer it is to ‘1’ the more positive correlation can be found between the fields. If the value is closer to −1, it means that there is a strong negative correlation between the fields. Now, you can analyze your output and see if there is a correlation or not, and if there is, then if it is a strong positive or negative correlation.
After performing linear regression on the dataset, you can finally come to the conclusion that ‘Employees’ is the most important field or factor, in the given dataset, which can be used to predict the ranking of the companies in the coming future. ‘Ranks’ has a linear correlation with ‘Employees,’ indicating that the number of employees in a particular year, in the companies in our dataset, has a direct impact on the Rank of those companies.
Enhance your skills in Apache Spark by grabbing this Big Data Training!
Introduction to Spark MLlib
Apache Spark comes with a library named MLlib to perform Machine Learning tasks using the Spark framework. Since there is a Python API for Apache Spark, i.e., PySpark, you can also use this Spark ML library in PySpark. MLlib contains many algorithms and Machine Learning utilities.
In this tutorial, you will learn how to use Machine Learning in PySpark. The dataset of Fortune 500 is used in this tutorial to implement this. This dataset consists of the information related to the top 5 companies ranked by Fortune 500 in the year 2017. This tutorial will use the first five fields. You can download the dataset by clicking here.
The dataset looks like below:
Rank
Title
Website
Employees
Sector
Walmart
http://www.walmart.com
2,300,000
Retail
Berkshire Hathaway
http://www.berkshirehathaway.com
367,700
Finance
Apple
http://www.apple.com
116,000
Technology
ExxonMobil
http://www.exxonmobil.com
72,700
Energy
McKesson
http://www.mckesson.com
68,000
Wholesale
In this Spark ML tutorial, you will implement Machine Learning to predict which one of the fields is the most important factor to predict the ranking of the above-mentioned companies in the coming years. Also, you will use DataFrames to implement Machine Learning.
Before diving right into this Spark MLlib tutorial, have a quick rundown of all the topics included in this tutorial:
Conclusion
In this article, we discussed pyspark vs pandas to compare their performance, speed, memory consumption, and use cases. To learn more about programming, you can read this article on spark vs Hadoop. You might also like this article on the best python debugging tools.
I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy Learning!
Disclosure of Material Connection: Some of the links in the post above are “affiliate links.” This means if you click on the link and purchase the item, I will receive an affiliate commission. Regardless, I only recommend products or services I use personally and believe will add value to my readers.
Data Scientists love working with PySpark as it helps streamline the overall process of deploying production-grade machine learning models from the prototyping stage. Data scientists across organizations claim that it helps them reduce the support required from the development team to scale machine learning models from prototyping to production. If you’re a data scientist or a machine learning engineer keen on getting your hands dirty with PySpark, you’re on the right page. This PySpark Machine Learning Tutorial is a beginner’s guide to building and deploying machine learning pipelines at scale using Apache Spark with Python.
Data Scientist spends 80% of their time wrangling and cleaning data, but as soon as we start to work with Big Data, using Python Pandas might be ineffective when working with large datasets for machine learning. PySpark comes to the rescue here as it provides various features such as querying in SQL, working with DataFrame, Streaming, Machine Learning, and abstracts that help to work with Big Data. PySpark is an easy-to-use interface for writing Apache Spark Code.
PySpark Linear Regression Example with Source Code
Now, the next step will be to pass the training set to our model and evaluate the same on the test data.
We can see that the R-squared value obtained is 0.77 or 77% which is quite decent, considering that we did not perform exploratory data analysis and hyperparameter Tuning.
R-Squared (R² or the coefficient of determination) is a statistical term used in a regression model that helps to determine the proportion of variance in the target variable that the independent variable can explain. In other words, r-squared shows how well the data fits in the regression model.
How to Build Machine Learning Pipelines with PySpark?
PySpark provides several APIs to work with DataFrames that ease the job of Data Scientists when building machine learning pipelines.
PySpark Machine Learning Pipeline Fundamentals
DataFrame: This Machine Learning API is the same as the one used in Python Pandas for data representation.
Transformers: A Transformer is an algorithm that can transform one DataFrame into another DataFrame.
Estimator: When building machine learning models, we choose the best models that train on our DataFrame and produce results using estimators.
Pipeline: A Pipeline combines numerous Transformers and Estimators to carry out a Machine Learning workflow.