Build machine learning applications that train quickly and run anywhere
Apache MXNet is a fast and scalable training and inference framework with an easy-to-use, concise API for machine learning.
MXNet includes the Gluon interface that allows developers of all skill levels to get started with deep learning on the cloud, on edge devices, and on mobile apps. In just a few lines of Gluon code, you can build linear regression, convolutional networks and recurrent LSTMs for object detection, speech recognition, recommendation, and personalization.
You can get started with MxNet on AWS with a fully-managed experience using Amazon SageMaker, a platform to build, train, and deploy machine learning models at scale. Or, you can use the AWS Deep Learning AMIs to build custom environments and workflows with MxNet as well as other frameworks including TensorFlow, PyTorch, Chainer, Keras, Caffe, Caffe2, and Microsoft Cognitive Toolkit.
Grab sample code, notebooks, and tutorial content at the GitHub project page.
Benefits of deep learning using MXNet
Ease-of-Use with Gluon
Greater Performance
For IoT & the Edge
Flexibility & Choice
Customer momentum
Case studies
There are over 500 contributors to the MXNet project including developers from Amazon, NVIDIA, Intel, Samsung, and Microsoft. Learn about how customers are using MXNet for deep learning projects. For more case studies, see the AWS machine learning blog and the MXNet blog.
Amazon SageMaker for machine learning
Amazon SageMaker is a fully-managed service that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. Amazon SageMaker removes all the barriers that typically slow down developers who want to use machine learning.
Spark is an open-source distributed computing framework widely used in data science to process large amounts of data. Its distributed computing capabilities make it ideal for cost-effectively analyzing large datasets, and its open-source framework ensures that data scientists have access to the latest innovations.
Developed at UC Berkeley’s AMPLab, Spark provides a unified API for working with diverse data sources. That includes high-level libraries for machine learning, graph processing, and stream processing. Therefore, Spark for data science is ideal for handling data wrangling, preprocessing, and analysis.
In this article, we will explore Spark’s features and capabilities. Then, we will discuss why it is essential for processing Big Data. Finally, we will explore strategies for best utilizing Spark in your data science workflows.
Apache Spark for Big Data
Big Data refers to the vast amounts of structured, semi-structured, and unstructured data that organizations generate on a daily basis. It comes from various sources, such as social media, mobile devices, internet searches, and IoT devices. This data’s sheer volume, velocity, and variety make it challenging to store, process, and analyze using traditional methods.
Also read: Big Data Technologies
To address these challenges, companies have turned to Big Data Analytics to extract valuable insights from their data. Big Data Analytics involves using advanced analytics techniques to process and analyze large and complex data sets.
Also read: Understanding Big Data Analytics
One of the most popular tools for Big Data Analytics is Apache Spark.
Spark’s distributed computing capabilities allow companies to process large data sets quickly and efficiently. Spark also provides high-level libraries for machine learning, graph processing, and real-time data processing. Thus, making it a versatile tool for Big Data Analytics.
Also read: 16 Best Big Data Analytics Tools
Additionally, using Spark for Data Science allows companies to process and analyze massive data sets in real-time. Thus, providing them with actionable insights and business intelligence.
Spark’s machine learning libraries enable it to build predictive models. These can be used to make better business decisions, while its graph processing capabilities can help companies identify patterns and relationships in their data.
Big Data Analytics is a critical component of modern business strategy. Apache Spark is one of the most popular and versatile tools for processing and analyzing large and complex data sets.
Usable in Java, Scala, Python, and R.
MLlib fits into Spark’s APIs and interoperates with NumPy in Python (as of Spark 0.9) and R libraries (as of Spark 1.5). You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows.
High-quality algorithms, 100x faster than MapReduce.
Spark excels at iterative computation, enabling MLlib to run fast. At the same time, we care about algorithmic performance: MLlib contains high-quality algorithms that leverage iteration, and can yield better results than the one-pass approximations sometimes used on MapReduce.
Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud, against diverse data sources.
You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.
MLlib contains many algorithms and utilities.
ML algorithms include:
ML workflow utilities include:
Other utilities include:
Refer to the MLlib guide for usage examples.
MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each Spark release.
If you have questions about the library, ask on the Spark mailing lists.
MLlib is still a rapidly growing project and welcomes contributions. If you’d like to submit an algorithm to MLlib, read how to contribute to Spark and send us a patch!
To get started with MLlib:
This article was published as a part of the Data Science Blogathon.
The digital transformation has given rise to the release of massive amounts of data each second, and companies’ servers are not that powerful to bear the load. It is tough to store and process a massive amount of data and more difficult when we have real-time or streaming data. When Hadoop came into the picture, then the companies need to use MapReduce, which only works in Java and needs to write many lines of code. After that, Spark, a live data processing tool, was released that helps to process live data and apply various machine learning and analytics on top of it. In this article, we will learn about Spark MLLIB, a python API to work on spark and run a machine learning model on top of the massive amount of data.
Spark is an open-source, distributed, unified analytics engine used for real-time data processing and acts as a faster cluster computing framework. Spark is popular due to its in-memory computation power, which increases the data processing speed and makes it capable to handle huge amounts of data. Apache spark is an advanced version of Hadoop because Hadoop is a framework that uses map-reduce for the processing, which reads the data from disk and forms key-value pair so if we read data from disk, process it, and write it again to disk so it is very time-consuming and spark does all things in main memory means data store in RAM was to compute time reduces and operations happen very fast.
Spark is built on Scala, an advanced Java version that runs on JVM. Spark provides high-level APIs through which we can code and use spark in any language, including Java, Python, Scala, R, etc. And working with spark through Python is known as Pyspark.
MLLIB stands for Machine learning library in Spark. This library aims to make practical machine learning scalable and easy to implement. It provides tools to implement all machine learning algorithms, including Regression, classification, dimensionality reduction tools, transformation, feature extraction, pipelines (tunning), save and load algorithm, and utilities for linear algebra and statistics.
When we talk about spark MLLIB so, it has a dataframe-based API, and as of spark 2 onwards, the RDD-based API entered the maintenance phase, and the primary ML API is now a dataframe-based API that is a spark. ml.0
Spark provides a different set of machine learning tools to perform different tasks and take different actions.
Spark supports different data types. Spark MLLIB supports local vectors and Matrices stored on a single machine and distributed matrices. So it supports many data types packed with one or many RDDs.
When we talk about ML pipelines, it is all about understanding different stages, including estimator, evaluator, transformer, etc. Machine learning pipelines provide uniform high-level APIs built on top of data frames. It is used to create and tune practical machine learning pipelines. It is mainly used with structured data.
We now know about spark and why today it is used by each organization to process their data. To get hands-on practical knowledge about spark let up first install and set up complete spark on our system. First, we are installing Pyspark on the Jupyter notebook.
#ENVIRONMENT VARIABLE SPARK_HOME = C:sparkspark-2.3.1-bin-hadoop2.7 HADOOP_HOME = C:sparkspark-2.3.1-bin-hadoop2.7 #PATH VARIABLE C:sparkspark-2.3.1-bin-hadoop2.7bin
Running Spark on Google Colab
Running Pyspark on Google colab is very simple; you must visit the collab website and create a new Colab Notebook. In the first cell run the below PIP command to install Pyspark.
! pip install pyspark
As the cell successfully runs and you are good to go to use Pyspark for further practicals.
We have installed PySpark on our system so before directly using MLLIB and developing a machine learning model using Spark, let us refresh some basic Pyspark and how it works with data frames and process data. Pyspark is a tool through which we can work with spark using Python as a programming language. Let us give some hands-on practice to Pyspark.
Create Spark context
spark context is the main entry to use the spark feature. It will create a connection with a spark cluster and can be used to create RDDs, accumulators, and broadcast variables on that cluster. Only one spark context may be active per JVM. While creating spark context you have to set the app name and a master name, which we have defined as local. And we create an object of spark context. Without defining any configurations also, you can create a spark context.
from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName(“pyspark_practice”).setMaster(“local”) sc = SparkContext(conf=conf)
To see all the configurations, you can use the below command and read all the system details.
RDD stands for Resilient distributed datasets. An RDD in spark is simply an immutable distribution of objects. Spark RDDs are the same as Pandas Dataframes. Each RDD is split into multiple partitions, which may be computed on different cluster nodes. RDD can be created in two ways by loading some external dataset like CSV, excel files, or you can also create it by transforming one RDD into another.
When we have RDD (dataset) so before applying a machine learning algorithm, we perform the different tasks on data like handling missing values, dealing with categorical data, and feature scaling. And all these tasks are known as operations, and operations can be anything like sorting, filtering, summarizing, etc. In spark, operations are divided into 2 things as Transformation and Actions. First, we will create one RDD and learn different operations that we can perform on it.
To create RDD parallelize function is used that accepts a list in which you can simply have a collection of numbers, strings, tuples, dictionary
names = sc.parallelize([‘Shubham’,’rishi’,’prayag’,’shivam’,’rahul’,’Madhav’,’Nihal’,’sourik’,’Rishabh’, ‘Yash’, ‘kaushik’,’shivani’]) print(type(names))
The second way to create an RDD is from an external file in which we read some files by referencing its URL.
#Reading any csv file csv_file = spark.read.csv(‘/content/students_placement.csv’, inferSchema = True, header = True)
## Reading TXT file #txt_file = spark.read.text(“example.txt”)
## Reading JSON file #json_file = spark.read.json(“sample.json”, multiLine=True)
Actions are used to execute the scheduled task on the dataset because when we apply the transformation, It only creates a DAG and when we act then tasks task tasks tasked display an output. We will study some popular actions used on the dataset.
1. Collect
This is the first action that will display all values right away. It has created a list. If you will perform the transformation, then nothing will be displayed.
names.collect()
source – SS Of Code output
2. count By Value
If you want a count of a particular value in data, then you can use this action. The alternative to this function you can also use a simple count function which is also one action.
names.countByValue()
3. For Each
It is a unique operation that takes each value and applies a function to it to perform a certain task. It is used to create logs.
def f(x): print(x) a = sc.parallelize([1,2,3,4,5]).foreach(lambda x: f(x))
words = sc.parallelize ([“scala”, “java”,”hadoop”, “spark”, “akka”,”spark vs hadoop”,”pyspark”, “pyspark and spark”]) fore = words.foreach(lambda x: x.startswith(‘p’))
4. Take
we do not use the collect function in a production environment because it gives me complete details of data in a cluster that can collapse the memory. After all, complete data will come in memory so we use the take function to intake the required number of rows.
5. First
The first action returns the first element from an RDD.
names.first()
6. Glom
It transforms each partition into a tuple of elements. It will create an RDD of tuples so it will store 1 tuple per RDD. We use GLOM function to make each partition into a tuple. We have created a 3 partition, and when you use to collect, then it will same as 1 partition, but when you use glom function, then it will be stored in 3 partitions.
nums = sc.parallelize([1,3,5,3,4,2,5,7], 3) nums.glom().collect()
7. Reduce
It is the same as the python reduce function, which accepts a function and reduces all the elements per a particular function. Below is an example of adding all the data using reduced action.
sample_rdd = sc.parallelize([2,4,6,8]) print(sample_rdd.reduce(lambda x, y: x+y))
7. save RDD as a Text File
To serve the resultant RDD into a text file, we use to save ait n RDD as a text file. You can specify the path where you want the RDD to be saved. This function is mainly used to save our results and analysis when working with a large amount of data.
Transformation helps us to shape our dataset. Changes are lazily evaluated because whenever we use transformation then it will create a new RDD and you can display the content of the new RDD only when you perform any action on it. It will create DAG(directed acyclic graph) and keep building the graph till you perform any action on it that is why it is called lazy. As RDD is immutable so we cannot make any change in the existing RDD so transformation takes an RDD as input and generate another RDD. we will discuss some of the most used RDD transformations.
1. Map
As the name suggests, the Map transformation maps a certain value to the elements of the input RDD. The map takes a function as a parameter and applies the function to each RDD element. For example, if we want to add 10 to each element in RDD then using MAP, in this case, will be easy and handy.
sample_rdd.map(lambda x: x+10).collect()
2. Filter Transformation
Filter transformation filters out the RDD elements according to certain conditions. It accepts a function and applies that function to each element and if the element meets the condition then added to a new RDD and creates a new RDD.
s_rdd = sc.parallelize([1,2,4,5,7,8,2]) print(s_rdd.filter(lambda x: x%2 == 0).collect())
filter_rdd_2 = sc.parallelize([‘Rahul’, ‘Swati’, ‘Rohan’, ‘Shreya’, ‘Priya’]) filter_rdd_2.filter(lambda x: x.startswith(‘R’)).collect()
3. Union Transformation
We have read the Union function in SQL and its task is the same as it accepts the 2 RDD and combines them to generate one single RDD, which is a combination of both the RDD.
union_inp = sc.parallelize([2,4,5,6,7,8,9]) union_rdd_1 = union_inp.filter(lambda x : x%2==0) union_rdd_2 = union_inp.filter(lambda x: x%3 == 0) print(union_rdd_1.union(union_rdd_2).collect())
4. FlatMapTransformation
It performs the same operation as the Map operation except for the fact that flat map transformation return separate (flatter) values for each element from the original RDD.
ft_rdd = sc.parallelize([“Hey there”, “This is Pyspark Rdd transformation”]) ft_rdd.flatMap(lambda x: x.split(” “)).collect()
5. Join Transformation
The operation returns an RDD with a pair of elements with the matching keys and the values for that key. In simple words, it joins two RDD based on certain keys and keeps their values in a list.
x = sc.parallelize([(‘Spark’, 1), (‘Hadoop’, 4)]) y = sc.parallelize([(‘Spark’, 2), (‘Hadoop’, 5)]) print(x.join(y).collect())
6. Distinct Transformation
It returns the distinct values in a specified column from an RDD.
sample_rdd.distinct().collect()
We will learn a simple demo of developing a simple linear regression using spark MLLIB. We will walk through each step of the ML project lifecycle including preparing and processing data. We are using a simple dataset which is a student grade dataset that you can download from here.
Install all dependencies and start the spark session
spark session is the same as the spark context which is used as an entry point to start working with the dataframe and datasets which were introduced when spark 2. O was introduced.
# Start Spark Session !pip install findspark import findspark findspark.init() from pyspark.sql import SparkSession spark = SparkSession.builder.master(“local[*]”).getOrCreate()
The builder method gives you access to builder APIs that you can use to configure the session. Get or create is used when you want to share a spark context.
Read Dataset
I hope that you have downloaded the dataset. And If you are working on the local Jupyter Notebook, then directly use the spark read command to read the dataset. If working on Collab, then you have to upload a file using the left navigation pane, or you can also use the below command.
from google.colab import files files.upload()
Now you can run the spark Read CSV file function and a file path to load the dataset.
data = spark.read.csv(‘Student_Grades_Data.csv’, header=True, inferSchema=True)
Display the data
To have a gentle look over a few rows of data we can use the show function. Apart from this, you can use the print schema function to print the data type of each column to observe what data type infer schema has set for a set column by keeping it True.
data.printSchema()
Separate the Independent columns
Now we will create the feature array by omitting the last column or dependent column of the dataset. If you remember that to train a machine learning model, we want to feed features and labels to predict a label for new features.
#create a feature array by omitting the last column feature_cols = data.columns[:-1] from pyspark.ml.feature import VectorAssembler vect_assembler = VectorAssembler(inputCols = feature_cols, outputCol=”features”) #Utilize Assembler created above in order to add the feature column data_w_features = vect_assembler.transform(data)
Vector assembler is a transformer that assembles all the features into one vector from multiple columns that are of type double you can also observe in the below-given figure. Now select only the feature and label to create the machine learning model.
finalized_data = data_w_features.select(“features”,”Grades”) finalized_data.show()
Train-Test Split
Now we will split the prepared final data into two sets train set and a test set where the train set is used for model training, and the test set is used to evaluate the model, like how it is performing on unknown features.
train_dataset, test_dataset = finalized_data.randomSplit([0.7, 0.3])
You can also statistically analyze the dataset like in Pandas we directly use to describe the function. The same can be used with Spark also.
#Peek into training data train_dataset.describe().show()
Linear Regression Model Creation
The process with Spark MLLIB is the same as you perform with sciket-learn, which is first importing the model and creating its object defining the parameters.
#Import Linear Regression class called LinearRegression from pyspark.ml.regression import LinearRegression LinReg = LinearRegression(featuresCol=”features”, labelCol=”Grades”)
Model training and testing
In model training, the input data and some correct labels are fed to a model, which is implemented using the fit function. And to find out the predictions on the unknown dataset (test dataset) evaluate function is used.
#Train the model on the training using fit() method. model = LinReg.fit(train_dataset) #Predict the Grades using the evulate method pred = model.evaluate(test_dataset)
Print coefficients and Intercept
A simple linear regression model simply built a straight line, and it calculates the coefficients using covariance and variance. So to display the coefficients and intercept, you can simply use the below command.
#Find out coefficient value coefficient = model.coefficients print (“The coefficient of the model is : %a” %coefficient) #Find out intercept Value intercept = model.intercept print (“The Intercept of the model is : %f” %intercept)
Evaluate Model using Metric
The error is the difference between the actual and predicted value, and metrics help us to evaluate our model in a good sense like what is percent accuracy of the model, where it performs best and worst, etc. we have calculated MAE, MSE, RMSE, and R squared.
#Evaluate the model using metric like Mean Absolute Error(MAE), Root Mean Square Error(RMSE) and R-Square from pyspark.ml.evaluation import RegressionEvaluator evaluation = RegressionEvaluator(labelCol=”Grades”, predictionCol=”prediction”) # Root Mean Square Error rmse = evaluation.evaluate(pred.predictions, {evaluation.metricName: “rmse”}) print(“RMSE: %.3f” % rmse) # Mean Square Error mse = evaluation.evaluate(pred.predictions, {evaluation.metricName: “mse”}) print(“MSE: %.3f” % mse) # Mean Absolute Error mae = evaluation.evaluate(pred.predictions, {evaluation.metricName: “mae”}) print(“MAE: %.3f” % mae) # r2 – coefficient of determination r2 = evaluation.evaluate(pred.predictions, {evaluation.metricName: “r2”}) print(“r2: %.3f” %r2)
Spark is a big data processing engine that helps us to work with a huge amount of data in real time. Machine learning is one kind of service that spark supports through which we can analyze and build an ML-based system on a large volume of data. In this article, we have studied Spark Machine learning, Pyspark, and Pyspark MLLIB. Let us take a few key takeaways from the article that you should remember related to spark and MLLIB.
Thank You Note
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Apache MXNet on AWS
Prepare and visualize data for ML algorithms
In supervised learning—-such as a regression algorithm—-you typically define a label and a set of features. In this linear regression example, the label is the 2015 median sales price and the feature is the 2014 Population Estimate. That is, you use the feature (population) to predict the label (sales price).
First drop rows with missing values and rename the feature and label columns, replacing spaces with .
data = data.dropna() # drop rows with missing values exprs = [col(column).alias(column.replace(' ', '_')) for column in data.columns]
To simplify the creation of features, register a UDF to convert the feature (2014_Population_estimate) column vector to a
VectorUDT
type and apply it to the column.
from pyspark.ml.linalg import Vectors, VectorUDT spark.udf.register("oneElementVec", lambda d: Vectors.dense([d]), returnType=VectorUDT()) tdata = data.select(*exprs).selectExpr("oneElementVec(2014_Population_estimate) as features", "2015_median_sales_price as label")
Then display the new DataFrame:
display(tdata)
SparkHadoop: Which one is better?
-
Performance
Performance is one of the primary differences between Apache Spark and Hadoop. Although Apache Spark may process in-memory, Hadoop keeps the results of each intermediate step on disc.
Hadoop can read data from the local HDFS, but its performance falls short of that of Spark, which runs in memory. Regarding in-memory and on-disk operations, Apache Spark iterations are 100 times faster than Hadoop’s.
-
Data processing
Apache Spark offers far more than Hadoop does in terms of batch processing. Additionally, the two frameworks approach data in various ways. Hadoop employs MapReduce to batch and distribute sizable datasets across a cluster for parallel data processing.
Apache Spark, on the other hand, offers real-time streaming processing and graph analysis. Hadoop must be used in conjunction with other technologies to accomplish such capability.
-
Real-time processing
Hadoop only supports batch processing using MapReduce. It does not support real-time processing. This makes it difficult to use Hadoop for applications that require low-latency execution.
Low-latency processing is offered by Apache Spark, and results are delivered via Spark Streaming in close to real-time. It is designed for high scalability and fault tolerance.
Spark can read streams from the stock market, Twitter, or Facebook, which include millions of events each second.
-
Cost
Spark and Apache Hadoop are open sources. Thus, companies using them do not need to pay a licensing fee. Nonetheless, it is necessary to account for the expenditures of infrastructure and development.
Hadoop is a less expensive choice because it is disk-based and compatible with commodity hardware. The infrastructure cost increases since Apache Spark activities, on the other hand, are memory-intensive and demand a lot of RAM.
-
Security
Spark offers only shared secret password authentication, which makes it slightly less secure than MapReduce. Due to Kerberos, Hadoop MapReduce is more secure. In addition, it supports the more conventional file permission model known as Access Control Lists (ACLs).
-
Fault tolerance
Fault tolerance exists in Spark. As a result, if the application fails, there is no need to start over from the beginning. MapReduce is also fault-tolerant, just like Apache Spark. Therefore any failure will not require the application to start from scratch.
Which one to choose?
The debate between Hadoop and Spark for big data processing is common. The choice of which platform to use depends largely on the specific needs and requirements of the project at hand. Here are some key factors to consider when deciding which platform is best:
Apache Spark
- Runs as a stand-alone utility without using Apache Hadoop.
- Provide scheduling, I/O capabilities, and distributed job dispatching.
- Includes Python, Java, and Scala among the languages it supports.
- Provides failure tolerance and implicit data parallelism.
You may choose Apache Spark if you are a data scientist focusing largely on machine learning methods and big data processing.
Apache Hadoop
- Provides a comprehensive framework for the processing and storing of large amounts of data.
- Offers a staggering variety of products, including Apache Spark.
- Builds on a distributed, scalable, and portable file system.
- Utilizes additional programs for parallel computing, machine learning, and data warehousing.
You may choose Apache Hadoop if you need a wide range of data science tools for processing and storing massive amounts of data.
What is Apache Spark?
Apache Spark is a popular open-source distributed computing framework. It enables the processing of large-scale data sets across multiple nodes in a cluster. It was originally developed at the University of California, Berkeley’s AMPLab, in 2009 and later donated to Apache Software Foundation (ASF). Spark enables users to access data quickly from HDFS and other sources like S3, MySQL, Cassandra, and MongoDB.
Spark can run in Hadoop clusters through YARN or stand-alone mode without any extra installation. The main features of Spark include its speed and scalability, which make it ideal for iterative machine-learning algorithms used by data scientists.
With the help of in-memory caching, Spark can process queries at lightning speeds compared to MapReduce jobs.
Spark provides a library of machine learning algorithms that can be used to create useful insights from data quickly and cost-effectively. It also offers support for graph analytics, making it easier to analyze the relationships between different elements in unstructured data sets.
Spark is helping organizations revolutionize how they approach data processing and analysis, unlocking the potential to make better decisions faster.
Video developed by Databricks, founded by the creators of Apache Spark
Busting Apache Spark Myths
#1. Apache Spark is a database.
For more clarity, Apache Spark is not a database. It is a framework designed for distributed computing and allows users to process data in parallel across a cluster of computers. It can read data from various sources like databases, file systems, and streaming data sources, even though it is not a database itself.
#2. Apache Spark can replace Hadoop.
Apache Spark is not a replacement for Hadoop. While Spark can run on Hadoop clusters, it is not a Hadoop replacement. Rather, it is a complementary tool that provides faster and more efficient data processing than Hadoop’s MapReduce algorithm.
#3. Apache Spark can solve all big data problems.
Apache Spark is not useful for all big data problems. It is best suited for tasks that require complex data processing. These include machine learning, graph processing, and real-time data processing. It may not be the best tool for simpler tasks, and users may be better off with a more straightforward solution.
#4. Apache Spark is a programming language.
Apache Spark is not a programming language. While it provides APIs for multiple programming languages, such as Python, Java, and Scala, it is not a programming language itself.
The Architecture of Spark
Spark follows the Master/Slave architecture where the master node controls and manages slaves. Its main components are:
- Apache Spark Core
The primary execution engine for the Spark platform, on which all other functionalities are built, is called Spark Core. Fault tolerance, in-memory computing, resource management, and access to external data sources are all features of the Spark core.
For simplicity of development, it also offers a distributed execution infrastructure that supports a wide range of programming languages, including Python, Java, and Scala.
- Spark SQL
Working with structured data is possible with the Spark SQL module. To query structured data inside Spark programs, utilize Spark SQL. Java, Python, R, and SQL are all supported.
Hive, Avro, Parquet, JSON, and JDBC are just a few of the data sources from that Spark SQL can also be linked. Moreover, Spark SQL supports the HiveQL syntax, enabling access to already existing Hive warehouses.
- Spark Streaming
The fault-tolerance semantics of Spark is used out of the box by Spark streaming. It uses the Spark Core API to do real-time interactive data analytics. Such well-liked data sources as HDFS, Flume, Kafka, and Twitter may all be coupled with Spark streaming.
- MLlib (Machine Learning Library)
High-quality algorithms are delivered via the scalable machine learning library MLib, which is developed on top of Spark Core API.
MLib is simple to integrate into Hadoop workflows since it can be used with any Hadoop data, including HDFS, HBase, or local storage. When used with Spark applications, the library is usable in Java, Scala, and Python.
- GraphX
The Spark API for graph-parallel processing is called GraphX. Users can construct and alter graph-structured data interactively and at scale with GraphX.
Conclusion
Spark has a promising future because it outperforms other big data solutions now in use. Even in situations where MapReduce would generally perform better, Apache Spark can operate up to 100 times quicker. While some experts predict Spark will eventually displace Hadoop, others assert that the two complement one another’s strengths.
Spark for Data Science will be used more due to its adaptability, emphasis on Big Data insights, and backing from numerous businesses. Businesses will continue to leverage Apache Spark for its edge in speed, scalability, and machine learning advantages.
With its open-source roots, ever-growing library of supporting tools and services, and impressive performance gains over legacy systems like MapReduce or Hadoop, Spark quickly becomes a staple tool in the Data Science community.
Machine Learning Library (MLlib) Guide
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:
- ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
- Featurization: feature extraction, transformation, dimensionality reduction, and selection
- Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
- Persistence: saving and load algorithms, models, and Pipelines
- Utilities: linear algebra, statistics, data handling, etc.
Announcement: DataFrame-based API is primary API
The MLlib RDD-based API is now in maintenance mode.
As of Spark 2.0, the RDD-based APIs in the
spark.mllib
package have entered maintenance mode.
The primary Machine Learning API for Spark is now the DataFrame-based API in the
spark.ml
package.
What are the implications?
-
MLlib will still support the RDD-based API in
spark.mllib
with bug fixes. - MLlib will not add new features to the RDD-based API.
- In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
Why is MLlib switching to the DataFrame-based API?
- DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
- The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
- DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.
What is “Spark ML”?
-
“Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API.
This is majorly due to the
org.apache.spark.ml
Scala package name used by the DataFrame-based API, and the “Spark ML Pipelines” term we used initially to emphasize the pipeline concept.
Is MLlib deprecated?
- No. MLlib includes both the RDD-based API and the DataFrame-based API. The RDD-based API is now in maintenance mode. But neither API is deprecated, nor MLlib as a whole.
Dependencies
MLlib uses linear algebra packages Breeze and dev.ludovic.netlib for optimised numerical processing1. Those packages may call native acceleration libraries such as Intel MKL or OpenBLAS if they are available as system libraries or in runtime library paths.
However, native acceleration libraries can’t be distributed with Spark. See MLlib Linear Algebra Acceleration Guide for how to enable accelerated linear algebra processing. If accelerated native libraries are not enabled, you will see a warning message like below and a pure JVM implementation will be used instead:
WARNING: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
To use MLlib in Python, you will need NumPy version 1.4 or newer.
Highlights in 3.0
The list below highlights some of the new features and enhancements added to MLlib in the
3.0
release of Spark:
-
Multiple columns support was added to
Binarizer
(SPARK-23578),
StringIndexer
(SPARK-11215),
StopWordsRemover
(SPARK-29808) and PySpark
QuantileDiscretizer
(SPARK-22796). - Tree-Based Feature Transformation was added (SPARK-13677).
-
Two new evaluators
MultilabelClassificationEvaluator
(SPARK-16692) and
RankingEvaluator
(SPARK-28045) were added. -
Sample weights support was added in
DecisionTreeClassifier/Regressor
(SPARK-19591),
RandomForestClassifier/Regressor
(SPARK-9478),
GBTClassifier/Regressor
(SPARK-9612),
MulticlassClassificationEvaluator
(SPARK-24101),
RegressionEvaluator
(SPARK-24102),
BinaryClassificationEvaluator
(SPARK-24103),
BisectingKMeans
(SPARK-30351),
KMeans
(SPARK-29967) and
GaussianMixture
(SPARK-30102). -
R API for
PowerIterationClustering
was added (SPARK-19827). - Added Spark ML listener for tracking ML pipeline status (SPARK-23674).
- Fit with validation set was added to Gradient Boosted Trees in Python (SPARK-24333).
-
RobustScaler
transformer was added (SPARK-28399). -
Factorization Machines
classifier and regressor were added (SPARK-29224). - Gaussian Naive Bayes Classifier (SPARK-16872) and Complement Naive Bayes Classifier (SPARK-29942) were added.
- ML function parity between Scala and Python (SPARK-28958).
-
predictRaw
is made public in all the Classification models.
predictProbability
is made public in all the Classification models except
LinearSVCModel
(SPARK-30358).
Migration Guide
The migration guide is now archived on this page.
-
To learn more about the benefits and background of system optimised natives, you may wish to watch Sam Halliday’s ScalaX talk on High Performance Linear Algebra in Scala. ↩
Apache Spark™ Tutorial: Getting Started with Apache Spark on Databricks
Visualize the model
As is typical for many machine learning algorithms, you want to visualize the scatterplot. Since Databricks supports pandas and ggplot, the code below creates a linear regression plot using pandas DataFrame (pydf) and ggplot to display the scatterplot and the two regression models.
# Import numpy, pandas, and ggplot import numpy as np from pandas import * from ggplot import * # Create Python DataFrame pop = data.map(lambda p: (p.features[0])).collect() price = data.map(lambda p: (p.label)).collect() predA = predictionsA.select("prediction").map(lambda r: r[0]).collect() predB = predictionsB.select("prediction").map(lambda r: r[0]).collect() # Create a Pandas DataFrame pydf = DataFrame({'pop':pop,'price':price,'predA':predA, 'predB':predB}) Visualizing the Model # Create scatter plot and two regression models (scaling exponential) using ggplot p = ggplot(pydf, aes('pop','price')) + geom_point(color='blue') + geom_line(pydf, aes('pop','predA'), color='red') + geom_line(pydf, aes('pop','predB'), color='green') + scale_x_log10() + scale_y_log10() display(p)
We also provide a sample notebook that you can import to access and run all of the code examples included in the module.
Đăng nhập/Đăng ký
Ranking
Cộng đồng
|
Kiến thức
22 tháng 01, 2022
Admin
14:45 22/01/2022
Apache Spark – Machine Learning Với PySpark Và MLlib
Cùng tác giả
Không có dữ liệu
0
0
0
Admin
2995 người theo dõi
1283
184
Có liên quan
Không có dữ liệu
Chia sẻ kiến thức – Kết nối tương lai
Về chúng tôi
Về chúng tôi
Giới thiệu
Chính sách bảo mật
Điều khoản dịch vụ
Học miễn phí
Học miễn phí
Khóa học
Luyện tập
Cộng đồng
Cộng đồng
Kiến thức
Tin tức
Hỏi đáp
CÔNG TY CỔ PHẦN CÔNG NGHỆ GIÁO DỤC VÀ DỊCH VỤ BRONTOBYTE
The Manor Central Park, đường Nguyễn Xiển, phường Đại Kim, quận Hoàng Mai, TP. Hà Nội
THÔNG TIN LIÊN HỆ
[email protected]
©2024 TEK4.VN
Copyright © 2024
TEK4.VN
# Every record contains a label and feature vector
df=spark.createDataFrame(data,[“label”,”features”])# Split the data into train/test datasets
train_df,test_df=df.randomSplit([.80,.20],seed=42)# Set hyperparameters for the algorithm
rf=RandomForestRegressor(numTrees=100)# Fit the model to the training data
model=rf.fit(train_df)# Generate predictions on the test dataset.
model.transform(test_df).show()
df=spark.read.csv(“accounts.csv”,header=True)# Select subset of features and filter for balance > 0
filtered_df=df.select(“AccountBalance”,”CountOfDependents”).filter(“AccountBalance > 0”)# Generate summary statistics
filtered_df.summary().show()
Run now
$ docker run -it –rm spark /opt/spark/bin/spark-sql
Explore the exciting world of machine learning with this IBM course.
Start by learning ML fundamentals before unlocking the power of Apache Spark to build and deploy ML models for data engineering applications. Dive into supervised and unsupervised learning techniques and discover the revolutionary possibilities of Generative AI through instructional readings and videos. Gain hands-on experience with Spark structured streaming, develop an understanding of data engineering and ML pipelines, and become proficient in evaluating ML models using SparkML. In practical labs, you’ll utilize SparkML for regression, classification, and clustering, enabling you to construct prediction and classification models. Connect to Spark clusters, analyze SparkSQL datasets, perform ETL activities, and create ML models using Spark ML and sci-kit learn. Finally, demonstrate your acquired skills through a final assignment. This intermediate course is suitable for aspiring and experienced data engineers, as well as working professionals in data analysis and machine learning. Prior knowledge in Big Data, Hadoop, Spark, Python, and ETL is highly recommended for this course.
This article was published as a part of the Data Science Blogathon.
The digital transformation has given rise to the release of massive amounts of data each second, and companies’ servers are not that powerful to bear the load. It is tough to store and process a massive amount of data and more difficult when we have real-time or streaming data. When Hadoop came into the picture, then the companies need to use MapReduce, which only works in Java and needs to write many lines of code. After that, Spark, a live data processing tool, was released that helps to process live data and apply various machine learning and analytics on top of it. In this article, we will learn about Spark MLLIB, a python API to work on spark and run a machine learning model on top of the massive amount of data.
Spark is an open-source, distributed, unified analytics engine used for real-time data processing and acts as a faster cluster computing framework. Spark is popular due to its in-memory computation power, which increases the data processing speed and makes it capable to handle huge amounts of data. Apache spark is an advanced version of Hadoop because Hadoop is a framework that uses map-reduce for the processing, which reads the data from disk and forms key-value pair so if we read data from disk, process it, and write it again to disk so it is very time-consuming and spark does all things in main memory means data store in RAM was to compute time reduces and operations happen very fast.
Spark is built on Scala, an advanced Java version that runs on JVM. Spark provides high-level APIs through which we can code and use spark in any language, including Java, Python, Scala, R, etc. And working with spark through Python is known as Pyspark.
MLLIB stands for Machine learning library in Spark. This library aims to make practical machine learning scalable and easy to implement. It provides tools to implement all machine learning algorithms, including Regression, classification, dimensionality reduction tools, transformation, feature extraction, pipelines (tunning), save and load algorithm, and utilities for linear algebra and statistics.
When we talk about spark MLLIB so, it has a dataframe-based API, and as of spark 2 onwards, the RDD-based API entered the maintenance phase, and the primary ML API is now a dataframe-based API that is a spark. ml.0
Spark provides a different set of machine learning tools to perform different tasks and take different actions.
Spark supports different data types. Spark MLLIB supports local vectors and Matrices stored on a single machine and distributed matrices. So it supports many data types packed with one or many RDDs.
When we talk about ML pipelines, it is all about understanding different stages, including estimator, evaluator, transformer, etc. Machine learning pipelines provide uniform high-level APIs built on top of data frames. It is used to create and tune practical machine learning pipelines. It is mainly used with structured data.
We now know about spark and why today it is used by each organization to process their data. To get hands-on practical knowledge about spark let up first install and set up complete spark on our system. First, we are installing Pyspark on the Jupyter notebook.
#ENVIRONMENT VARIABLE SPARK_HOME = C:sparkspark-2.3.1-bin-hadoop2.7 HADOOP_HOME = C:sparkspark-2.3.1-bin-hadoop2.7 #PATH VARIABLE C:sparkspark-2.3.1-bin-hadoop2.7bin
Running Spark on Google Colab
Running Pyspark on Google colab is very simple; you must visit the collab website and create a new Colab Notebook. In the first cell run the below PIP command to install Pyspark.
! pip install pyspark
As the cell successfully runs and you are good to go to use Pyspark for further practicals.
We have installed PySpark on our system so before directly using MLLIB and developing a machine learning model using Spark, let us refresh some basic Pyspark and how it works with data frames and process data. Pyspark is a tool through which we can work with spark using Python as a programming language. Let us give some hands-on practice to Pyspark.
Create Spark context
spark context is the main entry to use the spark feature. It will create a connection with a spark cluster and can be used to create RDDs, accumulators, and broadcast variables on that cluster. Only one spark context may be active per JVM. While creating spark context you have to set the app name and a master name, which we have defined as local. And we create an object of spark context. Without defining any configurations also, you can create a spark context.
from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName(“pyspark_practice”).setMaster(“local”) sc = SparkContext(conf=conf)
To see all the configurations, you can use the below command and read all the system details.
RDD stands for Resilient distributed datasets. An RDD in spark is simply an immutable distribution of objects. Spark RDDs are the same as Pandas Dataframes. Each RDD is split into multiple partitions, which may be computed on different cluster nodes. RDD can be created in two ways by loading some external dataset like CSV, excel files, or you can also create it by transforming one RDD into another.
When we have RDD (dataset) so before applying a machine learning algorithm, we perform the different tasks on data like handling missing values, dealing with categorical data, and feature scaling. And all these tasks are known as operations, and operations can be anything like sorting, filtering, summarizing, etc. In spark, operations are divided into 2 things as Transformation and Actions. First, we will create one RDD and learn different operations that we can perform on it.
To create RDD parallelize function is used that accepts a list in which you can simply have a collection of numbers, strings, tuples, dictionary
names = sc.parallelize([‘Shubham’,’rishi’,’prayag’,’shivam’,’rahul’,’Madhav’,’Nihal’,’sourik’,’Rishabh’, ‘Yash’, ‘kaushik’,’shivani’]) print(type(names))
The second way to create an RDD is from an external file in which we read some files by referencing its URL.
#Reading any csv file csv_file = spark.read.csv(‘/content/students_placement.csv’, inferSchema = True, header = True)
## Reading TXT file #txt_file = spark.read.text(“example.txt”)
## Reading JSON file #json_file = spark.read.json(“sample.json”, multiLine=True)
Actions are used to execute the scheduled task on the dataset because when we apply the transformation, It only creates a DAG and when we act then tasks task tasks tasked display an output. We will study some popular actions used on the dataset.
1. Collect
This is the first action that will display all values right away. It has created a list. If you will perform the transformation, then nothing will be displayed.
names.collect()
source – SS Of Code output
2. count By Value
If you want a count of a particular value in data, then you can use this action. The alternative to this function you can also use a simple count function which is also one action.
names.countByValue()
3. For Each
It is a unique operation that takes each value and applies a function to it to perform a certain task. It is used to create logs.
def f(x): print(x) a = sc.parallelize([1,2,3,4,5]).foreach(lambda x: f(x))
words = sc.parallelize ([“scala”, “java”,”hadoop”, “spark”, “akka”,”spark vs hadoop”,”pyspark”, “pyspark and spark”]) fore = words.foreach(lambda x: x.startswith(‘p’))
4. Take
we do not use the collect function in a production environment because it gives me complete details of data in a cluster that can collapse the memory. After all, complete data will come in memory so we use the take function to intake the required number of rows.
5. First
The first action returns the first element from an RDD.
names.first()
6. Glom
It transforms each partition into a tuple of elements. It will create an RDD of tuples so it will store 1 tuple per RDD. We use GLOM function to make each partition into a tuple. We have created a 3 partition, and when you use to collect, then it will same as 1 partition, but when you use glom function, then it will be stored in 3 partitions.
nums = sc.parallelize([1,3,5,3,4,2,5,7], 3) nums.glom().collect()
7. Reduce
It is the same as the python reduce function, which accepts a function and reduces all the elements per a particular function. Below is an example of adding all the data using reduced action.
sample_rdd = sc.parallelize([2,4,6,8]) print(sample_rdd.reduce(lambda x, y: x+y))
7. save RDD as a Text File
To serve the resultant RDD into a text file, we use to save ait n RDD as a text file. You can specify the path where you want the RDD to be saved. This function is mainly used to save our results and analysis when working with a large amount of data.
Transformation helps us to shape our dataset. Changes are lazily evaluated because whenever we use transformation then it will create a new RDD and you can display the content of the new RDD only when you perform any action on it. It will create DAG(directed acyclic graph) and keep building the graph till you perform any action on it that is why it is called lazy. As RDD is immutable so we cannot make any change in the existing RDD so transformation takes an RDD as input and generate another RDD. we will discuss some of the most used RDD transformations.
1. Map
As the name suggests, the Map transformation maps a certain value to the elements of the input RDD. The map takes a function as a parameter and applies the function to each RDD element. For example, if we want to add 10 to each element in RDD then using MAP, in this case, will be easy and handy.
sample_rdd.map(lambda x: x+10).collect()
2. Filter Transformation
Filter transformation filters out the RDD elements according to certain conditions. It accepts a function and applies that function to each element and if the element meets the condition then added to a new RDD and creates a new RDD.
s_rdd = sc.parallelize([1,2,4,5,7,8,2]) print(s_rdd.filter(lambda x: x%2 == 0).collect())
filter_rdd_2 = sc.parallelize([‘Rahul’, ‘Swati’, ‘Rohan’, ‘Shreya’, ‘Priya’]) filter_rdd_2.filter(lambda x: x.startswith(‘R’)).collect()
3. Union Transformation
We have read the Union function in SQL and its task is the same as it accepts the 2 RDD and combines them to generate one single RDD, which is a combination of both the RDD.
union_inp = sc.parallelize([2,4,5,6,7,8,9]) union_rdd_1 = union_inp.filter(lambda x : x%2==0) union_rdd_2 = union_inp.filter(lambda x: x%3 == 0) print(union_rdd_1.union(union_rdd_2).collect())
4. FlatMapTransformation
It performs the same operation as the Map operation except for the fact that flat map transformation return separate (flatter) values for each element from the original RDD.
ft_rdd = sc.parallelize([“Hey there”, “This is Pyspark Rdd transformation”]) ft_rdd.flatMap(lambda x: x.split(” “)).collect()
5. Join Transformation
The operation returns an RDD with a pair of elements with the matching keys and the values for that key. In simple words, it joins two RDD based on certain keys and keeps their values in a list.
x = sc.parallelize([(‘Spark’, 1), (‘Hadoop’, 4)]) y = sc.parallelize([(‘Spark’, 2), (‘Hadoop’, 5)]) print(x.join(y).collect())
6. Distinct Transformation
It returns the distinct values in a specified column from an RDD.
sample_rdd.distinct().collect()
We will learn a simple demo of developing a simple linear regression using spark MLLIB. We will walk through each step of the ML project lifecycle including preparing and processing data. We are using a simple dataset which is a student grade dataset that you can download from here.
Install all dependencies and start the spark session
spark session is the same as the spark context which is used as an entry point to start working with the dataframe and datasets which were introduced when spark 2. O was introduced.
# Start Spark Session !pip install findspark import findspark findspark.init() from pyspark.sql import SparkSession spark = SparkSession.builder.master(“local[*]”).getOrCreate()
The builder method gives you access to builder APIs that you can use to configure the session. Get or create is used when you want to share a spark context.
Read Dataset
I hope that you have downloaded the dataset. And If you are working on the local Jupyter Notebook, then directly use the spark read command to read the dataset. If working on Collab, then you have to upload a file using the left navigation pane, or you can also use the below command.
from google.colab import files files.upload()
Now you can run the spark Read CSV file function and a file path to load the dataset.
data = spark.read.csv(‘Student_Grades_Data.csv’, header=True, inferSchema=True)
Display the data
To have a gentle look over a few rows of data we can use the show function. Apart from this, you can use the print schema function to print the data type of each column to observe what data type infer schema has set for a set column by keeping it True.
data.printSchema()
Separate the Independent columns
Now we will create the feature array by omitting the last column or dependent column of the dataset. If you remember that to train a machine learning model, we want to feed features and labels to predict a label for new features.
#create a feature array by omitting the last column feature_cols = data.columns[:-1] from pyspark.ml.feature import VectorAssembler vect_assembler = VectorAssembler(inputCols = feature_cols, outputCol=”features”) #Utilize Assembler created above in order to add the feature column data_w_features = vect_assembler.transform(data)
Vector assembler is a transformer that assembles all the features into one vector from multiple columns that are of type double you can also observe in the below-given figure. Now select only the feature and label to create the machine learning model.
finalized_data = data_w_features.select(“features”,”Grades”) finalized_data.show()
Train-Test Split
Now we will split the prepared final data into two sets train set and a test set where the train set is used for model training, and the test set is used to evaluate the model, like how it is performing on unknown features.
train_dataset, test_dataset = finalized_data.randomSplit([0.7, 0.3])
You can also statistically analyze the dataset like in Pandas we directly use to describe the function. The same can be used with Spark also.
#Peek into training data train_dataset.describe().show()
Linear Regression Model Creation
The process with Spark MLLIB is the same as you perform with sciket-learn, which is first importing the model and creating its object defining the parameters.
#Import Linear Regression class called LinearRegression from pyspark.ml.regression import LinearRegression LinReg = LinearRegression(featuresCol=”features”, labelCol=”Grades”)
Model training and testing
In model training, the input data and some correct labels are fed to a model, which is implemented using the fit function. And to find out the predictions on the unknown dataset (test dataset) evaluate function is used.
#Train the model on the training using fit() method. model = LinReg.fit(train_dataset) #Predict the Grades using the evulate method pred = model.evaluate(test_dataset)
Print coefficients and Intercept
A simple linear regression model simply built a straight line, and it calculates the coefficients using covariance and variance. So to display the coefficients and intercept, you can simply use the below command.
#Find out coefficient value coefficient = model.coefficients print (“The coefficient of the model is : %a” %coefficient) #Find out intercept Value intercept = model.intercept print (“The Intercept of the model is : %f” %intercept)
Evaluate Model using Metric
The error is the difference between the actual and predicted value, and metrics help us to evaluate our model in a good sense like what is percent accuracy of the model, where it performs best and worst, etc. we have calculated MAE, MSE, RMSE, and R squared.
#Evaluate the model using metric like Mean Absolute Error(MAE), Root Mean Square Error(RMSE) and R-Square from pyspark.ml.evaluation import RegressionEvaluator evaluation = RegressionEvaluator(labelCol=”Grades”, predictionCol=”prediction”) # Root Mean Square Error rmse = evaluation.evaluate(pred.predictions, {evaluation.metricName: “rmse”}) print(“RMSE: %.3f” % rmse) # Mean Square Error mse = evaluation.evaluate(pred.predictions, {evaluation.metricName: “mse”}) print(“MSE: %.3f” % mse) # Mean Absolute Error mae = evaluation.evaluate(pred.predictions, {evaluation.metricName: “mae”}) print(“MAE: %.3f” % mae) # r2 – coefficient of determination r2 = evaluation.evaluate(pred.predictions, {evaluation.metricName: “r2”}) print(“r2: %.3f” %r2)
Spark is a big data processing engine that helps us to work with a huge amount of data in real time. Machine learning is one kind of service that spark supports through which we can analyze and build an ML-based system on a large volume of data. In this article, we have studied Spark Machine learning, Pyspark, and Pyspark MLLIB. Let us take a few key takeaways from the article that you should remember related to spark and MLLIB.
Thank You Note
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Overview
As organizations create more diverse and more user-focused data products and services, there is a growing need for machine learning, which can be used to develop personalizations, recommendations, and predictive insights. The Apache Spark machine learning library (MLlib) allows data scientists to focus on their data problems and models instead of solving the complexities surrounding distributed data (such as infrastructure, configurations, and so on).
In this tutorial module, you will learn how to:
- Load sample data
- Prepare and visualize data for ML algorithms
- Run a linear regression model
- Evaluation a linear regression model
- Visualize a linear regression model
We also provide a sample notebook that you can import to access and run all of the code examples included in the module.
Apache Spark Alternatives
-
Dask
Dask (2018) seeks to offer a robust parallel computing framework that is very user-friendly for Python programmers and can run on either a single laptop or a cluster. Incorporating Dask with existing hardware and programming is simpler and lighter than Spark.
Relevant reading: Dask vs. Spark
Because of its innovative API and execution model, Spark necessitates a certain amount of study and experience. In contrast, Pandas data frames and Numpy array data structures are supported by the pure Python framework Dask. Most data scientists can use it almost right away, which is a huge advantage.
Scikit-JobLib integration is also possible with Dask. With only a few modest code tweaks, Scikit-learn programs may be processed in parallel using Learn’s parallel computing module.
-
Ray
As a straightforward and all-purpose API for creating networked applications, the Ray framework was developed at RISELab, UC Berkeley. The following four libraries are included with the Ray core to speed up deep learning and machine learning operations.
- Tune (Scalable Hyperparameter Tuning),
- RLlib(Scalable Reinforcement Learning),
- Ray Train (Distributed Deep Learning)
- Datasets (Distributed Data Loading and Compute – beta phase).
Ray is utilized in important machine learning use cases like simulation, distributed training, sophisticated computations, and deployment in interactive settings while maintaining all of Hadoop and Spark advantageous traits.
Load sample data
The easiest way to start working with machine learning is to use an example Databricks dataset available in the
/databricks-datasets
folder accessible within the Databricks workspace. For example, to access the file that compares city population to median sale prices of homes, you can access the file
/databricks-datasets/samples/population-vs-price/data_geo.csv
.
# Use the Spark CSV datasource with options specifying: # - First line of file is a header # - Automatically infer the schema of the data data = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load("/databricks-datasets/samples/population-vs-price/data_geo.csv") data.cache() # Cache data for faster reuse
To view this data in a tabular format, instead of exporting this data to a third-party tool, you can use the
display()
command in your Databricks notebook.
display(data)
Evaluate the model
To evaluate the regression analysis, calculate the root mean square error using the
RegressionEvaluator
. Here is the Python code for evaluating the two models and their output.
from pyspark.ml.evaluation import RegressionEvaluator evaluator = RegressionEvaluator(metricName="rmse") RMSE = evaluator.evaluate(predictionsA) print("ModelA: Root Mean Squared Error = " + str(RMSE)) # ModelA: Root Mean Squared Error = 128.602026843 predictionsB = modelB.transform(data) RMSE = evaluator.evaluate(predictionsB) print("ModelB: Root Mean Squared Error = " + str(RMSE)) # ModelB: Root Mean Squared Error = 129.496300193
Best Practices to Use Spark for Data Science
As data scientists, you often rely on the power of Apache Spark to run your computations. Here are some best practices for using Spark for data science:
- Have a good understanding of the Spark architecture. Data formats, including RDD, data frames, and datasets, as well as a clear concept of the role and duties of the driver, executors, and the nature of the task, are essential.
- Balance out the workload of all jobs and ensure a smooth distribution for parallel processing to prevent bottlenecks. Choose the correct number of cores, partitions, and tasks.
- As far as possible, avoid frequent shuffles, loading the disc, and excessive garbage collection. Load data onto discs only if a certain executor reports that the worker node on which it is running is out of memory.
- Debugging knowledge is essential because processing big amounts of data is typically difficult and can result in errors.
- Partition appropriately to optimize query planning. Partitioning must be done based on the columns used for filtering and sorting.
You can enroll in our Data Science certification course and our exclusive PG in Data Science course at your convenience, or you can book a demo with us.
You can enroll in our Data Science certification course and our exclusive PG in Data Science course at your convenience, or you can book a demo with us.
Why is Spark Important for Big Data Companies?
Big Data companies rely on Spark for Data Science to generate, store, and process massive amounts of data on a daily basis. They require a tool that can handle data efficiently to extract insights and valuable information that can be used for further improvements in business operations.
Spark’s scalability enables companies to process massive datasets rapidly and efficiently. It works by distributing data and processing across a cluster of computers. This feature makes it an ideal tool for handling big data workloads.
Why Spark?
- Spark is faster than many other big data processing tools, including Hadoop. It is because of its in-memory computing capabilities. It can also process data in real time, making it an ideal tool for handling streaming data. It enables businesses to perform complex calculations, analytics, and machine learning tasks faster than traditional systems can.
- Spark for Data Science provides a variety of libraries for machine learning, graph processing, and real-time data processing. These libraries enable businesses to build complex models and extract valuable insights from their data. Thus, making it an essential tool for Big Data Analytics.
- It can also integrate with various data sources, including SQL databases, NoSQL databases, HDFS file systems, and Amazon S3. This makes it easy to access data from different sources, enabling businesses to use all their available data for analysis.
- In addition, Apache Spark can be deployed in a wide range of production environments. It can run on-premise or cloud-based deployments and hybrid solutions that combine the two. This makes it highly flexible and suitable for any enterprise’s needs. Finally, Spark’s open-source nature means that it is available for free. And this makes it a cost-effective solution for businesses of all sizes.
Features of Apache Spark
- In-memory processing: Spark processes data in memory faster than traditional disk-based processing. This means that data can be accessed and processed more quickly.
- Distributed computing: Spark’s distributed computing capabilities enable it to handle large-scale data.
- Versatility: Spark is a versatile tool that supports various data processing and analysis tasks. That includes batch processing, stream processing, and machine learning.
- Real-time data processing: Spark’s real-time data processing capabilities enable it to process and analyze data as it is generated. Thereby making it ideal for critical applications such as fraud detection and recommendation engines.
- Integration: Spark integrates with many data sources, including Hadoop, databases, and cloud storage solutions. Therefore, businesses find incorporating Spark into their existing Big Data infrastructure convenient.
- Easy-to-use APIs: Spark provides easy-to-use APIs for programming in languages such as Scala, Python, and Java, making it accessible to a wide range of developers.
Use cases
- Data processing: Apache Spark is ideal for processing large volumes of data, including batch processing and real-time processing. It is commonly used for data preprocessing, wrangling, and cleansing.
- Streaming data: Spark’s real-time data processing capabilities make it ideal for streaming data applications. It includes social media monitoring, real-time fraud detection, and IoT data processing.
- Machine learning: Spark’s machine learning libraries, such as MLlib, enable data scientists to build and train machine learning models on large datasets.
- Graph processing: Spark’s GraphX library enables graph processing. Thus, making it ideal for social network analysis, recommendation systems, and fraud detection.
- Image processing: Spark can be used for image processing applications, such as image recognition and classification.
- ETL processing: Spark is widely used for ETL (extract, transform, load) processing, enabling businesses to extract data from various sources, transform it into a usable format, and load it into a data warehouse or data lake.
- Data visualization: Spark can be used to generate visualizations of large and complex datasets, making it easier to identify patterns and trends.
- Ad hoc analysis: Spark’s interactive shell and SQL support enable ad hoc analysis, allowing businesses to explore their data and uncover insights.
Run the linear regression model
In this section, you run two different linear regression models using different regularization parameters to determine how well either of these two models predict the sales price (label) based on the population (feature).
Build the model
# Import LinearRegression class from pyspark.ml.regression import LinearRegression # Define LinearRegression algorithm lr = LinearRegression() # Fit 2 models, using different regularization parameters modelA = lr.fit(data, {lr.regParam:0.0}) modelB = lr.fit(data, {lr.regParam:100.0})
Using the model, you can also make predictions by using the
transform()
function, which adds a new column of predictions. For example, the code below takes the first model (modelA) and shows you both the label (original sales price) and prediction (predicted sales price) based on the features (population).
# Make predictions predictionsA = modelA.transform(data) display(predictionsA)
Keywords searched by users: apache spark machine learning
Categories: Chia sẻ 29 Apache Spark Machine Learning
See more here: kientrucannam.vn
See more: https://kientrucannam.vn/vn/