Chuyển tới nội dung
Home » Amazon Kinesis Data Analytics | Kafkakinesis Comparison

Amazon Kinesis Data Analytics | Kafkakinesis Comparison

Process Data from Kinesis Data Streams with Kinesis Data Analytics | Amazon Web Services

How it works

Amazon Kinesis cost-effectively processes and analyzes streaming data at any scale as a fully managed service. With Kinesis, you can ingest real-time data, such as video, audio, application logs, website clickstreams, and IoT telemetry data, for machine learning (ML), analytics, and other applications.

  • Kinesis Data Streams
  • Kinesis Video Streams
  • Kinesis Data Streams
  • Amazon Kinesis Data Streams is a serverless streaming data service that simplifies the capture, processing, and storage of data streams at any scale.

  • Kinesis Video Streams
  • With Amazon Kinesis Video Streams, you can more easily and securely stream video from connected devices to AWS for analytics, ML, playback, and other processing.

Use cases

Amazon Kinesis Data Analytics is ideal for solving a wide range of streaming data use cases, including:

Streaming ETL for Internet-of-Things (IoT) with Apache Flink Applications

You can develop applications with Apache Flink libraries and use Amazon Kinesis Data Analytics to transform, aggregate, and filter streaming data from IoT devices such as consumer appliances, embedded sensors, and TV set-top boxes. You can then use the data to send real-time alerts when a sensor exceeds certain operating thresholds.

Real-time log analytics with SQL

You can stream billions of small messages to Amazon Kinesis Data Analytics and calculate key metrics, which you can then use to refresh content performance dashboards in real time and improve content performance.

Build end-to-end streaming pipelines with Amazon Managed Service for Apache Flink Blueprints with a single click. Learn more.

Amazon Managed Service for Apache Flink

Build and run fully managed Apache Flink applications

Build and run Apache Flink applications, without setting up infrastructure and managing resources and clusters.

Process gigabytes of data per second with subsecond latencies and respond to events in real time.

Deploy highly available and durable applications with Multi-AZ deployments and APIs for application lifecycle management.

Process Data from Kinesis Data Streams with Kinesis Data Analytics | Amazon Web Services
Process Data from Kinesis Data Streams with Kinesis Data Analytics | Amazon Web Services

Apache KafkaAmazon Kinesis

Kafka and Kinesis are both very important components to facilitating data processing in modern data pipelines. And although both of these solutions are widely used in today’s business, they do offer some stark differences that every business should know about.

To better understand these event streaming platforms, we’ve put together a deep dive comparison analyzing the similarities and differences of Kafka and Kinesis.

Specifically, in this piece, we’ll look at how Kafka and Kinesis vary regarding performance, cost, scalability, and ease of use. At that, let’s dig in to a deep dive comparison between Kafka and Kinesis.

What is Kafka?

Apache Kafka is an open-source distributed event streaming platform (also known as a “pub/sub” messaging system) that brokers communication between bare-metal servers, virtual machines, and cloud-native services.

At a high level, Apache Kafka is a distributed system of servers and clients that communicate through a publish/subscribe messaging model. Streaming data is published (written to) and subscribed to (read from) these distributed servers and clients. Just like Kinesis, this asynchronous service-to-service communication model allows subscribers to a topic to immediately receive any message published to a topic.

Kafka has been a long-time favorite for on-premises data lakes. Used by thousands of Fortune 100 companies, Kafka has become a go-to open-source distributed event streaming platform to support high-performance streaming data processing. Here, streaming data is defined as continuously generated data from thousands of data sources. It’s Kafka’s responsibility to ingest all of these data sources in real-time and process and store data in the order it’s received. This attribute of the Kafka event streaming platform enables businesses to build high-performance Kafka data pipelines, streaming analytics tools, data integration applications, and an array of other mission-critical applications.

What is Kinesis?

Amazon Kinesis is an Amazon proprietary service that enables real-time data streaming. It collects, processes, and analyzes real-time streaming data within AWS (Amazon Web Services). As a replacement of the common SNS-SQS messaging queue, AWS Kinesis enables organizations to run critical applications and support baseline business processes in real-time rather than waiting until all the data is collected and cataloged, which could take hours to days.

As a cost-effective AWS-native service for collecting, processing, and analyzing streaming data at scale, Kinesis is designed to seamlessly integrate with a host of AWS-native services such as AWS Lambda and Redshift via Amazon Kinesis Data Stream APIs for stream processing. In doing so, Amazon Kinesis can ingest, catalog, and analyze incoming data for data analytics, sensor metrics, machine learning, artificial intelligence, and other modern-day applications.

Further, as a cloud-native solution, Kinesis is fault-tolerant by default, supports auto-scaling, and integrates seamlessly with AWS dashboards designed to monitor key metrics.

How to get started

Learn how Amazon Managed Service for Apache Flink works

Find out more about running serverless Apache Flink applications and Amazon Managed Service for Apache Flink Studio.

Review the step-by-step guide

Explore how to build applications in your integrated development environment (IDE) or in a Studio notebook.

Run your Apache Flink application

Start building your streaming application with no minimum fees or upfront commitments.

Process Streaming Data with Amazon Kinesis Data Analytics Studio | Amazon Web Services
Process Streaming Data with Amazon Kinesis Data Analytics Studio | Amazon Web Services

Pricing and billing

With Amazon Kinesis Data Analytics, you pay only for what you use. There are no resources to provision or upfront costs associated with Amazon Kinesis Data Analytics.

You are charged an hourly rate based on the number of Amazon Kinesis Processing Units (or KPUs) used to run your streaming application. A single KPU is a unit of stream processing capacity comprised of 1 vCPU compute and 4 GB memory. Amazon Kinesis Data Analytics automatically scales the number of KPUs required by your stream processing application as the demands of memory and compute vary in response to processing complexity and the throughput of streaming data processed.

For Apache Flink applications, you are charged a minimum of two KPUs and 50 GB running application storage if your Kinesis Data Analytics application is running. For SQL applications, you are charged a minimum of one KPU if your Kinesis Data Analytics application is running.

Kinesis Data Analytics is a fully managed stream processing solution, independent from the streaming source that it reads data from and the destinations it writes processed data to. You will be billed independently for the services you read from and write to in your application.

When Should I Use Amazon Kinesis Data Analytics?

Amazon Kinesis Data Analytics enables you to quickly author SQL code that continuously reads, processes, and stores data in near real time. Using standard SQL queries on the streaming data, you can construct applications that transform and provide insights into your data. Following are some of example scenarios for using Kinesis Data Analytics:

  • Generate time-series analytics – You can calculate metrics over time windows, and then stream values to Amazon S3 or Amazon Redshift through a Kinesis data delivery stream.

  • Feed real-time dashboards – You can send aggregated and processed streaming data results downstream to feed real-time dashboards.

  • Create real-time metrics – You can create custom metrics and triggers for use in real-time monitoring, notifications, and alarms.

For information about the SQL language elements that are supported by Kinesis Data Analytics, see Amazon Kinesis Data Analytics SQL Reference.

Introduction to Amazon Kinesis Data Analytics | Amazon Web Services
Introduction to Amazon Kinesis Data Analytics | Amazon Web Services

StreamSets’ Approach

StreamSets supports Apache Kafka as a source, broker, and destination allowing you to build complex Kafka pipelines with message brokering at every stage, and has supported stages for Kinesis too. To learn more, contact us today or get started building pipelines for free.

Amazon Kinesis Data Analytics

Amazon Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time. Amazon Kinesis Data Analytics reduces the complexity of building, managing, and integrating streaming applications with other Amazon Web Services services.

You can build sophisticated applications using Apache Flink. Apache Flink is an open source framework and engine for processing data streams. Your applications can transform and analyze data in real time, and integrate with other Amazon Web Services services in as little as one line of code. You can also use an interactive SQL editor to easily query streaming data and build streaming applications. Simply point to a streaming data source like Amazon Kinesis Data Streams and use standard SQL to analyze your data in real-time.

Amazon Kinesis Data Analytics takes care of everything required to run your real-time applications continuously and scales automatically to match the volume and throughput of your incoming data. With Amazon Kinesis Data Analytics, you only pay for the resources your streaming applications consume. There is no minimum fee or setup cost.

Benefits

Powerful real-time processing

Amazon Kinesis Data Analytics provides built-in functions to filter, aggregate, and transform streaming data for advanced analytics. It processes streaming data with sub-second latencies, enabling you to analyze and respond to incoming data and streaming events in real time.

No servers to manage

Amazon Kinesis Data Analytics is serverless; there are no servers to manage. It runs your streaming applications without requiring you to provision or manage any infrastructure. Amazon Kinesis Data Analytics automatically scales the infrastructure up and down as required to run your applications with low latency.

Pay only for what you use

With Amazon Kinesis Data Analytics, you pay only for the processing resources that your streaming applications use. There are no minimum fees or upfront commitments.

AWS Kinesis Data Analytics Demo
AWS Kinesis Data Analytics Demo

Key concepts

An application is the Kinesis Data Analytics entity that you work with. Kinesis Data Analytics applications continuously read and process streaming data in real time. You write application code in a language supported by Apache Flink to process the incoming streaming data and produce output. Then, Kinesis Data Analytics writes the output to a configured destination.

Input – The streaming source for your application. In the input configuration, you map the streaming source to an in-application data stream(s). Data flows from in your data source(s) into your in-application data streams. You process data from these in-application data streams using your application code, sending processed data to subsequent in-application data streams or destinations. You add inputs inside application code for Apache Flink applications and Studio notebooks, and via the API for Kinesis Data Analytics for SQL applications.

An in-application data stream is an entity that continuously stores data in your application for you to perform processing. Your applications continuously writes to and reads from in-application data streams. For Apache Flink and Studio applications, you interact with in-application stream by processing data via stream operators. Operators transform one or more data streams into a new data stream. For SQL applications, you interact with an in-application stream in the same way you would a SQL table by using SQL statements. You apply SQL statements to one or more data streams and insert the results into a new data stream.

For Apache Flink applications, Kinesis Data Analytics supports applications built using the Apache Flink open source libraries and the Amazon SDKs. For SQL applications, Kinesis Data Analytics supports the ANSI SQL with some extensions to the SQL standard to make it easier to work with streaming data. Kinesis Data Analytics Studio supports code built using Apache Flink-compatible SQL, Python, and Scala.

How to get started

Explore Kinesis capabilities

Learn more about implementing real-time applications with Kinesis.

Contact an expert

Reach out to an AWS expert to get your questions answered.

Get started with a free account

Gain instant access to the AWS Free Tier.

For new projects, we recommend that you use the new Managed Service for Apache Flink Studio over Kinesis Data Analytics for SQL Applications. Managed Service for Apache Flink Studio combines ease of use with advanced analytical capabilities, enabling you to build sophisticated stream processing applications in minutes.

Amazon Kinesis Data Analytics for SQL Applications: How It Works

Note

After September 12, 2023, you will not able to create new applications using Kinesis Data Firehose as a source if you do not already use Kinesis Data Analytics for SQL. For more information, see Limits.

An application is the primary resource in Amazon Kinesis Data Analytics that you can create in your account. You can create and manage applications using the AWS Management Console or the Kinesis Data Analytics API. Kinesis Data Analytics provides API operations to manage applications. For a list of API operations, see Actions.

Kinesis Data Analytics applications continuously read and process streaming data in real time. You write application code using SQL to process the incoming streaming data and produce output. Then, Kinesis Data Analytics writes the output to a configured destination. The following diagram illustrates a typical application architecture.

Each application has a name, description, version ID, and status. Amazon Kinesis Data Analytics assigns a version ID when you first create an application. This version ID is updated when you update any application configuration. For example, if you add an input configuration, add or delete a reference data source, add or delete an output configuration, or update application code, Kinesis Data Analytics updates the current application version ID. Kinesis Data Analytics also maintains timestamps for when an application was created and last updated.

In addition to these basic properties, each application consists of the following:

  • Input – The streaming source for your application. You can select either a Kinesis data stream or a Firehose data delivery stream as the streaming source. In the input configuration, you map the streaming source to an in-application input stream. The in-application stream is like a continuously updating table upon which you can perform the


    SELECT

    and

    INSERT SQL

    operations. In your application code, you can create additional in-application streams to store intermediate query results.

    You can optionally partition a single streaming source in multiple in-application input streams to improve the throughput. For more information, see Limits and Configuring Application Input.

    Amazon Kinesis Data Analytics provides a timestamp column in each application stream called Timestamps and the ROWTIME Column. You can use this column in time-based windowed queries. For more information, see Windowed Queries.

    You can optionally configure a reference data source to enrich your input data stream within the application. It results in an in-application reference table. You must store your reference data as an object in your S3 bucket. When the application starts, Amazon Kinesis Data Analytics reads the Amazon S3 object and creates an in-application table. For more information, see Configuring Application Input.

  • Application code – A series of SQL statements that process input and produce output. You can write SQL statements against in-application streams and reference tables. You can also write JOIN queries to combine data from both of these sources.

    For information about the SQL language elements that are supported by Kinesis Data Analytics, see Amazon Kinesis Data Analytics SQL Reference.

    In its simplest form, application code can be a single SQL statement that selects from a streaming input and inserts results into a streaming output. It can also be a series of SQL statements where output of one feeds into the input of the next SQL statement. Further, you can write application code to split an input stream into multiple streams. You can then apply additional queries to process these streams. For more information, see Application Code.

  • Output – In application code, query results go to in-application streams. In your application code, you can create one or more in-application streams to hold intermediate results. You can then optionally configure the application output to persist data in the in-application streams that hold your application output (also referred to as in-application output streams) to external destinations. External destinations can be a Firehose delivery stream or a Kinesis data stream. Note the following about these destinations:

    • You can configure a Firehose delivery stream to write results to Amazon S3, Amazon Redshift, or Amazon OpenSearch Service (OpenSearch Service).

    • You can also write application output to a custom destination instead of Amazon S3 or Amazon Redshift. To do that, you specify a Kinesis data stream as the destination in your output configuration. Then, you configure AWS Lambda to poll the stream and invoke your Lambda function. Your Lambda function code receives stream data as input. In your Lambda function code, you can write the incoming data to your custom destination. For more information, see Using AWS Lambda with Amazon Kinesis Data Analytics.

    For more information, see Configuring Application Output.

In addition, note the following:

  • Amazon Kinesis Data Analytics needs permissions to read records from a streaming source and write application output to the external destinations. You use IAM roles to grant these permissions.

  • Kinesis Data Analytics automatically provides an in-application error stream for each application. If your application has issues while processing certain records (for example, because of a type mismatch or late arrival), that record is written to the error stream. You can configure application output to direct Kinesis Data Analytics to persist the error stream data to an external destination for further evaluation. For more information, see Error Handling.

  • Amazon Kinesis Data Analytics ensures that your application output records are written to the configured destination. It uses an “at least once” processing and delivery model, even if you experience an application interruption. For more information, see Delivery Model for Persisting Application Output to an External Destination.

Amazon Kinesis Data Analytics

Amazon Kinesis Data Analytics is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time. Amazon Kinesis Data Analytics reduces the complexity of building, managing, and integrating streaming applications with other Amazon Web Services services.

You can build sophisticated applications using Apache Flink. Apache Flink is an open source framework and engine for processing data streams. Your applications can transform and analyze data in real time, and integrate with other Amazon Web Services services in as little as one line of code. You can also use an interactive SQL editor to easily query streaming data and build streaming applications. Simply point to a streaming data source like Amazon Kinesis Data Streams and use standard SQL to analyze your data in real-time.

Amazon Kinesis Data Analytics takes care of everything required to run your real-time applications continuously and scales automatically to match the volume and throughput of your incoming data. With Amazon Kinesis Data Analytics, you only pay for the resources your streaming applications consume. There is no minimum fee or setup cost.

AWS Kinesis Tutorial for Beginners [FULL COURSE in 65 mins]
AWS Kinesis Tutorial for Beginners [FULL COURSE in 65 mins]

Are You a First-Time User of Amazon Kinesis Data Analytics?

If you are a first-time user of Amazon Kinesis Data Analytics, we recommend that you read the following sections in order:

  1. Read the How It Works section of this guide. This section introduces various Kinesis Data Analytics components that you work with to create an end-to-end experience. For more information, see Amazon Kinesis Data Analytics for SQL Applications: How It Works.

  2. Try the Getting Started exercises. For more information, see Getting Started with Amazon Kinesis Data Analytics for SQL Applications.

  3. Explore the streaming SQL concepts. For more information, see Streaming SQL Concepts.

  4. Try additional examples. For more information, see Kinesis Data Analytics for SQL examples.

Data comes at businesses today at a relentless pace – and it never stops. It’s a good thing too. The data-driven enterprise is more likely to succeed. According to McKinsey, “companies with the greatest overall growth in revenue and earnings receive a significant proportion of that boost from data and analytics.” But there’s a secret to fueling those analytics: data ingest frameworks that help deliver data in real-time across a business. This is where the Kafka vs. Kinesis discussion begins.

  • Introduction to Event Streaming Platforms
  • Apache Kafka vs. Amazon Kinesis
  • Kafka vs. Kinesis Comparison
  • StreamSets’ Approach

Both Apache Kafka and Amazon Kinesis handle real-time data feeds. Both are capable of ingesting thousands of data feeds simultaneously to support high-speed data processing. Whether to support machine learning, artificial intelligence, big data, IoT, or general stream processing, today’s business is hyper-focused on investing in data stream processing solutions, facilitated by these message brokering services.

When Should I Use Amazon Kinesis Data Analytics?

Amazon Kinesis Data Analytics enables you to quickly author SQL code that continuously reads, processes, and stores data in near real time. Using standard SQL queries on the streaming data, you can construct applications that transform and provide insights into your data. Following are some of the example scenarios for using Kinesis Data Analytics:

  • Generate time-series analytics – You can calculate metrics over time windows, and then stream values to Amazon S3 or Amazon Redshift through a Kinesis data delivery stream.
  • Feed real-time dashboards – You can send aggregated and processed streaming data results downstream to feed real-time dashboards.
  • Create real-time metrics – You can create custom metrics and triggers for use in real-time monitoring, notifications, and alarms.

Amazon Kinesis Data Analytics: How It Works

An application is a primary resource in Amazon Kinesis Data Analytics that you can create in your account. You can create and manage applications using the AWS Management Console or the Amazon Kinesis Data Analytics API. Kinesis Data Analytics provides API operations to manage applications. For a list of API operations, see Actions.

Amazon Kinesis data analytics applications continuously read and process streaming data in real time. You write application code using SQL to process the incoming streaming data and produce output. Then, Kinesis Data Analytics writes the output to a configured destination. The following diagram illustrates a typical application architecture.

Each application has a name, description, version ID, and status. Amazon Kinesis Data Analytics assigns a version ID when you first create an application. This version ID is updated when you update any application configuration. For example, if you add an input configuration, add or delete a reference data source, or add or delete output configuration, or update application code, Kinesis Data Analytics updates the current application version ID. Kinesis Data Analytics also maintains time stamps for when an application was created and last updated.

In addition to these basic properties, each application consists of the following:

  • Input – The streaming source for your application. You can select either a Kinesis data stream or a Kinesis data delivery stream as the streaming source. In the input configuration, you map the streaming source to an in-application input stream. The in-application stream is like a continuously updating table upon which you can perform the SELECT and INSERT SQL operations. In your application code, you can create additional in-application streams to store intermediate query results.You can optionally partition a single streaming source in multiple in-application input streams to improve the throughput. For more information, see Limits and Configuring Application Input.Amazon Kinesis Data Analytics provides a timestamp column in each application stream called Timestamps and the ROWTIME Column. You can use this column in time-based windowed queries. For more information, see Windowed Queries.You can optionally configure a reference data source to enrich your input data stream within the application. It results in an in-application reference table. You must store your reference data as an object in your S3 bucket. When the application starts, Amazon Kinesis Data Analytics reads the Amazon S3 object and creates an in-application table. For more information, see Configuring Application Input.
  • Application code – A series of SQL statements that process input and produce output. You can write SQL statements against in-application streams and reference tables, and you can write JOIN queries to combine data from both of these sources. In its simplest form, application code can be a single SQL statement that selects from a streaming input and inserts results into a streaming output. It can also be a series of SQL statements where the output of one feeds into the input of the next SQL statement. Further, you can write application code to split an input stream into multiple streams and then apply additional queries to process these streams. For more information, see Application Code.
  • Output – In application code, query results go to in-application streams. In your application code, you can create one or more in-application streams to hold intermediate results. You can then optionally configure application output to persist data in the in-application streams, that hold your application output (also referred to as in-application output streams), to external destinations. External destinations can be a Kinesis data delivery stream or a Kinesis data stream. Note the following about these destinations:

    • You can configure a Kinesis data delivery stream to write results to Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service (Amazon ES).
    • You can also write application output to a custom destination, instead of Amazon S3 or Amazon Redshift. To do that, you specify a Kinesis data stream as the destination in your output configuration. Then, you configure AWS Lambda to poll the stream and invoke your Lambda function. Your Lambda function code receives stream data as input. In your Lambda function code, you can write the incoming data to your custom destination. For more information, see Using AWS Lambda with Amazon Kinesis Data Analytics.

Source: What Is Amazon Kinesis Data Analytics? – Amazon Kinesis Data Analytics

Analytics in 15: Event Detection with Amazon MSK and Kinesis Data Analytics- AWS Analytics in 15
Analytics in 15: Event Detection with Amazon MSK and Kinesis Data Analytics- AWS Analytics in 15

Managing applications

• Monitoring Kinesis Data Analytics in the Amazon Kinesis Data Analytics for Apache Flink Developer Guide.• Monitoring Kinesis Data Analytics in the Amazon Kinesis Data Analytics for SQL Developer Guide.

Q: How do I manage and control access to my Kinesis Data Analytics applications? Kinesis Data Analytics needs permissions to read records from the streaming data sources that you specify in your application. Kinesis Data Analytics also needs permissions to write your application output to destinations that you specify in your application output configuration. You can grant these permissions by creating IAM roles that Kinesis Data Analytics can assume. The permissions you grant to this role determine what Kinesis Data Analytics can do when the service assumes the role. For more information, see:

• Granting Permissions in the Amazon Kinesis Data Analytics for Apache Flink Developer Guide.• Granting Permissions in the Amazon Kinesis Data Analytics for SQL Developer Guide.

Q: How does Kinesis Data Analytics scale my application? Kinesis Data Analytics elastically scales your application to accommodate the data throughput of your source stream and your query complexity for most scenarios. Kinesis Data Analytics provisions capacity in the form of Amazon Kinesis Processing Units (KPU). One KPU provides you with 1 vCPU and 4GB memory.For Apache Flink applications, Kinesis Data Analytics assigns 50GB of running application storage per KPU that your application uses for checkpoints and is available for you to use via temporary disk. A checkpoint is an up-to-date backup of a running application that is used to recover immediately from an application disruption. You can also control the parallel execution for your Kinesis Data Analytics for Apache Flink application tasks (such as reading from a source or executing an operator) using the Parallelism and ParallelismPerKPU parameters in the API. Parallelism defines the number of concurrent instances of a task. All operators, sources, and sinks execute with a defined parallelism, by default 1. Parallelism per KPU defines the amount of the number of parallel tasks that can be scheduled per Kinesis Processing Unit (KPU) of your application, by default 1. For more information, see Scaling in the Amazon Kinesis Data Analytics for Apache Flink Developer Guide.For SQL applications, each streaming source is mapped to a corresponding in-application stream. While this is not required for many customers, you can more efficiently use KPUs by increasing the number of in-application streams that your source is mapped to by specifying the input parallelism parameter. Kinesis Data Analytics evenly assigns the streaming data source’s partitions, such as Amazon Kinesis data stream’s shards, to the number of in-application data streams that you specified. For example, if you have a 10-shard Amazon Kinesis data stream as a streaming data source and you specify an input parallelism of two, Kinesis Data Analytics assigns five Amazon Kinesis shards to two in-application streams named “SOURCE_SQL_STREAM_001” and “SOURCE_SQL_STREAM_002”. For more information, see Configuring Application Input in the Amazon Kinesis Data Analytics for SQL Developer Guide.

Q: What are the best practices associated for building and managing my Kinesis Data Analytics applications? For information about best practices for Apache Flink applications, see the Best Practices section of the Amazon Kinesis Data Analytics for Apache Flink Developer Guide. The section covers best practices for fault tolerance, performance, logging, coding, and more.

For information about best practices for SQL applications, see the Best Practices section of the Amazon Kinesis Data Analytics for SQL Developer Guide. The section covers managing applications, defining input schema, connecting to outputs, and authoring application code.

Q: Can I access resources behind an Amazon VPC with a Kinesis Data Analytics for Apache Flink application? Yes. You can access resources behind an Amazon VPC. You can learn how to configure your application for VPC access in the Using an Amazon VPC section of the Amazon Kinesis Data Analytics Developer Guide.

Q: Can a single Kinesis Data Analytics for Apache Flink application have access to multiple VPCs? No. If multiple subnets are specified, they must all be in the same VPC. You can connect to other VPCs by peering your VPCs.

Q: Can a Kinesis Data Analytics for Apache Flink application connected to a VPC also be able to access the internet and Amazon Web Services Service endpoints? Kinesis Data Analytics for Apache Flink applications and Kinesis Data Analytics Studio notebooks configured to access resources in a particular VPC will not have access to the internet as a default configuration. You can learn how to configure access to the internet for your application in the Internet and Service Access section of the Amazon Kinesis Data Analytics Developer Guide.

General

Data is coming at us at lightning speeds due to an explosive growth of real-time data sources. Whether it is log data from mobile and web applications, purchase data from ecommerce sites, or sensor data from IoT devices, the data delivers information that can help companies learn about what their customers, organization, and business are doing right now. By having visibility into this data as it arrives, you can monitor your business in real time and quickly leverage new business opportunities. For example, making promotional offers to customers based on where they might be at a specific time, or monitoring social sentiment and changing customer attitudes to identify and act on new opportunities.

To take advantage of these opportunities, you need a different set of analytics tools for collecting and analyzing real-time streaming data than what has been available traditionally for static, stored data. With traditional analytics, you gather the information, store it in a database, and analyze it hours, days, or weeks later. Analyzing real-time data requires a different approach, different tools, and different services. Instead of running database queries on stored data, streaming analytics services process the data continuously before the data is stored. Streaming data flows at an incredible rate that can vary up and down all the time. Streaming analytics services need to process this data when it arrives, often at speeds of millions of events per hour.

You can use Kinesis Data Analytics for many use cases to process data continuously and get insights in seconds or minutes rather than waiting days or even weeks. Kinesis Data Analytics enables you to quickly build end-to-end stream processing applications for log analytics, clickstream analytics, Internet of Things (IoT), ad tech, gaming, and more. The four most common use cases are streaming extract-transform-load (ETL), continuous metric generation, responsive real-time analytics, and interactive querying of data streams.

Streaming ETL

Streaming ETL applications enable you to clean, enrich, organize, and transform raw data prior to loading your data lake or data warehouse in real-time, reducing or eliminating batch ETL steps. These applications can buffer small records into larger files prior to delivery, and perform sophisticated joins across streams and tables. For example, you can build an application that continuously reads IoT sensor data stored in Amazon Managed Streaming for Apache Kafka (Amazon MSK), organize the data by sensor type, remove duplicate data, normalizes data per a specified schema, and then deliver the data to Amazon S3.

Continuous metric generation applications enable you to monitor and understand how your data is trending over time. Your applications can aggregate streaming data into critical information and seamlessly integrate it with reporting databases and monitoring services to serve your applications and users in real-time. With Kinesis Data Analytics, you can use SQL or Apache Flink code in a supported language to continuously generate time-series analytics over time windows. For example, you can build a live leaderboard for a mobile game by computing the top players every minute and then sending it to Amazon DynamoDB. Or, you can track the traffic to your website by calculating the number of unique website visitors every five minutes and then sending the processed results to Amazon Redshift.

Responsive real-time analytics

Responsive real-time analytics applications send real-time alarms or notifications when certain metrics reach predefined thresholds, or in more advanced cases, when your application detects anomalies using machine learning algorithms. These applications enable you to respond immediately to changes in your business in real-time like predicting user abandonment in mobile apps and identifying degraded systems. For example, an application can compute the availability or success rate of a customer-facing API over time, and then send results to Amazon CloudWatch. You can build another application to look for events that meet certain criteria, and then automatically notify the right customers using Amazon Kinesis Data Streams and Amazon Simple Notification Service (SNS).

Interactive analysis enables streaming data exploration in real time. With ad hoc queries or programs, you can inspect streams from Amazon MSK or Amazon Kinesis Data Streams and visualize how data looks like within those streams. For example, you can view how a real-time metric that computes the average over a time window behaves and send the aggregated data to a destination of your choice. Interactive analysis also helps with iterative development of stream processing applications. The queries you build will continuously update as new data arrives. With Amazon Kinesis Data Analytics Studio you can deploy these queries to run continuously with autoscaling and durable state backups enabled.

Sign into the Amazon Kinesis Data Analytics console and create a new stream processing application. You can also use the Amazon CLI and Amazon SDKs. Once you create an application, go to your favorite Integrated Development Environment, connect to Amazon Web Services, and install the open source Apache Flink libraries and Amazon SDKs in your language of choice. Apache Flink is an open source framework and engine for processing data streams. The extensible libraries include more than 25 pre-built stream processing operators like window and aggregate, and Amazon Web Services service integrations like Amazon MSK, Amazon Kinesis Data Streams, and Amazon Kinesis Data Firehose. Once built, you upload your code to Amazon Kinesis Data Analytics and the service takes care of everything required to run your real-time applications continuously including scaling automatically to match the volume and throughput of your incoming data.

Using Apache Beam to create your Kinesis Data Analytics application is very similar to getting started with Apache Flink. Please follow the instructions in the question above and be sure to install any components necessary for applications to run on Apache Beam, per the instructions in the Developer Guide. Note that Kinesis Data Analytics supports Java SDK’s only when running on Apache Beam.

Sign into the Amazon Kinesis Data Analytics console and create a new stream processing application. You can also use the Amazon CLI and Amazon SDKs. You can build an end-to-end application in three simple steps: 1) configure incoming streaming data, 2) write your SQL queries, and 3) point to where you want the results loaded. Kinesis Data Analytics recognizes standard data formats such as JSON, CSV, and TSV, and automatically creates a baseline schema. You can refine this schema, or if your data is unstructured, you can define a new one using our intuitive schema editor. Then, the service applies the schema to the input stream and makes it look like a SQL table that is continually updated so that you can write standard SQL queries against it. You use our SQL editor to build your queries.

The SQL editor comes with all the bells and whistles including syntax checking and testing against live data. We also give you templates that provide the SQL code for anything from a simple stream filter to advanced anomaly detection and top-K analysis. Kinesis Data Analytics takes care of provisioning and elastically scaling all of the infrastructure to handle any data throughput. You don’t need to plan, provision, or manage infrastructure.

Kinesis Data Analytics elastically scales your application to accommodate for the data throughput of your source stream and your query complexity for most scenarios. For detailed information on service Limits for Apache Flink applications, visit the Limits section in the Amazon Kinesis Data Analytics for Apache Flink Developer Guide. For detailed information on service limits, see Limits in the Amazon Kinesis Data Analytics for SQL Developer Guide.

Learn Real Time Data Analytics with AWS Kinesis and DynamoDB
Learn Real Time Data Analytics with AWS Kinesis and DynamoDB

Building Apache Flink Applications

Authoring application code for applications using Apache Flink

Apache Flink is an open source framework and engine for stream and batch data processing. It makes streaming applications easy to build, because it provides powerful operators and solves the core streaming problems like duplicate processing very well. Apache Flink provides data distribution, communication, and fault tolerance for distributed computations over data streams.

You can start by downloading the open source libraries that include the Amazon SDK, Apache Flink, and connectors for Amazon Web Services services. You can get instructions on how to download the libraries and create your first application in the Amazon Kinesis Data Analytics for Apache Flink Developer Guide.

You write your Apache Flink application code using data streams and stream operators. Application data streams are the data structure you perform processing against using your code. Data continuously flows from the sources into application data streams. One or more stream operators are used to define your processing on the application data streams, including transform, partition, aggregate, join and window. Data streams and operators can be put together in serial and parallel chains. A short example using pseudo code is shown below.


DataStream

rawEvents = env.addSource( New KinesisStreamSource(“input_events”)); DataStream

gameStream = rawEvents.map(event - > new UserPerLevel(event.gameMetadata.gameId, event.gameMetadata.levelId,event.userId)); gameStream.keyBy(event -> event.gameId) .keyBy(1) .window(TumblingProcessingTimeWindows.of(Time.minutes(1))) .apply(...) - > {...}; gameStream.addSink(new KinesisStreamSink("myGameStateStream"));


Operators take an application data stream as input and send processed data to an application data stream as output. Operators can be put together to build applications with multiple steps and don’t require advanced knowledge of distributed systems to implement and operate.

Kinesis Data Analytics for Apache Flink includes over 25 operators that can be used to solve a wide variety of use cases including Map, KeyBy, aggregations, Window Join, and Window. Map allows you to perform arbitrary processing, taking one element from an incoming data stream and producing another element. KeyBy logically organizes data using a specified key enabling you to process similar data points together. Aggregations performs processing across multiple keys like sum, min, and max. Window Join joins two data streams together on a given key and window. Window group date using a key and typically time-based operation, like counting the number of unique items over a 5-minute time-period.

You can build custom operators if these do not meet your needs. You can find more examples in the Operators section of the Amazon Kinesis Data Analytics for Apache Flink Developer Guide. You can find a full list of operators in the Operators section of the Apache Flink documentation.

You can setup pre-built integrations with minimal code, or build your own integration to connect to virtually any data source. The open source libraries based on Apache Flink support streaming sources and destinations, or sinks, for the delivery of process data. This also includes support for data enrichment via asynchronous input/output connectors. A list of specific connectors included in the open source libraries are shown below.

• Streaming data sources: Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams

• Destinations, or sinks: Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, Amazon DynamoDB, Amazon Elasticsearch Service, and Amazon S3 (through file sink integrations).

Apache Flink also includes other connectors including Apache Kafka, Apache Cassandra, Elasticsearch and more.

You can add a source or destination to your application by building upon a set of primitives that enable you to read and write from files, directories, sockets, or anything that you can access over the internet. Apache Flink provides these primitives for data sources and data sinks. The primitives come with configurations like the ability to read and write data continuously or once, asynchronously or synchronously, and much more. For example, you can setup an application to read continuously from Amazon S3 by extending the existing file based source integration.

Apache Flink applications in Kinesis Data Analytics use an “exactly once” delivery model if an application is built using idempotent operators, including sources and sinks. This means the processed data will impact downstream results once and only once. Checkpoints save the current application state and enable Kinesis Data Analytics for Apache Flink applications to recover the position of the application to provide the same semantics as a failure-free execution. Checkpoints for applications are provided via Apache Flink’s checkpointing functionality. By default, Kinesis Data Analytics for Apache Flink applications uses exactly-once semantics. Your application will support exactly once processing semantics if you design your applications using sources, operators, and sinks that utilize Apache Flink’s exactly once semantics.

Yes. Kinesis Data Analytics for Apache Flink applications provides your application 50 GB of running application storage per Kinesis Processing Unit (KPU). Kinesis Data Analytics scales storage with your application. Running application storage is used for saving application state using checkpoints. It is also accessible to your application code to use as temporary disk for caching data or any other purpose. Kinesis Data Analytics can remove data from running application storage not saved via checkpoints (e.g operators, sources, sinks) at any time. All data stored in running application storage is encrypted at rest.

Kinesis Data Analytics automatically backs up your running application’s state using checkpoints and snapshots. Checkpoints save the current application state and enable Kinesis Data Analytics for Apache Flink applications to recover the position of the application to provide the same semantics as a failure-free execution. Checkpoints utilize running application storage. Snapshots save a point in time recovery point for applications. Snapshots utilize durable application backups.

Snapshots enable you to create and restore your application to a previous point in time. This enables you to maintain previous application state and rollback your application at any time. You control how snapshots you have at any given from zero to thousands of snapshots. Snapshots use durable application backups and Kinesis Data Analytics charges you based on their size. Kinesis Data Analytics encrypts data saved in snapshots by default. You can delete individual snapshots through the API or all snapshots by deleting your application.

Amazon Kinesis Data Analytics for Apache Flink applications supports Apache Flink 1.6, 1.8 and 1.11. Apache Flink 1.11 in Kinesis Data Analytics supports Java Development Kit version 11, Python 3.7 and Scala 2.1.2. You can find more information in Creating Applications section of the Amazon Web Services Developer Guide.

Comparison to other stream processing solutions

The Amazon Kinesis Client Library (KCL) is a pre-built library that helps you build consumer applications for reading and processing data from an Amazon Kinesis data stream. The KCL handles complex issues such as adapting to changes in data stream volume, load balancing streaming data, coordinating distributed services, and processing data with fault-tolerance. The KCL enables you to focus on business logic while building applications.

With Kinesis Data Analytics, you can process and query real-time, streaming data. You use standard SQL to process your data streams, so you don’t have to learn any new programming languages. You just point Kinesis Data Analytics to an incoming data stream, write your SQL queries, and then specify where you want the results loaded. Kinesis Data Analytics uses the KCL to read data from streaming data sources as one part of your underlying application. The service abstracts this from you, as well as many of the more complex concepts associated with using the KCL, such as checkpointing.

If you want a fully managed solution and you want to use SQL to process the data from your data stream, you should use Kinesis Data Analytics. Use the KCL if you need to build a custom processing solution whose requirements are not met by Kinesis Data Analytics, and you are able to manage the resulting consumer application.

What Is Amazon Kinesis Data Analytics?

With Amazon Kinesis Data Analytics, you can process and analyze streaming data using standard SQL. The service enables you to quickly author and runs powerful SQL code against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics.

To get started with Kinesis Data Analytics, you create a Kinesis data analytics application that continuously reads and processes streaming data. The service supports ingesting data from Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose streaming sources. Then, you author your SQL code using the interactive editor and test it with live streaming data. You can also configure destinations where you want Kinesis Data Analytics to send the results.

Kinesis Data Analytics supports Amazon Kinesis Data Firehose (Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service), AWS Lambda, and Amazon Kinesis Data Streams as destinations.

Amazon Kinesis Data Analytics Studio – Quickly Interact with Streaming Data Using SQL, Python,Scala
Amazon Kinesis Data Analytics Studio – Quickly Interact with Streaming Data Using SQL, Python,Scala

When Should I Use Amazon Kinesis Data Analytics?

Amazon Kinesis Data Analytics enables you to quickly author SQL code that continuously reads, processes, and stores data in near real time. Using standard SQL queries on the streaming data, you can construct applications that transform and provide insights into your data. Following are some of example scenarios for using Kinesis Data Analytics:

  • Generate time-series analytics – You can calculate metrics over time windows, and then stream values to Amazon S3 or Amazon Redshift through a Kinesis data delivery stream.

  • Feed real-time dashboards – You can send aggregated and processed streaming data results downstream to feed real-time dashboards.

  • Create real-time metrics – You can create custom metrics and triggers for use in real-time monitoring, notifications, and alarms.

For information about the SQL language elements that are supported by Kinesis Data Analytics, see Amazon Kinesis Data Analytics SQL Reference.

How it works

Amazon Kinesis cost-effectively processes and analyzes streaming data at any scale as a fully managed service. With Kinesis, you can ingest real-time data, such as video, audio, application logs, website clickstreams, and IoT telemetry data, for machine learning (ML), analytics, and other applications.

  • Kinesis Data Streams
  • Kinesis Video Streams
  • Kinesis Data Streams
  • Amazon Kinesis Data Streams is a serverless streaming data service that simplifies the capture, processing, and storage of data streams at any scale.

  • Kinesis Video Streams
  • With Amazon Kinesis Video Streams, you can more easily and securely stream video from connected devices to AWS for analytics, ML, playback, and other processing.

Build a Real Time Data Streaming System with AWS Kinesis, Lambda Functions and a S3 Bucket
Build a Real Time Data Streaming System with AWS Kinesis, Lambda Functions and a S3 Bucket

Benefits

Powerful real-time processing

Amazon Kinesis Data Analytics provides built-in functions to filter, aggregate, and transform streaming data for advanced analytics. It processes streaming data with sub-second latencies, enabling you to analyze and respond to incoming data and streaming events in real time.

No servers to manage

Amazon Kinesis Data Analytics is serverless; there are no servers to manage. It runs your streaming applications without requiring you to provision or manage any infrastructure. Amazon Kinesis Data Analytics automatically scales the infrastructure up and down as required to run your applications with low latency.

Pay only for what you use

With Amazon Kinesis Data Analytics, you pay only for the processing resources that your streaming applications use. There are no minimum fees or upfront commitments.

Use cases

Create real-time applications

Build apps for application monitoring, fraud detection, and live leaderboards. Analyze data and emit the results to any data store or application.Learn more about streaming data solutions on AWS »

Evolve from batch to real-time analytics

Perform real-time analytics on data that has been traditionally analyzed using batch processing. Get the latest information without delay.Learn more about building a log analytics solution »

Analyze IoT device data

Process streaming data from IoT devices, and then use the data to programmatically send real-time alerts and respond when a sensor exceeds certain operating thresholds.

Build video analytics applications

Securely stream video from camera-equipped devices. Use streams for video playback, security monitoring, face detection, ML, and other analytics.Learn more about building video streaming apps »

AWS re:Invent 2023 - Serverless data streaming: Amazon Kinesis Data Streams and AWS Lambda (COM308)
AWS re:Invent 2023 – Serverless data streaming: Amazon Kinesis Data Streams and AWS Lambda (COM308)

Use cases

Create real-time applications

Build apps for application monitoring, fraud detection, and live leaderboards. Analyze data and emit the results to any data store or application.Learn more about streaming data solutions on AWS »

Evolve from batch to real-time analytics

Perform real-time analytics on data that has been traditionally analyzed using batch processing. Get the latest information without delay.Learn more about building a log analytics solution »

Analyze IoT device data

Process streaming data from IoT devices, and then use the data to programmatically send real-time alerts and respond when a sensor exceeds certain operating thresholds.

Build video analytics applications

Securely stream video from camera-equipped devices. Use streams for video playback, security monitoring, face detection, ML, and other analytics.Learn more about building video streaming apps »

Easy to use

Amazon Kinesis Data Analytics enables you to easily and quickly build queries and sophisticated streaming applications in three simple steps: setup your streaming data sources, write your queries or streaming applications, and setup your destination for processed data.

Build sophisticated streaming analytics applications with Apache Flink

Amazon Kinesis Data Analytics includes open source libraries and runtimes based on Apache Flink that enable you to build an application in hours instead of months using your favorite IDE. The extensible libraries include different APIs that are specialized for different use cases including stateful stream processing, streaming ETL, and real-time analytics. You can use the libraries to integrate with Amazon Web Services services like Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, Amazon OpenSearch Service (successor to Amazon Elasticsearch Service), Amazon S3, Amazon DynamoDB, and more.

Quickly get started with streaming data processing with Amazon Kinesis Data Analytics Studio

With Amazon Kinesis Data Analytics Studio, you can interactively query data streams and rapidly develop stream processing applications using an interactive development environment powered by Apache Zeppelin notebooks. Apache Flink powers the stream processing.

Use your favorite language

Amazon Kinesis Data Analytics supports building applications in SQL, Java, Scala, and Python. You can extend the open source libraries and include custom libraries from the language of your choice. Using Amazon Kinesis Data Analytics Studio, you can build applications interactively in SQL, Scala, and Python using Apache Zeppelin notebooks.

Top AWS Services A Data Engineer Should Know
Top AWS Services A Data Engineer Should Know

How to get started

Explore Kinesis capabilities

Learn more about implementing real-time applications with Kinesis.

Contact an expert

Reach out to an AWS expert to get your questions answered.

Get started with a free account

Gain instant access to the AWS Free Tier.

Amazon Kinesis

Thu thập, xử lý và phân tích video cũng như luồng dữ liệu trong thời gian thực

Tải nhập, tạo bộ đệm và xử lý dữ liệu truyền liên tục theo thời gian thực để rút ra thông tin chuyên sâu trong vài phút chứ không cần đến vài ngày.

Chạy các ứng dụng truyền liên tục của bạn trên cơ sở hạ tầng phi máy chủ với dịch vụ được quản lý toàn phần.

Xử lý bất kỳ khối lượng dữ liệu truyền liên tục nào từ hàng nghìn nguồn và xử lý với độ trễ thấp.

For new projects, we recommend that you use the new Managed Service for Apache Flink Studio over Kinesis Data Analytics for SQL Applications. Managed Service for Apache Flink Studio combines ease of use with advanced analytical capabilities, enabling you to build sophisticated stream processing applications in minutes.

What Is Amazon Kinesis Data Analytics for SQL Applications?

With Amazon Kinesis Data Analytics for SQL Applications, you can process and analyze streaming data using standard SQL. The service enables you to quickly author and run powerful SQL code against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics.

To get started with Kinesis Data Analytics, you create a Kinesis Data Analytics application that continuously reads and processes streaming data. The service supports ingesting data from Amazon Kinesis Data Streams and Amazon Data Firehose streaming sources. Then, you author your SQL code using the interactive editor and test it with live streaming data. You can also configure destinations where you want Kinesis Data Analytics to send the results.

Kinesis Data Analytics supports Amazon Data Firehose (Amazon S3, Amazon Redshift, Amazon OpenSearch Service, and Splunk), AWS Lambda, and Amazon Kinesis Data Streams as destinations.

Why it feels IMPOSSIBLE to get a data analyst job | AI, career, tech
Why it feels IMPOSSIBLE to get a data analyst job | AI, career, tech

Are You a First-Time User of Amazon Kinesis Data Analytics?

If you are a first-time user of Amazon Kinesis Data Analytics, we recommend that you read the following sections in order:

  1. Read the How It Works section of this guide. This section introduces various Kinesis Data Analytics components that you work with to create an end-to-end experience. For more information, see Amazon Kinesis Data Analytics for SQL Applications: How It Works.

  2. Try the Getting Started exercises. For more information, see Getting Started with Amazon Kinesis Data Analytics for SQL Applications.

  3. Explore the streaming SQL concepts. For more information, see Streaming SQL Concepts.

  4. Try additional examples. For more information, see Kinesis Data Analytics for SQL examples.

For new projects, we recommend that you use the new Managed Service for Apache Flink Studio over Kinesis Data Analytics for SQL Applications. Managed Service for Apache Flink Studio combines ease of use with advanced analytical capabilities, enabling you to build sophisticated stream processing applications in minutes.

What Is Amazon Kinesis Data Analytics for SQL Applications?

With Amazon Kinesis Data Analytics for SQL Applications, you can process and analyze streaming data using standard SQL. The service enables you to quickly author and run powerful SQL code against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics.

To get started with Kinesis Data Analytics, you create a Kinesis Data Analytics application that continuously reads and processes streaming data. The service supports ingesting data from Amazon Kinesis Data Streams and Amazon Data Firehose streaming sources. Then, you author your SQL code using the interactive editor and test it with live streaming data. You can also configure destinations where you want Kinesis Data Analytics to send the results.

Kinesis Data Analytics supports Amazon Data Firehose (Amazon S3, Amazon Redshift, Amazon OpenSearch Service, and Splunk), AWS Lambda, and Amazon Kinesis Data Streams as destinations.

Trường hợp sử dụng

Tạo các ứng dụng thời gian thực

Xây dựng các ứng dụng để giám sát ứng dụng, phát hiện lừa đảo và làm bảng xếp hạng trực tiếp. Phân tích dữ liệu và đưa ra kết quả cho bất kỳ kho dữ liệu hoặc ứng dụng nào.Tìm hiểu thêm về các giải pháp dữ liệu phát trực tuyến trên AWS »

Phát triển từ phân tích theo đợt lên phân tích theo thời gian thực

Thực hiện phân tích theo thời gian thực trên dữ liệu mà đã được phân tích theo cách truyền thống bằng quy trình xử lý theo đợt. Nhận thông tin mới nhất mà không bị chậm trễ.Tìm hiểu thêm về quá trình xây dựng giải pháp phân tích bản ghi »

Phân tích dữ liệu thiết bị IoT

Xử lý dữ liệu phát trực tuyến từ các thiết bị IoT, sau đó sử dụng dữ liệu đó để gửi cảnh báo theo thời gian thực bằng phương thức lập trình và phản hồi khi cảm biến vượt quá ngưỡng hoạt động nhất định.

Phát triển ứng dụng phân tích video

Truyền video một cách bảo mật từ các thiết bị được trang bị camera. Sử dụng luồng để phát lại video, giám sát bảo mật, phát hiện khuôn mặt, ML và các phân tích khác.Tìm hiểu thêm về quá trình xây dựng các ứng dụng truyền video »

AWS Hands-On: Build a real-time Streaming App with Amazon Kinesis
AWS Hands-On: Build a real-time Streaming App with Amazon Kinesis

Cách bắt đầu

Khám phá các tính năng của Amazon Kinesis

Tìm hiểu thêm về việc triển khai các ứng dụng trong thời gian thực với Kinesis.

Liên hệ với chuyên gia

Liên hệ với chuyên gia AWS để giải đáp thắc mắc của bạn.

Bắt đầu sử dụng bằng tài khoản miễn phí

Nhận ngay quyền sử dụng Bậc miễn phí của AWS.

Amazon Kinesis

Collect, process, and analyze real-time video and data streams

Ingest, buffer, and process streaming data in real time to derive insights in minutes, not days.

Run your streaming applications on serverless infrastructure with a fully managed service.

Handle any amount of streaming data from thousands of sources and process it with low latencies.

Building SQL applications

Configuring input for SQL applications

SQL applications in Kinesis Data Analytics support two types of inputs: streaming data sources and reference data sources. A streaming data source is continuously generated data that is read into your application for processing. A reference data source is static data that your application uses to enrich data coming in from streaming sources. Each application can have no more than one streaming data source and no more than one reference data source. An application continuously reads and processes new data from streaming data sources, including Amazon Kinesis Data Streams or Amazon Kinesis Data Firehose. An application reads a reference data source, including Amazon S3, in its entirety for use in enriching the streaming data source through SQL JOINs.

A reference data source is static data that your application uses to enrich data coming in from streaming sources. You store reference data as an object in your S3 bucket. When the SQL application starts, Kinesis Data Analytics reads the S3 object and creates an in-application SQL table to store the reference data. Your application code can then join it with an in-application stream. You can update the data in the SQL table by calling the UpdateApplication API.

A streaming data source can be an Amazon Kinesis data stream or an Amazon Kinesis Data Firehose delivery stream. Your Kinesis Data Analytics SQL application continuously reads new data from streaming data sources as it arrives in real time. The data is made accessible in your SQL code through an in-application stream. An in-application stream acts like a SQL table because you can create, insert, and select from it. However, the difference is that an in-application stream is continuously updated with new data from the streaming data source.

You can use the Amazon Web Services Management Console to add a streaming data source. You can learn more about sources in the Configuring Application Input section of the Kinesis Data Analytics for SQL Developer Guide.

A reference data source can be an Amazon S3 object. Your Kinesis Data Analytics SQL application reads the S3 object in its entirety when it starts running. The data is made accessible in your SQL code through a table. The most common use case for using a reference data source is to enrich the data coming from the streaming data source using a SQL JOIN.

Using the Amazon CLI, you can add a reference data source by specifying the S3 bucket, object, IAM role, and associated schema. Kinesis Data Analytics loads this data when you start the application, and reloads it each time you make any update API call.

SQL applications in Kinesis Data Analytics can detect the schema and automatically parses UTF-8 encoded JSON and CSV records using the DiscoverInputSchema API. This schema is applied to the data read from the stream as part of the insertion into an in-application stream.

For other UTF-8 encoded data that does not use a delimiter, uses a different delimiter than CSV, or in cases were the discovery API did not fully discover the schema, you can define a schema using the interactive schema editor or use string manipulation functions to structure your data. For more information, see Using the Schema Discovery Feature and Related Editing in the Amazon Kinesis Data Analytics for SQL Developer Guide.

Kinesis Data Analytics for SQL applies your specified schema and inserts your data into one or more in-application streams for streaming sources, and a single SQL table for reference sources. The default number of in-application streams is the one that meets the needs of most of your use cases. You should increase this if you find that your application is not keeping up with the latest data in your source stream as defined by CloudWatch metric MillisBehindLatest. The number of in-application streams required is impacted by both the amount of throughput in your source stream and your query complexity. The parameter for specifying the number of in-application streams that are mapped to your source stream is called input parallelism.

Authoring application code for SQL applications

You can use the following pattern to work with in-application streams:

• Always use a SELECT statement in the context of an INSERT statement. When you select rows, you insert results into another in-application stream.

• Use an INSERT statement in the context of a pump. You use a pump to make an INSERT statement continuous, and write to an in-application stream.

• You use a pump to tie in-application streams together, selecting from one in-application stream and inserting into another in-application stream.


CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" ( ticker_symbol VARCHAR(4), change DOUBLE, price DOUBLE); CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM" SELECT STREAM ticker_symbol, change, price FROM "SOURCE_SQL_STREAM_001";

Kinesis Data Analytics includes a library of analytics templates for common use cases including streaming filters, tumbling time windows, and anomaly detection. You can access these templates from the SQL editor in the Amazon Web Services Management Console. After you create an application and navigate to the SQL editor, the templates are available in the upper-left corner of the console.

Kinesis Data Analytics includes pre-built SQL functions for several advanced analytics including one for anomaly detection. You can simply make a call to this function from your SQL code for detecting anomalies in real-time. Kinesis Data Analytics uses the Random Cut Forest algorithm to implement anomaly detection.

Configuring destinations in SQL applications

Kinesis Data Analytics for SQL supports up to three destinations per application. You can persist SQL results to Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service (through Amazon Kinesis Data Firehose), and Amazon Kinesis Data Streams. You can write to a destination not directly supported by Kinesis Data Analytics by sending SQL results to Amazon Kinesis Data Streams, and leveraging its integration with Amazon Lambda to send to a destination of your choice.

In your application code, you write the output of SQL statements to one or more in-application streams. Optionally, you can add an output configuration to your application to persist everything written to specific in-application streams to up to four external destinations. These external destinations can be an Amazon S3 bucket, Amazon Redshift table, Amazon Elasticsearch Service domain (through Amazon Kinesis Data Firehose) and an Amazon Kinesis data stream. Each application supports up to four destinations, which can be any combination of the above. For more information, see Configuring Output Streams in the Amazon Kinesis Data Analytics for SQL Developer Guide.

You can use Amazon Lambda to write to a destination that is not directly supported using Kinesis Data Analytics for SQL applications. We recommend that you write results to an Amazon Kinesis data stream, and then use Amazon Lambda to read the processed results and send it to the destination of your choice. For more information, see the Example: Amazon Lambda Integration in the Amazon Kinesis Data Analytics for SQL Developer Guide. Alternatively, you can use a Kinesis Data Firehose delivery stream to load the data into Amazon S3, and then trigger an Amazon Lambda function to read that data and send it to the destination of your choice.

SQL applications in Kinesis Data Analytics uses an “at least once” delivery model for application output to the configured destinations. Kinesis Data Analytics applications take internal checkpoints, which are points in time when output records were delivered to the destinations and there was no data loss. The service uses the checkpoints as needed to ensure that your application output is delivered at least once to the configured destinations. For more information about the delivery model, see Configuring Application Output in the Amazon Kinesis Data Analytics for SQL Developer Guide.

Getting Started with Kinesis Data Streams | Amazon Web Services
Getting Started with Kinesis Data Streams | Amazon Web Services

KafkaKinesis Comparison

Category Kafka Kinesis
Performance Kafka is more highly configurable compared to Kinesis. With Kafka, it’s possible to write data to a single server. On the other hand, Kinesis is designed to write simultaneously to three servers – a constraint that makes Kafka a better performing solution. Kinesis’ configurability is limited in how it must write to three servers synchronously within AWS. This requirement adds additional overhead to the Kinesis platform leading to degradation in performance.
Cost Kafka requires more engineering hours for implementation and maintenance leading to a higher total cost of ownership (TCO). As an AWS cloud-native service, Kinesis supports a pay-as-you-go model leading to lower costs to achieve the same outcome.
Scalability Kafka’s scalability is determined by brokers and partitions. A standard configuration of Kafka can reach a throughput of 30k messages per second. Kafka requires manual configuration for cross-replication Kinesis scalability is determined by shards. A shard provides a write capacity of 1MB, or 1,000 records per second, and a read capacity of 2MB, or 5 transactions per second. Kinesis has built-in cross-replication between geo-locations.
Security Kafka requires a heavy amount of engineering to implement for its on-premises deployment, leading to unforeseen misconfigurations, vulnerabilities, and bugs. Kinesis leverages more automated cloud-native services, removing the human element mitigating the risk of unforeseen misconfigurations, vulnerabilities, and bugs.
Ease of Use Kafka requires a heavy lift for implementation, making it a more challenging solution to implement, use and maintain. Kinesis is designed for easy implementation. Spinning up Kinesis within AWS can be done with just a few clicks, making it a much easier service to spin up, use and maintain.

Performance

When considering a larger data ecosystem, performance is a major concern. Businesses need to know that their data stream processing architecture and associated message brokering service will keep up with their stream processing requirements. That said, when looking at Kafka vs. Kinesis, there are some stark differences that influence performance.

One of the major considerations is how these tools are designed to operate. By design, Kinesis will synchronously broker data streams and write and replicate ingested data into three different AWS machines. This replication cannot be reconfigured, influencing resource overhead such as throughput and latency.Kafka gives more control to the operator in its configurability than Kinesis. It allows operators to configure the data publishing process to as little as one machine, removing some of the overhead seen with Kinesis. Here, Kafka is the clear winner.

Cost

Amazon’s Kinesis follows the typical cloud pricing structure: pay-as-you-go removing the requirement for on-premise data centers. Amazon’s Kinesis requires no upfront costs to set up (unless an organization seeks third-party services to configure their Kinesis environment). Amazon Kinesis also has no minimum fees, and businesses can pay only for the resources they require. Kinesis Data Streams can be purchased via two capacity modes – on-demand and provisioned.

When we look at Kafka, whether in an on-premises or cloud deployment, cost is measured more in data engineering time. It takes significant technical resources to implement the solution fully and keep it running efficiently. For this reason, Kinesis is generally more cost-effective than Kafka.

Scalability

Although Kafka and Kinesis are highly configurable to meet the scale required of a data streaming environment, these two services offer that configurability in distinctly different ways.

For Kinesis, scaling is enabled by an abstraction of the Kinesis framework known as a Shard.

A shard is the base throughput unit of a Kinesis data ingestion stream. By definition, a shard provides a write capacity of 1MB, or 1,000 records per second, and a read capacity of 2MB, or 5 transactions per second. Further, one given shard can support up to 1000 PUT records per second.

With Kafka, scalability is highly configurable by the end-user providing both benefits and challenges. There are two primary components of the Kafka architecture at a high level that influence throughput, known as Kafka brokers and the Kafka partitions. When first configuring a Kafka environment, one starts by configuring a Kafka cluster and defining the broker as the underlying server to the Kafka cluster. Here, choosing the right instance type for the Kafka cluster and the number of brokers will profoundly impact throughput.

Unfortunately, selecting an instance type and the number of brokers isn’t entirely straightforward. Typically this comes down to some fine-tuning on the fly. Following Amazon’s sizing guide can help, but most organizations will reconfigure the instance type and number of brokers according to the throughput needs as the scale.

Comparable to Kinesis, Kafka partition offers the same functionality as a Kinesis shard. Much like the Kinesis shard, the more Kafka partitions configured within a Kafka cluster, the more simultaneous reads and writes Kafka can perform. And if you’re wondering how this all boils down to throughput capabilities for Kafka, as a quick rule of thumb, Kafka can reach a throughput of 30k messages per second.

Aside from some of the scaling nuances between Kafka and Kinesis mentioned above, cross replication is a major concern for those looking to replicate streaming data. By default, Amazon Kinesis offers built-in cross replication between geo-locations; Kafka requires replication configuration to be done manually – a major consideration regarding scalability.

Security

Kafka and Kinesis are similarly positioned when it comes to security, with a couple of key differences.

First on the list is immutability. Both Kafka and Kinesis support immutability in how they write to their respective databases. The immutability functionality disallows any user or service to change an entry once it’s written. This promotes a high degree of dependability and data durability both by Kafka and Kinesis and greatly mitigates the risk of data destruction or security vulnerabilities.

We also come to a draw when it comes to the security inherent to the cloud vs. the higher configurability of security available in Kafka. Here, arguments for and against could be made on both sides, and it’s largely a matter of preference.

However, the human element (or lack thereof) is where Amazon Kinesis may gain an edge over Kafka regarding security. Since Kafka requires such a substantial heavy lift during implementation compared to Kinesis, it inherently introduces risk into the equation. Anytime, a large number of engineering resource hours are required for implementation, it also introduces the chance of bugs, misconfigurations, and vulnerabilities.

Ease of Use

Lastly, let’s address ease of use. Since we’ve hit on this quite a bit in this piece, we’re sure you can guess the winner here. Right? Yep. Amazon Kinesis. Since Amazon Kinesis is a cloud-native pay-as-you-go service, it can be spun up easily and preconfigured to integrate with other AWS cloud-native services on the fly. On the flip side, Kafka typically requires physical on-premises self-managed infrastructure – lots of engineering hours and even third-party managed services to get it up and running.

Use cases

Amazon Kinesis Data Analytics is ideal for solving a wide range of streaming data use cases, including:

Streaming ETL for Internet-of-Things (IoT) with Apache Flink Applications

You can develop applications with Apache Flink libraries and use Amazon Kinesis Data Analytics to transform, aggregate, and filter streaming data from IoT devices such as consumer appliances, embedded sensors, and TV set-top boxes. You can then use the data to send real-time alerts when a sensor exceeds certain operating thresholds.

Real-time log analytics with SQL

You can stream billions of small messages to Amazon Kinesis Data Analytics and calculate key metrics, which you can then use to refresh content performance dashboards in real time and improve content performance.

Amazon Kinesis

Collect, process, and analyze real-time video and data streams

Ingest, buffer, and process streaming data in real time to derive insights in minutes, not days.

Run your streaming applications on serverless infrastructure with a fully managed service.

Handle any amount of streaming data from thousands of sources and process it with low latencies.

Kinesis Stream Tutorial | Kinesis Data Stream to S3 demo | Firehose | AWS Kafka
Kinesis Stream Tutorial | Kinesis Data Stream to S3 demo | Firehose | AWS Kafka

Introduction to Event Streaming Platforms

As modern business needs have evolved, the monolithic app and singular database paradigm is quickly being replaced by a microservices architectural approach. The concept of microservices is to create a larger architectural ecosystem through stitching together many individual programs or systems, each of which can be patched and reworked all on their own.

This architectural evolution to microservices requires a new approach to facilitate near-instantaneous communication between these interconnected microservices. Enter message brokering from event streaming platforms like Apache Kafka and Amazon Kinesis.

Cách thức hoạt động

Amazon Kinesis xử lý và phân tích dữ liệu phát trực tuyến một cách tiết kiệm chi phí ở mọi quy mô dưới dạng dịch vụ được quản lý toàn phần. Với Kinesis, bạn có thể tải nhập dữ liệu thời gian thực, chẳng hạn như video, âm thanh, bản ghi ứng dụng, luồng dữ liệu nhấp chuột vào trang web và dữ liệu đo từ xa IoT, cho máy học (ML), phân tích và các ứng dụng khác.

  • Luồng dữ liệu Kinesis
  • Firehose dữ liệu Kinesis
  • Luồng video Kinesis
  • Luồng dữ liệu Kinesis
  • Luồng dữ liệu Amazon Kinesis là dịch vụ dữ liệu phát trực tuyến phi máy chủ giúp dễ dàng thu thập, xử lý và lưu trữ các luồng dữ liệu ở mọi quy mô.

  • Firehose dữ liệu Kinesis
  • Firehose dữ liệu Amazon Kinesis là một dịch vụ trích xuất, chuyển đổi và tải (ETL) có khả năng thu thập, chuyển đổi và truyền dữ liệu phát trực tuyến một cách đáng tin cậy đến các hồ dữ liệu, kho dữ liệu và các dịch vụ phân tích.
  • Luồng video Kinesis
  • Luồng video Amazon Kinesis giúp cho việc truyền video từ các thiết bị được kết nối tới AWS một cách dễ dàng và bảo mật hơn để thực hiện phân tích, ML, phát lại và quy trình xử lý khác.

PySpark For AWS Glue Tutorial [FULL COURSE in 100min]
PySpark For AWS Glue Tutorial [FULL COURSE in 100min]

Use cases

Deliver streaming data in seconds

Develop applications that transform and deliver data to Amazon Simple Storage Service (Amazon S3), Amazon OpenSearch Service, and more.

Create real-time analytics

Interactively query and analyze data in real time and continuously produce insights for time-sensitive use cases.

Perform stateful processing

Use long-running, stateful computations to initiate real-time actions such as anomaly detection based on historical data trends.

Easy to use

Amazon Kinesis Data Analytics enables you to easily and quickly build queries and sophisticated streaming applications in three simple steps: setup your streaming data sources, write your queries or streaming applications, and setup your destination for processed data.

Build sophisticated streaming analytics applications with Apache Flink

Amazon Kinesis Data Analytics includes open source libraries and runtimes based on Apache Flink that enable you to build an application in hours instead of months using your favorite IDE. The extensible libraries include different APIs that are specialized for different use cases including stateful stream processing, streaming ETL, and real-time analytics. You can use the libraries to integrate with Amazon Web Services services like Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, Amazon OpenSearch Service (successor to Amazon Elasticsearch Service), Amazon S3, Amazon DynamoDB, and more.

Quickly get started with streaming data processing with Amazon Kinesis Data Analytics Studio

With Amazon Kinesis Data Analytics Studio, you can interactively query data streams and rapidly develop stream processing applications using an interactive development environment powered by Apache Zeppelin notebooks. Apache Flink powers the stream processing.

Use your favorite language

Amazon Kinesis Data Analytics supports building applications in SQL, Java, Scala, and Python. You can extend the open source libraries and include custom libraries from the language of your choice. Using Amazon Kinesis Data Analytics Studio, you can build applications interactively in SQL, Scala, and Python using Apache Zeppelin notebooks.

AWS Lambda Tutorial For Beginners | What is AWS Lambda? | AWS  Lambda For Beginners | Simplilearn
AWS Lambda Tutorial For Beginners | What is AWS Lambda? | AWS Lambda For Beginners | Simplilearn

Building Amazon Kinesis Analytics Studio applications

Configuring input for SQL applications

You can write code in the notebook in your preferred language of SQL, Python, or Scala using Apache Flink’s Table API. The Table API is a high-level abstraction and relational API that supports a superset of SQL’s capabilities. It offers familiar operations such as select, filter, join, group by, aggregate, etc., along with stream specific concepts like windowing. You use %

to specify the language to be used in a section of the notebook, and easily switch between languages. Interpreters are Apache Zeppelin plug-ins that enable developers to specify a language or data processing engine for each section of the notebook. You can also build user defined functions and reference them to improve code functionality.

You can perform SQL operations such as Scan and Filter (SELECT, WHERE), Aggregations (GROUP BY, GROUP BY WINDOW,HAVING), Set (UNION, UNIONALL, INTERSECT, IN, EXISTS), Order (ORDER BY, LIMIT), Joins (INNER, OUTER, Timed Window –BETWEEN, AND, joining with temporal tables – tables that track changes over time), Top N, deduplication, and pattern recognition. Some of these queries such as GROUP BY, OUTER JOIN, and Top N are “results updating” for streaming data,which means that the results are continuously updating as the streaming data is processed. Other DDL statements such as CREATE, ALTER, and DROP are also supported. For a complete list of queries and samples, see https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/queries.html.

Apache Flink’s Table API supports Python and Scala through language integration using Python strings and Scala expressions. The operations supported are very similar to the SQL operations supported, including select, order, group, join, filter, and windowing. A full list of operations and samples are included here: https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/tableApi.html

Kinesis Data Analytics Studio supports Apache Flink 1.11 and Apache Zeppelin 0.9.

  • Data sources: Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams, Amazon S3
  • Destinations, or sinks: Amazon MSK, Amazon Kinesis Data Streams, and Amazon S3

You can configure additional integrations with a few additional steps and lines of Apache Flink code (Python, Scala, or Java) to define connections with all Apache Flink supported integrations including destinations such as Amazon Elasticsearch Service, Amazon ElastiCache for Redis, Amazon Aurora, Amazon Redshift, Amazon DynamoDB, Amazon Keyspaces, and more. You can attach executables for these custom connectors when you create or configure your Studio application.

We recommend getting started with Kinesis Data Analytics Studio as it offers a more comprehensive stream processing experience with exactly-once processing. Kinesis Data Analytics Studio offers stream processing application development in your language of choice (SQL, Python, and Scala), scales to GB/s of processing, supports long running computations over hours or even days, performs code updates within seconds, handles multiple input streams, and works with a variety of input streams including Amazon Kinesis Data Streams and Amazon MSK.

Keywords searched by users: amazon kinesis data analytics

What Is Amazon Kinesis, Its Capabilities, And Benefits
What Is Amazon Kinesis, Its Capabilities, And Benefits
Amazon Kinesis | Developersio
Amazon Kinesis | Developersio
Real-Time Streaming Analytics - Amazon Kinesis Data Streams - Aws
Real-Time Streaming Analytics – Amazon Kinesis Data Streams – Aws
Kinesis Data Analytics | Aws Architecture Blog
Kinesis Data Analytics | Aws Architecture Blog
What Is Amazon Kinesis, Its Capabilities, And Benefits
What Is Amazon Kinesis, Its Capabilities, And Benefits
Amazon Kinesis: The Key To Real-Time Data Streaming
Amazon Kinesis: The Key To Real-Time Data Streaming

See more here: kientrucannam.vn

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *