Aws Glue FAQs Managed Etl Service Amazon Web Services Flashcards

Q. If I am already using Amazon Athena or Amazon Redshift Spectrum and have tables in Amazon Athena’s internal data catalog, how can I start using the AWS Glue Data Catalog as my common metadata repository?

The metadata stored in the AWS Glue Data Catalog can be readily accessed from Glue ETL, Amazon Athena, Amazon EMR, Amazon Redshift Spectrum, and third-party services.

Q: What analytics services use the AWS Glue Data Catalog?

You can use either Scala or Python.

Q: What programming language can I use to write my ETL code for AWS Glue?

AWS Glue’s ETL script recommendation system generates Scala or Python code. It leverages Glue’s custom ETL library to simplify access to data sources as well as manage job execution. You can find more details about the library in our documentation. You can write ETL code using AWS Glue’s custom library or write arbitrary code in Scala or Python by using inline editing via the AWS Glue Console script editor, downloading the auto-generated code, and editing it in your own IDE. You can also start with one of the many samples hosted in our Github repository and customize that code.

Q: How can I customize the ETL code generated by AWS Glue?

Yes. You can import custom Python libraries and Jar files into your AWS Glue ETL job. For more details, please check our documentation here.

Q: Can I import custom libraries as part of my ETL script?

Yes. You can write your own code using AWS Glue’s ETL library, or write your own Scala or Python code and upload it to a Glue ETL job. For more details, please check our documentation here.

Q: Can I bring my own code?

You can create and connect to development endpoints that offer ways to connect your notebooks and IDEs.

Q: How can I develop my ETL code using my own IDE?

In addition to the ETL library and code generation, AWS Glue provides a robust set of orchestration features that allow you to manage dependencies between multiple jobs to build end-to-end ETL workflows. AWS Glue ETL jobs can either be triggered on a schedule or on a job completion event. Multiple jobs can be triggered in parallel or sequentially by triggering them on a job completion event. You can also trigger one or more Glue jobs from an external source such as an AWS Lambda function.

Q: How can I build end-to-end ETL workflow using multiple jobs in AWS Glue?

AWS Glue manages dependencies between two or more jobs or dependencies on external events using triggers. Triggers can watch one or more jobs as well as invoke one or more jobs. You can either have a scheduled trigger that invokes jobs periodically, an on-demand trigger, or a job completion trigger.

Q: How does AWS Glue monitor dependencies?

AWS Glue monitors job event metrics and errors, and pushes all notifications to Amazon CloudWatch. With Amazon CloudWatch, you can configure a host of actions that can be triggered based on specific notifications from AWS Glue. For example, if you get an error or a success notification from Glue, you can trigger an AWS Lambda function. Glue also provides default retry behavior that will retry all failures three times before sending out an error notification.

Q: How does AWS Glue handle errors?

Yes. You can run your existing Scala or Python code on AWS Glue. Simply upload the code to Amazon S3 and create one or more jobs that use that code. You can reuse the same code across multiple jobs by pointing them to the same code location on Amazon S3.

Q: Can I run my existing ETL jobs with AWS Glue?

AWS Glue supports ETL on streams from Amazon Kinesis Data Streams, Apache Kafka, and Amazon MSK. Add the stream to the Glue Data Catalog and then choose it as the data source when setting up your AWS Glue job.

Q: How can I use AWS Glue to ETL streaming data?

No. While we do believe that using both the AWS Glue Data Catalog and ETL provides an end-to-end ETL experience, you can use either one of them independently without using the other.

Q: Do I have to use both AWS Glue Data Catalog and Glue ETL to use the service?

Both AWS Glue and Amazon Kinesis Data Analytics can be used to process streaming data. AWS Glue is recommended when your use cases are primarily ETL and when you want to run jobs on a serverless Apache Spark-based platform. Amazon Kinesis Data Analytics is recommended when your use cases are primarily analytics and when you want to run jobs on a serverless Apache Flink-based platform.

Q: When should I use AWS Glue Streaming and when should I use Amazon Kinesis Data Analytics?

Both AWS Glue and Amazon Kinesis Data Firehose can be used for streaming ETL. AWS Glue is recommended for complex ETL, including joining streams, and partitioning the output in Amazon S3 based on the data content. Amazon Kinesis Data Firehose is recommended when your use cases focus on data delivery and preparing data to be processed after it is delivered.

Q: When should I use AWS Glue and when should I use Amazon Kinesis Data Firehose?

Q: What kind of problems does the FindMatches ML Transform solve?

Q: How does AWS Glue deduplicate my data?

ML Transforms provide a destination for creating and managing machine-learned transforms. Once created and trained, these ML Transforms can then be executed in standard AWS Glue scripts. Customers select a particular algorithm (for example, the FindMatches ML Transform) and input datasets and training examples, and the tuning parameters needed by that algorithm. AWS Glue uses those inputs to build an ML Transform that can be incorporated into a normal ETL Job workflow.

Q: What are ML Transforms?

AWS Glue includes specialized ML-based dataset transformation algorithms customers can use to create their own ML Transforms. These include record de-duplication and match finding.

Q: How do ML Transforms work?

Q: Can I see a presentation on using AWS Glue (and AWS Lake Formation) to find matches and deduplicate records?

AWS Glue provides a managed ETL service that runs on a serverless Apache Spark environment. This allows you to focus on your ETL job and not worry about configuring and managing the underlying compute resources. AWS Glue takes a data first approach and allows you to focus on the data properties and data manipulation to transform the data to a form where you can derive business insights. It provides an integrated data catalog that makes metadata available for ETL as well as querying via Amazon Athena and Amazon Redshift Spectrum.

Q: When should I use AWS Glue vs. AWS Data Pipeline?

AWS Glue works on top of the Apache Spark environment to provide a scale-out execution environment for your data transformation jobs. AWS Glue infers, evolves, and monitors your ETL jobs to greatly simplify the process of creating and maintaining jobs. Amazon EMR provides you with direct access to your Hadoop environment, affording you lower-level access and greater flexibility in using tools beyond Spark.

Q: When should I use AWS Glue vs. Amazon EMR?

Q: When should I use AWS Glue vs AWS Database Migration Service?

Q: When should I use AWS Glue vs AWS Batch?

Amazon Kinesis Data Analytics allows you to run standard SQL queries on your incoming data stream. You can specify a destination like Amazon S3 to write your results. Once your data is available in your target data source, you can kick off an AWS Glue ETL job to do further transform your data and prepare it for additional analytics and reporting.

Q: When should I use AWS Glue vs Amazon Kinesis Data Analytics?

You will pay a simple monthly fee, above the AWS Glue Data Catalog free tier, for storing and accessing the metadata in the AWS Glue Data Catalog. Additionally, you will pay an hourly rate, billed per second, for the ETL job and crawler run, with a 10-minute minimum for each. If you choose to use a development endpoint to interactively develop your ETL code, you will pay an hourly rate, billed per second, for the time your development endpoint is provisioned, with a 10-minute minimum. For more details, please refer our pricing page.

Q: How am I charged for AWS Glue?

Billing commences as soon as the job is scheduled for execution and continues until the entire job completes. With AWS Glue, you only pay for the time for which your job runs and not for the environment provisioning or shutdown time.

Q: When does billing for my AWS Glue jobs begin and end?

We provide server side encryption for data at rest and SSL for data in motion.

Q: How does AWS Glue keep my data secure?

Please refer our documentation to learn more about service limits.

Q: What are the service limits associated with AWS Glue?

Please refer to the AWS Region Table for details of AWS Glue service availability by region.

Q: What regions is AWS Glue in?

A development endpoint is provisioned with 5 DPUs by default. You can configure a development endpoint with a minimum of 2 DPUs and a maximun of 5 DPUs.

Q: How many DPUs (Data Processing Units) are allocated to the development endpoint?

You can simply specify the number of DPUs (Data Processing Units) you want to allocate to your ETL job. A Glue ETL job requires a minimum of 2 DPUs. By default, AWS Glue allocates 10 DPUs to each ETL job.

Q: How do I scale the size and performance of my AWS Glue ETL jobs?

Q: How do I monitor the execution of my AWS Glue jobs?

Our AWS Glue SLA guarantees a Monthly Uptime Percentage of at least 99.9% for AWS Glue.

Q: What does the AWS Glue SLA guarantee?

You are eligible for a SLA credit for AWS Glue under the AWS Glue SLA if more than one Availability Zone in which you are running a task, within the same region has a Monthly Uptime Percentage of less than 99.9% during any monthly billing cycle.

Q: How do I know if I qualify for a SLA Service Credit?

Aws Glue FAQs Managed Etl Service Amazon Web Services Flashcards ionicons-v5-c

Aws Glue FAQs Managed Etl Service Amazon Web Services Flashcards