Aws Glue FAQs Managed Etl Service Amazon Web Services Flashcards
To start using AWS Glue, simply sign into the AWS Management Console and navigate to “Glue” under the “Analytics” category. You can follow one of our guided tutorials that will walk you through an example use case for AWS Glue. You can also find sample ETL code in our GitHub repository under AWS Labs.
Q: How do I get started with AWS Glue?
A: Lake Formation leverages a shared infrastructure with AWS Glue, including console controls, ETL code creation and job monitoring, a common data catalog, and a serverless architecture. While AWS Glue is still focused on these types of functions, Lake Formation encompasses all AWS Glue features AND provides additional capabilities designed to help build, secure, and manage a data lake. See the AWS Lake Formation pages for more details.
Q: How does AWS Glue relate to AWS Lake Formation?
The AWS Glue Data Catalog is a central repository to store structural and operational metadata for all your data assets. For a given data set, you can store its table definition, physical location, add business relevant attributes, as well as track how this data has changed over time.
Q: What is the AWS Glue Data Catalog?
You simply run an ETL job that reads from your Apache Hive Metastore, exports the data to an intermediate format in Amazon S3, and then imports that data into the AWS Glue Data Catalog.
Q: How do I import data from my existing Apache Hive Metastore to the AWS Glue Data Catalog?
No. AWS Glue Data Catalog is Apache Hive Metastore compatible. You can point to the Glue Data Catalog endpoint and use it as an Apache Hive Metastore replacement. For more information on how to configure your cluster to use AWS Glue Data Catalog as an Apache Hive Metastore, please read our documentation here.
Q: Do I need to maintain my Apache Hive Metastore if I am storing my metadata in the AWS Glue Data Catalog?
Before you can start using AWS Glue Data Catalog as a common metadata repository between Amazon Athena, Amazon Redshift Spectrum, and AWS Glue, you must upgrade your Amazon Athena data catalog to AWS Glue Data Catalog. The steps required for the upgrade are detailed here.
Q. If I am already using Amazon Athena or Amazon Redshift Spectrum and have tables in Amazon Athena’s internal data catalog, how can I start using the AWS Glue Data Catalog as my common metadata repository?
The metadata stored in the AWS Glue Data Catalog can be readily accessed from Glue ETL, Amazon Athena, Amazon EMR, Amazon Redshift Spectrum, and third-party services.
Q: What analytics services use the AWS Glue Data Catalog?
You can use either Scala or Python.
Q: What programming language can I use to write my ETL code for AWS Glue?
Yes. You can import custom Python libraries and Jar files into your AWS Glue ETL job. For more details, please check our documentation here.
Q: Can I import custom libraries as part of my ETL script?
Yes. You can write your own code using AWS Glue’s ETL library, or write your own Scala or Python code and upload it to a Glue ETL job. For more details, please check our documentation here.
Q: Can I bring my own code?
You can create and connect to development endpoints that offer ways to connect your notebooks and IDEs.
Q: How can I develop my ETL code using my own IDE?
In addition to the ETL library and code generation, AWS Glue provides a robust set of orchestration features that allow you to manage dependencies between multiple jobs to build end-to-end ETL workflows. AWS Glue ETL jobs can either be triggered on a schedule or on a job completion event. Multiple jobs can be triggered in parallel or sequentially by triggering them on a job completion event. You can also trigger one or more Glue jobs from an external source such as an AWS Lambda function.
Q: How can I build end-to-end ETL workflow using multiple jobs in AWS Glue?
AWS Glue manages dependencies between two or more jobs or dependencies on external events using triggers. Triggers can watch one or more jobs as well as invoke one or more jobs. You can either have a scheduled trigger that invokes jobs periodically, an on-demand trigger, or a job completion trigger.
Q: How does AWS Glue monitor dependencies?
Yes. You can run your existing Scala or Python code on AWS Glue. Simply upload the code to Amazon S3 and create one or more jobs that use that code. You can reuse the same code across multiple jobs by pointing them to the same code location on Amazon S3.
Q: Can I run my existing ETL jobs with AWS Glue?
AWS Glue supports ETL on streams from Amazon Kinesis Data Streams, Apache Kafka, and Amazon MSK. Add the stream to the Glue Data Catalog and then choose it as the data source when setting up your AWS Glue job.
Q: How can I use AWS Glue to ETL streaming data?
No. While we do believe that using both the AWS Glue Data Catalog and ETL provides an end-to-end ETL experience, you can use either one of them independently without using the other.
Q: Do I have to use both AWS Glue Data Catalog and Glue ETL to use the service?
Both AWS Glue and Amazon Kinesis Data Analytics can be used to process streaming data. AWS Glue is recommended when your use cases are primarily ETL and when you want to run jobs on a serverless Apache Spark-based platform. Amazon Kinesis Data Analytics is recommended when your use cases are primarily analytics and when you want to run jobs on a serverless Apache Flink-based platform.
Q: When should I use AWS Glue Streaming and when should I use Amazon Kinesis Data Analytics?
Both AWS Glue and Amazon Kinesis Data Firehose can be used for streaming ETL. AWS Glue is recommended for complex ETL, including joining streams, and partitioning the output in Amazon S3 based on the data content. Amazon Kinesis Data Firehose is recommended when your use cases focus on data delivery and preparing data to be processed after it is delivered.
Q: When should I use AWS Glue and when should I use Amazon Kinesis Data Firehose?
AWS Glue includes specialized ML-based dataset transformation algorithms customers can use to create their own ML Transforms. These include record de-duplication and match finding.
Q: How do ML Transforms work?
Q: Can I see a presentation on using AWS Glue (and AWS Lake Formation) to find matches and deduplicate records?
AWS Glue works on top of the Apache Spark environment to provide a scale-out execution environment for your data transformation jobs. AWS Glue infers, evolves, and monitors your ETL jobs to greatly simplify the process of creating and maintaining jobs. Amazon EMR provides you with direct access to your Hadoop environment, affording you lower-level access and greater flexibility in using tools beyond Spark.
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Database Migration Service (DMS) helps you migrate databases to AWS easily and securely. For use cases which require a database migration from on-premises to AWS or database replication between on-premises sources and sources on AWS, we recommend you use AWS DMS. Once your data is in AWS, you can use AWS Glue to move and transform data from your data source into another database or data warehouse, such as Amazon Redshift.
Q: When should I use AWS Glue vs AWS Database Migration Service?
Amazon Kinesis Data Analytics allows you to run standard SQL queries on your incoming data stream. You can specify a destination like Amazon S3 to write your results. Once your data is available in your target data source, you can kick off an AWS Glue ETL job to do further transform your data and prepare it for additional analytics and reporting.
Q: When should I use AWS Glue vs Amazon Kinesis Data Analytics?
Billing commences as soon as the job is scheduled for execution and continues until the entire job completes. With AWS Glue, you only pay for the time for which your job runs and not for the environment provisioning or shutdown time.
Q: When does billing for my AWS Glue jobs begin and end?
We provide server side encryption for data at rest and SSL for data in motion.
Q: How does AWS Glue keep my data secure?
Please refer our documentation to learn more about service limits.
Q: What are the service limits associated with AWS Glue?
Please refer to the AWS Region Table for details of AWS Glue service availability by region.
Q: What regions is AWS Glue in?
A development endpoint is provisioned with 5 DPUs by default. You can configure a development endpoint with a minimum of 2 DPUs and a maximun of 5 DPUs.
Q: How many DPUs (Data Processing Units) are allocated to the development endpoint?
You can simply specify the number of DPUs (Data Processing Units) you want to allocate to your ETL job. A Glue ETL job requires a minimum of 2 DPUs. By default, AWS Glue allocates 10 DPUs to each ETL job.
Q: How do I scale the size and performance of my AWS Glue ETL jobs?
The AWS Glue provides status of each job and pushes all notifications to Amazon CloudWatch. You can setup SNS notifications via CloudWatch actions to be informed of job failures or completions.
Q: How do I monitor the execution of my AWS Glue jobs?
Our AWS Glue SLA guarantees a Monthly Uptime Percentage of at least 99.9% for AWS Glue.
Q: What does the AWS Glue SLA guarantee?
You are eligible for a SLA credit for AWS Glue under the AWS Glue SLA if more than one Availability Zone in which you are running a task, within the same region has a Monthly Uptime Percentage of less than 99.9% during any monthly billing cycle.
Q: How do I know if I qualify for a SLA Service Credit?