Amazon Redshift FAQs Cloud Data Warehouse Amazon Web Services Flashcards
AQUA is a new distributed and hardware-accelerated cache that enables Redshift to run up to 10x faster than any other cloud data warehouse. Existing data warehousing architectures with centralized storage require data be moved to compute clusters for processing. As data warehouses continue to grow over the next few years, the network bandwidth needed to move all this data becomes a bottleneck on query performance.
Q: What is AQUA (Advanced Query Accelerator) for Amazon Redshift?
If you are already using Amazon Redshift DS or DC node nodes, you can upgrade your existing clusters to the new compute instance RA3 to use managed storage. You can also create a new cluster based on the RA3 instance and managed storage is automatically included. No other action is required to use this capability.
Q: How do I use Amazon Redshift’s managed storage?
Amazon Redshift manages the work needed to set up, operate, and scale a data warehouse. For example, provisioning the infrastructure capacity, automating ongoing administrative tasks such as backups, and patching, and monitoring nodes and drives to recover from failures. For Redshift Spectrum, Amazon Redshift manages all the computing infrastructure, load balancing, planning, scheduling and execution of your queries on data stored in Amazon S3.
Q: How does Amazon Redshift simplify data warehouse management?
Amazon Redshift uses a variety of innovations to achieve up to ten times better performance than traditional databases for data warehousing and analytics workloads, they include the following:
Q: How does the performance of Amazon Redshift compare to most on-premises databases for data warehousing and analytics?
You can sign up and get started within minutes from the Amazon Redshift detail page or via the AWS Management Console. If you don't already have an AWS account, you'll be prompted to create one. Visit the Getting Started page to see how to try Amazon Redshift for free.
Q: How do I get started with Amazon Redshift?
You can easily create an Amazon Redshift data warehouse cluster by using the AWS Management Console or the Amazon Redshift APIs. You can start with a single node, 160GB data warehouse and scale all the way to petabytes or more with a few clicks in the AWS Console or a single API call.
Q: How do I create and access an Amazon Redshift data warehouse cluster?
You can create a cluster using either RA3, DC, or DS node types. RA3 node types enable you to scale and pay for compute and storage independently. You choose the number of instances you need based on performance requirements, and only pay for the managed storage that you use.
Q: What is the maximum storage capacity per compute node? What is the recommended amount of data per compute node for optimal performance?
You should use Amazon EMR if you use custom code to process and analyze extremely large datasets with big data processing frameworks such as Apache Spark, Hadoop, Presto, or Hbase. Amazon EMR gives you full control over the configuration of your clusters and the software you install on them.
Q: When would I use Amazon Redshift or Redshift Spectrum vs. Amazon EMR?
Amazon Athena is the simplest way to give any employee the ability to run ad-hoc queries on data in Amazon S3. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing your data immediately.
Q: When should I use Amazon Athena vs. Redshift Spectrum?
Amazon Redshift automatically handles many of the time-consuming tasks associated with managing your own data warehouse, including:
Q: Why should I use Amazon Redshift instead of running my own MPP data warehouse cluster on Amazon EC2?
You pay only for what you use, and there are no minimum or setup fees. Amazon Redshift supports the ability to pause and resume a cluster, allowing you to easily suspend on-demand billing while the cluster is not being used. For example, a cluster used for development can have compute billing suspended when not in use. While the cluster is paused, you are only charged for the cluster’s storage. For steady-state production workloads, you can get significant discounts over on-demand pricing by switching to Reserved Instances.
Q: How will I be charged and billed for my use of Amazon Redshift?
You can use our COPY command to load data in parallel directly to Amazon Redshift from Amazon EMR, Amazon DynamoDB, or any SSH-enabled host. Redshift Spectrum also enables you to load data from Amazon S3 into your cluster with a simple INSERT INTO command. This could enable you to load data from various formats such as Parquet and RC into your cluster. Note that if you use this approach, you will accrue Redshift Spectrum charges for the data scanned from Amazon S3.
Q: How do I load data from my existing Amazon RDS, Amazon EMR, Amazon DynamoDB, and Amazon EC2 data sources to Amazon Redshift?
You can use AWS Import/Export to transfer the data to Amazon S3 using portable storage devices. In addition, you can use AWS Direct Connect to establish a private network connection between your network or data center and AWS. You can choose 1Gbit/sec or 10Gbit/sec connection ports to transfer your data.
Q: I have a lot of data for initial loading into Amazon Redshift. Transferring via the Internet would take a long time. How do I load this data?
Yes. Granular column level security controls ensure users see only the data they should have access to. Amazon Redshift supports column level access control for local tables so you can control access to individual columns of a table or view by granting / revoking column level privileges to a user or a user-group. Redshift is integrated with AWS Lake Formation, ensuring Lake Formation’s column level access controls are also enforced for Redshift queries on the data in the data lake.
Q: Does Redshift support granular access controls like column level security?
Yes. Customers who want to use their corporate identity providers such as Microsoft Azure Active Directory, Active Directory Federation Services, Okta, Ping Federate, or other SAML compliant identity providers can configure Amazon Redshift to provide single-sign on.
Q: Does Redshift support single sign-on?
You can sign-on to Amazon Redshift cluster with Microsoft Azure Active Directory (AD) identities. This allows you to be able to sign-on to Redshift without duplicating Azure Active Directory identities in Redshift.
Q: How does Redshift support single sign-on with Microsoft Azure Active Directory?
Yes. You can use multi-factor authentication (MFA) for additional security when authenticating to your Amazon Redshift cluster.
Q: Does Amazon Redshift support multi-factor authentication (MFA)?
Yes, you can use Amazon Redshift as part of your VPC configuration. With Amazon VPC, you can define a virtual network topology that closely resembles a traditional network that you might operate in your own data center. This gives you complete control over who can access your Amazon Redshift data warehouse cluster. You can use Redshift Spectrum with an Amazon Redshift cluster that is part of your VPC.
Q: Can I use Amazon Redshift in Amazon Virtual Private Cloud (Amazon VPC)?
No. Your Amazon Redshift compute nodes are in a private network space and can only be accessed from your data warehouse cluster's leader node. This provides an additional layer of security for your data.
Q: Can I access my Amazon Redshift compute nodes directly?
Amazon Redshift will automatically detect and replace a failed node in your data warehouse cluster. The data warehouse cluster will be unavailable for queries and updates until a replacement node is provisioned and added to the DB. Amazon Redshift makes your replacement node available immediately and loads your most frequently accessed data from S3 first to allow you to resume querying your data as quickly as possible. Single node clusters do not support data replication. In the event of a drive failure, you will need to restore the cluster from snapshot on S3. We recommend using at least two nodes for production.
Q: What happens to my data warehouse cluster availability and data durability if a drive on one of my nodes fails?
Amazon Redshift will automatically detect and replace a failed node in your data warehouse cluster. The data warehouse cluster will be unavailable for queries and updates until a replacement node is provisioned and added to the DB. Amazon Redshift makes your replacement node available immediately and loads your most frequently accessed data from S3 first to allow you to resume querying your data as quickly as possible. Single node clusters do not support data replication. In the event of a drive failure, you will need to restore the cluster from snapshot on S3. We recommend using at least two nodes for production.
Q: What happens to my data warehouse cluster availability and data durability in the event of individual node failure?
If your Amazon Redshift data warehouse cluster's Availability Zone becomes unavailable, you will not be able to use your cluster until power and network access to the AZ are restored. Your data warehouse cluster's data is preserved so you can start using your Amazon Redshift data warehouse as soon as the AZ becomes available again. In addition, you can also choose to restore any existing snapshots to a new AZ in the same Region. Amazon Redshift will restore your most frequently accessed data first so you can resume queries as quickly as possible.
Q: What happens to my data warehouse cluster availability and data durability if my data warehouse cluster's Availability Zone (AZ) has an outage?
Amazon Redshift replicates all your data within your data warehouse cluster when it is loaded and also continuously backs up your data to Amazon S3. Amazon Redshift always attempts to maintain at least three copies of your data (the original and replica on the compute nodes, and a backup in Amazon S3). Redshift can also asynchronously replicate your snapshots to S3 in another region for disaster recovery.
Q: How does Amazon Redshift backup my data? How do I restore my cluster from a backup?
You can use the AWS Management Console or ModifyCluster API to manage the period of time your automated backups are retained by modifying the RetentionPeriod parameter. If you wish to turn off automated backups altogether, you can set up the retention period to 0 (not recommended).
Q: How do I manage the retention of my automated backups and snapshots?
When you delete a data warehouse cluster you have the ability to specify whether a final snapshot is created upon deletion. This enables a restore of the deleted data warehouse cluster at a later date. All previously created manual snapshots of your data warehouse cluster will be retained and billed at standard Amazon S3 rates, unless you choose to delete them.
Q: What happens to my backups if I delete my data warehouse cluster?
If you would like to increase query performance or respond to CPU, memory or I/O over-utilization, you can increase the number of nodes within your data warehouse cluster using Elastic Resize via the AWS Management Console or the ModifyCluster API. When you modify your data warehouse cluster, your requested changes will be applied immediately. Metrics for compute utilization, storage utilization, and read/write traffic to your Amazon Redshift data warehouse cluster are available free of charge via the AWS Management Console or Amazon CloudWatch APIs. You can also add additional, user-defined metrics via Amazon Cloudwatch custom metric functionality.
Q: How do I scale the size and performance of my Amazon Redshift data warehouse cluster?
It depends. When you using the Concurrency Scaling feature, the cluster is fully available for read and write during concurrency scaling. With Elastic resize, the cluster is unavailable for four to eight minutes of the resize period. With the Redshift RA3 storage elasticity in managed storage, the cluster is fully available and data is automatically moved between managed storage and compute nodes.
Q: Will my data warehouse cluster remain available during scaling?
A typical data warehouse has significant variance in concurrent query usage over the course of a day. It is more cost-effective to add resources just for the period during which they are required rather than provisioning to peak demand. Amazon Redshift handles this automatically on your behalf.
Q: How do I manage resources to ensure that my Redshift cluster can provide consistently fast performance during periods of high concurrency?
Elastic Resize adds or removes nodes from a single Redshift cluster within minutes to manage its query throughput. For example, an ETL workload for certain hours in a day or month-end reporting may need additional Redshift resources to complete on time. Concurrency Scaling adds additional cluster resources to increase the overall query concurrency.
Q: What is Elastic Resize and how is it different from Concurrency Scaling?
No. Concurrency Scaling is a massively scalable pool of Redshift resources and customers do not have direct access.
Q: Can I access the Concurrency Scaling clusters directly?
Amazon Redshift uses industry-standard SQL and is accessed using standard JDBC and ODBC drivers. You can download Amazon Redshift custom JDBC and ODBC drivers from the Connect Client tab of the Redshift Console. We have validated integrations with popular BI and ETL vendors, a number of which are offering free trials to help you get started loading and analyzing your data. You can also go to the AWS Marketplace to deploy and configure solutions designed to work with Amazon Redshift in minutes.
Q: Are Amazon Redshift and Redshift Spectrum compatible with my preferred business intelligence software package and ETL tools?
Redshift Spectrum currently supports many open source data formats, including Avro, CSV, Grok, Amazon Ion, JSON, ORC, Parquet, RCFile, RegexSerDe, Sequence, Text, and TSV.
Q: What data formats and compression formats does Redshift Spectrum support?
Just like with local tables, you can use the schema name to pick exactly which one you mean by using schema_name.table_name in your query.
Q: What happens if a table in my local storage has the same name as an external table?
Yes. The CREATE EXTERNAL SCHEMA command supports Hive Metastores. We do not currently support DDL against the Hive Metastore.
Q: I use a Hive Metastore to store metadata about my S3 data lake. Can I use Redshift Spectrum?
You can query the system table SVV_EXTERNAL_TABLES to get that information.
Q: How do I get a list of all external database tables created in my cluster?
Amazon Redshift periodically performs maintenance to apply fixes, enhancements and new features to your cluster. You can change the scheduled maintenance windows by modifying the cluster, either programmatically or by using the Redshift Console. During these maintenance windows, your Amazon Redshift cluster is not available for normal operations. For more information about maintenance windows and schedules by region, see Maintenance Windows in the Amazon Redshift Management Guide.
Q: What is a maintenance window? Will my data warehouse cluster be available during software maintenance?