Amazon Redshift FAQs Cloud Data Warehouse Amazon Web Services Flashcards

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution. Most results come back in seconds. With Redshift, you can start small for just $0.25 per hour with no commitments and scale out to petabytes of data for $1,000 per terabyte per year, less than a tenth the cost of traditional on-premises solutions. Amazon Redshift also includes Amazon Redshift Spectrum, allowing you to run SQL queries directly against exabytes of unstructured data in Amazon S3 data lakes. No loading or transformation is required, and you can use open data formats, including Avro, CSV, Grok, Amazon Ion, JSON, ORC, Parquet, RCFile, RegexSerDe, Sequence, Text, and TSV. Redshift Spectrum automatically scales query compute capacity based on the data retrieved, so queries against Amazon S3 run fast, regardless of data set size.

Q: What is Amazon Redshift?

On-premises data warehouses require significant time and resource to administer, especially for large datasets. In addition, the financial costs associated with building, maintaining, and growing self-managed, on-premises data warehouses are very high. As your data grows, you have to constantly trade-off what data to load into your data warehouse and what data to archive in storage so you can manage costs, keep ETL complexity low, and deliver good performance. Amazon Redshift not only significantly lowers the cost and operational overhead of a data warehouse, but with Redshift Spectrum, it also makes it easy to analyze large amounts of data in its native format without requiring you to load the data.

Q: Why would I use Amazon Redshift over an on-premises data warehouse?

AQUA is a new distributed and hardware-accelerated cache that enables Redshift to run up to 10x faster than any other cloud data warehouse. Existing data warehousing architectures with centralized storage require data be moved to compute clusters for processing. As data warehouses continue to grow over the next few years, the network bandwidth needed to move all this data becomes a bottleneck on query performance.

Q: What is AQUA (Advanced Query Accelerator) for Amazon Redshift?

Redshift Spectrum is a feature of Amazon Redshift that enables you to run queries against exabytes of unstructured data in Amazon S3, with no loading or ETL required. When you issue a query, it goes to the Amazon Redshift SQL endpoint, which generates and optimizes a query plan. Amazon Redshift determines what data is local and what is in Amazon S3, generates a plan to minimize the amount of Amazon S3 data that needs to be read, requests Redshift Spectrum workers out of a shared resource pool to read and process data from Amazon S3.

Q: What is Redshift Spectrum?

Amazon Redshift managed storage is available with RA3 node types and enables you to scale and pay for compute and storage independently so you can size your cluster based only on your compute needs. It automatically uses high-performance SSD based local storage as tier-1 cache and takes advantage of optimizations such as data block temperature, data block age, and workload patterns to deliver high performance while scaling storage automatically to Amazon S3 when needed without requiring any action.

Q: What is Amazon Redshift managed storage?

If you are already using Amazon Redshift DS or DC node nodes, you can upgrade your existing clusters to the new compute instance RA3 to use managed storage. You can also create a new cluster based on the RA3 instance and managed storage is automatically included. No other action is required to use this capability.

Q: How do I use Amazon Redshiftâ€™s managed storage?

Amazon Redshift manages the work needed to set up, operate, and scale a data warehouse. For example, provisioning the infrastructure capacity, automating ongoing administrative tasks such as backups, and patching, and monitoring nodes and drives to recover from failures. For Redshift Spectrum, Amazon Redshift manages all the computing infrastructure, load balancing, planning, scheduling and execution of your queries on data stored in Amazon S3.

Q: How does Amazon Redshift simplify data warehouse management?

Amazon Redshift uses a variety of innovations to achieve up to ten times better performance than traditional databases for data warehousing and analytics workloads, they include the following:

Q: How does the performance of Amazon Redshift compare to most on-premises databases for data warehousing and analytics?

Q: How do I get started with Amazon Redshift?

You can easily create an Amazon Redshift data warehouse cluster by using the AWS Management Console or the Amazon Redshift APIs. You can start with a single node, 160GB data warehouse and scale all the way to petabytes or more with a few clicks in the AWS Console or a single API call.

Q: How do I create and access an Amazon Redshift data warehouse cluster?

You can create a cluster using either RA3, DC, or DS node types. RA3 node types enable you to scale and pay for compute and storage independently. You choose the number of instances you need based on performance requirements, and only pay for the managed storage that you use.

Q: What is the maximum storage capacity per compute node? What is the recommended amount of data per compute node for optimal performance?

Both Amazon Redshift and Amazon RDS enable you to run traditional relational databases in the cloud while offloading database administration. Customers use Amazon RDS databases primarily for online-transaction processing (OLTP) workload while Redshift is used primarily for reporting and analytics. OLTP workloads require quickly querying specific information and support for transactions like insert, update, and delete and are best handled by Amazon RDS. Amazon Redshift harnesses the scale and resources of multiple nodes and uses a variety of optimizations to provide order of magnitude improvements over traditional databases for analytic and reporting workloads against very large data sets. Amazon Redshift provides an excellent scale-out option as your data and query complexity grows if you want to prevent your reporting and analytic processing from interfering with the performance of your OLTP workload. Now, with the new Federated Query feature, you can easily query data across your Amazon RDS or Aurora database services with Amazon Redshift.

Q: When would I use Amazon Redshift vs. Amazon RDS?

You should use Amazon EMR if you use custom code to process and analyze extremely large datasets with big data processing frameworks such as Apache Spark, Hadoop, Presto, or Hbase. Amazon EMR gives you full control over the configuration of your clusters and the software you install on them.

Q: When would I use Amazon Redshift or Redshift Spectrum vs. Amazon EMR?

Amazon Athena is the simplest way to give any employee the ability to run ad-hoc queries on data in Amazon S3. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing your data immediately.

Q: When should I use Amazon Athena vs. Redshift Spectrum?

Amazon Redshift automatically handles many of the time-consuming tasks associated with managing your own data warehouse, including:

Q: Why should I use Amazon Redshift instead of running my own MPP data warehouse cluster on Amazon EC2?

You pay only for what you use, and there are no minimum or setup fees. Amazon Redshift supports the ability to pause and resume a cluster, allowing you to easily suspend on-demand billing while the cluster is not being used. For example, a cluster used for development can have compute billing suspended when not in use. While the cluster is paused, you are only charged for the clusterâ€™s storage. For steady-state production workloads, you can get significant discounts over on-demand pricing by switching to Reserved Instances.

Q: How will I be charged and billed for my use of Amazon Redshift?

You can load data into Amazon Redshift from a range of data sources including Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon EMR, AWS Glue, AWS Data Pipeline and or any SSH-enabled host on Amazon EC2 or on-premises. Amazon Redshift attempts to load your data in parallel into each compute node to maximize the rate at which you can ingest data into your data warehouse cluster. Clients can connect to Amazon Redshift using ODBC or JDBC and issue 'insert' SQL commands to insert the data. Please note this is slower than using S3 or DynamoDB since those methods load data in parallel to each compute node while SQL insert statements load via the single leader node. For more details on loading data into Amazon Redshift, please view our Getting Started Guide.

Q: How do I load data into my Amazon Redshift data warehouse?

You can use our COPY command to load data in parallel directly to Amazon Redshift from Amazon EMR, Amazon DynamoDB, or any SSH-enabled host. Redshift Spectrum also enables you to load data from Amazon S3 into your cluster with a simple INSERT INTO command. This could enable you to load data from various formats such as Parquet and RC into your cluster. Note that if you use this approach, you will accrue Redshift Spectrum charges for the data scanned from Amazon S3.

Q: How do I load data from my existing Amazon RDS, Amazon EMR, Amazon DynamoDB, and Amazon EC2 data sources to Amazon Redshift?

You can use AWS Import/Export to transfer the data to Amazon S3 using portable storage devices. In addition, you can use AWS Direct Connect to establish a private network connection between your network or data center and AWS. You can choose 1Gbit/sec or 10Gbit/sec connection ports to transfer your data.

Q: I have a lot of data for initial loading into Amazon Redshift. Transferring via the Internet would take a long time. How do I load this data?

Amazon Redshift supports industry-leading security with built-in AWS IAM integration, identity federation for single-sign on (SSO), multi-factor authentication, column-level access control, Amazon Virtual Private Cloud (Amazon VPC), and provides built-in AWS KMS integration to protect your data in transit and at rest. Amazon Redshift encrypts and keeps your data secure in transit and at rest using industry-standard encryption techniques. To keep data secure in transit, Amazon Redshift supports SSL-enabled connections between your client application and your Redshift data warehouse cluster. To keep your data secure at rest, Amazon Redshift encrypts each block using hardware-accelerated AES-256 as it is written to disk. This takes place at a low level in the I/O subsystem, which encrypts everything written to disk, including intermediate query results. The blocks are backed up as is, which means that backups are encrypted as well. By default, Amazon Redshift takes care of key management but you can choose to manage your keys keys through AWS Key Management Service. All Amazon Redshift security features are offered at no additional costs. Redshift Spectrum supports Amazon S3â€™s Server Side Encryption (SSE) using your accountâ€™s default key managed used by the AWS Key Management Service (KMS).

Q: How does Amazon Redshift keep my data secure?

Yes. Granular column level security controls ensure users see only the data they should have access to. Amazon Redshift supports column level access control for local tables so you can control access to individual columns of a table or view by granting / revoking column level privileges to a user or a user-group. Redshift is integrated with AWS Lake Formation, ensuring Lake Formationâ€™s column level access controls are also enforced for Redshift queries on the data in the data lake.

Q: Does Redshift support granular access controls like column level security?

Yes. Customers who want to use their corporate identity providers such as Microsoft Azure Active Directory, Active Directory Federation Services, Okta, Ping Federate, or other SAML compliant identity providers can configure Amazon Redshift to provide single-sign on.

Q: Does Redshift support single sign-on?

You can sign-on to Amazon Redshift cluster with Microsoft Azure Active Directory (AD) identities. This allows you to be able to sign-on to Redshift without duplicating Azure Active Directory identities in Redshift.

Q: How does Redshift support single sign-on with Microsoft Azure Active Directory?

Yes. You can use multi-factor authentication (MFA) for additional security when authenticating to your Amazon Redshift cluster.

Q: Does Amazon Redshift support multi-factor authentication (MFA)?

Yes, you can use Amazon Redshift as part of your VPC configuration. With Amazon VPC, you can define a virtual network topology that closely resembles a traditional network that you might operate in your own data center. This gives you complete control over who can access your Amazon Redshift data warehouse cluster. You can use Redshift Spectrum with an Amazon Redshift cluster that is part of your VPC.

Q: Can I use Amazon Redshift in Amazon Virtual Private Cloud (Amazon VPC)?

No. Your Amazon Redshift compute nodes are in a private network space and can only be accessed from your data warehouse cluster's leader node. This provides an additional layer of security for your data.

Q: Can I access my Amazon Redshift compute nodes directly?

Q: What happens to my data warehouse cluster availability and data durability if a drive on one of my nodes fails?

Q: What happens to my data warehouse cluster availability and data durability in the event of individual node failure?

If your Amazon Redshift data warehouse cluster's Availability Zone becomes unavailable, you will not be able to use your cluster until power and network access to the AZ are restored. Your data warehouse cluster's data is preserved so you can start using your Amazon Redshift data warehouse as soon as the AZ becomes available again. In addition, you can also choose to restore any existing snapshots to a new AZ in the same Region. Amazon Redshift will restore your most frequently accessed data first so you can resume queries as quickly as possible.

Q: What happens to my data warehouse cluster availability and data durability if my data warehouse cluster's Availability Zone (AZ) has an outage?

Currently, Amazon Redshift only supports Single-AZ deployments. You can run data warehouse clusters in multiple AZ's by loading data into two Amazon Redshift data warehouse clusters in separate AZs from the same set of Amazon S3 input files. With Redshift Spectrum, you can spin up multiple clusters across AZs and access data in Amazon S3 without having to load it into your cluster. In addition, you can also restore a data warehouse cluster to a different AZ from your data warehouse cluster snapshots.

Q: Does Amazon Redshift support Multi-AZ Deployments?

Amazon Redshift replicates all your data within your data warehouse cluster when it is loaded and also continuously backs up your data to Amazon S3. Amazon Redshift always attempts to maintain at least three copies of your data (the original and replica on the compute nodes, and a backup in Amazon S3). Redshift can also asynchronously replicate your snapshots to S3 in another region for disaster recovery.

Q: How does Amazon Redshift backup my data? How do I restore my cluster from a backup?

You can use the AWS Management Console or ModifyCluster API to manage the period of time your automated backups are retained by modifying the RetentionPeriod parameter. If you wish to turn off automated backups altogether, you can set up the retention period to 0 (not recommended).

Q: How do I manage the retention of my automated backups and snapshots?

When you delete a data warehouse cluster you have the ability to specify whether a final snapshot is created upon deletion. This enables a restore of the deleted data warehouse cluster at a later date. All previously created manual snapshots of your data warehouse cluster will be retained and billed at standard Amazon S3 rates, unless you choose to delete them.

Q: What happens to my backups if I delete my data warehouse cluster?

If you would like to increase query performance or respond to CPU, memory or I/O over-utilization, you can increase the number of nodes within your data warehouse cluster using Elastic Resize via the AWS Management Console or the ModifyCluster API. When you modify your data warehouse cluster, your requested changes will be applied immediately. Metrics for compute utilization, storage utilization, and read/write traffic to your Amazon Redshift data warehouse cluster are available free of charge via the AWS Management Console or Amazon CloudWatch APIs. You can also add additional, user-defined metrics via Amazon Cloudwatch custom metric functionality.

Q: How do I scale the size and performance of my Amazon Redshift data warehouse cluster?

It depends. When you using the Concurrency Scaling feature, the cluster is fully available for read and write during concurrency scaling. With Elastic resize, the cluster is unavailable for four to eight minutes of the resize period. With the Redshift RA3 storage elasticity in managed storage, the cluster is fully available and data is automatically moved between managed storage and compute nodes.

Q: Will my data warehouse cluster remain available during scaling?

A typical data warehouse has significant variance in concurrent query usage over the course of a day. It is more cost-effective to add resources just for the period during which they are required rather than provisioning to peak demand. Amazon Redshift handles this automatically on your behalf.

Q: How do I manage resources to ensure that my Redshift cluster can provide consistently fast performance during periods of high concurrency?

Elastic Resize adds or removes nodes from a single Redshift cluster within minutes to manage its query throughput. For example, an ETL workload for certain hours in a day or month-end reporting may need additional Redshift resources to complete on time. Concurrency Scaling adds additional cluster resources to increase the overall query concurrency.

Q: What is Elastic Resize and how is it different from Concurrency Scaling?

No. Concurrency Scaling is a massively scalable pool of Redshift resources and customers do not have direct access.

Q: Can I access the Concurrency Scaling clusters directly?

Amazon Redshift uses industry-standard SQL and is accessed using standard JDBC and ODBC drivers. You can download Amazon Redshift custom JDBC and ODBC drivers from the Connect Client tab of the Redshift Console. We have validated integrations with popular BI and ETL vendors, a number of which are offering free trials to help you get started loading and analyzing your data. You can also go to the AWS Marketplace to deploy and configure solutions designed to work with Amazon Redshift in minutes.

Q: Are Amazon Redshift and Redshift Spectrum compatible with my preferred business intelligence software package and ETL tools?

Redshift Spectrum currently supports many open source data formats, including Avro, CSV, Grok, Amazon Ion, JSON, ORC, Parquet, RCFile, RegexSerDe, Sequence, Text, and TSV.

Q: What data formats and compression formats does Redshift Spectrum support?

Just like with local tables, you can use the schema name to pick exactly which one you mean by using schema_name.table_name in your query.

Q: What happens if a table in my local storage has the same name as an external table?

Yes. The CREATE EXTERNAL SCHEMA command supports Hive Metastores. We do not currently support DDL against the Hive Metastore.

Q: I use a Hive Metastore to store metadata about my S3 data lake. Can I use Redshift Spectrum?

You can query the system table SVV_EXTERNAL_TABLES to get that information.

Q: How do I get a list of all external database tables created in my cluster?

Metrics for compute utilization, storage utilization, and read/write traffic to your Amazon Redshift data warehouse cluster are available free of charge via the AWS Management Console or Amazon CloudWatch APIs. You can also add additional, user-defined metrics via Amazon Cloudwatchâ€™s custom metric functionality. The AWS Management Console provides a monitoring dashboard that helps you monitor the health and performance of all your clusters. Amazon Redshift also provides information on query and cluster performance via the AWS Management Console. This information enables you to see which users and queries are consuming the most system resources to diagnose performance issues by viewing query plans and execution statistics. In addition, you can see the resource utilization on each of your compute nodes to ensure that you have data and queries that are well-balanced across all nodes.

Q: How do I monitor the performance of my Amazon Redshift data warehouse cluster?

Amazon Redshift periodically performs maintenance to apply fixes, enhancements and new features to your cluster. You can change the scheduled maintenance windows by modifying the cluster, either programmatically or by using the Redshift Console. During these maintenance windows, your Amazon Redshift cluster is not available for normal operations. For more information about maintenance windows and schedules by region, see Maintenance Windows in the Amazon Redshift Management Guide.

Q: What is a maintenance window? Will my data warehouse cluster be available during software maintenance?

Amazon Redshift FAQs Cloud Data Warehouse Amazon Web Services Flashcards ionicons-v5-c

Amazon Redshift FAQs Cloud Data Warehouse Amazon Web Services Flashcards