Aws Data Pipeline FAQs Managed Etl Service Amazon Web Services Flashcards
While both services provide execution tracking, handling retries and exceptions, and running arbitrary actions, AWS Data Pipeline is specifically designed to facilitate the specific steps that are common across a majority of data-driven workflows. For example: executing activities after their input data meets specific readiness criteria, easily copying data between different data stores, and scheduling chained transforms. This highly specific focus means that Data Pipeline workflow definitions can be created rapidly, and with no code or programming knowledge.
Q: How is AWS Data Pipeline different from Amazon Simple Workflow Service?
An activity is an action that AWS Data Pipeline initiates on your behalf as part of a pipeline. Example activities are EMR or Hive jobs, copies, SQL queries, or command-line scripts.
Q: What is an activity?
Yes, AWS Data Pipeline provides built-in support for the following activities:
Q: Does Data Pipeline supply any standard Activities?
Yes, AWS Data Pipeline provides built-in support for the following preconditions:
Q: Does AWS Data Pipeline supply any standard preconditions?
Yes, you can use the ShellCommandActivity to run arbitrary Activity logic.
Q: Can I supply my own custom activities?
Yes, you can use the ShellCommandPrecondition to run arbitrary precondition logic.
Q: Can I supply my own custom preconditions?
Yes, simply define multiple schedule objects in your pipeline definition file and associate the desired schedule to the correct activity via its schedule field. This allows you to define a pipeline in which, for example, log files are stored in Amazon S3 each hour to drive generation of an aggregate report one time per day.
Q: Can you define multiple schedules for different activities in the same pipeline?
You can define Amazon SNS alarms to trigger on activity success, failure, or delay. Create an alarm object and reference it in the onFail,onSuccess, or onLate slots of the activity object.
Q: How do I add alarms to an activity?
Yes. You can rerun a set of completed or failed activities by resetting their state to SCHEDULED. This can be done by using the Rerun button in the UI or modifying their state in the command line or API. This will immediately schedule a of re-check all activity dependencies, followed by the execution of additional activity attempts. Upon subsequent failures, the Activity will perform the original number of retry attempts.
Q: Can I manually rerun activities that have failed?
Yes, compute resources will be provisioned when the first activity for a scheduled time that uses those resources is ready to run and those instances will be terminated when the final activity that uses the resources has completed successfully or failed.
Q: Will AWS Data Pipeline provision and terminate AWS Data Pipeline-managed compute resources for me?
Yes, simply define multiple cluster objects in your definition file and associate the cluster to use for each activity via its runsOn field. This allows pipelines to combine AWS and on-premise resources, or to use a mix of instance types for their activities – for example, you may want to use a t1.micro to execute a quick script cheaply, but later on the pipeline may have an Amazon EMR job that requires the power of a cluster of larger instances.
Q: Can multiple compute resources be used on the same pipeline?
Yes. To enable running activities using on-premise resources, AWS Data Pipeline supplies a Task Runner package that can be installed on your on-premise hosts. This package continuously polls the AWS Data Pipeline service for work to perform. When it’s time to run a particular activity on your on-premise resources, for example, executing a DB stored procedure or a database dump, AWS Data Pipeline will issue the appropriate command to the Task Runner. In order to ensure that your pipeline activities are highly available, you can optionally assign multiple Task Runners to poll for a given job. This way, if one Task Runner becomes unavailable, the others will simply pick up its work.
Q: Can I execute activities on on-premise resources, or AWS resources that I manage?
You can install the Task Runner package on your on-premise hosts using the following steps:
Q: How do I install a Task Runner on my on-premise hosts?
To get started with AWS Data Pipeline, simply visit the AWS Management Console and go to the AWS Data Pipeline tab. From there, you can create a pipeline using a simple graphical editor.
Q: How can I get started with AWS Data Pipeline?
With AWS Data Pipeline, you can schedule and manage periodic data-processing jobs. You can use this to replace simple systems which are current managed by brittle, cron-based solutions, or you can use it to build complex, multi-stage data processing jobs.
Q: What can I do with AWS Data Pipeline?
Yes, there are sample pipelines in our documentation. Additionally, the console has several pipeline templates that you can use to get started.
Q: Are there Sample Pipelines that I can use to try out AWS Data Pipeline?
By default, your account can have 100 pipelines.
Q: How many pipelines can I create in AWS Data Pipeline?
By default, each pipeline you create can have 100 objects.
Q: Are there limits on what I can put inside a single pipeline?
Yes. If you would like to increase your limits, simply contact us.
Q: Can my limits be changed?