Data engineering
Storage topics
- S3 data lakes: upload file, do partitioning
S3 storage Tiers (Frequent->infrequent use)
S3 Lifecycle rules: move to different storage classes, General Purpose -> Infrequent Access ->Glacier(never use), Transition actions(60 days, 6 months), expiration actions (delete)
S3 Security: User based/Resource Based32w
- DynamoDB
Data Transformation
- Glue
- Glue ETL (The underlying platform for Glue ETL is a serverless Apache Spark Platform)
Streaming
- Kinesis: alternative to Kafka, great for application logs, metrics, IoT, clickstreams, great for real time big data
- Kinesis Video Streams: streaming video in real time
-Streams are divided in ordered shards/partitions. Shards have to be provisioned in advance (capacity planning).
-Ability to reprocess/replay data
-multiple applications can consume the same stream
-fast record, small in size
-Data ingestion into redshift/amazon s3/elasticsearch/splunk
-Automatic scaling
- Support many data formats/conversion
Firehose: content delivery
Workflows
- Data Pipelines
- AWS batch
- Step functions
Exploration Data Analysis
-Scikit-learn
-Data Distribution
-Trends and Seaonsality
- Athena
- Quicksight
- EMR
- Apache Spark
Feature Engineering
- Imputation methods
- Outliers
- Binning
- log transforms
- one hot encoding
- scaling and normalization
Model
- Deep learning crash course (MLP’s, CNN’s, RNN’s)
- Tuning neural networks and regularization techniques
SageMaker
- Using containers
- Security
- Choosing instance types
- AB testing
- Tensorflow integration
- SageMaker Neo and GreenGrass
- SageMaker Pipes
- Elastic Inference
- Inference Pipelines