AWS

yueyuan
3 min readFeb 6, 2021

Data engineering

Storage topics

  • S3 data lakes: upload file, do partitioning

S3 storage Tiers (Frequent->infrequent use)

S3 Lifecycle rules: move to different storage classes, General Purpose -> Infrequent Access ->Glacier(never use), Transition actions(60 days, 6 months), expiration actions (delete)

S3 Security: User based/Resource Based32w

  • DynamoDB

Data Transformation

  • Glue
  • Glue ETL (The underlying platform for Glue ETL is a serverless Apache Spark Platform)

Streaming

  • Kinesis: alternative to Kafka, great for application logs, metrics, IoT, clickstreams, great for real time big data
  • Kinesis Video Streams: streaming video in real time

-Streams are divided in ordered shards/partitions. Shards have to be provisioned in advance (capacity planning).

-Ability to reprocess/replay data

-multiple applications can consume the same stream

-fast record, small in size

-Data ingestion into redshift/amazon s3/elasticsearch/splunk

-Automatic scaling

  • Support many data formats/conversion

Firehose: content delivery

Workflows

  • Data Pipelines
  • AWS batch
  • Step functions

Exploration Data Analysis

-Scikit-learn

-Data Distribution

-Trends and Seaonsality

  • Athena
  • Quicksight
  • EMR
  • Apache Spark

Feature Engineering

  • Imputation methods
  • Outliers
  • Binning
  • log transforms
  • one hot encoding
  • scaling and normalization

Model

  • Deep learning crash course (MLP’s, CNN’s, RNN’s)
  • Tuning neural networks and regularization techniques

SageMaker

  • Using containers
  • Security
  • Choosing instance types
  • AB testing
  • Tensorflow integration
  • SageMaker Neo and GreenGrass
  • SageMaker Pipes
  • Elastic Inference
  • Inference Pipelines

--

--