big data

1 min readJan 15, 2021

Name node: what keeps track of what’s on all the data nodes and individual data nodes are ultimately what your client application will be talking to.

Using hdfs: UI(ambari)

Why map reduce?

Distribute the processing of data on your cluster
Divide your data up into partitions that are mapped (transformed) and reduced (aggregated) by mapper and reducer functions you define
Resilient to failure -an application master monitors your mappers and reducers on each partition

mapper <k,v> → shuffle and sort →reducer (count)

Hive is not suitable for online transaction (OLTP) processing. It’s not suitable for being hit with tons of queries all at once, from a website or something like that ==> hbase.

NOsql:

— large scale data

— fast transaction

Hbase

build on hdfs
web service
high transaction rate
<key, value> storage
sparse data -> column family
horizontally scalability
each cell can have many versions as timestamps

Mongodb

无需要跨文档或跨表的事务及复杂的join查询支持 // 目前已经支持事务，join的支持也越来越好。
敏捷迭代的业务，需求变动频繁，数据模型无法确定
存储的数据格式灵活，不固定，或属于半结构化数据
业务并发访问量大，需数千的QPS
TB级以上的海量数据存储，且数据量不断增加
要求存储的数据持久化、不丢失
需要99.999%的数据高可用性
需要大量的地理位置查询、文本查询

big data

Written by yueyuan