Name node: what keeps track of what’s on all the data nodes and individual data nodes are ultimately what your client application will be talking to.
Using hdfs: UI(ambari)
Why map reduce?
- Distribute the processing of data on your cluster
- Divide your data up into partitions that are mapped (transformed) and reduced (aggregated) by mapper and reducer functions you define
- Resilient to failure -an application master monitors your mappers and reducers on each partition
mapper <k,v> → shuffle and sort →reducer (count)
Hive is not suitable for online transaction (OLTP) processing. It’s not suitable for being hit with tons of queries all at once, from a website or something like that ==> hbase.
NOsql:
— large scale data
— fast transaction
Hbase
- build on hdfs
- web service
- high transaction rate
- <key, value> storage
- sparse data -> column family
- horizontally scalability
- each cell can have many versions as timestamps
Mongodb
- 无需要跨文档或跨表的事务及复杂的join查询支持 // 目前已经支持事务,join的支持也越来越好。
- 敏捷迭代的业务,需求变动频繁,数据模型无法确定
- 存储的数据格式灵活,不固定,或属于半结构化数据
- 业务并发访问量大,需数千的QPS
- TB级以上的海量数据存储,且数据量不断增加
- 要求存储的数据持久化、不丢失
- 需要99.999%的数据高可用性
- 需要大量的地理位置查询、文本查询