3주차 배치 분석

K-MOOC/빅데이터와 머신러닝 소프트웨어 2020. 3. 24. 17:35

###map, flatMap

rdd : {"apple pear", "apple orange", "apple lemon grape"}

rdd.map(tokenize)

{"apple", "pear"},["apple","orange"]......

이런식으로 분활됨

rdd.flatMap(tokenize)

["apple", "pear","apple","orange"......

이런식으로 모임

###

@@출력

reduceByKey()

groupByKey() 같은 키값 모음

keys() 키만 모아서 변환

values() 벨류만 모아서 변환

sortByKey() sort함수로 만듬

jion() 결과만듬

@@Action

collect() 다모아서 프로그렘에 돌려줌

count() 원소갯수

first() 첫번째거

take(n) n개원소 반환

@@저장 반환

saveAsTextFile(path) 텍스트로 저장

saveAsSequenceFile(path) 페어 형테의 저장

lines = sc.textFile("hdfs:/data/logs")

errors = lines.filter(lambda line:line.startsWith("ERROR"))

messages = errors.map(lambda line: line.split()).map(lambda words: words[1])

messages.filter(lambda line: *sshd" in line).count()

messages.filter(lambda line: "ngnix" in line).count()

로그 분석 하는것

Log mining with Caching

좀더 효율적으로

massages = errors.map(lambda line: line.split()).map(lambda words: words[1])

messages.persist() # caching messages

곂친것을 수행하지않는다

###Natural jion

v_lookup 엑셀 같은거같음

값을 찾아서 옆 값을 같이 붙여서 추가해준다

### dataframe

df = spark.read.json("data/customer.josn")

df.show() ## 값을 보여줌

df.printSchema() ##타입을 알려줌

select() df.select("name", df("age")+10)

filter() df("age")> 30

groupBy() df.groupBy("age").count()

완전 DB 같다

DB처럼 출력되며 구릅을 묶거나 True 만 출력함

df.createOrReplaceTempView("customer")

sqlDF = spark.sql("SELECT age, name FROM customer")

sqlDF.show()

'K-MOOC > 빅데이터와 머신러닝 소프트웨어' 카테고리의 다른 글

6주차 텐서플로우 (0)	2020.04.11
5주차 머신러닝 시스템 개요 (0)	2020.04.02
4주차 스트림 처리 (0)	2020.04.01
2주차 데이터 처리 개요 (0)	2020.03.22
1주차 빅데이터 소프트웨어 (0)	2020.03.20

Or71nH

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31