0%

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1
2
bash
$ hexo new "My New Post"

More info: Writing

Run server

1
$ hexo server

More info: Server

Generate static files

1
$ hexo generate

More info: Generating

Deploy to remote sites

1
$ hexo deploy

More info: Deployment

目标

监控yarn的资源

流程

数据采集:采集yarn的指标

数据处理:实时处理:spark-streaming

数据输出:mysql(加索引),olap(毫秒级别)

数据可视化:superset,dataease

olap:clickhouse,doris,tidb,phenix

oltp:支持事务的

链路

yarn -> jar 采集数据 -> kafka-> sparkstreaming -> ck -> superset/dataease

采集的数据格式:

  • 文本数据 : 网络io占据量少,分隔符设置有要求
  • json数据 : json要占用网络io ,解析方便

start

采集数据

yarn的api

添加yarn的依赖

1
2
3
4
5
6
7
8
9
10
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-common</artifactId>
<version>3.3.4</version>
</dependency>

在idea里来进行开发

设置获取yarn数据的接口

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
package sparkfirst

import java.util

import org.apache.hadoop.yarn.api.records.YarnApplicationState
import org.apache.hadoop.yarn.client.api.YarnClient
import org.apache.hadoop.yarn.conf.YarnConfiguration

trait YarnInfo {

def getYarnInfo={
val client = YarnClient.createYarnClient()
/*
init 方法里面要求的是conf ,是yarn的配置文件
先在resource里加上yarn-site.xml
然后new配置文件类
然后启动客户端
*/
val configuration = new YarnConfiguration()
client.init(configuration)
client.start()

val states = util.EnumSet.noneOf(classOf[YarnApplicationState])
states.add(YarnApplicationState.ACCEPTED)
states.add(YarnApplicationState.RUNNING)
states.add(YarnApplicationState.NEW)
states.add(YarnApplicationState.SUBMITTED)
states.add(YarnApplicationState.KILLED)
states.add(YarnApplicationState.NEW_SAVING)
states.add(YarnApplicationState.FAILED)


val reports = client.getApplications(states)

val value = reports.iterator()

val builder = new StringBuilder
while (value.hasNext){
val report = value.next()
val report1 = report.getApplicationResourceUsageReport
val id = report.getApplicationId
val host = report.getHost
val applicationType = report.getApplicationType
val name = report.getName
val starttime = report.getStartTime
val user = report.getUser
val finishtime = report.getFinishTime
val mem = report1.getMemorySeconds
val vcore = report1.getVcoreSeconds
val size = report1.getUsedResources.getMemorySize
val cores = report1.getUsedResources.getVirtualCores
val resources = report1.getUsedResources.getResources
val state = report.getYarnApplicationState
val url = report.getTrackingUrl
val margin =
s"""
|report: ${report}
|report1 : ${report1}
|id:${id}
|host:${host}
|applicationtype : ${applicationType}
|name : ${name}
|starttime ${starttime}
|finishtime : ${finishtime}
|user:${user}
|memeveryscends:${mem}
|vcoreeveryscends:${vcore}
|size:${size}
|cores${cores}
|state:${state}
|url:${url}
|resources:${resources.mkString(",")}
|---
|""".stripMargin
builder.appendAll(margin)
}

println(builder)

}
}

在主方法里继承接口并且实现方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
package sparkfirst
import org.apache.flink.api.java.utils.ParameterTool
object testyarn {

def apply(parameterTool: ParameterTool): testyarn = new testyarn(parameterTool)
def main(args: Array[String]): Unit = {
val tool = ParameterTool.fromArgs(args)
testyarn(tool).excute()
}
}
class testyarn(parameterTool: ParameterTool) extends YarnInfo {




def excute(): Unit ={

getYarnInfo
}
}

在机器上启动一个sparksql通过yarn模式的部署

获取数据如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
report: applicationId { id: 1 cluster_timestamp: 1675390427337 } user: "hadoop" queue: "default" name: "SparkSQL::192.168.41.132" host: "192.168.41.133" rpc_port: -1 yarn_application_state: RUNNING trackingUrl: "http://bigdata4:9999/proxy/application_1675390427337_0001/" diagnostics: "" startTime: 1675390547814 finishTime: 0 final_application_status: APP_UNDEFINED app_resource_Usage { num_used_containers: 3 num_reserved_containers: 0 used_resources { memory: 5120 virtual_cores: 3 resource_value_map { key: "memory-mb" value: 5120 units: "Mi" type: COUNTABLE } resource_value_map { key: "vcores" value: 3 units: "" type: COUNTABLE } } reserved_resources { memory: 0 virtual_cores: 0 resource_value_map { key: "memory-mb" value: 0 units: "Mi" type: COUNTABLE } resource_value_map { key: "vcores" value: 0 units: "" type: COUNTABLE } } needed_resources { memory: 5120 virtual_cores: 3 resource_value_map { key: "memory-mb" value: 5120 units: "Mi" type: COUNTABLE } resource_value_map { key: "vcores" value: 3 units: "" type: COUNTABLE } } memory_seconds: 3804903 vcore_seconds: 2232 queue_usage_percentage: 41.666664 cluster_usage_percentage: 41.666664 preempted_memory_seconds: 0 preempted_vcore_seconds: 0 application_resource_usage_map { key: "memory-mb" value: 3804903 } application_resource_usage_map { key: "vcores" value: 2232 } application_preempted_resource_usage_map { key: "memory-mb" value: 0 } application_preempted_resource_usage_map { key: "vcores" value: 0 } } originalTrackingUrl: "http://bigdata3:4040" currentApplicationAttemptId { application_id { id: 1 cluster_timestamp: 1675390427337 } attemptId: 1 } progress: 0.1 applicationType: "SPARK" log_aggregation_status: LOG_NOT_START unmanaged_application: false priority { priority: 0 } appNodeLabelExpression: "<Not set>" amNodeLabelExpression: "<DEFAULT_PARTITION>" appTimeouts { application_timeout_type: APP_TIMEOUT_LIFETIME application_timeout { application_timeout_type: APP_TIMEOUT_LIFETIME expire_time: "UNLIMITED" remaining_time: -1 } } launchTime: 1675390548575
report1 : num_used_containers: 3 num_reserved_containers: 0 used_resources { memory: 5120 virtual_cores: 3 resource_value_map { key: "memory-mb" value: 5120 units: "Mi" type: COUNTABLE } resource_value_map { key: "vcores" value: 3 units: "" type: COUNTABLE } } reserved_resources { memory: 0 virtual_cores: 0 resource_value_map { key: "memory-mb" value: 0 units: "Mi" type: COUNTABLE } resource_value_map { key: "vcores" value: 0 units: "" type: COUNTABLE } } needed_resources { memory: 5120 virtual_cores: 3 resource_value_map { key: "memory-mb" value: 5120 units: "Mi" type: COUNTABLE } resource_value_map { key: "vcores" value: 3 units: "" type: COUNTABLE } } memory_seconds: 3804903 vcore_seconds: 2232 queue_usage_percentage: 41.666664 cluster_usage_percentage: 41.666664 preempted_memory_seconds: 0 preempted_vcore_seconds: 0 application_resource_usage_map { key: "memory-mb" value: 3804903 } application_resource_usage_map { key: "vcores" value: 2232 } application_preempted_resource_usage_map { key: "memory-mb" value: 0 } application_preempted_resource_usage_map { key: "vcores" value: 0 }
id:application_1675390427337_0001
host:192.168.41.133
applicationtype : SPARK
name : SparkSQL::192.168.41.132
starttime 1675390547814
finishtime : 0
user:hadoop
memeveryscends:3804903
vcoreeveryscends:2232
size:5120
cores3
state:RUNNING
url:http://bigdata4:9999/proxy/application_1675390427337_0001/
resources:name: memory-mb, units: Mi, type: COUNTABLE, value: 5120, minimum allocation: 0, maximum allocation: 9223372036854775807, tags: [], attributes {},name: vcores, units: , type: COUNTABLE, value: 3, minimum allocation: 0, maximum allocation: 9223372036854775807, tags: [], attributes {}
---

yarn上的数据如下:

img经对比数据一致

发送数据到kafka

如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
package sparkfirst
import java.util.Properties

import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.kafka.clients.producer.Producer
import org.apache.kafka.clients.producer.ProducerRecord
import org.apache.flink.api.java.utils.ParameterTool

import scala.util.Random
object testyarn {
def apply(parameterTool: ParameterTool): testyarn = new testyarn(parameterTool)
def main(args: Array[String]): Unit = {
val tool = ParameterTool.fromArgs(args)
testyarn(tool).excute()
}
}

class testyarn(parameterTool: ParameterTool) extends YarnInfo {

val properties = new Properties
properties.put("bootstrap.servers", "bigdata3:9092,bigdata4:9092,bigdata5:9092 ")
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("acks", "all")


def excute() ={
val producer: Producer[String, String] = new KafkaProducer[String, String](properties)
val i = new Random().nextInt(10) % 3
val strings = getYarnInfo.split("-------------------------------------------------------------")
for (elem <- strings) {
println(elem)
producer.send(new ProducerRecord[String, String]("yarninfo", i, " ", elem))
}

producer.close()
}
}

消费数据

通过sparkstreaming消费

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
package project

import java.lang.reflect.Field
import java.util.Properties

import org.apache.flink.api.java.utils.ParameterTool
import org.apache.poi.ss.formula.functions.T

import scala.reflect.runtime.{universe => ru}
import org.apache.spark.sql
import org.apache.spark.sql.catalyst.plans.logical.MapPartitions
import tool._
object makeYArninfo {

def apply(parameterTool: ParameterTool): makeYArninfo = new makeYArninfo(parameterTool)

def main(args: Array[String]): Unit = {
val tool = ParameterTool.fromArgs(args)
makeYArninfo(tool).excute()
}
}

class makeYArninfo(parameterTool: ParameterTool) extends Serializable {
import org.apache.spark.streaming.kafka010._
import org.apache.spark.sql.SparkSession
import tool._
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.TaskContext


val kafkaip = parameterTool.get("kafkaip","bigdata3:9092,bigdata4:9092,bigdata5:9092")
val groupid = parameterTool.get("groupid","test-3")
val offsetreset = parameterTool.get("offsetset" , "earliest")
val topicid = parameterTool.get("topic","yarninfo")
val mideng = parameterTool.get("mideng","timestamp")
val url = parameterTool.get("url","jdbc:clickhouse://ip:8123/bigdata")
val root = parameterTool.get("root","default")
val password = parameterTool.get("password","123456")
val driver = parameterTool.get("driver","com.clickhouse.jdbc.ClickHouseDriver")
val dbtable = parameterTool.get("dbtable","yarninfo_zihang")
val mode = parameterTool.get("mode","append")



val kafkaParams = Map[String,Object](
"bootstrap.servers" -> kafkaip, // kafka地址
"key.deserializer" -> classOf[StringDeserializer], // 反序列化
"value.deserializer" -> classOf[StringDeserializer], // 反序列化
"group.id" -> groupid, // 指定消费者组
"auto.offset.reset" -> offsetreset, // 从什么地方开始消费
"enable.auto.commit" -> (false: java.lang.Boolean) // offset的提交 是不是自动提交
)
private val streamingcontext = new streamingcontext

private val savefile = new savefile

def excute()={

val streaming = streamingcontext.getstreamingnocheckpoint()
val topic = Array(topicid)
val stream = KafkaUtils.createDirectStream(
streaming,
PreferConsistent,
Subscribe[String, String](topic, kafkaParams)
)
// 获取offset信息
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
println(rdd.partitions.size)
rdd.foreachPartition { iter =>
val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
val spark = SparkSession.builder().config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
println("----------------------------------------------")
val wordsDataFrametmp = rdd.map(_.value()).filter(_.nonEmpty).map(line => {
var str:String = line
if (line.startsWith("\r\n\r\n")){
if (line.startsWith("\r\n\r\n")){
str = line.replace("\r\n\r\n", "\r\n")
}
}
str.split("\r\n")
}).filter(_.nonEmpty)

var wordsDataFrame:sql.DataFrame = null

// def getTypeTag[T: ru.TypeTag](obj: T) = ru.typeTag[T]
//
// val tpe = getTypeTag(wordsDataFrametmp).tpe
//
// tpe.dealias.getClass.getFields.foreach(println(_))
// println("---------------------------------")
// tpe.getClass.getDeclaredFields.foreach(println(_))



// println("\"wordsDataFrametmp的数据\" ")
// //wordsDataFrametmp.collect().foreach(_.foreach(println(_)))
// println("wordsDataFrametmp")
// wordsDataFrametmp.toDF("total").show(false)
// println("rdd.map(_.value())")
// rdd.map(_.value()).toDF("total").show(false)
// println("rdd.map(_.value()).map(_.split(\"\\r\\n\"))")
// rdd.map(_.value()).map(line => {
// var str:String = line
// if (line.startsWith("\r\n\r\n")){
// str = line.replace("\r\n\r\n", "\r\n")
// }
// str.split("\r\n")
// }).toDF("total").show(false)




// ------------------------------------------------------------------------------------------------------
if(!((wordsDataFrametmp.collect().length == 1)&&(wordsDataFrametmp.collect().length == 0))){
wordsDataFrame= wordsDataFrametmp.map(strings=>{
val id = strings(1).split(":")(1)
val host = strings(2).split(":")(1)
val applicationtype = strings(3).split(":")(1)
val name = strings(4).split("&&")(1)
val startime = strings(5).split(":")(1)
val endtime = strings(6).split(":")(1)
val user = strings(7).split(":")(1)
val memeveryscends = strings(8).split(":")(1)
val vcoreeveryscends = strings(9).split(":")(1)
val size = strings(10).split(":")(1).toLong
val cores = strings(11).split(":")(1).toLong
val state = strings(12).split(":")(1)
val url = strings(13).split("&&")(1)
val queue = strings(14).split(":")(1)
val timestamp = strings(15).split("&&")(1)
(id,host,applicationtype,name,startime,endtime,user,memeveryscends,vcoreeveryscends,size,cores,state,url,queue,timestamp)
})
.toDF("id","host",
"applicationtype","name",
"startime","endtime",
"user","memeveryscends",
"vcoreeveryscends","size",
"cores","state","url","queue","timestamp")
if (!wordsDataFrame.isEmpty){
wordsDataFrame.show()
savefile.savetojdbc(spark, wordsDataFrame, url , root , password,dbtable,driver,mideng,mode)
}
}



//存储offset和提交offset
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
streaming.start()
streaming.awaitTermination()
}


}

部署

在机器上部署添加依赖的时候可以使用–jars

如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
spark-submit \
--master yarn \
--deploy-mode client \
--name 录入yarninfo \
--executor-memory 1g \
--num-executors 1 \
--executor-cores 1 \
--jars /home/hadoop/software/jar/kafka/spark-streaming-kafka-0-10_2.12-3.2.1.jar,/home/hadoop/software/jar/kafka/spark-token-provider-kafka-0-10_2.12-3.2.1.jar,/home/hadoop/software/jar/kafka/kafka-clients-2.2.1.jar,/home/hadoop/software/jar/connect/clickhouse-jdbc-0.3.2.jar,/home/hadoop/software/jar/connect/clickhouse-http-client-0.3.2.jar,/home/hadoop/software/jar/connect/clickhouse-client-0.3.2.jar,/home/hadoop/software/jar/flink/flink-clients_2.12-1.13.6.jar,/home/hadoop/software/jar/flink/flink-core-1.13.6.jar,/home/hadoop/software/jar/flink/flink-scala_2.12-1.13.6.jar,/home/hadoop/software/jar/flink/flink-java-1.13.6.jar \
--class project.makeYArninfo \
/home/hadoop/project/jar/bigdatajava-1.0-SNAPSHOT.jar \
--kafkaip namenode:9092,resourcemanager:9092,workers:9092
-------------------------------------------------------------------
spark-submit \
--master yarn \
--name 采集yarn \
--deploy-mode client \
--executor-memory 1g \
--num-executors 1 \
--executor-cores 1 \
--jars /home/hadoop/software/jar/kafka/spark-streaming-kafka-0-10_2.12-3.2.1.jar,/home/hadoop/software/jar/kafka/spark-token-provider-kafka-0-10_2.12-3.2.1.jar,/home/hadoop/software/jar/kafka/kafka-clients-2.2.1.jar,/home/hadoop/software/jar/flink/flink-clients_2.12-1.13.6.jar,/home/hadoop/software/jar/flink/flink-core-1.13.6.jar,/home/hadoop/software/jar/flink/flink-scala_2.12-1.13.6.jar,/home/hadoop/software/jar/flink/flink-java-1.13.6.jar \
--class sparkfirst.testyarn \
--queue zihan \
/home/hadoop/project/jar/bigdatajava-1.0-SNAPSHOT.jar \
--kafkaip namenode:9092,resourcemanager:9092,workers:9092


或者使用maven的方式

通过参数

1
2
--repositories https://oss.sonatype.org/content/groups/public/ \
--packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0,org.elasticsearch:elasticsearch-spark_2.10:2.2.0 \

就可以控制,packages里加的是依赖,上面则是maven仓库的地址

只有第一次使用的时候会下载,之后就不会了

或者直接打胖包通过插件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>

xxl进行调度,报警等

然后我们把启动脚本封装到一个sh中,通过xxl进行调度

脚本如下:

1
2
3
4
5
6
7
pid=$(jps |  grep SparkSubmit | awk '{print $1}')
if [ ! -n "$pid" ];then
yarninfo.sh
ssh bigdata3 "/home/hadoop/shell/ding.sh 梅花十三 采集yarn日志 请登录查看 192.168.41.133 15046528047"
else
echo "信息正常"
fi

上述只是个简单的脚本,如果要真正的实时监控请自行编写

api

jdbctohive

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
package project


import java.util

import org.apache.spark.sql.catalog.Catalog
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import tool.sqlUtils
import tool.getmysqldf
import tool.savefile
import tool.readfile
import org.apache.flink.api.java.utils.ParameterTool
object jdbctohive{
def apply(parameterTool: ParameterTool): jdbctohive = new jdbctohive(parameterTool)

def main(args: Array[String]): Unit = {
if (args.length==0){
println(
"""
|欢迎使用本程序
|参数详情 mysql hive
|-------------------------mysql
|url 例子 : jdbc:mysql://bigdata2:3306/try
|user 例子 : root
|password 例子 : liuzihan010616
|tablename => 支持谓词下压 例子 : emp 或者 select * from emp 等
|driver => com.mysql.jdbc.Driver
|---------------------------hive
|mode模式 overwrite append 等
|hive中的table 例子 bigdata.emp
|可选参数 分区字段 自动开启的是动态分区 例子 deptno
|分区字段 [字段值] [标志位]:代表是不是只更新这一个分区的数据
|jdbc:mysql://bigdata2:3306/try root liuzihan010616 "select * from emp " com.mysql.jdbc.Driver append default.tmp deptno,sal,test,re 999,888
|""".stripMargin)
}
val tool = ParameterTool.fromArgs(args)
jdbctohive(tool).excute(args)
}
}




class jdbctohive(parameterTool: ParameterTool) {
System.setProperty("HADOOP_USER_NAME","hadoop")
val spark = SparkSession.builder().appName("sqoop").master("local[4]").enableHiveSupport().getOrCreate()
spark.sparkContext.setCheckpointDir("/tmp/checkpoint")

val getmysqldf = new readfile
val sqlUtils = new sqlUtils
val saveFile = new savefile
private val catalog: Catalog = spark.catalog
var changecolunm = false
import spark.implicits._
import org.apache.spark.sql.functions._

val url = parameterTool.getRequired("url")
val user = parameterTool.getRequired("user")
val password = parameterTool.getRequired("password")
val table = parameterTool.getRequired("table")
val driver = parameterTool.getRequired("driver")
val mode = parameterTool.getRequired("mode")
val hivetable = parameterTool.getRequired("hivetable")
val hivepartition = parameterTool.get("hivepartition",null)
val partitionValues = parameterTool.get("partitionValues")
val insertpartition = parameterTool.get("insertpartition")


def excute(args: Array[String]): Unit = {


// 获取jdbc的df
val mysqlconnect = getmysqldf.getmysqldataframe(spark, url, user, password, table , driver)
// 验证指示
mysqlconnect.show()
// 生成hive参数数组
// var hiveconf = new Array[String](args.length-5)
// hiveconf = util.Arrays.copyOfRange(args, 5, args.length)
//hiveconf.foreach(println(_))
jdbctohive(args.length,catalog,mysqlconnect)
spark.stop()
}



def changecolnums(int: Int,resourcesql:DataFrame) ={
var finallyresult:Dataset[Row] = null // 最终结果集
var frame:DataFrame = null // 中间变量
val strings2 = hivepartition.split(",")
var hiveconclumns = spark.table(hivetable).columns // hive的列数
//hiveconclumns.foreach(println((_))) // 验证hive的列数
var mysqlconnect:DataFrame = resourcesql // 设置数据源的resource

// 判断分区字段在不在jdbc的数据里,如果不在,则在jdbc的数据源中先添加上分区字段
var strings1:Array[String] =null
if (int > 8 && partitionValues != null){
strings1 = partitionValues.split(",")
}
var flagtmp:Int = 0;
for (elem <- strings2){
if (!mysqlconnect.columns.contains(elem)){
println(elem)
println(strings1(flagtmp))
mysqlconnect = mysqlconnect.withColumn(elem,lit(strings1(flagtmp)))
flagtmp = flagtmp + 1
mysqlconnect.show()
}
}



val jdbcconclumns = mysqlconnect.columns // jdbc的列数


var jdbcoldsource:Dataset[Row] = null // 源数据库的数据 checkpoint是为了破坏数据均衡,以后能编写变读取

if (int == 10){
hivepartition.split(",")(0) match {
case "" => {
println("-------------------------无操作")
}
case _ => {
hivepartition.split(",").length match {
case 1 =>
{
jdbcoldsource = spark.sql(
s"""
|select * from ${hivetable} where ${hivepartition} != ${partitionValues}
|""".stripMargin).checkpoint()
}
case _ =>
{
var tmpstring:String = null
var flag:Int = 0
val flagvalue = partitionValues.split(",")
for (elem <- hivepartition.split(",")){
if (elem == hivepartition.split(",")(hivepartition.split(",").length-1)){
tmpstring = tmpstring + elem + "!=" + flagvalue(flag)
}else{
tmpstring = tmpstring + elem + "!=" + flagvalue(flag) + "and"
}
flag = flag + 1
}
jdbcoldsource = spark.sql(
s"""
|select * from ${hivetable} where ${tmpstring}
|""".stripMargin).checkpoint()
}
}
}
}


}else{
jdbcoldsource = spark.sql(
s"""
|select * from ${hivetable}
|""".stripMargin).checkpoint()
}

var existcolunms: Array[String] = null // 设置hive或者mysql的额外列
var resultdf: DataFrame = jdbcoldsource // 获取hive的数据原始数据

// 判断是hive的列多,还是数据源的列数多
if (hiveconclumns.length >= jdbcconclumns.length){
// 判断额外列的存在
existcolunms= hiveconclumns.filter(hivecol => {
val bool = jdbcconclumns.map(jdbccol => {
jdbccol == hivecol
}).contains(true)
!bool
})
// 判断两个列数是不是相等
if (existcolunms.isEmpty) {
frame = mysqlconnect.selectExpr(hiveconclumns: _*)
frame
}else{
// 列数不相等的时候让列数少的加列
resultdf = mysqlconnect
for (elem <- existcolunms){
resultdf = resultdf.withColumn(elem, lit(null))
}
// 对字段进行排序 , 让分区数据的分区字段在最后一列
frame = resultdf.selectExpr(hiveconclumns: _*)
// 验证数据
frame.show()
// 整合历史数据
finallyresult = jdbcoldsource.union(frame)
// 验证数据
finallyresult.show()
changecolunm = true
finallyresult
}
}else{
// 数据的列多
existcolunms= jdbcconclumns.filter(jdbccol => {
val bool = hiveconclumns.map(hivecol => {
jdbccol == hivecol
}).contains(true)
!bool
})

if (existcolunms.isEmpty) {
frame = mysqlconnect.selectExpr(hiveconclumns: _*)
frame
}else{
for (elem <- existcolunms){
resultdf = resultdf.withColumn(elem, lit(null))
}
frame = resultdf.selectExpr(jdbcconclumns: _*)
finallyresult = frame.union(mysqlconnect)
changecolunm = true
finallyresult
}
}
}






def jdbctohive(int: Int,catalog: Catalog,mysqlconnect: DataFrame)={
// 分割字符串获取hive的 表和数据库
val hivedbandtables = hivetable.split("\\.")
val hivepart = hivepartition.split(",")
hivepart.foreach(println(_))
// catalog的方法 获取表存不存在的方法
// catalog.listTables(strings(0)).show()
// val empty = catalog.listTables(strings(0)).filter(x => {
// x.name == strings(1)
// }).isEmpty
val empty = catalog.tableExists(hivedbandtables(0),hivedbandtables(1))
//-----------------------------------------------------------------------------
// sql的方法
// val empty1 = spark.sql(
// """
// |show tables in hivedb
// |""".stripMargin).filter("tableName = 'hivetablename'").isEmpty
// --------------------------------------------------------------------------


// 判断列数是不是相等
var frameresult:DataFrame = mysqlconnect
// 先判断表存不存在 ,因为判断列数的方法要求表存在
empty match {
// 表不存在
case false => {
// 判断输入的变量个数执行 判断分区表还是普通表
if (int > 7) {
println("-----------------分区表")
// 判断分区的参数在不在列中 如果不在 ,则加上 ,在的话就自动往下走
var hivepartval:Array[String] =null
if (int > 8 && partitionValues != null){
hivepartval = partitionValues.split(",")
}
var flagtmp:Int = 0;
for (elem <- hivepart){
if (!mysqlconnect.columns.contains(elem)){
println(elem)
println(hivepartval(flagtmp))
frameresult = frameresult.withColumn(elem,lit(hivepartval(flagtmp)))
flagtmp = flagtmp + 1
}
}
}else{
println("-----------普通表")
frameresult = mysqlconnect
mysqlconnect.show()
}
}

case true => {
// 表存在
// 判断是不是分区表
frameresult = changecolnums(int, mysqlconnect)
// if (args.length > 7) {
// println("-----------------分区表")
// if (!mysqlconnect.columns.contains(args(7))){
// frameresult = changecolnums(args, hiveconf, mysqlconnect)
// }
// }else{
// println("-----------普通表")
// frameresult = mysqlconnect
// }
frameresult.show()}
}









spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
spark.conf.set("hive.exec.dynamic.partition","true")
spark.conf.set("spark.sql.parquet.writeLegacyFormat", "true")
println(empty)
saveFile.savetohiveapi(spark,empty,frameresult,hivetable,mode,hivepartition,changecolunm)
}

}

hivetojdbc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
package project

import java.util

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalog.Catalog
import tool.{getmysqldf, savefile, sqlUtils,readfile}
import org.apache.flink.api.java.utils.ParameterTool
object hivetojdbc{
def apply(parameterTool: ParameterTool): hivetojdbc = new hivetojdbc(parameterTool)

def main(args: Array[String]): Unit = {
if (args.length==0){
println(
"""
|欢迎使用本程序
|参数说明
|总体参数种类 hive mysql
|---------------------------hive
|hive中要选择的字段 例子 : "sal,big / * "
|hive的table的名字 例子 : bigdata_hive3.emp
|hive中的 条件可以为空 例子 : where sal > '300'
|---------------------------mysql
|savemode overwrite append 等
|url 例子 : jdbc:mysql://bigdata2:3306/try
|user 例子 : root
|password 例子 : liuzihan010616
|dbtable 例子 : emp
|幂等性的列 : 例子 : sal
|驱动名称 : 例子 com.mysql.jdbc.Driver
|""".stripMargin)
}
val tool = ParameterTool.fromArgs(args)
hivetojdbc(tool).excute()
}
}




class hivetojdbc(parameterTool: ParameterTool) {
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
val getmysqldf = new readfile
val sqlUtils = new sqlUtils
val saveFile = new savefile
private val catalog: Catalog = spark.catalog
val hiveconclunms = parameterTool.getRequired("hiveconclumns")
val hivetable = parameterTool.getRequired("hivetable")
val hiveoption = parameterTool.get("hiveoption",null)
val url = parameterTool.get("url","jdbc:mysql://bigdata2:3306/bigdata")
val user = parameterTool.get("user","root")
val pasword = parameterTool.get("password","liuzihan010616")
val dbtable = parameterTool.getRequired("dbtable")
val driver = parameterTool.getRequired("driver")
val midengconclumns = parameterTool.getRequired("col")
val mode = parameterTool.getRequired("mode")

def excute(): Unit = {

val frame = sqlUtils.checksql(spark, sqlUtils.hivesqlchoose(hiveconclunms,hivetable,hiveoption))
saveFile.savetojdbc(spark,frame,url,user,pasword,dbtable,driver,midengconclumns,mode)
}

}

sql方式

jdbctohive

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
package sparkfirst

import org.apache.spark.sql.SparkSession
import tool.savefile
import tool.sqlUtils
import org.apache.spark.sql.functions._
object test {
val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").enableHiveSupport().getOrCreate()
private val savefile = new savefile
private val utils = new sqlUtils
def main(args: Array[String]): Unit = {
val df = spark.read.format("JDBC")
.option("url","jdbc:mysql://bigdata2:3306/try")
.option("dbtable", "emp")
.option("user", "root")
.option("password", "liuzihan010616")
.load()
df.select("sal").tail(1).foreach(println(_))
println(df.select("sal").tail(1)(0)(0))
df.show()


var str:String = null
val bool = spark.catalog.tableExists("default.tmp")
if (bool){
spark.sql(
s"""
|drop table default.tmp
|""".stripMargin)
str = utils.mkcreatesql(df, "default.tmp", "text", "','","deptno,hiredate")
utils.checksql(spark,str)
}else{
str = utils.mkcreatesql(df, "default.tmp", "text", "','","deptno,hiredate")
utils.checksql(spark,str)
}

spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
spark.conf.set("hive.exec.dynamic.partition","true")
val frame = df.withColumn("ee", lit("aaa"))
utils.insertmake(spark,df,"default.tmp","','","deptno,hiredate")
utils.changecolunms(spark,frame,"default.tmp")
utils.insertmake(spark,frame,"default.tmp","','","deptno,hiredate")


}
}

hivetojdbc

用sqlUtils里自己定义的api来进行

思路及实现功能

source

通过api进行对jdbc数据接收

通过

1
2
3
4
5
6
7
8
getmysqldf.getmysqldataframe(spark, url, user, password, table , driver)
----------------------------------------------------------------------------------------------
def getmysqldataframe(sparkSession: SparkSession,string: String*) ={
val sql = string(3)
val frame: DataFrame = sparkSession.read.format("jdbc").options(Map("url" -> string(0), "user" -> string(1), "password" -> string(2), "dbtable" -> s"($sql) as tmp","driver"->string(4))).load()
frame
}

来获取jdbc数据

其他的方法 : 其中options 可以换成多个option来进行,option里是KV类型的

通过

1
2
3
4
5
6
7
8
9
10
11
 val frame = sqlUtils.checksql(spark, sqlUtils.hivesqlchoose(hiveconclunms,hivetable,hiveoption))
-----------------------------------------------------------------------------------------------------------
def checksql(spark:SparkSession, string: String)={
spark.sql(string)
}
---------------------------------------------------------------------------------------------------------------
def hivesqlchoose(hiveconclumns:String,hivetable:String,hiveoptions:String)={

"select" + " " + hiveconclumns + " " + "from" + " " + hiveconclumns + " " + hiveoptions
}

前提在saprksession处打开enablesupporthive参数

todo

通过api对数据进行整合以及处理

功能 :

  • jdbctohive
    • 基本功能
      • 同步普通表
      • 同步分区表
        • 单分区
        • 多分区
    • 追加功能api
      • 用户自定义分区字段及其值
      • 用户在jdbc数据中增加列,hive中自动增加列
      • 分区字段更改且不丢失源数据
      • 可以实现表中自带的字段以及用户定义的字段一起分区的操作
      • 实现单独对一个分区的数据追加或者重新写入
      • 设置hive表的存储以及压缩格式
      • 实现对所有分区的追加或者重新写入
      • 通过flink的参数工具部署
    • 追加功能sql
      • 实现用户自定义分区字段以及值
      • 用户在jdbc数据中增加列,hive中自动增加列
      • 可以实现表中自带的字段以及用户定义的字段一起分区的操作
      • 实现单独对一个分区的数据追加或者重新写入
      • 实现对所有分区的追加或者重新写入
      • 设置hive表的存储格式
      • 实现设置存储格式text等
      • 通过flink的参数工具部署
  • hivetojdbc
    • 基本功能
      • 同步数据
    • 追加功能
      • 幂等性操作

基本功能没有上面要注意的点,但是多分区的时候我是采用获取字符串然后split之后map加上数据类型然后mkstring制作的

sql:

但是分区一般分为数据里的分区字段,以及用户自定义的分区字段,针对于用户自定义的分区,我直接把他们定义为string,但是对于数据里的分区字段,我选择保留他原始的类型,通过筛选出他包含的分区字段的schema信息,然后通过他的datatype,来进行数据的备份,最后和上述用户自定义的分区字段拼接到一起,就可以了,最后前面加上partitioned by 就好了,注意点是要提取变量,以及判断最后一次的时机,以及如何判断分区列在不在字段里。

api:

对于api则更为简单,直接调用partitionbyapi然后把字符串通过split然后:_*的方式传入,就ok了,但是api的分区字段必须是在df里的,也就是说,我们要提前把分区字段加上,先判断有没有分区字段,然后进而加上分区字段

追加功能:

sink

通过api对数据进行输出到表

sink到hive

api

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
    saveFile.savetohiveapi(spark,empty,frameresult,hivetable,mode,hivepartition,changecolunm,fileformated,codec)
------------------------------------------------------------------------------------------------------------------
def savetohiveapi(sparkSession: SparkSession,boolean: Boolean,spark: DataFrame,hivetable:String,mode:String,hivepartition:String,changecolnums:Boolean,fileformat:String,codec:String) = {




if (!boolean){
if (hivepartition != null){
spark.write.partitionBy(hivepartition.split(","):_*).option("fileFormat",fileformat).option("compression",codec).mode(mode).format("hive").saveAsTable(hivetable)
}else {

println(hivetable)
println(hivepartition)
spark.write.option("fileFormat",fileformat).option("compression",codec).mode(mode).format("hive").saveAsTable(hivetable)
}

}else{
changecolnums match {
case true => {
if (hivepartition != null){
if(sparkSession.table(hivetable).columns.length != spark.columns.length){
sparkSession.sql(
s"""
|drop table ${hivetable}
|""".stripMargin)
}
spark.write.partitionBy(hivepartition.split(","):_*).option("fileFormat",fileformat).option("compression",codec).mode(mode).format("hive").saveAsTable(hivetable)
}else {
spark.write.option("fileFormat",fileformat).option("compression",codec).mode(mode).format("hive").saveAsTable(hivetable)
}
}
case false => spark.write.option("fileFormat",fileformat).option("compression",codec).mode(mode).format("hive").insertInto(hivetable)
}

spark.show()
println(spark.count())

}
}

sql

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
def insertmake(sparkSession: SparkSession,dataFrame: DataFrame,tablename:String,otheroptions:String*) ={

var strings:Array[String] = null

dataFrame.selectExpr(sparkSession.table(tablename).columns:_*).createOrReplaceTempView("tmp")

// val partitionstring = sparkSession.table(tablename).columns.tail(sparkSession.table(tablename).columns.length - 2)
otheroptions.length match {
case 0 => {
sparkSession.sql(
s"""
|insert overwrite ${tablename}
|select * from tmp
|""".stripMargin)
}
case _ => {

if (otheroptions.length > 1){
strings = otheroptions(1).split(",").filter(conclunms => {
!dataFrame.columns.contains(conclunms)
})
val fuzhiarray:Array[String] = util.Arrays.copyOfRange(otheroptions.toArray, 2, otheroptions.length)
fuzhiarray.foreach(println(_))
strings.isEmpty match {
case true => {

sparkSession.sql(
s"""
|insert overwrite ${tablename} partition(${otheroptions(1).split(",").map(conclunms => {s"${conclunms}"}).mkString(",")})
|select * from tmp
|""".stripMargin)
}
case false => {
var tmpdf:DataFrame = dataFrame
for (i <- 0 to strings.length-1){
tmpdf = tmpdf.withColumn(strings(i),lit(fuzhiarray(i)))
}
tmpdf.show()
tmpdf.printSchema()
tmpdf = tmpdf.selectExpr(sparkSession.table(tablename).columns: _*)
tmpdf.show()
tmpdf.printSchema()
val str = tmpdf.columns.mkString(",\n")
tmpdf.createOrReplaceTempView("smp")
sparkSession.sql(
s"""
|insert overwrite ${tablename} partition(${otheroptions(1).split(",").map(conclunms => {s"${conclunms}"}).mkString(",")})
|select ${str} from smp
|""".stripMargin)
}
}



}else{
strings = otheroptions(0).split(",").filter(conclunms => {
!dataFrame.columns.contains(conclunms)
})
val fuzhiarray:Array[String] = util.Arrays.copyOfRange(otheroptions.toArray, 1, otheroptions.length)

strings.isEmpty match {
case true => {
sparkSession.sql(
s"""
|insert overwrite ${tablename} partition(${otheroptions(0).split(",").map(conclunms => {s"${conclunms}"}).mkString(",")})
|select * from tmp
|""".stripMargin)
}

case false => {
var tmpdf:DataFrame = dataFrame

for (i <- 0 to strings.length-1){
tmpdf = tmpdf.withColumn(strings(i),lit(fuzhiarray(i)))
}
tmpdf.show()
tmpdf.printSchema()
tmpdf = tmpdf.selectExpr(sparkSession.table(tablename).columns: _*)
val str = tmpdf.columns.mkString(",\n")
tmpdf.createOrReplaceTempView("smp")
sparkSession.sql(
s"""
|insert overwrite ${tablename} partition(${otheroptions(0).split(",").map(conclunms => {s"${conclunms}"}).mkString(",")})
|select ${str} from smp
|""".stripMargin)
}
}
}
}
}
}

sink到jdbc

api

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def savetojdbc(spark: SparkSession,df: DataFrame, url:String,user:String,password:String,dbtable:String,driver:String,mideng:String,mode:String)={
val map = Map("url" -> url,
"user" -> user,
"password" -> password,
"dbtable" -> dbtable,
"driver"-> driver)

// df.write.mode(string(2)).format(string(1)).options(map).save()
// -------------------------------------幂等性
val connection = jdbcconnect.getconncet(driver,url,user,password)
try{
val bool = connection.createStatement().executeQuery(s"show tables like '${dbtable}'").next()
if (!bool){
throw new NullPointerException(s"写入的结果表${dbtable} 尚未创建!!!")
}else{
var flag:Any = null
val flagbool = mysqldf.getmyqsldffromMap(spark, map).select(mideng).isEmpty
if (!flagbool){
flag = mysqldf.getmyqsldffromMap(spark, map).select(mideng).tail(1)(0)(0)
}
val tmpresult = df.select(mideng).filter(line => {
line.getString(0) != flag
})

if (df.isEmpty){
println("数据集为空")
}else{
if (tmpresult.isEmpty){
println("你的数据已经插入过")
df.show(false)
}else {
// df.show()
// println(df.count())
//val insertresult = tmpresult.join(df, string(5))
tmpresult.show()
println(tmpresult.count())
// insertresult.show()
// println(insertresult.count())
tmpresult.write.mode(mode).format("jdbc").options(map).save()
}
}
}
}finally {
connection.close()
}


}

jdbctohive

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
package project


import java.util

import org.apache.spark.sql.catalog.Catalog
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import tool.sqlUtils
import tool.getmysqldf
import tool.savefile
import tool.readfile
object jdbctohive {
System.setProperty("HADOOP_USER_NAME","hadoop")
val spark = SparkSession.builder().appName("sqoop").master("local[4]").enableHiveSupport().getOrCreate()
spark.sparkContext.setCheckpointDir("/tmp/checkpoint")

val getmysqldf = new readfile
val sqlUtils = new sqlUtils
val saveFile = new savefile
private val catalog: Catalog = spark.catalog
var changecolunm = false
import spark.implicits._
import org.apache.spark.sql.functions._


def main(args: Array[String]): Unit = {
if (args.length==0){
println(
"""
|欢迎使用本程序
|参数详情 mysql hive
|-------------------------mysql
|url 例子 : jdbc:mysql://bigdata2:3306/try
|user 例子 : root
|password 例子 : liuzihan010616
|tablename => 支持谓词下压 例子 : emp 或者 select * from emp 等
|driver => com.mysql.jdbc.Driver
|---------------------------hive
|mode模式 overwrite append 等
|hive中的table 例子 bigdata.emp
|可选参数 分区字段 自动开启的是动态分区 例子 deptno
|""".stripMargin)
}

val url = args(0)
val user = args(1)
val password = args(2)
val table = args(3)
val driver = args(4)
// 获取jdbc的df
val mysqlconnect = getmysqldf.getmysqldataframe(spark, url, user, password, table , driver)
// 验证指示
mysqlconnect.show()
// 生成hive参数数组
var hiveconf = new Array[String](args.length-5)
hiveconf = util.Arrays.copyOfRange(args, 5, args.length)
//hiveconf.foreach(println(_))
jdbctohive(args,catalog,mysqlconnect,hiveconf)
spark.stop()
}



def changecolnums(args:Array[String],hiveconf:Array[String],resourcesql:DataFrame) ={
var finallyresult:Dataset[Row] = null // 最终结果集
var frame:DataFrame = null // 中间变量
var hiveconclumns = spark.table(args(6)).columns // hive的列数
hiveconclumns.foreach(println((_))) // 验证hive的列数
var mysqlconnect:DataFrame = resourcesql // 设置数据源的resource

// 判断分区字段在不在jdbc的数据里,如果不在,则在jdbc的数据源中先添加上分区字段
if (args.length > 7){
if (!resourcesql.columns.contains(args(7))){
mysqlconnect = resourcesql.withColumn(args(7),lit(args(8)))
}
}

val jdbcconclumns = mysqlconnect.columns // jdbc的列数


var jdbcoldsource:Dataset[Row] = null // 源数据库的数据 checkpoint是为了破坏数据均衡,以后能编写变读取

if (args.length == 10){
jdbcoldsource = spark.sql(
s"""
|select * from ${hiveconf(1)} where ${hiveconf(2)} != ${hiveconf(3)}
|""".stripMargin).checkpoint()
}else{
jdbcoldsource = spark.sql(
s"""
|select * from ${hiveconf(1)}
|""".stripMargin).checkpoint()
}

var existcolunms: Array[String] = null // 设置hive或者mysql的额外列
var resultdf: DataFrame = jdbcoldsource // 获取hive的数据原始数据

// 判断是hive的列多,还是数据源的列数多
if (hiveconclumns.length >= jdbcconclumns.length){
// 判断额外列的存在
existcolunms= hiveconclumns.filter(hivecol => {
val bool = jdbcconclumns.map(jdbccol => {
jdbccol == hivecol
}).contains(true)
!bool
})
// 判断两个列数是不是相等
if (existcolunms.isEmpty) {
frame = mysqlconnect.selectExpr(hiveconclumns: _*)
frame
}else{
// 列数不相等的时候让列数少的加列
resultdf = mysqlconnect
for (elem <- existcolunms){
resultdf = resultdf.withColumn(elem, lit(null))
}
// 对字段进行排序 , 让分区数据的分区字段在最后一列
frame = resultdf.selectExpr(hiveconclumns: _*)
// 验证数据
frame.show()
// 整合历史数据
finallyresult = jdbcoldsource.union(frame)
// 验证数据
finallyresult.show()
changecolunm = true
finallyresult
}
}else{
// 数据的列多
existcolunms= jdbcconclumns.filter(jdbccol => {
val bool = hiveconclumns.map(hivecol => {
jdbccol == hivecol
}).contains(true)
!bool
})

if (existcolunms.isEmpty) {
frame = mysqlconnect.selectExpr(hiveconclumns: _*)
frame
}else{
for (elem <- existcolunms){
resultdf = resultdf.withColumn(elem, lit(null))
}
frame = resultdf.selectExpr(jdbcconclumns: _*)
finallyresult = resultdf.union(mysqlconnect)
changecolunm = true
finallyresult
}
}
}






def jdbctohive(args:Array[String],catalog: Catalog,mysqlconnect: DataFrame, hiveconf: Array[String])={
// 分割字符串获取hive的 表和数据库
val strings = hiveconf(1).split("\\.")

// catalog的方法 获取表存不存在的方法
// catalog.listTables(strings(0)).show()
// val empty = catalog.listTables(strings(0)).filter(x => {
// x.name == strings(1)
// }).isEmpty
val empty = catalog.tableExists(strings(0),strings(1))
//-----------------------------------------------------------------------------
// sql的方法
// val empty1 = spark.sql(
// """
// |show tables in hivedb
// |""".stripMargin).filter("tableName = 'hivetablename'").isEmpty
// --------------------------------------------------------------------------


// 判断列数是不是相等
var frameresult:DataFrame = null
// 先判断表存不存在 ,因为判断列数的方法要求表存在
empty match {
// 表不存在
case false => {
// 判断输入的变量个数执行 判断分区表还是普通表
if (args.length > 7) {
println("-----------------分区表")
// 判断分区的参数在不在列中 如果不在 ,则加上 ,在的话就自动往下走
if (!mysqlconnect.columns.contains(args(7))){
frameresult = mysqlconnect.withColumn(args(7),lit(args(8)))
frameresult.show()
}
}else{
println("-----------普通表")
frameresult = mysqlconnect
mysqlconnect.show()
}
}

case true => {
// 表存在
// 判断是不是分区表
frameresult = changecolnums(args, hiveconf, mysqlconnect)
// if (args.length > 7) {
// println("-----------------分区表")
// if (!mysqlconnect.columns.contains(args(7))){
// frameresult = changecolnums(args, hiveconf, mysqlconnect)
// }
// }else{
// println("-----------普通表")
// frameresult = mysqlconnect
// }
frameresult.show()}
}









spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
spark.conf.set("hive.exec.dynamic.partition","true")
spark.conf.set("spark.sql.parquet.writeLegacyFormat", "true")
println(empty)
hiveconf.foreach(println(_))
saveFile.savetohiveapi(empty,frameresult,hiveconf,changecolunm)
}

}

hivetojdbcs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
package project

import java.util

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalog.Catalog
import project.jdbctohive.spark
import tool.{getmysqldf, savefile, sqlUtils,readfile}

object hivetojdbc {
val spark = SparkSession.builder().appName("sqoop").master("local[4]").enableHiveSupport().getOrCreate()
val getmysqldf = new readfile
val sqlUtils = new sqlUtils
val saveFile = new savefile
private val catalog: Catalog = spark.catalog

def main(args: Array[String]): Unit = {

if (args.length==0){
println(
"""
|欢迎使用本程序
|参数说明
|总体参数种类 hive mysql
|---------------------------hive
|hive中要选择的字段 例子 : "sal,big / * "
|hive的table的名字 例子 : bigdata_hive3.emp
|hive中的 条件可以为空 例子 : where sal > '300'
|---------------------------mysql
|savemode overwrite append
|url 例子 : jdbc:mysql://bigdata2:3306/try
|user 例子 : root
|password 例子 : liuzihan010616
|dbtable 例子 : emp
|幂等性的列 : 例子 : sal
|驱动名称 : 例子 com.mysql.jdbc.Driver
|""".stripMargin)
}
val frame = sqlUtils.checksql(spark, sqlUtils.hivesqlchoose(args))

var mysqlconf = new Array[String](args.length-3)
mysqlconf = util.Arrays.copyOfRange(args, 2, args.length)
saveFile.savetojdbc(spark,frame,mysqlconf)


}

}

flink

简介

从14年到15年1月就正式开始

flink本身是德语词,代表快速灵巧

1
2
3
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

Here, we explain important aspects of Flink’s architecture.

数据类型

边界流:Bounded streams : 有明确的开始以及结束的流 也为人所知是批处理(batch processing)

无边界流:Unbounded streams :有开始无结束的边界流

flink特点

flink处理的数据类型也在从边界流逐渐转向无边界流 :一起是边界流 现在是无边界流

flink是个分布式系统

flink针对本地访问进行了优化-> 任务的状态始终保存在内存中如果追昂太大小超过可用内存,那么则会把他存储在能高效访问的磁盘数据结构中->任务通过访问本地(通常在内存中)状态来进行所有的计算,从而产生非常低的处理延迟。Flink 通过定期和异步地对本地状态进行持久化存储来保证故障场景下精确一次的状态一致性。

flink应用

数据

  • 有界
  • 无界
  • 实时
  • 离线

状态

只有在每一个单独的事件上进行转换操作的应用才不需要状态,换言之,每一个具有一定复杂度的流处理应用都是有状态的。

Flink 提供了许多状态管理相关的特性支持,其中包括:

  • 多种状态基础类型 : 例如 value ,map ,list
  • 插件化的State Backend : State Backend 负责管理应用程序状态,并在需要的时候进行 checkpoint。Flink 支持多种 state backend,可以将状态存在内存或者 RocksDB。RocksDB 是一种高效的嵌入式、持久化键值存储引擎。Flink 也支持插件式的自定义 state backend 进行状态存储。
  • 精确一次语义 : 就是kafka里的精准一次 -> 说明flink支持事务
  • 超大数据量状态 : Flink 能够利用其异步以及增量式的 checkpoint 算法,存储数 TB 级别的应用状态。
  • 可弹性伸缩的应用 : Flink 能够通过在更多或更少的工作节点上对状态进行重新分布,支持有状态应用的分布式的横向伸缩。

时间

时间是流处理应用另一个重要的组成部分。因为事件总是在特定时间点发生,所以大多数的事件流都拥有事件本身所固有的时间语义。进一步而言,许多常见的流计算都基于时间语义,例如窗口聚合、会话计算、模式检测和基于时间的 join。流处理的一个重要方面是应用程序如何衡量时间,即区分事件时间(event-time)和处理时间(processing-time)。

  • 事件时间模式 : 使用事件时间语义的流处理应用根据事件本身自带的时间戳进行结果的计算。因此,无论处理的是历史记录的事件还是实时的事件,事件时间模式的处理总能保证结果的准确性和一致性。
  • Watermark 支持 : Flink 引入了 watermark 的概念,用以衡量事件时间进展。Watermark 也是一种平衡处理延时和完整性的灵活机制。
  • 迟到数据处理 :当以带有 watermark 的事件时间模式处理数据流时,在计算完成之后仍会有相关数据到达。这样的事件被称为迟到事件。Flink 提供了多种处理迟到数据的选项,例如将这些数据重定向到旁路输出(side output)或者更新之前完成计算的结果
  • 处理时间模式 :除了事件时间模式,Flink 还支持处理时间语义。处理时间模式根据处理引擎的机器时钟触发计算,一般适用于有着严格的低延迟需求,并且能够容忍近似结果的流处理应用。

分层APi

Flink 根据抽象程度分层,提供了三种不同的 API。每一种 API 在简洁性和表达力上有着不同的侧重,并且针对不同的应用场景。

如下 :

ProcessFunction

ProcessFunction 是 Flink 所提供的最具表达力的接口。ProcessFunction 可以处理一或两条输入数据流中的单个事件或者归入一个特定窗口内的多个事件。它提供了对于时间和状态的细粒度控制。开发者可以在其中任意地修改状态,也能够注册定时器用以在未来的某一时刻触发回调函数。因此,你可以利用 ProcessFunction 实现许多有状态的事件驱动应用所需要的基于单个事件的复杂业务逻辑。

相当于计时器的用处,下处是官方的例子:

官方的例子是设置开始,并登记一个4小时的的计时器,当提前返回end,则是提前结束并返回时间,当到4小时则清空状态并结束

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
/**

* 将相邻的 keyed START 和 END 事件相匹配并计算两者的时间间隔
* 输入数据为 Tuple2<String, String> 类型,第一个字段为 key 值,
* 第二个字段标记 START 和 END 事件。
*/
public static class StartEndDuration
extends KeyedProcessFunction<String, Tuple2<String, String>, Tuple2<String, Long>> {

private ValueState<Long> startTime;

@Override
public void open(Configuration conf) {
// obtain state handle
startTime = getRuntimeContext()
.getState(new ValueStateDescriptor<Long>("startTime", Long.class));
}

/** Called for each processed event. */
@Override
public void processElement(
Tuple2<String, String> in,
Context ctx,
Collector<Tuple2<String, Long>> out) throws Exception {

switch (in.f1) {
case "START":
// set the start time if we receive a start event.
startTime.update(ctx.timestamp());
// register a timer in four hours from the start event.
ctx.timerService()
.registerEventTimeTimer(ctx.timestamp() + 4 * 60 * 60 * 1000);
break;
case "END":
// emit the duration between start and end event
Long sTime = startTime.value();
if (sTime != null) {
out.collect(Tuple2.of(in.f0, ctx.timestamp() - sTime));
// clear the state
startTime.clear();
}
default:
// do nothing
}
}

/** Called when a timer fires. */
@Override
public void onTimer(
long timestamp,
OnTimerContext ctx,
Collector<Tuple2<String, Long>> out) {

// Timeout interval exceeded. Cleaning up the state.
startTime.clear();
}
}

DataStream API

DataStream API 为许多通用的流处理操作提供了处理原语。这些操作包括窗口、逐条记录的转换操作,在处理事件时进行外部数据库查询等。DataStream API 支持 Java 和 Scala 语言,预先定义了例如 map()reduce()aggregate() 等函数。你可以通过扩展实现预定义接口或使用 Java、Scala 的 lambda 表达式实现自定义的函数。

下面的代码示例展示了如何捕获会话时间范围内所有的点击流事件,并对每一次会话的点击量进行计数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// 网站点击 Click 的数据流
DataStream<Click> clicks = ...

DataStream<Tuple2<String, Long>> result = clicks
// 将网站点击映射为 (userId, 1) 以便计数
.map(
// 实现 MapFunction 接口定义函数
new MapFunction<Click, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> map(Click click) {
return Tuple2.of(click.userId, 1L);
}
})
// 以 userId (field 0) 作为 key
.keyBy(0)
// 定义 30 分钟超时的会话窗口
.window(EventTimeSessionWindows.withGap(Time.minutes(30L)))
// 对每个会话窗口的点击进行计数,使用 lambda 表达式定义 reduce 函数
.reduce((a, b) -> Tuple2.of(a.f0, a.f1 + b.f1));

SQL & Table API

Flink 支持两种关系型的 API,Table API 和 SQL。这两个 API 都是批处理和流处理统一的 API,这意味着在无边界的实时数据流和有边界的历史记录数据流上,关系型 API 会以相同的语义执行查询,并产生相同的结果。Table API 和 SQL 借助了 Apache Calcite 来进行查询的解析,校验以及优化。它们可以与 DataStream 和 DataSet API 无缝集成,并支持用户自定义的标量函数,聚合函数以及表值函数。

Flink 的关系型 API 旨在简化数据分析数据流水线和 ETL 应用的定义。

下面的代码示例展示了如何使用 SQL 语句查询捕获会话时间范围内所有的点击流事件,并对每一次会话的点击量进行计数。此示例与上述 DataStream API 中的示例有着相同的逻辑。

1
2
3
SELECT userId, COUNT(*)
FROM clicks
GROUP BY SESSION(clicktime, INTERVAL '30' MINUTE), userId

Flink 具有数个适用于常见数据处理应用场景的扩展库。这些库通常嵌入在 API 中,且并不完全独立于其它 API。它们也因此可以受益于 API 的所有特性,并与其他库集成。

  • 复杂事件处理(CEP) :模式检测是事件流处理中的一个非常常见的用例。Flink 的 CEP 库提供了 API,使用户能够以例如正则表达式或状态机的方式指定事件模式。CEP 库与 Flink 的 DataStream API 集成,以便在 DataStream 上评估模式。CEP 库的应用包括网络入侵检测,业务流程监控和欺诈检测。
  • DataSet API :DataSet API 是 Flink 用于批处理应用程序的核心 API。DataSet API 所提供的基础算子包括 mapreduce(outer) joinco-groupiterate等。所有算子都有相应的算法和数据结构支持,对内存中的序列化数据进行操作。如果数据大小超过预留内存,则过量数据将存储到磁盘。Flink 的 DataSet API 的数据处理算法借鉴了传统数据库算法的实现,例如混合散列连接(hybrid hash-join)和外部归并排序(external merge-sort)。
  • Gelly : Gelly 是一个可扩展的图形处理和分析库。Gelly 是在 DataSet API 之上实现的,并与 DataSet API 集成。因此,它能够受益于其可扩展且健壮的操作符。Gelly 提供了内置算法,如 label propagation、triangle enumeration 和 page rank 算法,也提供了一个简化自定义图算法实现的 Graph API

flink运维

Flink通过几下多种机制维护应用可持续运行及其一致性:

  • 检查点的一致性 : Flink的故障恢复机制是通过建立分布式应用服务状态一致性检查点实现的,当有故障产生时,应用服务会重启后,再重新加载上一次成功备份的状态检查点信息。结合可重放的数据源,该特性可保证 精确一次(exactly-once) 的状态一致性。
  • 高效的检查点 : 如果一个应用要维护一个TB级的状态信息,对此应用的状态建立检查点服务的资源开销是很高的,为了减小因检查点服务对应用的延迟性(SLAs服务等级协议)的影响,Flink采用异步及增量的方式构建检查点服务。
  • 端到端的精确一次 : Flink 为某些特定的存储支持了事务型输出的功能,及时在发生故障的情况下,也能够保证精确一次的输出。
  • 集成多种集群管理服务 : Flink已与多种集群管理服务紧密集成,如 Hadoop YARN, Mesos, 以及 Kubernetes。当集群中某个流程任务失败后,一个新的流程服务会自动启动并替代它继续执行。
  • 内置高可用服务 : Flink内置了为解决单点故障问题的高可用性服务模块,此模块是基于Apache ZooKeeper 技术实现的,Apache ZooKeeper是一种可靠的、交互式的、分布式协调服务组件。

Flink能够更方便地升级、迁移、暂停、恢复应用服务

而Flink的 Savepoint 服务就是为解决升级服务过程中记录流应用状态信息及其相关难题而产生的一种唯一的、强大的组件。一个 Savepoint,就是一个应用服务状态的一致性快照,因此其与checkpoint组件的很相似,但是与checkpoint相比,Savepoint 需要手动触发启动,而且当流应用服务停止时,它并不会自动删除。Savepoint 常被应用于启动一个已含有状态的流服务,并初始化其(备份时)状态。Savepoint 有以下特点:

  • 便于升级应用服务版本 : Savepoint 常在应用版本升级时使用,当前应用的新版本更新升级时,可以根据上一个版本程序记录的 Savepoint 内的服务状态信息来重启服务。它也可能会使用更早的 Savepoint 还原点来重启服务,以便于修复由于有缺陷的程序版本导致的不正确的程序运行结果。
  • 方便集群服务移植 : 通过使用 Savepoint,流服务应用可以自由的在不同集群中迁移部署。
  • 方便Flink版本升级 : 通过使用 Savepoint,可以使应用服务在升级Flink时,更加安全便捷。
  • 增加应用并行服务的扩展性 : Savepoint 也常在增加或减少应用服务集群的并行度时使用。
  • 便于A/B测试及假设分析场景对比结果 : 通过把同一应用在使用不同版本的应用程序,基于同一个 Savepoint 还原点启动服务时,可以测试对比2个或多个版本程序的性能及服务质量。
  • 暂停和恢复服务 : 一个应用服务可以在新建一个 Savepoint 后再停止服务,以便于后面任何时间点再根据这个实时刷新的 Savepoint 还原点进行恢复服务。
  • 归档服务 : Savepoint 还提供还原点的归档服务,以便于用户能够指定时间点的 Savepoint 的服务数据进行重置应用服务的状态,进行恢复服务。

监控和控制应用服务

如其它应用服务一样,持续运行的流应用服务也需要监控及集成到一些基础设施资源管理服务中,例如一个组件的监控服务及日志服务等。监控服务有助于预测问题并提前做出反应,日志服务提供日志记录能够帮助追踪、调查、分析故障发生的根本原因。最后,便捷易用的访问控制应用服务运行的接口也是Flink的一个重要的亮点特征。

Flink与许多常见的日志记录和监视服务集成得很好,并提供了一个REST API来控制应用服务和查询应用信息。具体表现如下:

  • Web UI方式 : Flink提供了一个web UI来观察、监视和调试正在运行的应用服务。并且还可以执行或取消组件或任务的执行。
  • 日志集成服务 :Flink实现了流行的slf4j日志接口,并与日志框架log4jlogback集成。
  • 指标服务 : Flink提供了一个复杂的度量系统来收集和报告系统和用户定义的度量指标信息。度量信息可以导出到多个报表组件服务,包括 JMX, Ganglia, Graphite, Prometheus, StatsD, Datadog, 和 Slf4j.
  • 标准的WEB REST API接口服务 : Flink提供多种REST API接口,有提交新应用程序、获取正在运行的应用程序的Savepoint服务信息、取消应用服务等接口。REST API还提供元数据信息和已采集的运行中或完成后的应用服务的指标信息。

flink优点

处理流式数据

事件驱动

低延迟

高吞吐

准确性,以及容错性

支持精准一次

应用:

数据源 -> etl -> 数仓 -> flink -> 报表业务等

flink解析

state : 存储在内存中的数据 ,内存的数据响应快 ,但是不稳定

checkpoint : 备份checkpoint ,当机器出现故障的时候可以恢复数据 -> 会周期行进行保存

对于数据的准确性:以lambad为例

通过两个系统(lambad系统,sparkStreaming)

  • 流式处理 -> 实现快速
  • 批处理 -> 保证顺序

先把数据通过流式处理进行数据处理,然后设定一定时间或者一定的数据量,当达到一定时间的时候或者数据量达到一定程度,再进行往下通过批处理发送,保证结果的顺序

衍生出flink

storm第一代

lambda第二代

flink第三代 -> 集成上面所有的

数据模型

sparkStreaming:

采用RDD 实际上就是一组小数据的集合RDD

flink:

基本就是数据流 ,以event序列

运行时架构

spark是批计算,将DAG划分为不同的stage,一个完成才开始下一个

flink是标准的流执行,一个事件再一个节点处理完后可以直接发送到下一个节点

配置文件

jobmanager :针对整个job的 =》 driver =》 会在启动的机器上 =》 其会和taskmanager进行通信,默认通信端口是 6123

rpc.address : 启动的机器,在配置文件里设置的

rpc.port : 通信端口

heap.size : 堆的内存 jvm中

process.size : taskmanager 的占用的内存,包括jvm以及堆外内存 默认开启

flink.size : task占用的内存,包括一些状态什么的 process.size 包含flink.size

numberTaskSlots: 一个任务在几个Solt上执行

parallelism:并行度 这个参数和上面的不一样,这个是运行的时候来的,上一个是直接给你分,不是在运行的时候的

taskmanager : 针对job下的一个task的 =》 worker

搭建flink项目

在idea里

pom

1
2
3
4
5
6
7
8
9
10
11
12
   <dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.12</artifactId>
<version>1.13.6</version>
</dependency>

<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.12</artifactId>
<version>1.13.6</version>
</dependency>

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
package flinklearn

import org.apache.flink.api.scala.ExecutionEnvironment


object frist {
def apply(): frist = new frist()
def main(args: Array[String]): Unit = {
// frist().piwc()
frist().Streamwc()
}
}



class frist() {

//创建批处理执行环境类比sparksession
val pienvironment =org.apache.flink.api.scala.ExecutionEnvironment.getExecutionEnvironment
val streamingenv= org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.getExecutionEnvironment

// 批处理wc
def piwc()={

import org.apache.flink.api.scala._
// 从文件中取数据
val path = "F:\\bigdatajava\\src\\main\\resources\\wc.data"
val value = pienvironment.readTextFile(path)

// 对数据进行转换处理
val resultds:DataSet[(String,Int)] = value.flatMap(_.split(",")).map((_,1)).groupBy(0).sum(1)
resultds.print()
}

// 流处理wc
def Streamwc() = {

import org.apache.flink.streaming.api.scala._

// 设置并行度 -> 界面的数字就是并行度,10> (flume,2) 前面的数字就是哪一个任务的id -> 是根据hash值进行分的 -> 默认是电脑的最大配置
// 下面是全局设置
// 还可按照每个算子后面设置
// 因为每个算子都算一个单独的任务
// val value1 = value.flatMap(_.split(",")).filter(_.nonEmpty).setParallelism(3).map((_, 1)).keyBy(0).sum(1).setParallelism(1)
streamingenv.setParallelism(1)


// 接受一个socket文本流
val value = streamingenv.socketTextStream("bigdata3",8888)

val value1 = value.flatMap(_.split(",")).filter(_.nonEmpty).map((_, 1)).keyBy(0).sum(1)

value1.print().setParallelism(1)

// 启动任务执行
streamingenv.execute("first")

}

}

部署flink并运行

先下载flink的包,我用的scala是2.12的所以下的是flink_scala_2.12的

根据自己的版本选择

地址:flink

下载完成上传到服务器

然后解压 -> 设置环境变量 -> 进入到flink的conf文件夹,编辑 flink-conf.yaml 文件 把参数 jobmanager.rpc.address:设置成主节点,然后按照需求是不是开启高可用,以及设置检查点的文件夹(如果文件夹放在hdfs上,则flink要两个依赖包,flink自己没有的分别是 flink-shaded-hadoop-3-uber-3.1.1.7.2.9.0-173-9.0.jar以及 commons-cli-1.5.0.jar)可以区maven官网下载,然后放到flink/lib下,这两个jar包要按照自己hadoop的版本进行下载

->然后再编辑workers -> 添加上子节点的名字 -> 分发到各个机器上

然后再主节点启动start-cluster.sh

就成功了 访问 主节点:8081

就可以访问flink的web页面

编写启动脚本如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
case $1 in 
"start")
ssh bigdata5 "/home/hadoop/app/flink/bin/start-cluster.sh"
;;
"stop")
ssh bigdata5 "/home/hadoop/app/flink/bin/stop-cluster.sh"
;;
"status")
echo "web ui : bigdata5:8081"
jps| grep TaskManagerRunner
ssh bigdata4 "jps| grep TaskManagerRunner"
ssh bigdata5 "jps| grep StandaloneSessionClusterEntrypoint"
;;
*)
echo "error input you should use by start|stop|status"
;;
esac

把上述scala代码打包成jar包

web

上传到服务器的web界面如下

img然后设置运行主类,以及并行度,参数,checkpoint点就好

如果上述没有放置两个jar包,则是无法执行再hdfs上设置checkpoint文件夹的

上述的代码执行之后输出在哪里呢?

他会输出在task manager里,至于具体在哪个里,应该点击输出任务

如下:

img

然后在web界面上点击task-manager -> 点击相应机器 -> 点击Stdout 就会看见控制台信息了

这就是web部署成功了

然后停止如下:

命令行

执行:flink run -m bigdata5:8081 -c flinklearn.frist -p 2 ./bigdatajava-1.0-SNAPSHOT.jar

就可以了,参数以及checkpoint可以加载后面,如果不设置,就走默认的

因为对于sockt文本流他的并行度就是1,所以外面无法改变

如下:

经过ctrl + c 或者其他操作之后,这个作业并不会停掉

通过 flink list 9723a168e896e048b777473cb871e10a后面的是job的id,其实知识为了更精准一下,这个参数是可选的

还可以接-a 代表查看所有的

通过 flink cancel jobid就可以对只定的jobid进行停止

如下 :

部署模式

flink为我们的不同场景设置了不同的模式

  • 会话模式
  • 单作业模式
  • 应用模式

会话

先启动集群,然后其他的进行提交作业,就是我们上述的模式

优点:相当于集群先启动,索要的资源已经固定好了,集群的生命周期高于任何的job,不和job的结束而改变

缺点:资源不够的时候会出问题

和另外的资源管理平台结合用

单作业

每个作业都启动一个flink集群,就不会出现上述资源不够的问题

就是按照把资源按照作业来划分

相当于container

一般的时候是首选的,但是flink本身是没有办法用单作业的

他要借助别人的容器化的管理机制-> yarn/ k8s

应用模式

上述两种是都先在客户端进行执行的,然后再发送给jobmanager,但是会占用网络带宽,

而且对于单作业模式的情况很可能会在客户端拆分成好几个作业,任何根据他每个作业就启动一个集群的说法,会造成大量的资源浪费

然后我们直接把作业发送到jobmanager上直接由他做处理,就是应用模式

和单作业很像

单作业是作业和集群一对一

应用是应用和集群一对一

独立模式

不依赖任何外部资源管理平台

最基本,也是最简单的

在实际项目中使用会比较少

因为对资源的管理有要求

再独立模式的时候,没有单作业的,因为必须要外部平台

应用模式 -> 可以 但是使用少

首先把要运行的jar包放在flink的lib文件夹下

然后执行 standalone-job.sh start --job-classname flinklearn.frist因为flink会默认扫描lib包下的所有的jar包所以这里指定入口就好

然后 执行 taskmanager.sh start

停掉集群:

1
2
3
standlone-job.sh stop

taskmanager.sh stop

yarn模式

客户端先把flink的一个应用提交到yarn上

yarn他的resourcemanager会向nodemanager申请容器 ?

在这些容器上flink会部署他的作业flink会根据作业所要的sloat的数量进行动态分配taskmanager的资源

hadoop至少是2.2及其以上

flink在1.8之前hadoop的版本和正常的版本是分开的,就是人家给了你两套

但是1.8-1.11我们要下载的仅仅只是hadoop的插件

但是1.11之后就更不用下载hadoop的插件了,我们主要就进行环境变量的配置就好了

要配置

1
2
3
4
5
export HADOOP_HOME=/home/hadoop/app/hadoop
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_CLASSPATH=`hadoop classpath`
export PATH=${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:$PATH

就好

然后要创建一个Yarnsession

在flink的主节点下用 yarn-session.sh -nm name就能关联上yarn

如下

但是仅仅这样启动的集群他的web界面查看后发现插槽是0,如下

这是因为我们启动的是yarn的模式的应用模式

当我们关掉它的时候yarnsession就会关掉了,我们可以加如下参数对它进行控制

-d : 分离模式 ,前台关掉,后面不会把yarnsession关掉

-jm : 配置jobmanager索要的内存 默认单位 MB

-nm : 配置名字

-qu : 指定yarn的队列名字

-tm: 配置每个taskmanager的内存

注意:flink 从1.11之后就不再使用 -s和-n 指定插槽数量以及taskmanager的数量了,yarn会动态的进行分配的

然后用户还是可以通过web和命令行两种进行提交作业和上述standlone的时候是一样的

其实上述就是很简单的会话模式

单作业

在yarn模式的时候由于有了外部资源管理平台,就可以进行单作业模式了

执行 flink run -d -t yarn-per-job -c flinklearn.frist jar包的绝对路径

-d : 就是分离模式

-t :是指定yarn模式的模式 yarn-per-job 就是单作业

-c : 是class入口

后面还可以接参数等等

早期还有一种把 -t pre-yarn-job 用 -m yarn-cluster 代替的写法

应用模式

和单作业模式很像,就是运行的参数不同

flink run-application -t yarn-application -c ....

查看作业

flink list -t yarn-application -Dyarn.application.id = ....

取消作业

flink cancel -t yarn-application -Dyarn.application.id = ....

还可以通过yarn.provided.lib.dirs配置选项指定位置 ,把jar上传到远程

1
flink run-application -t yarn-application -Dyarn.provided.lib.dirs="hdfs://bigdata3:9000/tmp/flinktmp" hdfs://bigdata3:9000/tmp/flinktmp

上传到hdfs上运行

flink运行时的架构

flink系统架构

作业管理器(jobmanager)

是flink集群中的任务管理中心以及调度中心

最核心的组件,负责单独处理job

在作业提交的时候jobmaster会先接受到要执行的应用,一般是客户端提交的,包括:jar,数据流图,作业图

jobmaster会把jobGraph转换成一个物理层面的数据流图,这个图被叫做执行图(ExecutionGraph),它包含了所有可以并发的任务,jobMaster会向资源管理器(ResourceManager)发送请求,申请执行任务必要的资源,一旦它获取了足够的资源,就会将执行图分别发到他们真正运行的TaskManager上

在运行过程中jobmaster会负责监控指标以及调度,比如说检查点的协调

资源管理器(resourcemanager)

在一个flink集群里只有一个,负责分配资源,所谓资源其实主要是taskmanager的任务槽(slot),任务槽就是flink集群中的资源调度单位,包含机器用来计算的我cpu和内存资源,每一个任务都要分配到一个solt上,主要是内存分开

分发器(Dispatcher)

他主要是负责提供一个rest接口,用来提交应用的,并且为每一个新提交的作业启动一个新的jobMaster组件,Diapatcher也会启动一个web UI 用来方便和展示监控作业的信息,Diapatcher在架构中并不是必须的在不同的模式种可能会被忽略

任务管理器(taskmanager)

flink种的worker

每一个taskmanager包含了一定的solt

插槽的数量限制了并行度 : 设置并行度的优先级 代码最高 其次是命令 其次是配置文件

启动之后taskmanager 会将一个或者多个插槽提供给jobmaster调用,jobmaster就可以向插槽分配任务来执行

执行过程中,一个taskManager可以和其他的与运行同一job的taskmanager来交互数据

一些执行流程图如下:

flink的细节

程序和数据流:

所有的flink程序都是要由三部分组成的 source transform sink

在运行flink项目的时候flink的程序会被映射成逻辑数据流(dataflow),它包含了三个部分 ,每一个dataflow都以一个或者多个source开始,以一个或者多个sink结束,其类似有向无环图(DAG)

大部分情况,程序中的转换操作(transform)和dataflow的算子(operation)是一一对应的关系

并行度

每一个算子可以包含多个或者一个子任务 ,这些子任务在不同的线程,不同的物理机,不同的容器中是完全独立的

一个特定的算子的子任务的个数就被称为并行度

任务并行:就是相当于多个线程
数据并行:同一个算子可以茶城多分町是处理多份数据

例子:suorce的时候如何设置多并行?

它是把数据源进行复制,如何让每一个线程去处理不同数据最后再合到一起

数据传输形式

一个程序之间不同的算子可能有不同的并行度

算子之间的传输数据的形式可以是one-to-one也可以是redistributing的模式具体是什么取决于算子的种类

one-to-one:streaming维护着分区的顺序以及元素的顺序(比如source以及map之间)这意味着元素的个数顺序相同,map,filiter,flatMap,等算子,都是one-to-one的

Redistributing:指分区数量可能会发生改变,每一个算子,的子任务依据所选择的transform发送数据到不同的目标任务

例如:keyBy基于hashcode重新分区,而broadcast和rebalance会随即重新分区,这些算子都是引起redistributing的而这个过程就相当于spar中的shuffle

于是就诞生了算子链

flink使用一种称为任务链的优化技术,减少通信的开销,为了满足任务链的需求,将两个或者多个算子设为相同的并行度,通过本地转发(local forward)的放式进行链接

相同并行度的one-to-one操作,flink放在一起,链接形成一个task,并行度相同,并且是one-to-one操作,两个条件缺一不可

执行图

flink中的执行图可以分为StreamingGraph -> jobGraph -> ExcutionGraph -> 物理执行图

  • StreamingGraph:是根据用户的api自动生成的最初的图用来表明程序的拓扑结构
  • jobGraph:上面一个经过优化,提交给jobmanager的数据结构,将多个符合条件的节点chain到一起作为一个节点
  • ExcutionGraph : jobmanager 根据jobGraph生成的并行化版本,是调度曾的核心的数据结构
  • 物理执行图:在各个taskmanager上的,就是告诉他们怎么做的,是部署到taskmanager上的,不是一个数据结构

如下:

任务和任务槽

flink中每一个taskmanager就相当于是一个进程,他会在独立的线程上执行一个或者多个子任务

为了控制taskmanager能接收多少个task,Taskmanager通过task solt来进行控制 ,(一个taskmanager最少有一个slot)

slot最主要的作用就是隔离内存,因为cpu是没有办法真正隔离开的

flink里默认是允许子任务进行共享slot的,简单来说就是一个slot可以作为我们保存作业的整个管道

当我们将资源密集型和资源非密集型的任务放到一个slot中,他们就可以自行分配对资源占用的比例,从而保证最重的活平均分配给所有的taskmanager

slot和并行度

solt:静态概念:是指taskmanager具有的并发的执行的能力

通过参数taskmanager.numberOfTaskSlot进行配置

并行度:动态概念,就是真正所用到的并发能力

通过参数:parallelism.default进行设置

简单来说就是我可以拿起多沉的东西,但是我不用那么大的力气

flink控制任务调度(代码)

可以禁用算子链

通过 xxx.disableChaining()

可以实现一个slot单独给一个算子用,同时也不能把他纳入任何一条算子链

还可以用 xxx.startNewChain()

可以实现从xxx开始一个新的算子链,不管前面如何都要分开

还可以设置slot共享组

就是在一个共享组里的slot才可以共享slot

不在一个共享组里的slot他们必须分开

通过 xxx.slotSharingGroup(String)实现 代表后面的算子默认情况下就是在String所在的共享组

DataStreamAPI

对于以后的apiDatasetapi即将被弃用

所以我们用datasetapi

可以把DS堪称一种比较特殊的java集合类型

比如一个socket文本流底层就是DataStream

如果想调用DS的api要进行先创建环境

创建环境

getExecutionEnvironment

它是相当于把下面两种放在一起了,自动判断

1
2
val pienvironment =org.apache.flink.api.scala.ExecutionEnvironment.getExecutionEnvironment
val streamingenv= org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.getExecutionEnvironment

上述的getExecutionEnvironment方法是很智能的,它会自动识别我们是在本地调试还是在集群中调试,它会自动进行转换

createLocalEnvironment

是创建一个本地的环境,在调用的时候可以传入一个参数指定默认的并行度,如果不传入默认就是当前电脑的cpu核心数量

1
2
private val environment: StreamExecutionEnvironment = org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.createLocalEnvironment()

createRemoteEnvironment

调用远程的执行环境

1
2
private val environment: StreamExecutionEnvironment = org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.createRemoteEnvironment("bigdata5",8081,1,"jar包的路径")

它底层是这样定义的

1
2
3
4
5
6
7
8
9
10
def createRemoteEnvironment(
host: String,
port: Int,
parallelism: Int,
jarFiles: String*): StreamExecutionEnvironment = {

val javaEnv = JavaEnv.createRemoteEnvironment(host, port, jarFiles: _*)
javaEnv.setParallelism(parallelism)
new StreamExecutionEnvironment(javaEnv)
}

执行模式

经过上面获取的环境,我们就可以开始对其设置执行模式

在早期的代码中它把批处理和流处理分开了

通过代码

1
2
val pienvironment =org.apache.flink.api.scala.ExecutionEnvironment.getExecutionEnvironment
val streamingenv= org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.getExecutionEnvironment

这样的方式

上面一个是批处理的

下面一个是流处理的

他们的api是基本相同的,但是包不同

但是现在的做法是直接用下面的那个

对于批处理而言:我们只要在提交的时候通过命令

flink run -Dexecution.runtime-mode=BATCH 。。。。

就可以证明他是批处理的

如果不处理上述的参数默认是STREAMING :就是流处理的格式

或者在代码的时候直接通过

1
2
3
4
val streamingenv= org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.getExecutionEnvironment
streamingenv.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC)
streamingenv.setRuntimeMode(RuntimeExecutionMode.BATCH)
streamingenv.setRuntimeMode(RuntimeExecutionMode.STREAMING)

里面传入相应的参数即可

但是一般不推荐这样做,因为这相当于固定死了,直接当命令行参数传递更好一点

在flink中批处理数据被划分到有界流中了,为什么还要批处理模式?

因为性能问题,流处理是来一条数据我处理一个数据,然后发送一条,批处理是来一堆数据我处理,如何再一起发送

对于批处理数据,它来的时候就是一堆来的,然后流处理的时候要一条一条发送,发送的次数多了,而对于批处理,我只用处理,然后一次发过去,就好了

这就是批处理还在flink中的原因

我们的flink代码是懒执行的,和懒加载是一个道理的,只有通过excute才开张真正的执行

source

源算子:就是读取数据源的算子

有界数据

读取有界数据的简单的测试方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
 val streamingenv= org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.getExecutionEnvironment

case class event(uaer:String,url:String,timestamp:Long)

streamingenv.setParallelism(1)
// 从元素中读取数据
streamingenv.fromElements(1,2,3,4,5,65,67,7,7).print("from elem")
streamingenv.fromElements(
event("zihan","1211",1111),
event("bob","1333",22222)
).print("from case class")

// 这个可以从迭代器中读取数据,具体可以ctrl + p 查看
val events = List(event("zihan", "1211", 1111), event("bob", "1333", 22222))
streamingenv.fromCollection(events).print("from list")

// 读取文本文件
streamingenv.readTextFile("F:\\bigdatajava\\src\\main\\resources\\wc.data").print("from text")

输出结果为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from elem> 1
from elem> 2
from elem> 3
from elem> 4
from elem> 5
from elem> 65
from elem> 67
from elem> 7
from elem> 7
from case class> event(zihan,1211,1111)
from list> event(zihan,1211,1111)
from case class> event(bob,1333,22222)
from list> event(bob,1333,22222)
feom text> spark,linux,spark,spark
feom text> hadoop
feom text> linux,hive
feom text> flume,flink
feom text> gg,dd
feom text> ttm,ff
feom text> "zihan","1211",1111
feom text> "bob","1333",22222
[WARN ][2023-02-04 16:15:57][org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator$ReaderState$6.prepareToProcessRecord(ContinuousFileReaderOperator.java:178)]not processing any records while closed

Process finished with exit code 0

我们还可以把一些数据写进文本文件中然后进行读取

无界数据

我们一般是从kafka来接受数据的

我们先要引入链接kafka的依赖

如下:

1
2
3
4
5
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.12</artifactId>
<version>1.13.6</version>
</dependency>

值得注意的是这个是官方的,他会自动根据kafka的版本进行更新,目前支持kafka0.10.0版本及以上的

有特殊需要就去找特殊的版本的

而且对于1.14版本之后的时候对我们要引入的方法有了更改从FlinkKafkaConsumer变成KafkaSource

代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
 
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
streamingenv.setParallelism(1)

// 链接kafka
val properties = new Properties()
properties.put("bootstrap.servers", "bigdata3:9092,bigdata4:9092,bigdata5:9092 ")

// 注意使用下面的那个方法的时候不用在此设置下面的参数,因为这个FlinkKafkaConsumer[T]里面已经封装好了,而且默认采用的就是精准一次
// properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
// properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
// properties.put("acks", "all")
/*
传入参数说明FlinkKafkaConsumer[T]
topic , 反序列化器 , kafka配置参数
上面的T是代表把获得的数据当作什么类型
*/

streamingenv.addSource(new FlinkKafkaConsumer[String]("dl2262",new SimpleStringSchema(),properties)).print("kafka")

读取自定义数据源

如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
streamingenv.setParallelism(1)

/*
自己定义外部数据源
实现SourceFunction接口
重写两个方法run()和cancel()
run()获取数据的方法
cance()控制停止的方法
*/

import flinklearn.clickSource

val stream = streamingenv.addSource(new clickSource)

stream.print("makebyself")

source方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
package flinklearn

import java.util.Calendar

import org.apache.flink.streaming.api.functions.source.SourceFunction

import scala.util.Random

object clickSource {

def apply(): clickSource = new clickSource()

def main(args: Array[String]): Unit = {

}
}

/*
SourceFunction[T]
其中的泛型就是我们对应的返回的数据的类型
*/


class clickSource extends SourceFunction[event]{

// 标志位
var flag = true


def excute(): Unit ={

}


override def run(sourceContext: SourceFunction.SourceContext[event]): Unit = {
// 随机数生成器
val random =new Random()

// 定义选择的范围
val user = Array("1","2","3")
val url = Array("/cat","/.dog","/info")

//使用循环不停的发送数据,标志位做为判断题条件,不停的发送数据
while (flag){
val eventtmp = event(user(random.nextInt(2)),url(random.nextInt(2)),Calendar.getInstance().getTimeInMillis)
// 调用上下文sourceContext的方法向下游发送数据
sourceContext.collect(eventtmp)
// 每隔1s发送一条数据
Thread.sleep(1000)
}

}

override def cancel(): Unit = {
flag = false
}

}

但是对于SourceFunction它本身就是个并行度只能为1的接口

和socket文本流一样

如果想设置多并行度的就要用ParallelSourceFunction这个接口,其使用和上面一样

flink支持的类型

flink里DS的数据类型都是由他的泛型进行控制的

1
2
val stream:DataStream[event] = streamingenv.addSource(new clickSource)

基本上scala和java里所有的他都支持,但只是基本上,因为flink是分布式的,它再每个节点之间交付数据的时候是要经网络传输的,序列化和反序列化,所以对于一些的数据类型就无法支持

他的底层类型都是封装在TypeInformation和types中的,可以点进去查看

泛型的时候不是由flink进行序列化的,他是由Kyro进行的所以就可能出现问题,要尽可能避免

算子

转换算子

  • map
  • filter
  • FlatMap
  • KeyBy:根据key进行聚合里面可以传入字符串,或者数字,或者和map里一样传入一个function
  • 简单聚合方法 -> sum , min ,max 等
  • reduce:就是和spark里的reducebykey一样

调用的时候都是得到DS进行调用

使用如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
    val value: DataStream[String] = streamingenv.readTextFile("F:\\bigdatajava\\src\\main\\resources\\wc.data")
value.flatMap(_.split(",")).map((_,1)).keyBy(_._1).reduce((x,y)=>{
(x._1,x._2+y._2)
}).print()
-------------------------------------------数据
spark,linux,spark,spark
hadoop
linux,hive
flume,flink
gg,dd
ttm,ff
"zihan","1211",1111
"bob","1333",22222

函数类(udf)

为什么他们里面可以放function

查看底层源码可以看见

1
2
@Public
public interface Function extends java.io.Serializable {}

他们继承于这个接口,并实现了各自的方法,所以就可以传入Function

进而导出udf是如何实现的

对于flink里的udf我们可以让它继承不同的function,然后再放进去

测试自定义udf的做法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
package flinklearn
import org.apache.flink.api.common.functions.FilterFunction
import tool._
import org.apache.flink.streaming.api.scala._
object udf {

def apply(): udf = new udf()

def main(args: Array[String]): Unit = {
udf().udftest
}
}

class udf{

private val streamingcontext = new streamingcontext

def udftest = {
val environment = streamingcontext.getflinkenv()

val testdata = List(
event("zihan", "1211", 1111),
event("bob", "1333", 22222)
)

val testDS = environment.fromCollection(testdata)

// 筛选特定数据
testDS.filter( new myfiliterfunction() ).print()

testDS.filter( new FilterFunction[event] {
override def filter(value: event): Boolean = {
value.uaer.contains("zihan")
}
}).print()

environment.execute()
}
}

// 实现自定义的function
class myfiliterfunction() extends FilterFunction[event]{
def filter(value: event): Boolean = {
value.uaer.contains("zi")
}

}

注意,这里不要引用错包,如果引用错包,就会报错,因为scala和java的api名字是一样的

富函数(udf)

因为我们上述所说的udf是针对一条数据进行操作的

但是假如我们想对一批数据进行操作,也就是数据来之前对其进行操作怎么办?

我们要通过更加复杂的用户自定义类,是函数类的扩展版本

最大的不同就是富函数类,可以获取运行环境的上下文,以及有生命周期等

富函数类的继承接口是Rich…Fnction

它里面有两个方法:

  • open : 相当于算子初始化的时候 和spring 里的初始化一样
  • close : 结束的时候 和spring里的销毁是一样的

如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
package flinklearn
import org.apache.flink.api.common.functions.{FilterFunction, RichMapFunction}
import org.apache.flink.configuration.Configuration
import tool._
import org.apache.flink.streaming.api.scala._
object udf {

def apply(): udf = new udf()

def main(args: Array[String]): Unit = {
udf().udftest
}
}

class udf{

private val streamingcontext = new streamingcontext

def udftest = {
val environment = streamingcontext.getflinkenv()

val testdata = List(
event("zihan", "1211", 1111),
event("bob", "1333", 22222)
)

val testDS = environment.fromCollection(testdata)

// 定义富函数
testDS.map( new myRichmap).print("2")

val result =
"""
|索引号0
|编号为4a4f30b513560b972fd0e372460b71c4
|2> 1211
|2> 1333
|这个是结束方法0
|""".stripMargin
environment.execute()
}
}


// 实现富函数
class myRichmap extends RichMapFunction[event,String]{
override def map(value: event): String = {
value.url
}

// 在所有数据到来之前进行处理
override def open(parameters: Configuration): Unit = {
println("索引号" + getRuntimeContext.getIndexOfThisSubtask)
println("编号为" + getRuntimeContext.getJobId)
}

// closa
override def close(): Unit = {
println("这个是结束方法" + getRuntimeContext.getIndexOfThisSubtask)
}


}

注意,当多个并行度进行的时候,每一个并行调用map的时候都会进行初始化,以及销毁

分区函数

简单来说就是数据的重新分区的操作

简单介绍一下:keyby

keyby:是把每个key根据hash值然后取模运算的方法进行分区也就造成了,每一个相同的key一定可以在同一分区,不同的不一定不在同一分区

接下来我们要学习的算子,是可以真正控制分区的

如果用上面keyby有可能会造成数据倾斜,也就是我们现在的操作就是控制数据倾斜的

物理分区,一般在并行度减少的时候会自动进行

随机分区(shuffle)

使用方法很简单直接DS.shuffle就可以了

轮询分区(Round-Robin)

对比上面的shuffle是洗牌,则他就是发牌,和打扑克一样的那种,和kafka以及nginx是一样的

调用方式DS.rebalance(),其实Ds里上游到下游默认的就是轮询

重缩放分区(rescale)

它和上面的轮询很像

轮询是把每一个并行子任务的数据都进行轮询,就是如果上游是两个任务,下游是三个任务

轮询会让第一个子任务的第一个数据 给下游的第一个,第一个第二个给下游的第二个,第一个的第三个给下游的第三个

上游的第二个子任务同理

但是这个并不是,它是做了个分组,旨在当前的组内进行轮询

就是相当于玩游戏局大了,要分开玩一样

每一个上游任务都会对应下游的一个组,然后再组里进行轮询,不能发牌给其他组

其本质上也就是按照taskmanager进行分组,每个taskmanager之间如果项进行通信则要经过网络传输代价比较大,

然后轮询其实就是再一个taskmanager(上游)和另外一个taskmanager(下游)之间进行通信,所以轮询的数据通道要建立M(上游数量)* N (下游数量)个通信

而现在的则不是,因为他是按照taskmanager进行分开的,所以它理论上是按照它组内的来进行的1上游数量)* N (下游数量)来通信,但是这里的n的数量都比上述小得多

但是要注意如果想优化性能要让上游子任务和下游子任务的数量是倍数的关系最好

使用的时候直接DS.rescale就好

可以用自定义数据源进行测试如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
package flinklearn
import org.apache.flink.api.common.functions.{FilterFunction, RichMapFunction}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
import tool._
import org.apache.flink.streaming.api.scala._
object udf {

def apply(): udf = new udf()

def main(args: Array[String]): Unit = {
udf().udftest
}
}

class udf{

private val streamingcontext = new streamingcontext

def udftest = {
val environment = streamingcontext.getflinkenv(3)

val testdata = List(
event("zihan", "1211", 1111),
event("bob", "1333", 22222)
)
val testDS = environment.addSource(new trysource).setParallelism(3)
testDS.rescale.print("rescale")
environment.execute()
val result =
"""
|rescale:1> 2
|rescale:2> 1
|rescale:2> 3
|rescale:1> 4
|rescale:2> 5
|rescale:1> 6
|rescale:2> 7
|rescale:1> 8
|""".stripMargin

}
}



class trysource extends RichParallelSourceFunction[Int]{
override def run(ctx: SourceFunction.SourceContext[Int]): Unit = {
for (i <- 0 to 7){
if (getRuntimeContext.getIndexOfThisSubtask == (i+1)%2){
ctx.collect(i+1)
}
}
}

override def cancel(): Unit = ???
}


通过结果我们可以知道1,3,5,7对应的子任务的id都是2 ,则2,4,6,8是1

满足我们设置的条件

广播分区(broadcast)

把一份数据复制成多个然后发送到下游所有子任务

但是一般会造成数据重复,但还是有作用的在用广播创建广播流的时候用

自定义分区

接口叫做partitionCustom

源码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/**
* Partitions a DataStream on the key returned by the selector, using a custom partitioner.
* This method takes the key selector to get the key to partition on, and a partitioner that
* accepts the key type.
*
* Note: This method works only on single field keys, i.e. the selector cannot return tuples
* of fields.
*/
def partitionCustom[K: TypeInformation](partitioner: Partitioner[K], fun: T => K)
: DataStream[T] = {

val keyType = implicitly[TypeInformation[K]]
val cleanFun = clean(fun)

val keyExtractor = new KeySelector[T, K] with ResultTypeQueryable[K] {
def getKey(in: T) = cleanFun(in)
override def getProducedType(): TypeInformation[K] = keyType
}

asScalaStream(stream.partitionCustom(partitioner, keyExtractor))
}

Partitioner是分区器,后面的lambda表达式是提取当前分区字段的方法

点进去查看发现

1
2
3
4
5
6
7
8
9
10
11
12
public interface Partitioner<K> extends java.io.Serializable, Function {

/**
* Computes the partition for the given key.
*
* @param key The key.
* @param numPartitions The number of partitions to partition into.
* @return The partition index.
*/
int partition(K key, int numPartitions);
}

它也是一个接口,他的返回值是要返回到下游子任务的编号,也就是分区的编号

如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
package flinklearn
import org.apache.flink.api.common.functions.{FilterFunction, Partitioner, RichMapFunction}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
import tool._
import org.apache.flink.streaming.api.scala._
object udf {

def apply(): udf = new udf()

def main(args: Array[String]): Unit = {
udf().udftest
}
}

class udf{

private val streamingcontext = new streamingcontext

def udftest = {
val environment = streamingcontext.getflinkenv(3)

val testDS = environment.fromElements(1,1,2,3,4,5,6,67,7,8,8,5,6,4,3)
testDS.partitionCustom( new Partitioner[Int]{
override def partition(key: Int, numPartitions: Int): Int = {
key % 2
}
}, x=>x ).print("rescale")
environment.execute()
val result =
"""
|rescale:1> 2
|rescale:1> 4
|rescale:2> 1
|rescale:1> 6
|rescale:2> 1
|rescale:1> 8
|rescale:2> 3
|rescale:1> 8
|rescale:2> 5
|rescale:1> 6
|rescale:2> 67
|rescale:1> 4
|rescale:2> 7
|rescale:2> 5
|rescale:2> 3
|
|Process finished with exit code 0
|
|""".stripMargin

}
}





但是对于case class 可能不好使,我用就是不好用

输出算子

可以调用addSink就可以自己定义一个sink

里面最关键的构造方法是一个invoke具体在源码里

当然SinkFunction一般我们不用用,因为官方给我们提供了好多

接下来我们按照官网进行学习

JDBC

先在idea里添加依赖

1
2
3
4
5
6
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-jdbc</artifactId>
<version>1.16.0</version>
</dependency>

已创建的 JDBC Sink 能够保证至少一次的语义。 更有效的精确执行一次可以通过 upsert 语句或幂等更新实现。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
val value1: DataStreamSink[Yarninfo] = value.addSink(
JdbcSink.sink(
"insert into yarninfo(id,host,applicationtype,name,startime,endtime,user,memeveryscends,vcoreeveryscends,size,cores,state,url) values(?,?,?,?,?,?,?,?,?,?,?,?,?)",
new JdbcStatementBuilder[Yarninfo] {
override def accept(t: PreparedStatement, u: Yarninfo): Unit = {
t.setString(1, u.id)
t.setString(2, u.host)
t.setString(3, u.applicationtype)
t.setString(4, u.name)
t.setString(5, u.startime)
t.setString(6, u.endtime)
t.setString(7, u.user)
t.setString(8, u.memeveryscends)
t.setString(9, u.vcoreeveryscends)
t.setString(10, u.size)
t.setString(11, u.cores)
t.setString(12, u.state)
t.setString(13, u.url)
}
},
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://bigdata2:3306/bigdata")
.withDriverName("com.mysql.jdbc.Driver")
.withUsername("root")
.withPassword("liuzihan010616")
.build()
)
)

如果要实现幂等性等要自己家进行操作

文件

flink写入到文件中

如下:

1
2
3
4
testDS.map(_.toString).addSink(StreamingFileSink.
forRowFormat(new Path("./output"),
new SimpleStringEncoder[String]("UTF-8"))
.build())

分区数量等于生成的文件数量

还可以在.bulid之前用with来设置一些写入的参数

  • withBucketCheckInterval():设置多长时间进行滚动一次
  • 等,具体自己看下就ok

写入到hdfs上的时候也直接改一下path就好

kafka

如下:

1
2
3
4
testDS.map(_.toString).addSink(
new FlinkKafkaProducer[String]
("bigdata3:9092,bigdata4:9092,bigdata5:9092","dl2262",new SimpleStringSchema())
)

就可以往kafka里写入了

自定义外部连接器

就是通过继承SinkFunction,以及对应的RichSinkfunction

实现invoke方法,放入自己写入的方法

时间语义

对于无界流,我们要查看它一定时间内的数据

对于分布式系统,我们没有一个绝对的时间指标

窗口进行数据的收集是以什么为标准的?

处理时间

就是我们对数据进行处理的时候的时间

事件事件

就是这个数据什么时候产生的

水位线

用来度量事件时间的度量

当我们使用事件事件的时候,假如我们要采集8点到9点的数据

那么当我们用事件事件,就是在数据生成的时候打上标记,进行统计他的事件的话,

假如下游还有对时间进行操作的事情,则只能去提取事件时间的时间戳,进行计算,

这样下游的操作就会延迟数据的输出时间,导致输出的数据是一段一段的

于是就把时间戳提出,当作一个变量,当对这个数据进行处理的时候,在时间戳上打个标记

并包装成一个特殊形式,直接插入数据流,跟随着数据一起流动,然后如果看见这个标志就会放到下游

就是在对每一条数据进行处理之后,我们会在这个数据之后加一个类似标记的东西,而这个东西是和数据时间有关系的,作用就是告诉下游我当前处理的数据是这个时间的

有序流中的水位线

就是按照时间顺序进行插入时间戳,保证了数据的顺序

但是如果事件生成的特别快时间特别密,则水位线打上的时间会有所相同,然后因为数据量特别大,则打上所需要的时间和资源会特别多,于是我们从上面的转变成,每间隔一段时间插入一条水位线,每间隔一段时间插入一段水位线,然后这个插入的标准就是它之前最近一次提取到的时间戳,插入的时间周期默认是200ms(可以设置)ps:这插入的周期,是按照系统时间200ms之后就生成一次的

但是假设:

上游是三个分区。下游是一个分区,那么则可能出现乱序,

就是假如第一个分区正常处理时间数据。而对于第二个分区则是有问题或者延迟什么的,它发送了一个在之前时间的数据,就会发生乱序

第一个分区发送的数据如下:1,2,3,4数据全到下面的分区了

第二个分区又发送了个2的数据,就会出现数据集乱序的问题

解决方法

设置一个标志位,保存之前最大的时间戳,然后用这个标志位进行推进时间并对比数据的时间戳,然后如过来的数据特别多可以采用和上面一样的方法进行周期执行的判断最大时间戳

但是上述的方法会出现问题,假如按照上述规则处理窗口,则可能会出问题,假如我们定义一个0-9s的一个窗口,按照这个方法,可能会出现迟到的数据,然后就会丢数据

解决方法

设置延迟函数,就是让他延迟2s,就是真实数据的时间2s的话,则让水位线的时间是0s,这样就可以减少数据丢失的时间了,因为窗口是按照水位线的时间来的,但是上述的方法也不严谨,最终解决方法就是等足够长的时间

就是我们判断一个数据流中的最大乱序从程度,进行设置时间,假如22s后面跟着一个17s的数据,则说他的最大乱序程度是22-17=5s如果还有比这个大的,则就更新这个时间,同时这个时间也是要延迟多少秒的时间

水位线特性:

  • 水位线是一个插入到数据流中的一个标记,可以认为是一个特殊的数据
  • 水位线的主要内容就是一个时间戳,用来表示当前事件时间的进展的
  • 水位线是基于数据的时间戳进行生成的
  • 水位线的时间戳必须是单调递增的,以确保时间的推进
  • 水位线可以通过设置延迟来进行处理迟到的数据

然后就不会出现小于等于t的时间数据了

但是如何确认最大乱序时间?

一般这个最大乱序时间,是按照一个正态分布的,最易最终我们就是在正确性和延时时间做一个权衡

在idea代码如下:

水位线的基本使用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
package flinklearn

import java.time.Duration

import org.apache.flink.api.common.eventtime.{SerializableTimestampAssigner, Watermark, WatermarkGenerator, WatermarkGeneratorSupplier, WatermarkOutput, WatermarkStrategy}
import org.apache.flink.api.java.utils.ParameterTool

object f3 {
def apply(parameterTool: ParameterTool): f3 = new f3(parameterTool)

def main(args: Array[String]): Unit = {
val tool = ParameterTool.fromArgs(args)
f3(tool).excute()
}
}




class f3(parameterTool: ParameterTool) {

import org.apache.flink.streaming.api.scala._

def excute()={
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
// 设置水位线的默认事件默认是毫秒
env.getConfig.setAutoWatermarkInterval(500)

val value = env.addSource(new clickSource)

// 有序流的水位线生成策略
value.assignTimestampsAndWatermarks( WatermarkStrategy.forMonotonousTimestamps().withTimestampAssigner(new SerializableTimestampAssigner[event] {
override def extractTimestamp(element: event, recordTimestamp: Long): Long = {
element.timestamp
}
}))


// 乱序流的水位线生成方法
// 这里的Duration 是java.time下的
value.assignTimestampsAndWatermarks( WatermarkStrategy.forBoundedOutOfOrderness[event](Duration.ofSeconds(5)).withTimestampAssigner(new SerializableTimestampAssigner[event] {
override def extractTimestamp(element: event, recordTimestamp: Long): Long = {
element.timestamp
}
}))

// 自定义水位线
value.assignTimestampsAndWatermarks(new WatermarkStrategy[event] {
override def createWatermarkGenerator(context: WatermarkGeneratorSupplier.Context): WatermarkGenerator[event] = {
new WatermarkGenerator[event] {
// 底层默认要实现的两个方法 但是flink内置了几种基本的策略,在WatermarkStrategy源码中
// 事件触发
val delay = 5000L
// 定义属性保存最大时间戳
var maxtx = Long.MinValue + delay + 1

// 判断最大时间戳
// 按照系统时间做调度
override def onEvent(event: event, eventTimestamp: Long, output: WatermarkOutput): Unit = {
maxtx = Math.max(maxtx,event.timestamp)
}

// // 按照数据进行调度
// override def onEvent(event: event, eventTimestamp: Long, output: WatermarkOutput): Unit = {
// maxtx = Math.max(maxtx,event.timestamp)
// val watermark = new Watermark[event](maxtx)
// output.emitWatermark(watermark)
// }


// 周期行的生产水位线
override def onPeriodicEmit(output: WatermarkOutput): Unit = {
val watermark = new Watermark[event](maxtx -delay -1)
// 周期性发射
output.emitWatermark(watermark)
}

}
}
})

}
}

但是我们还可以在数据源机械能配置,自定义source的时候可以直接定义水位线等参数如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
package flinklearn

import java.util.Calendar

import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.watermark.Watermark

import scala.util.Random

object clickSource {

def apply(): clickSource = new clickSource()

def main(args: Array[String]): Unit = {

}
}

/*
SourceFunction[T]
其中的泛型就是我们对应的返回的数据的类型
*/


class clickSource extends RichParallelSourceFunction[event]{

// 标志位
var flag = true


def excute(): Unit ={

}


override def run(sourceContext: SourceFunction.SourceContext[event]): Unit = {
// 随机数生成器
val random =new Random()

// 定义选择的范围
val user = Array("1","2","3")
val url = Array("/cat","/.dog","/info")




//使用循环不停的发送数据,标志位做为判断题条件,不停的发送数据
while (flag){
val eventtmp = event(user(random.nextInt(2)),url(random.nextInt(2)),Calendar.getInstance().getTimeInMillis)
// 为要发送的数据指定时间戳,按照下面指定完成之后发送数据的时候就会知道哪一个是时间戳,就可以不实现withTimestampAssigner了
sourceContext.collectWithTimestamp(eventtmp,eventtmp.timestamp)
// 往下游直接发送水位线,然后下游就可以不用assignTimestampsAndWatermarks这个方法了,因为水位线已经生成完了
sourceContext.emitWatermark(new Watermark(eventtmp.timestamp))

// 每隔1s发送一条数据
sourceContext.collect(eventtmp)
Thread.sleep(1000)
}

}

override def cancel(): Unit = {
flag = false
}

}

就可以了

水位线的正常就是像数据一样正常的流动,这个是单分区的时候

如果想发送到多个下游的子任务,我们应该广播出去,

但是如果上游有多个分区,那么他们广播的水位线如果不一样,下游该采用哪一个水位线?

答案是最小的数据

我们会设置一个分区水位线的概念,就是采取最小的分区水位线

窗口

我们要观察,或者对一定时间内的数据进行操作,一般定义窗口的时候都是左闭右开,滑动窗口是可以出现重复的数据

但是对于时间时间语义下乱序的时候,就会有迟到的数据,然后我们就要设置延迟时间

但是,既然又迟到的数据,那么也就会有超前的数据在这个窗口中,于是我们不能简单的理解把窗口想象成简单的窗口

我们可以想象成桶的概念,就是简单的,如果这个时间戳是复合这个窗口规定的时间,则会被拉到一个桶中,

这样就不会出现时间不对的数据导致观察错误

窗口的分类:

  • 时间窗口

    • 滚动窗口:就是头连着尾巴一样,一直看,生产很多都是基于滚动窗口的,就类似于把数据分成很多个框框,挨个看
    • 滑动窗口:基于上面滚动窗口,就像一个滑块一样的,从头滑到尾,也叫跳动窗口,滑动窗口的参数是滑动步长,就是每次滑动滑动的距离,如果把滑动步长跳到整个数据那么长,就会变成滚动窗口了
    • 会话窗口:他的标准并不是给窗口设置一个固定的大小,开始和结束的规律也是完全没有的,窗口之间一定没有重叠的,会复杂点
    • 全局窗口:就是全局的,默认是不会触发计算的因为数据不会停下,但是可以设置触发器,进行设置
  • 计数窗口

    • 滚动窗口:同上
    • 滑动窗口:同上
    • 会话窗口:同上
    • 全局窗口:同上

时间窗口略微的复杂点,计数则更为简单

窗口api:可以堪称df的api的一小部分

首先,我们要确定我们做没做keyby

如果keyby了,则要通过调用.window进行开始,会在多个并行子任务上执行,针对每一个key进行执行

如果没做keyby,则是用调用.windowall(),相当于并行度都变成1

无论是上面的哪一个window/windowall

都要街上窗口分配器,然后加上聚合函数

除了全都要我们自定义的窗口分配器以外,flink都提供了内置的function

如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
package flinklearn



import org.apache.flink.api.common.eventtime.{SerializableTimestampAssigner, WatermarkStrategy}
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.windowing.assigners.{EventTimeSessionWindows, ProcessingTimeSessionWindows, SlidingEventTimeWindows, SlidingProcessingTimeWindows, TumblingEventTimeWindows, TumblingProcessingTimeWindows}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction

object flinkWindos {
def apply(parameterTool: ParameterTool): flinkWindos = new flinkWindos(parameterTool)

def main(args: Array[String]): Unit = {
val tool = ParameterTool.fromArgs(args)
flinkWindos(tool).ecxcute()
}
}


class flinkWindos(parameterTool: ParameterTool){

import org.apache.flink.streaming.api.scala._

def ecxcute()={
val env = StreamExecutionEnvironment.getExecutionEnvironment
val zihan = env.addSource(new clickSource)

val zihan1 = zihan.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps().withTimestampAssigner(
new SerializableTimestampAssigner[event] {
override def extractTimestamp(element: event, recordTimestamp: Long): Long = {
element.timestamp
}
}))

zihan1.map(data => {(data.uaer,1)})
.keyBy(_._1)
.window(TumblingEventTimeWindows.of(Time.seconds(7)))// 基于事件时间的滚动窗口 , 偏移量为后面的参数
// .window(TumblingProcessingTimeWindows.of(Time.days(1),Time.hours(-8))) // 基于处理时间的滚动窗口
// .window(SlidingEventTimeWindows.of(Time.days(1),Time.minutes(10),Time.hours(-8))) // 基于时间时间的滑动窗口 步长为10min
// .window(SlidingProcessingTimeWindows.of(Time.days(1),Time.minutes(10),Time.hours(-8))) // 基于处理时间的滑动窗口 步长为10min
// .window(EventTimeSessionWindows.withGap(Time.seconds(10))) // 基于事件时间的会话窗口
// .window(ProcessingTimeSessionWindows.withGap(Time.seconds(10))) // 基于处理时间的会话窗口
// .countWindow(10) // 大小为10的滚动计数窗口
// .countWindow(10,2) // 大小为10的滑动计数窗口,步长为2

// 窗口函数
// 分为增量窗口 和 全窗口
// 增量聚合 是每来一条数据,就处理一条数据,然后存储他的状态,等窗口满足条件,直接输出
// 全窗口,则是类似批处理的形式,把数据都聚集在一起,然后满足条件执行操作,在输出


/*
增量聚合函数包括(典型) : ReduceFunction AggregateFunction
规约聚合:reduceFunction -> 两两进行规约,就和之前简单函数的那个是一样的
*/
// reduce 他在规约的过程中,中间是不能变的,就是数据的输入,输出,规则都一样
// .reduce( (x,y)=> {
// (x._1,x._2+y._2)
// } )
// .print()
// aggre 则可以改变类型,比上面更为灵活
.aggregate(new tryFunction)

env.execute()

}



}


class tryFunction extends org.apache.flink.api.common.functions.AggregateFunction[(String,Int),(Long,Set[String]),Double] {
override def createAccumulator(): (Long, Set[String]) = {
(0,Set[String]()) // 赋初值
}

// 计算过程
override def add(value: (String, Int), accumulator: (Long, Set[String])): (Long, Set[String]) = {
(value._2 + accumulator._1 , accumulator._2 + value._1)
}

// 结果
override def getResult(accumulator: (Long, Set[String])): Double = {
accumulator._1/accumulator._2.size
}

// 会话窗口要用的
override def merge(a: (Long, Set[String]), b: (Long, Set[String])): (Long, Set[String]) = ???
}

全窗口函数:

就相当于针对于全局的窗口函数,而且它可以获取更多的信息

窗口函数现在处于一个迭代的过程中,所以可能会略微复杂些

首先本身上的窗口函数是通过.apply进行调用的里面的传入参数是WindowFunction,这个是最早的时候用的不过现在已经快被弃用了

因为出现了个比他更好的Function,是ProcessWindowFunction,因为这个方法不光可以获取上下文window信息,还可以获取很多其他的属性

WindowFunction:而且他是富函数

1
2
3
4
5
6
7
8
9
10
11
12
13
trait WindowFunction[IN, OUT, KEY, W <: Window] extends Function with Serializable {

/**
* Evaluates the window and outputs none or several elements.
*
* @param key The key for which this window is evaluated.
* @param window The window that is being evaluated.
* @param input The elements in the window being evaluated.
* @param out A collector for emitting elements.
* @throws Exception The function may throw exceptions to fail the program and trigger recovery.
*/
def apply(key: KEY, window: W, input: Iterable[IN], out: Collector[OUT])
}

ProcessWindowFunction

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
abstract class ProcessWindowFunction[IN, OUT, KEY, W <: Window]
extends AbstractRichFunction {

/**
* Evaluates the window and outputs none or several elements.
*
* @param key The key for which this window is evaluated.
* @param context The context in which the window is being evaluated.
* @param elements The elements in the window being evaluated.
* @param out A collector for emitting elements.
* @throws Exception The function may throw exceptions to fail the program and trigger recovery.
*/
@throws[Exception]
def process(key: KEY, context: Context, elements: Iterable[IN], out: Collector[OUT])

/**
* Deletes any state in the [[Context]] when the Window expires
* (the watermark passes its `maxTimestamp` + `allowedLateness`).
*
* @param context The context to which the window is being evaluated
* @throws Exception The function may throw exceptions to fail the program and trigger recovery.
*/
@throws[Exception]
def clear(context: Context) {}

/**
* The context holding window metadata
*/
abstract class Context {
/**
* Returns the window that is being evaluated.
*/
def window: W

/**
* Returns the current processing time.
*/
def currentProcessingTime: Long

/**
* Returns the current event-time watermark.
*/
def currentWatermark: Long

/**
* State accessor for per-key and per-window state.
*/
def windowState: KeyedStateStore

/**
* State accessor for per-key global state.
*/
def globalState: KeyedStateStore

/**
* Emits a record to the side output identified by the [[OutputTag]].
*/
def output[X](outputTag: OutputTag[X], value: X);
}
}

下面我简单用ProcessWindowFunction进行创建

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
package flinklearn


import org.apache.flink.streaming.api.scala.function.{ProcessWindowFunction, WindowFunction}
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

object flinkwindowall {
def apply(): flinkwindowall = new flinkwindowall()

def main(args: Array[String]): Unit = {
flinkwindowall().excute()
}
}


class flinkwindowall(){

import org.apache.flink.streaming.api.scala._



def excute(): Unit ={
val env = StreamExecutionEnvironment.getExecutionEnvironment

val value = env.addSource(new clickSource)

// 指定一个无关的数据,代表全局
value.assignAscendingTimestamps(_.timestamp) // 创建水位线
.keyBy(data => "key") // 设置全局分区
.window(TumblingEventTimeWindows.of(Time.seconds(10))) // 开窗
.process(new firstProcessWindowFunction ) // 调用ProcessWimdowFunction的方法


env.execute()
}
}

class firstProcessWindowFunction extends ProcessWindowFunction[event,String,String,TimeWindow]{
override def process(key: String, context: Context, elements: Iterable[event], out: Collector[String]): Unit = {
// 使用set进行去重
var userset = Set[String]()


// 从element中提取元素
elements.map(userset += _.uaer)
val uv = userset.size
// 提取窗口信息,进行输出
val end = context.window.getEnd
val start = context.window.getStart

println(s"从${start}${end} 的uv是${uv}")


}
}

可以把上述的全窗口和增量放到一起,通过Aggregatortion里面可以传入两个参数,一个是增量的,一个是全窗口的

就表式,增量的结果变成了全窗口的输入,就是两者结合如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
package flinklearn



import org.apache.flink.api.common.eventtime.{SerializableTimestampAssigner, WatermarkStrategy}
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.assigners.{EventTimeSessionWindows, ProcessingTimeSessionWindows, SlidingEventTimeWindows, SlidingProcessingTimeWindows, TumblingEventTimeWindows, TumblingProcessingTimeWindows}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

object flinkWindos {
def apply(parameterTool: ParameterTool): flinkWindos = new flinkWindos(parameterTool)

def main(args: Array[String]): Unit = {
val tool = ParameterTool.fromArgs(args)
flinkWindos(tool).ecxcute()
}
}


class flinkWindos(parameterTool: ParameterTool){

import org.apache.flink.streaming.api.scala._

def ecxcute()={
val env = StreamExecutionEnvironment.getExecutionEnvironment
val zihan = env.addSource(new clickSource)

val zihan1 = zihan.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps().withTimestampAssigner(
new SerializableTimestampAssigner[event] {
override def extractTimestamp(element: event, recordTimestamp: Long): Long = {
element.timestamp
}
}))

val zihan2 = zihan1.map(data => {(data.uaer,1)})
.keyBy(data => "key")
// .window(TumblingEventTimeWindows.of(Time.seconds(7)))// 基于事件时间的滚动窗口 , 偏移量为后面的参数
// .window(TumblingProcessingTimeWindows.of(Time.days(1),Time.hours(-8))) // 基于处理时间的滚动窗口
.window(SlidingEventTimeWindows.of(Time.seconds(10),Time.minutes(2))) // 基于时间时间的滑动窗口 步长为10min
// .window(SlidingProcessingTimeWindows.of(Time.days(1),Time.minutes(10),Time.hours(-8))) // 基于处理时间的滑动窗口 步长为10min
// .window(EventTimeSessionWindows.withGap(Time.seconds(10))) // 基于事件时间的会话窗口
// .window(ProcessingTimeSessionWindows.withGap(Time.seconds(10))) // 基于处理时间的会话窗口
// .countWindow(10) // 大小为10的滚动计数窗口
// .countWindow(10,2) // 大小为10的滑动计数窗口,步长为2

// 窗口函数
// 分为增量窗口 和 全窗口
// 增量聚合 是每来一条数据,就处理一条数据,然后存储他的状态,等窗口满足条件,直接输出
// 全窗口,则是类似批处理的形式,把数据都聚集在一起,然后满足条件执行操作,在输出


/*
增量聚合函数包括(典型) : ReduceFunction AggregateFunction
规约聚合:reduceFunction -> 两两进行规约,就和之前简单函数的那个是一样的
*/
// reduce 他在规约的过程中,中间是不能变的,就是数据的输入,输出,规则都一样
// .reduce( (x,y)=> {
// (x._1,x._2+y._2)
// } )
// .print()
// aggre 则可以改变类型,比上面更为灵活
zihan2.aggregate(new tryFunction11, new firstProcessWindowFunction1).print()

env.execute()

}



}

import org.apache.flink.api.common.functions._
class tryFunction11 extends AggregateFunction[(String,Int),(Long,Set[String]),Double] {
override def createAccumulator(): (Long, Set[String]) = {
(0L,Set[String]()) // 赋初值
}

// 计算过程
override def add(value: (String, Int), accumulator: (Long, Set[String])) = {
(value._2 + accumulator._1 , accumulator._2 + value._1)
}

// 结果
override def getResult(accumulator: (Long, Set[String])): Double = {
accumulator._1/accumulator._2.size
}

// 会话窗口要用的
override def merge(a: (Long, Set[String]), b: (Long, Set[String])): (Long, Set[String]) = ???
}

class firstProcessWindowFunction1 extends ProcessWindowFunction[Double,Double,String,TimeWindow]{
override def process(key: String, context: Context, elements:Iterable[Double], out: Collector[Double]): Unit ={


var total:Double = 0
elements.map(total+=_)
// 提取窗口信息,进行输出
val end = context.window.getEnd
val start = context.window.getStart
println(s"从${start}${end} 的rate是${elements}额外的统计信息是${total}")


}


}

处理迟到数据

可以允许迟到数据

通过调用windowStream下的allowedLateness,设置允许迟到时间,等到达时间,则会发送到下游

还可以通过测输出流,进行收集过于迟到的数据,但是对这个侧输出流的操作是影响不到窗口的,和窗口相当于是分开的

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
package flinklearn
import java.time.Duration
import java.util.Calendar

import org.apache.flink.api.common.eventtime.{SerializableTimestampAssigner, WatermarkStrategy}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.assigners.{TumblingEventTimeWindows, TumblingProcessingTimeWindows}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
import tool._
object dealdelaydata {

def main(args: Array[String]): Unit = {
val environment = StreamExecutionEnvironment.getExecutionEnvironment
val value = environment.socketTextStream("43.140.193.43", 6000).map(data=>{
val strings = data.split(" ")
loginfo(strings(0),strings(1))
})
val resulttmp = value.assignTimestampsAndWatermarks(WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(2)).withTimestampAssigner(new SerializableTimestampAssigner[loginfo] {
override def extractTimestamp(element: loginfo, recordTimestamp: Long): Long = {
element.dt.toLong
}
}))

// val resulttmp2 = resulttmp.keyBy(_.log).window(new TumblingProcessingTimeWindows()).process(new myprocessTimeWindow)
//
// resulttmp2.print()

val flag = new OutputTag[loginfo]("test")
val resluttmp3 = resulttmp.keyBy(_.log).window( TumblingProcessingTimeWindows.of(Time.seconds(10))).allowedLateness(Time.seconds(10)).sideOutputLateData(flag).process( new myprocessTimeWindow)
resluttmp3.print("resulttmp3的原始数据")
resluttmp3.getSideOutput(flag).print("侧输出流")

environment.execute()
}
}

class myprocessTimeWindow extends ProcessWindowFunction[loginfo,String,String,TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[loginfo], out: Collector[String]): Unit = {

out.collect(s"处理时间${context.window.getStart}~${context.window.getEnd}用户${key}的点击次数${elements.size}当前水位线为${context.currentWatermark}")
}
}



import org.apache.flink.api.common.functions._
class myeventTimewindow extends AggregateFunction[loginfo,String,String]{
override def createAccumulator(): String = ???

override def add(value: loginfo, accumulator: String): String = ???

override def getResult(accumulator: String): String = ???

override def merge(a: String, b: String): String = ???
}

处理函数

基本处理函数(ProcessFunction)

sparkstreaming

用于实时计算的模块 =》 sparkstreaming,structuredstreaming

流处理 : 实时

  • 实时 来一套数据处理一条 storm,flink 数据叫event
  • 近实时 来一批数据处理 mini-batch sparkstreaming
  • 数据会源源不断地来

批处理 : 离线

  • 代码或者程序处理一个批次的数据

    • 例子:数据放在hdfs上,我们对他进行处理 =》 ok

技术选型

生产上:

  • sparkstreaming,structuredstreaming 10%
  • flink 90%
  • storm 2%

开发角度:

  • code =》 flink > sparkstreaming
  • sql => flink > spark streaming

业务:

  • 实施指标 :都差不多
  • 实时数仓:
    • 代码 : 差不多
    • sql文件 : flinksql维护实时数仓 =》 ok

特性

容易使用 =》 客观看

批流一体的处理方法 =》 sparksql <=> 流处理

低延迟高消费

简介

  • sparkstreaming开发是spark-core的一个扩展
  • 接收数据的渠道多
  • 还可以对数据进行流处理的可以机器学习等

一般来说流式处理会比批处理负载小,但不绝对

数据源 :

  • kafka ****** 流式引擎重要的数据源 -》 通过topic进行数据缓冲,它会根据sp的吞吐量来进行处理,两个引擎之间会有联系
  • flume **** 可以使用但是一般不用 flume 没有数据缓冲 致命 -》 直接把数据弄到sp里,如果数据量特别多,会让sp程序挂掉,因为如果sa程序的吞吐量比较小,则会崩掉,和sp无联系
  • hdfs

数据积压:kafka数据太大,导致sp程序一直处理不过来,一个出不来报表 =>解决方法

  • 吞吐量提高
  • 数据量减少

sparkstreaming运行机制

  • 接收数据
  • 拆分成batches

sparkstreaming -> kafka :

  • 5s处理数据
  • 每5s会切分成一次batch
  • 交给sparkngine处理
  • 处理完的也是一个batch

sparkstreaming编程模型:Dstream

  • 外部数据源
  • 高级算子
  • 类似RDD

idea开发先配置pom文件

1
2
3
4
5
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>3.2.1</version>
</dependency>

idea代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
package sparkstreaming
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.SparkConf

object sparkstreaming1 {
def main(args: Array[String]): Unit = {

val conf = new SparkConf().setMaster("local[4]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(5))
// 或者通过sparkcontext进行创建
//val ssc = new StreamingContext(sc, Seconds(1))
// 数据源
val lines = ssc.socketTextStream("bigdata5", 9999)
// 处理数据
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

// 打印数据当前批次
wordCounts.print()
ssc.start() // Start
ssc.awaitTermination() // Wait

// 配置数据源在目标机器上执行nc -lk 9999 然后输入数据就ok了
}

}

还可以在webui上查看

如下:

他的打印数据是处理当前批次的数据,不是累积批次的数据

双流join

api :

  • flink -》调用api
  • sparkstreaming code 很多 -》 api join stste

延迟数据

  • processtime + udf
  • eventime + watermaker
    • 数据和离线对不上(容易)

如何构建DStream

  • 从inputstream的方式 生产上
  • receiver 测试用 为面试准备

构建Dstream

inputstrteam

比如卡夫卡

receiver

用receiver接受的时候如果是本地则要大于1 -> local[2+]

因为sparkstreaming最少是有两部分切分以及处理,如果只给1则会没有资源进行处理

所以针对于receiver一个要大于等于

上面仅仅是针对receiver

例子 :

1
val lines = ssc.socketTextStream("bigdata5", 9999)

因为他底层源码是

1
2
3
4
5
6
7
8
def socketTextStream(
hostname: String,
port: Int,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
}

Dstream算子

转换操作:

1
Similar to that of RDDs, transformations allow the data from the input DStream to be modified. DStreams support many of the transformations available on normal Spark RDD’s. Some of the common ones are as follows.
Transformation Meaning
map ( func ) Return a new DStream by passing each element of the source DStream through a functionfunc .
flatMap ( func ) Similar to map, but each input item can be mapped to 0 or more output items.
filter ( func ) Return a new DStream by selecting only the records of the source DStream on whichfunc returns true.
repartition ( numPartitions ) Changes the level of parallelism in this DStream by creating more or fewer partitions.
union ( otherStream ) Return a new DStream that contains the union of the elements in the source DStream andotherDStream .
count () Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce ( func ) Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a functionfunc (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
countByValue () When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey ( func , [ numTasks ]) When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function.Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
join ( otherStream , [ numTasks ]) When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup ( otherStream , [ numTasks ]) When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform ( func ) Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey ( func ) Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

输出操作:

Output Operation Meaning
print () Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.``Python API This is calledpprint() in the Python API.
saveAsTextFiles ( prefix , [ suffix ]) Save this DStream’s contents as text files. The file name at each batch interval is generated based onprefix and suffix : “prefix-TIME_IN_MS[.suffix]” .
saveAsObjectFiles ( prefix , [ suffix ]) Save this DStream’s contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix : “prefix-TIME_IN_MS[.suffix]” .``Python API This is not available in the Python API.
saveAsHadoopFiles ( prefix , [ suffix ]) Save this DStream’s contents as Hadoop files. The file name at each batch interval is generated based onprefix and suffix : “prefix-TIME_IN_MS[.suffix]” .``Python API This is not available in the Python API.
foreachRDD ( func ) The most generic output operator that applies a function,func , to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

我们之前的计算代码只是计算当前批次的数据,也是sparkstreaming默认的

基于上面官方提出了状态

状态

  • 有状态 前后批次有联系
  • 无状态 前后批次无联系

用于解决统计类问题

updateStateByKey ( func ):这个算子

1
2
3
4
5
def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
val newCount = ... // add the new values with the previous running count to get the new count
Some(newCount)
}
val runningCounts = pairs.updateStateByKey[Int](updateFunction _)

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
package sparkstreaming
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.SparkConf
import tool.streamingcontext
object sparkstreaming1 {
private val streamingcontext = new streamingcontext
def main(args: Array[String]): Unit = {

val ssc = streamingcontext.getstreamcotext()
// 或者通过sparkcontext进行创建
//val ssc = new StreamingContext(sc, Seconds(1))
// 数据源
val lines = ssc.socketTextStream("bigdata5", 9999)
// 处理数据
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// 要指定checkpoint目录
ssc.checkpoint("file:///D:\\checkpoint")
val totalwc = pairs.updateStateByKey(updateFunction _)
//wordCounts.updateStateByKey()
// 打印数据当前批次
wordCounts.print()
totalwc.print()
ssc.start() // Start
ssc.awaitTermination() // Wait

// 配置数据源在目标机器上执行nc -lk 9999 然后输入数据就ok了
}

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
// add the new values with the previous running count to get the new count
val sum = newValues.sum
val i = runningCount.getOrElse(0)
Some(sum+i)
}

}

但是这样也产生了个新问题

我们观察checkpoint文件夹

生成很多个小文件

我们该如何解决

生产上我们不用

但是必备的知识还是要的

为了容错,恢复作业,和kafka里的一样

checkpoint的存储东西

matestore 元数据

  • conf 作业里的配置信息
  • 算子操作
  • 未完成的批次

Data

  • 就是批次的数据

使用场景

  • 作业失败的时候回复的时候用
  • 转换算子的时候

但是注意生产上用不了

如何使用

1
2
3
4
Checkpointing can be enabled by setting a directory in a fault-tolerant, reliable file system (e.g., HDFS, S3, etc.) to which the checkpoint information will be saved. This is done by using streamingContext.checkpoint(checkpointDirectory). This will allow you to use the aforementioned stateful transformations. Additionally, if you want to make the application recover from driver failures, you should rewrite your streaming application to have the following behavior.

When the program is being started for the first time, it will create a new StreamingContext, set up all the streams and then call start().
When the program is being restarted after failure, it will re-create a StreamingContext from the checkpoint data in the checkpoint directory.

idea代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc
}

// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)

// Do additional setup on context that needs to be done,
// irrespective of whether it is being started or restarted
context. ...

// Start the context
context.start()
context.awaitTermination()

缺点

小文件

修改代码就费了,就要重整

checkpoint用不了-》累计批次指标问题 -》 出现问题

如何实现相同功能?

实现存储到外部,如何根据而外部文件进行累计

使用checkpoint

解决checkpoint修改代码报错和小文件问题

所以简历上不可以出现我在生产上用过updateStateByKey,坚决不会用

如何把处理好的数据存储到外部

如下:

1
2
3
4
5
6
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}

idea

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
package sparkstreaming
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import tool.{mysqlutils, streamingcontext,savefile}
object sparkstreaming1 {
private val mysqlutils = new mysqlutils
private val streamingcontext = new streamingcontext
private val savefile = new savefile
def main(args: Array[String]): Unit = {


val ssc = streamingcontext.getstreamcotext()
// 或者通过sparkcontext进行创建
//val ssc = new StreamingContext(sc, Seconds(1))
// 数据源
val lines = ssc.socketTextStream("bigdata5", 9999)
// 处理数据
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
//val totalwc = pairs.updateStateByKey(updateFunction _)
//wordCounts.updateStateByKey()
// 打印数据当前批次
wordCounts.print()
//totalwc.print()
// 把结果输入到mysql里 先在mysql里创建完表了
// 下面会报错-> mysql链接没有进行序列化 ,我们不能加除非更改底层源码
// closure 闭包 -> 方法内使用了方法外的变量 比如下述的connect
wordCounts.foreachRDD(rdd=>{
val connection = mysqlutils.getconnect("jdbc:mysql://bigdata2:3306/bigdata", "root", "liuzihan010616")
rdd.foreach { record =>
val sql = s"insert into wc values('${record._1}','${record._2}')"
connection.createStatement.execute(sql)
}
})
// --------------------------------------------------------
//对上述进行修改之后
//这样是可以的但是性能不高
//因为会一直拿链接,会造成性能下降
wordCounts.foreachRDD(rdd=>{
rdd.foreach { record =>
val sql = s"insert into wc values('${record._1}','${record._2}')"
val connection = mysqlutils.getconnect("jdbc:mysql://bigdata2:3306/bigdata", "root", "liuzihan010616")
connection.createStatement.execute(sql)
}
})
//优化性能
wordCounts.foreachRDD(rdd=>{
rdd.foreachPartition(record=>{
val connection = mysqlutils.getconnect("jdbc:mysql://bigdata2:3306/bigdata", "root", "liuzihan010616")
record.foreach(pari => {
val sql = s"insert into wc values('${pari._1}','${pari._2}')"
connection.createStatement.execute(sql)
})
mysqlutils.closeconnect(connection)
})
})
// 再次进行优化 原因 -》 partition的数量过高
// 通过连接池来进行
// 或者通过coalse来控制这个分区数量
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
}
// 利用sparksql的方式写入 最推荐
// 性能也很好因为用的是spark的
wordCounts.foreachRDD(rdd=>{
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
// Convert RDD[String] to DataFrame
val wordsDataFrame = rdd.toDF("word","cnt")
val srray:Array[String] = Array("append","jdbc:mysql://bigdata2:3306/bigdata","root","liuzihan010616","wc","word")
savefile.savetojdbc(spark,wordsDataFrame,srray)
})


ssc.start() // Start
ssc.awaitTermination() // Wait

// 配置数据源在目标机器上执行nc -lk 9999 然后输入数据就ok了
}

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
// add the new values with the previous running count to get the new count
val sum = newValues.sum
val i = runningCount.getOrElse(0)
Some(sum+i)
}

}

transform

DStream和RDD之间交互的算子

需求 :

  • 一个数据是来自于mysql/文本数据 : 量小 伪表
  • 一个数据 来自kafka sss读取形成的DStream 量大 主业务线

实例:弹幕过滤功能

  • 离线
  • 实时

数据如下

1
2
3
4
5
6
7
8
9
主表:
不好看
垃圾
女主真好看
666
过滤的弹幕:
热巴真丑
鸡儿真美
王退出娱乐圈

离线:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
package sparkstreaming
import org.apache.spark.sql.SparkSession
import tool._
object sparkstreaming2 {

val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").enableHiveSupport().getOrCreate()
def main(args: Array[String]): Unit = {
var mainsql = List(
"不好看",
"垃圾",
"女主真好看",
"666",
"热巴真丑",
"鸡儿真美",
"王退出娱乐圈")
val maintable = spark.sparkContext.parallelize(mainsql)

var black = List(
"热巴真丑",
"鸡儿真美",
"王退出娱乐圈"
)

val blacktable = spark.sparkContext.parallelize(black)

val value1 = maintable.map(x => {
(x, 1)
})
val value = blacktable.map(x => {
(x, true)
})
value1.leftOuterJoin(value).filter(_._2._2.getOrElse(false)!=true).map(_._1).foreach(println(_))
}

}

实时:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
private val streamingcontext = new streamingcontext
def main(args: Array[String]): Unit = {
val ssc = streamingcontext.getstreamcotext()
val maintable = ssc.socketTextStream("bigdata5", 9099)
var black = List(
"热巴真丑",
"鸡儿真美",
"王退出娱乐圈"
)

val blacktable = ssc.sparkContext.parallelize(black)

val value = blacktable.map(x => {
(x, true)
})

val value1 = maintable.map(x => {
(x, 1)
})
val value2 = value1.transform(x => {
x.leftOuterJoin(value).filter(_._2._2.getOrElse(false) != true).map(_._1)
})

value2.print()

ssc.start()
ssc.awaitTermination()

sparkstreaming和kafka整合

通过receiver方式读取kafka数据

kafka版本我们选择的是2.2.1

在sparkstreaming里默认的时候是至少一次

spark消费kafka数据形成的DStream里的分区数量和Kafka里的topic的分区数是一一对应的

分区数和task数是一一对应的-》并行度一一对应

官网:kafkaonspark

idea里的依赖

1
2
3
4
5
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>3.2.1</version>
</dependency>

使用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
-------------------------------kafka消费数据
kafka-console-consumer.sh \
--bootstrap-server bigdata3:9092,bigdata4:9092,bigdata5:9092 \
--topic dl2262 \
--from-beginning
-----------------------------kafka创建topic
kafka-topics.sh \
--create \
--zookeeper bigdata3:2181,bigdata4:2181,bigdata5:2181/kafka \
--topic dl2262 \
--partitions 6 \
--replication-factor 3
-------------------------------kafka生产数据
kafka-console-producer.sh \
--broker-list bigdata3:9092,bigdata4:9092,bigdata5:9092 \
--topic dl2262
-------------------------------kafka查看topic
kafka-topics.sh \
--describe \
--zookeeper bigdata3:2181,bigdata4:2181,bigdata5:2181/kafka \
--topic dl2262
-----------------------------代码
import org.apache.spark.sql.SparkSession
import tool._
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
object sparkstreaming2 {

private val streamingcontext = new streamingcontext

def main(args: Array[String]): Unit = {
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "bigdata3:9092,bigdata4:9092,bigdata5:9092", // kafka地址
"key.deserializer" -> classOf[StringDeserializer], // 反序列化
"value.deserializer" -> classOf[StringDeserializer], // 反序列化
"group.id" -> "dl2262-1", // 指定消费者组
"auto.offset.reset" -> "latest", // 从什么地方开始消费
"enable.auto.commit" -> (false: java.lang.Boolean) // offset的提交 是不是自动提交
)

val example = streamingcontext.getstreamcotext()
val topics = Array("dl2262")
val stream = KafkaUtils.createDirectStream[String, String](
example,
PreferConsistent, // 数据存储策略 Kafka数据均匀分在各个exector上,一共有三种
Subscribe[String, String](topics, kafkaParams) // 固定写法
)

stream.map(record => (record.value)).print()

example.start()
example.awaitTermination()
}
}

上述用的是新版本的kafka的api

消费kafka数据做wc 将结果-》mysql

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
package sparkstreaming
import org.apache.spark.sql.SparkSession
import tool._
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

object sparkstreaming2 {

private val streamingcontext = new streamingcontext
private val savefile = new savefile
def main(args: Array[String]): Unit = {
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "bigdata3:9092,bigdata4:9092,bigdata5:9092", // kafka地址
"key.deserializer" -> classOf[StringDeserializer], // 反序列化
"value.deserializer" -> classOf[StringDeserializer], // 反序列化
"group.id" -> "dl2262-1", // 指定消费者组
"auto.offset.reset" -> "latest", // 从什么地方开始消费
"enable.auto.commit" -> (false: java.lang.Boolean) // offset的提交 是不是自动提交
)


val example = streamingcontext.getstreamcotext()
val topics = Array("dl2262")
val stream = KafkaUtils.createDirectStream[String, String](
example,
PreferConsistent, // 数据存储策略 Kafka数据均匀分在各个exector上,一共有三种
Subscribe[String, String](topics, kafkaParams) // 固定写法
)

stream.map(record => (record.value)).print()

val value = stream.map(record => (record.value)).flatMap(x => {
x.split(",")
}).map(word => {
(word, 1)
}).reduceByKey(_ + _)

value.foreachRDD(rdd=>{
val spark = SparkSession.builder().config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val wordsDataFrame = rdd.toDF("word","cnt")
val srray:Array[String] = Array("append","jdbc:mysql://bigdata2:3306/bigdata","root","liuzihan010616","wc","word")
savefile.savetojdbc(spark,wordsDataFrame,srray)
})

example.start()
example.awaitTermination()
}}

消费完之后如果重启从上次挂掉的位置继续消费

要设置

  • enable.auto.commit
  • auto.offset.reset

才可以从断掉的位置开始

解决:

  • 获取kafka offset
  • 提交kafka offset

获取kafka的offset信息

1
2
3
4
5
6
7
8
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition { iter =>
val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
println(rdd.partitions.size)
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
}

关于offset信息的解释:只要最后两列数据一样就代表这个topic里的数据都消费完了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
6
dl2262 5 19 19
dl2262 4 18 18
dl2262 0 19 19
dl2262 2 77 77
dl2262 1 19 19
dl2262 3 46 47
-------------------------------------------
Time: 1673939535000 ms
-------------------------------------------
bidhashdas

6
dl2262 4 18 18
dl2262 3 47 47
dl2262 0 19 19
dl2262 5 19 19
dl2262 2 77 77
dl2262 1 19 19

注意:这些操作是获取到数据之后立刻这样做,就可以获得offset信息

其他对数据的进行操作,可以在这个里面做

如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
println(rdd.partitions.size)
rdd.foreachPartition { iter =>
val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
// wc
val spark = SparkSession.builder().config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._

val wordsDataFrame = rdd.map(_.value).flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _).toDF("word","cnt")
val srray:Array[String] = Array("append","jdbc:mysql://bigdata2:3306/bigdata","root","liuzihan010616","wc","word")
savefile.savetojdbc(spark,wordsDataFrame,srray)

// 存储offset


// 提交offset


}

接下来,我们要进行提交offset

在提交offset之前

我们要存储offset

spark流处理 默认的就是至少一次

存储offfset

  • checkpoints 不能用
  • kafka本身 简单高效 -》消费语义-》至少一次/最多一次 :但是最多一次我们不用 -》因为不支持事务 -》 无法支持精准一次
  • 可以使用支持事务的存储结构进行精准一次的交付语义

kafka本身 : 他存储的offset信息是存储在kafka的一共特殊的offset里比如_customer_offsets

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
-------------------------------------------整体代码
package sparkstreaming
import org.apache.spark.sql.SparkSession
import tool._
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.TaskContext
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

object sparkstreaming2 {

private val streamingcontext = new streamingcontext
private val savefile = new savefile
def main(args: Array[String]): Unit = {
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "bigdata3:9092,bigdata4:9092,bigdata5:9092", // kafka地址
"key.deserializer" -> classOf[StringDeserializer], // 反序列化
"value.deserializer" -> classOf[StringDeserializer], // 反序列化
"group.id" -> "dl2262-1", // 指定消费者组
"auto.offset.reset" -> "latest", // 从什么地方开始消费
"enable.auto.commit" -> (false: java.lang.Boolean) // offset的提交 是不是自动提交
)


val example = streamingcontext.getstreamcotext()
val topics = Array("dl2262")
val stream = KafkaUtils.createDirectStream[String, String](
example,
PreferConsistent, // 数据存储策略 Kafka数据均匀分在各个exector上,一共有三种
Subscribe[String, String](topics, kafkaParams) // 固定写法
)
// 获取offset信息
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
println(rdd.partitions.size)
rdd.foreachPartition { iter =>
val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
// wc
val spark = SparkSession.builder().config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._

val wordsDataFrame = rdd.map(_.value).flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _).toDF("word","cnt")
val srray:Array[String] = Array("append","jdbc:mysql://bigdata2:3306/bigdata","root","liuzihan010616","wc","word")
savefile.savetojdbc(spark,wordsDataFrame,srray)

// 存储offset和提交offset
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)




}
example.start()
example.awaitTermination()
}

其他数据源:官方的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// The details depend on your data store, but the general idea looks like this

// begin from the offsets committed to the database
val fromOffsets = selectOffsetsFromYourDatabase.map { resultSet =>
new TopicPartition(resultSet.string("topic"), resultSet.int("partition")) -> resultSet.long("offset")
}.toMap

val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets)
)

stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

val results = yourCalculation(rdd)

// begin your transaction

// update results
// update offsets where the end of existing offsets matches the beginning of this batch of offsets
// assert that offsets were updated correctly

// end your transaction
}

SSL的一些配置

1
2
3
4
5
6
7
8
9
val kafkaParams = Map[String, Object](
// the usual params, make sure to change the port in bootstrap.servers if 9092 is not TLS
"security.protocol" -> "SSL",
"ssl.truststore.location" -> "/some-directory/kafka.client.truststore.jks",
"ssl.truststore.password" -> "test1234",
"ssl.keystore.location" -> "/some-directory/kafka.client.keystore.jks",
"ssl.keystore.password" -> "test1234",
"ssl.key.password" -> "test1234"
)

案例:业务数据 + 日志数据

业务数据 在mysql里 =》

  • city_info
  • user_info

日志数据 在hive里 =》

  • user_click

思路 =》 先用代码把数据统计到df中对df进行操作

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
package sparkfirst

import org.apache.spark.sql.SparkSession

object xiangmu1 {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").enableHiveSupport().getOrCreate()
import spark.implicits._
val city_info = spark.read.format("jdbc")
.options(Map("url"->args(0),"dbtable"->args(3),"user"->args(1),"password"->args(2),"driver"->"com.mysql.jdbc.Driver")).load()

val user_info = spark.sql(
"""
|select * from bigdata.product_info
|""".stripMargin)

// city_info.show()
// user_info.show()

val product_info = spark.read.textFile("hdfs://bigdata3:9000/data/user_click.txt")
// product_info.show(false)

val userlog = product_info.map(line => {
val strings = line.split(",")
val userid = strings(0)
val sessionid = strings(1)
val dt = strings(2)
val cityid = strings(3)
val shopid = strings(4)
(userid, sessionid, dt, cityid, shopid)
}).toDF("userid", "sessionid", "dt", "cityid", "shopid")


// userlog.show(false)



//----------------------------------------------------------------
city_info.createOrReplaceTempView("city_info")
userlog.createOrReplaceTempView("user_log")
user_info.createOrReplaceTempView("product_info")
//--------------------------------------------------------------
spark.sql(
"""
|drop table if exists bigdata.tmp
|""".stripMargin)
spark.sql(
"""
|
|
|create table bigdata.tmp as
|select
|*
|from (
| select * from city_info left join user_log on city_info.city_id = user_log.cityid left join product_info on user_log.shopid = product_info.product_id
|)
|""".stripMargin)
spark.sql(
"""
|drop table if exists bigdata.sparkfinish
|""".stripMargin)
spark.sql(
"""
|create table bigdata.sparkfinish as
|select
|*
|from(
|select
|area,
|product_name,
|rank() over(partition by area order by cnt) as rk
|from (
|select
|area,
|product_name,
|count(1) as cnt
|from bigdata.tmp
|group by area,product_name
|)
|)where rk < 3;
|""".stripMargin)


}
}

然后我们进行打包上传

进行打包之前,我们要把我们的

1
val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").enableHiveSupport().getOrCreate()

给注释掉,因为我们用spark-submit的时候会通过命令的方式指定他

打包上传之后我们可以进行调用脚本spark-submit脚本

但是这里还是有分歧的,因为spark-submit部署的时候一般是部署在yarn的要分几种模式的简单介绍一下两种模式

Cluster:

  • 提交作业 client作业提交 client就可以关闭了 对spark作业是没有影响的 而且运行的时候
  • driver 是在集群机器里的
  • 上yarn上看日志

client:

  • 提交作业 client作业提交 如果client关闭了 driver process 挂了 对spark作业有影响的
  • 是在 client机器里的
  • 可以直接看见日志的

以下代码分别写一下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
spark-submit \
--master yarn \
--deploy-mode client \
--name userlog \
--executor-memory 1g \
--num-executors 1 \
--executor-cores 1 \
--class sparkfirst.xiangmu1 \
/home/hadoop/project/jar/bigdatajava-1.0-SNAPSHOT.jar \
"jdbc:mysql://bigdata2:3306/bigdata" root liuzihan010616 city_info
--------------------------------------------------------cluster
spark-submit \
--master yarn \
--deploy-mode cluster \
--name userlog \
--executor-memory 1g \
--num-executors 2 \
--executor-cores 1 \
--class sparkfirst.xiangmu1 \
/home/hadoop/project/jar/bigdatajava-1.0-SNAPSHOT.jar \
"jdbc:mysql://bigdata2:3306/bigdata" root liuzihan010616 city_info
-------------------------------------------------------------- 以下是正常的因为我把mysql的driver加入到spark的jars文件夹里了,所以上面不用指定jar
spark-submit \
--master yarn \
--deploy-mode client \
--name userlog \
--executor-memory 1g \
--num-executors 1 \
--executor-cores 1 \
--jars /home/hadoop/software/mysql-connector-java-5.1.28.jar \
--driver-class-path /home/hadoop/software/mysql-connector-java-5.1.28.jar \
--driver-library-path /home/hadoop/software/mysql-connector-java-5.1.28.jar \
--class sparkfirst.xiangmu1 \
/home/hadoop/project/jar/bigdatajava-1.0-SNAPSHOT.jar \
"jdbc:mysql://bigdata2:3306/bigdata" root liuzihan010616 city_info user_info
--------------------------------------------------------------------------------------------------
spark-submit \
--master yarn \
--deploy-mode cluster \
--name userlog \
--executor-memory 1g \
--num-executors 1 \
--executor-cores 1 \
--jars /home/hadoop/software/mysql-connector-java-5.1.28.jar \
--driver-class-path /home/hadoop/software/mysql-connector-java-5.1.28.jar \
--driver-library-path /home/hadoop/software/mysql-connector-java-5.1.28.jar \
--class sparkfirst.xiangmu1 \
/home/hadoop/project/jar/bigdatajava-1.0-SNAPSHOT.jar \
"jdbc:mysql://bigdata2:3306/bigdata" root liuzihan010616 city_info user_info

几种加入driver的方法是

在执行命令的时候加入命令:如上

直接加到jars里

这里比较不推荐的就是第二种,因为怕和spark本身的包产生冲突

执行流程

spark到yarn的执行流程和hadoop基本一样,除了spark的持久化操作要用到catche,其余都一样

driver =》manager

excuter =》 container

catlog

hive元数据 在mysql里

spark 访问hive元数据 通过jdbc

spark提供了个catlog

直接调用catlog可以直接拿到hive的元数据的功能

比如制作大数据分析平台

获取catlog

sparksesson.catlog

然后里面有很多的方法可以在idea里通过ctrl + f12 查看方法

冷数据可能放在cos 或者 oss上

udf

  • 代码的方式定义udf
  • hive的udf可以在spark可以直接用
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
idea里定义udf
---------------------------------------
先导包
import org.apache.spark.sql.functions.udf
然后


val spark = SparkSession
.builder()
.appName("Spark SQL UDF scalar example")
.getOrCreate()

// Define and register a zero-argument non-deterministic UDF
// UDF is deterministic by default, i.e. produces the same result for the same input.
val random = udf(() => Math.random())
spark.udf.register("random", random.asNondeterministic())
spark.sql("SELECT random()").show()
// +-------+
// |UDF() |
// +-------+
// |xxxxxxx|
// +-------+

// Define and register a one-argument UDF
val plusOne = udf((x: Int) => x + 1)
spark.udf.register("plusOne", plusOne)
spark.sql("SELECT plusOne(5)").show()
// +------+
// |UDF(5)|
// +------+
// | 6|
// +------+

// Define a two-argument UDF and register it with Spark in one step
spark.udf.register("strLenScala", (_: String).length + (_: Int))
spark.sql("SELECT strLenScala('test', 1)").show()
// +--------------------+
// |strLenScala(test, 1)|
// +--------------------+
// | 5|
// +--------------------+

// UDF in a WHERE clause
spark.udf.register("oneArgFilter", (n: Int) => { n > 5 })
spark.range(1, 10).createOrReplaceTempView("test")
spark.sql("SELECT * FROM test WHERE oneArgFilter(id)").show()
// +---+
// | id|
// +---+
// | 6|
// | 7|
// | 8|
// | 9|
// +---+

构建df

  • rdd
  • hive
  • 外部数据源
    • json,csv,jdbc/odbc

加载外部数据源

api简介

TEXT

Property Name Default Meaning Scope
wholetext false If true, read each file from input path(s) as a single row. read
lineSep \r, \r\n, \n (for reading), \n (for writing) Defines the line separator that should be used for reading or writing. read/write
compression (none) Compression codec to use when saving to file.
This can be one of the known case-insensitive shorten names
(none, bzip2, gzip, lz4, snappy and deflate).
write

json

Property Name Default Meaning Scope
timeZone (value of spark.sql.session.timeZone configuration) Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of timeZone are supported:``* Region-based zone ID:
It should have the form ‘area/city’,
such as ‘America/Los_Angeles’.* Zone offset: It should be in the format ‘(+
-)HH:mm’, for example ‘-08:00’ or ‘+01:00’. Also ‘UTC’ and ‘Z’ are supported as aliases of ‘+00:00’.Other short names like ‘CST’ are not recommended to use because they can be ambiguous.
primitivesAsString false Infers all primitive values as a string type. read
prefersDecimal false Infers all floating-point values as a decimal type.
If the values do not fit in decimal, then it infers them as doubles.
read
allowComments false Ignores Java/C++ style comment in JSON records. read
allowUnquotedFieldNames false Allows unquoted JSON field names. read
allowSingleQuotes true Allows single quotes in addition to double quotes. read
allowNumericLeadingZero false Allows leading zeros in numbers (e.g. 00012). read
allowBackslashEscapingAnyCharacter false Allows accepting quoting of all character using backslash quoting mechanism
.
read
mode PERMISSIVE Allows a mode for dealing with corrupt records during parsing.``PERMISSIVE: when it meets a corrupted record, puts the malformed string into a field configured by columnNameOfCorruptRecord, and sets malformed fields to null. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a columnNameOfCorruptRecord field in an output schema. DROPMALFORMED: ignores the whole corrupted records. This mode is unsupported in the JSON built-in functions.* FAILFAST: throws an exception when it meets corrupted records. read
columnNameOfCorruptRecord (value of spark.sql.columnNameOfCorruptRecord configuration) Allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord. read
dateFormat yyyy-MM-dd Sets the string that indicates a date format. Custom date formats follow the formats atdatetime pattern. This applies to date type. read/write
timestampFormat yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] Sets the string that indicates a timestamp format. Custom date formats follow the formats atdatetime pattern. This applies to timestamp type. read/write
timestampNTZFormat yyyy-MM-dd’T’HH:mm:ss[.SSS] Sets the string that indicates a timestamp without timezone format. Custom date formats follow the formats atDatetime Patterns. This applies to timestamp without timezone type, note that zone-offset and time-zone components are not supported when writing or reading this data type. read/write
multiLine false Parse one record, which may span multiple lines, per file. JSON built-in functions ignore this option. read
allowUnquotedControlChars false Allows JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not. read
encoding Detected automatically when multiLine is set to true (for reading), UTF-8 (for writing) For reading, allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. For writing, Specifies encoding (charset) of saved json files. JSON built-in functions ignore this option. read/write
lineSep \r, \r\n, \n (for reading), \n (for writing) Defines the line separator that should be used for parsing. JSON built-in functions ignore this option. read/write
samplingRatio 1.0 Defines fraction of input JSON objects used for schema inferring. read
dropFieldIfAllNull false Whether to ignore column of all null values or empty array/struct during schema inference. read
locale en-US Sets a locale as language tag in IETF BCP 47 format. For instance,locale is used while parsing dates and timestamps. read
allowNonNumericNumbers true Allows JSON parser to recognize set of “Not-a-Number” (NaN) tokens as legal floating number values.``+INF:
for positive infinity, as well as alias of +Infinity and Infinity.
-INF: for negative infinity, alias -Infinity.* NaN: for other not-a-numbers, like result of division by zero.
read
compression (none) Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names
(none, bzip2, gzip, lz4, snappy and deflate). JSON built-in functions ignore this option.
write
ignoreNullFields (value of spark.sql.jsonGenerator.ignoreNullFields configuration) Whether to ignore null fields when generating JSON objects. write

csv

Property Name Default Meaning Scope
sep , Sets a separator for each field and value. This separator can be one or more characters. read/write
encoding UTF-8 For reading, decodes the CSV files by the given encoding type. For writing, specifies encoding (charset) of saved CSV files. CSV built-in functions ignore this option. read/write
quote Sets a single character used for escaping quoted values where the separator can be part of the value. For reading, if you would like to turn off quotations, you need to set not null but an empty string. For writing, if an empty string is set, it uses u0000 (null character). read/write
quoteAll false A flag indicating whether all values should always be enclosed in quotes. Default is to only escape values containing a quote character. write
escape \ Sets a single character used for escaping quotes inside an already quoted value. read/write
escapeQuotes true A flag indicating whether values containing quotes should always be enclosed in quotes. Default is to escape all values containing a quote character. write
comment Sets a single character used for skipping lines beginning with this character. By default, it is disabled. read
header false For reading, uses the first line as names of columns. For writing, writes the names of columns as the first line. Note that if the given path is a RDD of Strings, this header option will remove all lines same with the header if exists. CSV built-in functions ignore this option. read/write
inferSchema false Infers the input schema automatically from data. It requires one extra pass over the data. CSV built-in functions ignore this option. read
enforceSchema true If it is set to true, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to false, the schema will be validated against all headers in CSV files in the case when the header option is set to true. Field names in the schema and column names in CSV headers are checked by their positions taking into account spark.sql.caseSensitive. Though the default value is true, it is recommended to disable the enforceSchema option to avoid incorrect results. CSV built-in functions ignore this option. read
ignoreLeadingWhiteSpace false (for reading), true (for writing) A flag indicating whether or not leading whitespaces from values being read/written should be skipped. read/write
ignoreTrailingWhiteSpace false (for reading), true (for writing) A flag indicating whether or not trailing whitespaces from values being read/written should be skipped. read/write
nullValue Sets the string representation of a null value. Since 2.0.1, this nullValue param applies to all supported types including the string type. read/write
nanValue NaN Sets the string representation of a non-number value. read
positiveInf Inf Sets the string representation of a positive infinity value. read
negativeInf -Inf Sets the string representation of a negative infinity value. read
dateFormat yyyy-MM-dd Sets the string that indicates a date format. Custom date formats follow the formats atDatetime Patterns. This applies to date type. read/write
timestampFormat yyyy-MM-dd’T’HH:mm:ss[.SSS][XXX] Sets the string that indicates a timestamp format. Custom date formats follow the formats atDatetime Patterns. This applies to timestamp type. read/write
timestampNTZFormat yyyy-MM-dd’T’HH:mm:ss[.SSS] Sets the string that indicates a timestamp without timezone format. Custom date formats follow the formats atDatetime Patterns. This applies to timestamp without timezone type, note that zone-offset and time-zone components are not supported when writing or reading this data type. read/write
maxColumns 20480 Defines a hard limit of how many columns a record can have. read
maxCharsPerColumn -1 Defines the maximum number of characters allowed for any given value being read. By default, it is -1 meaning unlimited length read
mode PERMISSIVE Allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes. Note that Spark tries to parse only required columns in CSV under column pruning. Therefore, corrupt records can be different based on required set of fields. This behavior can be controlled by spark.sql.csv.parser.columnPruning.enabled (enabled by default).``* PERMISSIVE: when it meets a corrupted record, puts the malformed string into a field configured by columnNameOfCorruptRecord, and sets malformed fields to null. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. A record with less/more tokens than schema is not a corrupted record to CSV. When it meets a record having fewer tokens than the length of the schema, sets null to extra fields. When the record has more tokens than the length of the schema, it drops extra tokens.* DROPMALFORMED: ignores the whole corrupted records. This mode is unsupported in the CSV built-in functions.* FAILFAST: throws an exception when it meets corrupted records. read
columnNameOfCorruptRecord (value of spark.sql.columnNameOfCorruptRecord configuration) Allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord. read
multiLine false Parse one record, which may span multiple lines, per file. CSV built-in functions ignore this option. read
charToEscapeQuoteEscaping escape or \0 Sets a single character used for escaping the escape for the quote character. The default value is escape character when escape and quote characters are different,\0 otherwise. read/write
samplingRatio 1.0 Defines fraction of rows used for schema inferring. CSV built-in functions ignore this option. read
emptyValue (for reading),"" (for writing) Sets the string representation of an empty value. read/write
locale en-US Sets a locale as language tag in IETF BCP 47 format. For instance, this is used while parsing dates and timestamps. read
lineSep \r, \r\n and \n (for reading), \n (for writing) Defines the line separator that should be used for parsing/writing. Maximum length is 1 character. CSV built-in functions ignore this option. read/write
unescapedQuoteHandling STOP_AT_DELIMITER Defines how the CsvParser will handle values with unescaped quotes.``STOP_AT_CLOSING_QUOTE: If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found. BACK_TO_DELIMITER: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters of the current parsed value until the delimiter is found. If no delimiter is found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found.* STOP_AT_DELIMITER: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until the delimiter or a line ending is found in the input.* SKIP_VALUE: If unescaped quotes are found in the input, the content parsed for the given value will be skipped and the value set in nullValue will be produced instead.* RAISE_ERROR: If unescaped quotes are found in the input, a TextParsingException will be thrown. read
compression (none) Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate). CSV built-in functions ignore this option. write

jdbc

Property Name Default Meaning Scope
url (none) The JDBC URL of the form jdbc:subprotocol:subname to connect to. The source-specific connection properties may be specified in the URL. e.g., jdbc:postgresql://localhost/test?user=fred&password=secret read/write
dbtable (none) The JDBC table that should be read from or written into. Note that when using it in the read path anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses. It is not allowed to specify dbtable and query options at the same time. read/write
query (none) A query that will be used to read data into Spark. The specified query will be parenthesized and used as a subquery in the FROM clause. Spark will also assign an alias to the subquery clause. As an example, spark will issue a query of the following form to the JDBC Source.SELECT <columns> FROM (<user_specified_query>) spark_gen_aliasBelow are a couple of restrictions while using this option.1. It is not allowed to specify `dbtable` and `query` options at the same time.1. It is not allowed to specify `query` and `partitionColumn` options at the same time. When specifying `partitionColumn` option is required, the subquery can be specified using `dbtable` option instead and partition columns can be qualified using the subquery alias provided as part of `dbtable`.Example:spark.read.format("jdbc").option("url", jdbcUrl).option("query", "select c1, c2 from t1").load() read/write
driver (none) The class name of the JDBC driver to use to connect to this URL. read/write
partitionColumn, lowerBound, upperBound (none) These options must all be specified if any of them is specified. In addition,numPartitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric, date, or timestamp column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading. read
numPartitions (none) The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing. read/write
queryTimeout 0 The number of seconds the driver will wait for a Statement object to execute to the given number of seconds. Zero means there is no limit. In the write path, this option depends on how JDBC drivers implement the API setQueryTimeout, e.g., the h2 JDBC driver checks the timeout of each query instead of an entire JDBC batch. read/write
fetchsize 0 The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (e.g. Oracle with 10 rows). read
batchsize 1000 The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. This option applies only to writing. write
isolationLevel READ_UNCOMMITTED The transaction isolation level, which applies to current connection. It can be one of NONE, READ_COMMITTED, READ_UNCOMMITTED, REPEATABLE_READ, or SERIALIZABLE, corresponding to standard transaction isolation levels defined by JDBC’s Connection object, with default of READ_UNCOMMITTED. Please refer the documentation in java.sql.Connection. write
sessionInitStatement (none) After each database session is opened to the remote DB and before starting to read data, this option executes a custom SQL statement (or a PL/SQL block). Use this to implement session initialization code. Example:option("sessionInitStatement", """BEGIN execute immediate 'alter session set "_serial_direct_read"=true'; END;""") read
truncate false This is a JDBC writer related option. When SaveMode.Overwrite is enabled, this option causes Spark to truncate an existing table instead of dropping and recreating it. This can be more efficient, and prevents the table metadata (e.g., indices) from being removed. However, it will not work in some cases, such as when the new data has a different schema. In case of failures, users should turn off truncate option to use DROP TABLE again. Also, due to the different behavior of TRUNCATE TABLE among DBMS, it’s not always safe to use this. MySQLDialect, DB2Dialect, MsSqlServerDialect, DerbyDialect, and OracleDialect supports this while PostgresDialect and default JDBCDirect doesn’t. For unknown and unsupported JDBCDirect, the user option truncate is ignored. write
cascadeTruncate the default cascading truncate behaviour of the JDBC database in question, specified in the isCascadeTruncate in each JDBCDialect This is a JDBC writer related option. If enabled and supported by the JDBC database (PostgreSQL and Oracle at the moment), this options allows execution of a TRUNCATE TABLE t CASCADE (in the case of PostgreSQL a TRUNCATE TABLE ONLY t CASCADE is executed to prevent inadvertently truncating descendant tables). This will affect other tables, and thus should be used with care. write
createTableOptions This is a JDBC writer related option. If specified, this option allows setting of database-specific table and partition options when creating a table (e.g.,CREATE TABLE t (name string) ENGINE=InnoDB.). write
createTableColumnTypes (none) The database column data types to use instead of the defaults, when creating the table. Data type information should be specified in the same format as CREATE TABLE columns syntax (e.g:"name CHAR(64), comments VARCHAR(1024)"). The specified types should be valid spark sql data types. write
customSchema (none) The custom schema to use for reading data from JDBC connectors. For example,"id DECIMAL(38, 0), name STRING". You can also specify partial fields, and the others use the default type mapping. For example, "id DECIMAL(38, 0)". The column names should be identical to the corresponding column names of JDBC table. Users can specify the corresponding data types of Spark SQL instead of using the defaults. read
pushDownPredicate true The option to enable or disable predicate push-down into the JDBC data source. The default value is true, in which case Spark will push down filters to the JDBC data source as much as possible. Otherwise, if set to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark. Predicate push-down is usually turned off when the predicate filtering is performed faster by Spark than by the JDBC data source. read
pushDownAggregate false The option to enable or disable aggregate push-down in V2 JDBC data source. The default value is false, in which case Spark will not push down aggregates to the JDBC data source. Otherwise, if sets to true, aggregates will be pushed down to the JDBC data source. Aggregate push-down is usually turned off when the aggregate is performed faster by Spark than by the JDBC data source. Please note that aggregates can be pushed down if and only if all the aggregate functions and the related filters can be pushed down. If numPartitions equals to 1 or the group by key is the same as partitionColumn, Spark will push down aggregate to data source completely and not apply a final aggregate over the data source output. Otherwise, Spark will apply a final aggregate over the data source output. read
pushDownLimit false The option to enable or disable LIMIT push-down into V2 JDBC data source. The LIMIT push-down also includes LIMIT + SORT , a.k.a. the Top N operator. The default value is false, in which case Spark does not push down LIMIT or LIMIT with SORT to the JDBC data source. Otherwise, if sets to true, LIMIT or LIMIT with SORT is pushed down to the JDBC data source. If numPartitions is greater than 1, SPARK still applies LIMIT or LIMIT with SORT on the result from data source even if LIMIT or LIMIT with SORT is pushed down. Otherwise, if LIMIT or LIMIT with SORT is pushed down and numPartitions equals to 1, SPARK will not apply LIMIT or LIMIT with SORT on the result from data source. read
pushDownTableSample false The option to enable or disable TABLESAMPLE push-down into V2 JDBC data source. The default value is false, in which case Spark does not push down TABLESAMPLE to the JDBC data source. Otherwise, if value sets to true, TABLESAMPLE is pushed down to the JDBC data source. read
keytab (none) Location of the kerberos keytab file (which must be pre-uploaded to all nodes either by --files option of spark-submit or manually) for the JDBC client. When path information found then Spark considers the keytab distributed manually, otherwise --files assumed. If both keytab and principal are defined then Spark tries to do kerberos authentication. read/write
principal (none) Specifies kerberos principal name for the JDBC client. If both keytab and principal are defined then Spark tries to do kerberos authentication. read/write
refreshKrb5Config false This option controls whether the kerberos configuration is to be refreshed or not for the JDBC client before establishing a new connection. Set to true if you want to refresh the configuration, otherwise set to false. The default value is false. Note that if you set this option to true and try to establish multiple connections, a race condition can occur. One possble situation would be like as follows.1. refreshKrb5Config flag is set with security context 11. A JDBC connection provider is used for the corresponding DBMS1. The krb5.conf is modified but the JVM not yet realized that it must be reloaded1. Spark authenticates successfully for security context 11. The JVM loads security context 2 from the modified krb5.conf1. Spark restores the previously saved security context 11. The modified krb5.conf content just gone read/write
connectionProvider (none) The name of the JDBC connection provider to use to connect to this URL, e.g.db2, mssql. Must be one of the providers loaded with the JDBC data source. Used to disambiguate when more than one provider can handle the specified driver and options. The selected provider must not be disabled by spark.sql.sources.disabledJdbcConnProviderList. read/write

excel

暂时没找到,找到再补

hive

对于hive我们要定义输入格式输出格式甚至是元素内部,以及每个元素之间的分隔符如下

Property Name Meaning
fileFormat A fileFormat is kind of a package of storage format specifications, including “serde”, “input format” and “output format”. Currently we support 6 fileFormats: ‘sequencefile’, ‘rcfile’, ‘orc’, ‘parquet’, ‘textfile’ and ‘avro’.
inputFormat, outputFormat These 2 options specify the name of a corresponding InputFormat and OutputFormat class as a string literal, e.g. org.apache.hadoop.hive.ql.io.orc.OrcInputFormat. These 2 options must be appeared in a pair, and you can not specify them if you already specified the fileFormat option.
serde This option specifies the name of a serde class. When the fileFormat option is specified, do not specify this option if the given fileFormat already include the information of serde. Currently “sequencefile”, “textfile” and “rcfile” don’t include the serde information and you can use this option with these 3 fileFormats.
fieldDelim, escapeDelim, collectionDelim, mapkeyDelim, lineDelim These options can only be used with “textfile” fileFormat. They define how to read delimited files into rows.

关于不同版本的spark关联到hive的源数据库如下配置

1
One of the most important pieces of Spark SQL’s Hive support is interaction with Hive metastore, which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below. Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL will compile against built-in Hive and use those classes for internal execution (serdes, UDFs, UDAFs, etc).
Property Name Default Meaning Since Version
spark.sql.hive.metastore.version 2.3.9 Version of the Hive metastore. Available options are 0.12.0 through 2.3.9 and 3.0.0 through 3.1.2. 1.40
spark.sql.hive.metastore.jars builtin Location of the jars that should be used to instantiate the HiveMetastoreClient.
This property can be one of four options:
1.builtin
Use Hive 2.3.9, which is bundled with the Spark assembly when -Phive is enabled. When this option is chosen, spark.sql.hive.metastore.version must be either 2.3.9 or not defined.
maven
Use Hive jars of specified version downloaded from Maven repositories. This configuration is not generally recommended for production deployments.
path
Use Hive jars configured by spark.sql.hive.metastore.jars.path in comma separated format. Support both local or remote paths. The provided jars should be the same version as spark.sql.hive.metastore.version.
A classpath in the standard format for the JVM. This classpath must include all of Hive and its dependencies, including the correct version of Hadoop. The provided jars should be the same version as spark.sql.hive.metastore.version. These jars only need to be present on the driver, but if you are running in yarn cluster mode then you must ensure they are packaged with your application.
1.40
spark.sql.hive.metastore.jars.path (empty) Comma-separated paths of the jars that used to instantiate the HiveMetastoreClient. This configuration is useful only when spark.sql.hive.metastore.jars is set as path.``The paths can be any of the following format:1. file://path/to/jar/foo.jar1. hdfs://nameservice/path/to/jar/foo.jar1. /path/to/jar/(path without URI scheme follow conf fs.defaultFS‘s URI schema)1. [http/https/ftp]://path/to/jar/foo.jarNote that 1, 2, and 3 support wildcard. For example:1. file://path/to/jar/*,file://path2/to/jar/*/*.jar1. hdfs://nameservice/path/to/jar/*,hdfs://nameservice2/path/to/jar/*/*.jar 3.10
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,
org.postgresql,
com.microsoft.sqlserver,
oracle.jdbc
A comma-separated list of class prefixes that should be loaded using the classloader that is shared between Spark SQL and a specific version of Hive. An example of classes that should be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need to be shared are those that interact with classes that are already shared. For example, custom appenders that are used by log4j. 1.40
spark.sql.hive.metastore.barrierPrefixes (empty) A comma separated list of class prefixes that should explicitly be reloaded for each version of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a prefix that typically would be shared (i.e.org.apache.spark.*). 1.40

读数据

TEXT

官方简介

1
Spark SQL provides spark.read().text("file_name") to read a file or directory of text files into a Spark DataFrame, and dataframe.write().text("path") to write to a text file. When reading a text file, each line becomes each row that has stringvalue” column by default. The line separator can be changed as shown in the example below. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator, compression, and so on.

读的时候我们不用设置压缩格式,它和mr一样会自动解压

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
package sparkfirst

import org.apache.spark.sql.SparkSession

object sparksql2 {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").getOrCreate()
val df = spark.read.text("file:///D:\\test.txt") // 返回值是DF
df.show()
df.printSchema()


var result =
"""
+--------------------+
| value|
+--------------------+
|as,s,ed,f,,,qq,eq...|
|,w,wq,e,w,ewq,we,...|
+--------------------+
"""

//text所带有的schame信息是他自己给我们添加上的,所以有时候我们用这个会不方便
val df1 = spark.read.textFile("file:///D:\\test.txt") // 返回值是dataset
df1.printSchema()
//--------------------------------------------------------------------
//使用lineSep改变分隔符如下
val df2 = spark.read.option("lineSep",",").text("file:///D:\\test.txt")
df2.show()
result =
"""
+---------+
| value|
+---------+
| as|
| s|
| ed|
| f|
| |
| |
| qq|
|eqedqwe\n|
| w|
| wq|
| e|
| w|
| ewq|
| we|
| q|
| e|
|wewqeqwel|
| qe|
| lqeweqwl|
| qw\n|
+---------+
"""
//---------------------------------------------------------------------
//使用wholetext把一整个文件当作一行来接受
val df3 = spark.read.option("wholetext",true).text("file:///D:\\test.txt")
df3.show()
var result =
"""
+--------------------+
| value|
+--------------------+
|as,s,ed,f,,,qq,eq...|
+--------------------+
"""
//这里的text和textFile是可以相互用的
val df4 = spark.read.option("wholetext",true).textFile("file:///D:\\test.txt")
df4.show()
var result =
"""
+--------------------+
| value|
+--------------------+
|as,s,ed,f,,,qq,eq...|
+--------------------+
"""
val df5 = spark.read.option("wholetext",true).format("text").load("file:///D:\\test.txt")
df5.show()
var result =
"""
+--------------------+
| value|
+--------------------+
|as,s,ed,f,,,qq,eq...|
+--------------------+
"""

}
}

我们点进去源码发现text的底层是

def text(paths: String*): DataFrame = format("text").load(paths : _*)

所以我们使用的时候可以

1
val df5=spark.read.option("wholetext",true).format("text").load("file:///D:\\test.txt")

json

简介

1
2
3
4
5
Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset[Row]. This conversion can be done using SparkSession.read.json() on either a Dataset[String], or a JSON file.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

json分为简单json和嵌套json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").getOrCreate()

//-------------------------普通json
val df = spark.read.json("file:///C:\\Users\\dell\\Desktop\\dept.json")
df.show()
df.printSchema()
//--------------------------嵌套json如果嵌套的是STruct => 打点 / ARRAY 类型 先炸开 再打点
var df1 = spark.read.format("json").load("file:///C:\\Users\\dell\\Desktop\\Skills.json")
df1.printSchema()
//--------------------------api
//-withColumn可以增加一个字段,或者把一个字段重命名 =》 提出字段
df1=df1.withColumn("critical",col("damage.critical"))
df1=df1.withColumn("elementId",explode(col("damage.elementId")))
df1.printSchema()
//------------------------删除字段
df1=df1.drop("damage.critical","damage.elementId")
//-------------------------sql
//------------------------对比hivesql
df1.createOrReplaceTempView("test")
//spark.sql("SELECT get_json_object('{\"a\":\"b\"}', '$.a');").show()
// struct可以用下面打点的方法
spark.sql(
"""
|select
|effects.ddd,
|damage.ddddds
|from
|test
|""".stripMargin).show()
//或者用爆炸加侧写进行 :array元素,嵌套json
spark.sql(
"""
|select
|effects.ddd,
|damage.ddddds
|from
|test
|lateral view explode(store.fruit) as fruit
|""".stripMargin)




csv

简介

1
Spark SQL provides spark.read().csv("file_name") to read a file or directory of files in CSV format into Spark DataFrame, and dataframe.write().csv("path") to write to a CSV file. Function option() can be used to customize the behavior of reading or writing, such as controlling behavior of the header, delimiter character, character set, and so on.

csv文件默认的分隔符是,但是可以更改

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").getOrCreate()

val df = spark.read.format("csv").load("file:///C:\\Users\\dell\\Desktop\\user_profile.csv")
df.show()
var result=
"""
|+---+---------+------+----+----------+---+--------------------+------------+----------+
||_c0| _c1| _c2| _c3| _c4|_c5| _c6| _c7| _c8|
|+---+---------+------+----+----------+---+--------------------+------------+----------+
|| id|device_id|gender| age|university|gpa|active_days_withi...|question_cnt|answer_cnt|
|| 1| 2138| male| 21| 北京大学|3.4| 7| 2| 12|
|| 2| 3214| male|null| 复旦大学| 4| 15| 5| 25|
|| 3| 6543|female| 20| 北京大学|3.2| 12| 3| 30|
|| 4| 2315|female| 23| 浙江大学|3.6| 5| 1| 2|
|| 5| 5432| male| 25| 山东大学|3.8| 20| 15| 70|
|| 6| 2131| male| 28| 山东大学|3.3| 15| 7| 13|
|| 7| 4321| male| 28| 复旦大学|3.6| 9| 6| 52|
|+---+---------+------+----+----------+---+--------------------+------------+----------+
|""".stripMargin
// 这样默认是通过,进行分割的
//可以通过delimiter来设置分割参数,sep和他一样
val df1 = spark.read.option("delimiter",";").csv("file:///C:\\Users\\dell\\Desktop\\user_profile.csv")
df1.show()
result =
"""
|+------------------------+
|| _c0|
|+------------------------+
|| "id","device_id",...|
|| 1,2138,male,21,北京...|
||2,3214,male,,复旦大学...|
|| 3,6543,female,20,...|
|| 4,2315,female,23,...|
|| 5,5432,male,25,山东...|
|| 6,2131,male,28,山东...|
|| 7,4321,male,28,复旦...|
|+------------------------+
|""".stripMargin
val df4 = spark.read.option("sep",";").csv("file:///C:\\Users\\dell\\Desktop\\user_profile.csv")
df4.show()
result =
"""
|+------------------------+
|| _c0|
|+------------------------+
|| "id","device_id",...|
|| 1,2138,male,21,北京...|
||2,3214,male,,复旦大学...|
|| 3,6543,female,20,...|
|| 4,2315,female,23,...|
|| 5,5432,male,25,山东...|
|| 6,2131,male,28,山东...|
|| 7,4321,male,28,复旦...|
|+------------------------+
|""".stripMargin
//还可以从csv里加载表头
val df2 = spark.read.option("delimiter",",").option("header","true").format("csv").load("file:///C:\\Users\\dell\\Desktop\\user_profile.csv")
df2.show()
result=
"""
|+---+---------+------+----+----------+---+---------------------+------------+----------+
|| id|device_id|gender| age|university|gpa|active_days_within_30|question_cnt|answer_cnt|
|+---+---------+------+----+----------+---+---------------------+------------+----------+
|| 1| 2138| male| 21| 北京大学|3.4| 7| 2| 12|
|| 2| 3214| male|null| 复旦大学| 4| 15| 5| 25|
|| 3| 6543|female| 20| 北京大学|3.2| 12| 3| 30|
|| 4| 2315|female| 23| 浙江大学|3.6| 5| 1| 2|
|| 5| 5432| male| 25| 山东大学|3.8| 20| 15| 70|
|| 6| 2131| male| 28| 山东大学|3.3| 15| 7| 13|
|| 7| 4321| male| 28| 复旦大学|3.6| 9| 6| 52|
|+---+---------+------+----+----------+---+---------------------+------------+----------+
|""".stripMargin
//还可以把上面两个option合并
val df3 = spark.read.options(Map("delimiter" -> "," ,"header" -> "true")).csv("file:///C:\\Users\\dell\\Desktop\\user_profile.csv")
df3.show()
result=
"""
|+---+---------+------+----+----------+---+---------------------+------------+----------+
|| id|device_id|gender| age|university|gpa|active_days_within_30|question_cnt|answer_cnt|
|+---+---------+------+----+----------+---+---------------------+------------+----------+
|| 1| 2138| male| 21| 北京大学|3.4| 7| 2| 12|
|| 2| 3214| male|null| 复旦大学| 4| 15| 5| 25|
|| 3| 6543|female| 20| 北京大学|3.2| 12| 3| 30|
|| 4| 2315|female| 23| 浙江大学|3.6| 5| 1| 2|
|| 5| 5432| male| 25| 山东大学|3.8| 20| 15| 70|
|| 6| 2131| male| 28| 山东大学|3.3| 15| 7| 13|
|| 7| 4321| male| 28| 复旦大学|3.6| 9| 6| 52|
|+---+---------+------+----+----------+---+---------------------+------------+----------+
|""".stripMargin
//还可以加上自动推断类型,如果不加,它就会默认是字符串类型inferSchema
val df5 = spark.read.options(Map("sep"->",","header"->"true","inferSchema"->"true","encoding"->"UTF8")).csv("file:///C:\\Users\\dell\\Desktop\\user_profile.csv")
df5.show()
result=
"""
|+---+---------+------+----+----------+---+---------------------+------------+----------+
|| id|device_id|gender| age|university|gpa|active_days_within_30|question_cnt|answer_cnt|
|+---+---------+------+----+----------+---+---------------------+------------+----------+
|| 1| 2138| male| 21| 北京大学|3.4| 7| 2| 12|
|| 2| 3214| male|null| 复旦大学|4.0| 15| 5| 25|
|| 3| 6543|female| 20| 北京大学|3.2| 12| 3| 30|
|| 4| 2315|female| 23| 浙江大学|3.6| 5| 1| 2|
|| 5| 5432| male| 25| 山东大学|3.8| 20| 15| 70|
|| 6| 2131| male| 28| 山东大学|3.3| 15| 7| 13|
|| 7| 4321| male| 28| 复旦大学|3.6| 9| 6| 52|
|+---+---------+------+----+----------+---+---------------------+------------+----------+
|""".stripMargin
df5.printSchema()
result=
"""
|root
| |-- id: integer (nullable = true)
| |-- device_id: integer (nullable = true)
| |-- gender: string (nullable = true)
| |-- age: integer (nullable = true)
| |-- university: string (nullable = true)
| |-- gpa: double (nullable = true)
| |-- active_days_within_30: integer (nullable = true)
| |-- question_cnt: integer (nullable = true)
| |-- answer_cnt: integer (nullable = true)
|""".stripMargin
//等 剩下的请看api简介

df5.createOrReplaceTempView("csv")
spark.sql(
"""
|select
|gender,
|device_id,
|active_days_within_30,
|university
|from
|csv
|where university = '北京大学'
|""".stripMargin).show()

result=
"""
|+------+---------+---------------------+----------+
||gender|device_id|active_days_within_30|university|
|+------+---------+---------------------+----------+
|| male| 2138| 7| 北京大学|
||female| 6543| 12| 北京大学|
|+------+---------+---------------------+----------+
|""".stripMargin

jdbc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").getOrCreate()

//用代码创建
val df = spark.read.format("JDBC")
.option("url","jdbc:mysql://bigdata2:3306/try")
.option("dbtable", "emp")
.option("user", "root")
.option("password", "liuzihan010616")
.load()
df.printSchema()
df.show()
// 但是这样传入是把整个表直接传进来,但是有时候我们只要其中一一部分可以这样相当于谓词下压
val sal =
"""
|select
|*
|from
|emp where sal > 1500
|""".stripMargin
val df1 = spark.read.format("JDBC")
.option("url","jdbc:mysql://bigdata2:3306/try")
.option("dbtable", s"($sal) as tmp")
.option("user", "root")
.option("password", "liuzihan010616")
.load()
df1.show()
result =
"""
|+-----+------+---------+----+-------------------+-------+-------+------+
||empno| ename| job| mgr| hiredate| sal| comm|deptno|
|+-----+------+---------+----+-------------------+-------+-------+------+
|| 7499| ALLEN| SALESMAN|7698|1981-02-20 00:00:00|1600.00| 300.00| 30|
|| 7566| JONES| MANAGER|7839|1981-04-02 00:00:00|2975.00| null| 20|
|| 7698| BLAKE| MANAGER|7839|1981-05-01 00:00:00|2850.00| null| 30|
|| 7782| CLARK| MANAGER|7839|1981-06-09 00:00:00|2450.00| null| 10|
|| 7788| SCOTT| ANALYST|7566|1982-12-09 00:00:00|3000.00| null| 20|
|| 7839| KING|PRESIDENT|null|1981-11-17 00:00:00|5000.00| null| 10|
|| 7902| FORD| ANALYST|7566|1981-12-03 00:00:00|3000.00| null| 20|
|| 7839| KING|PRESIDENT|null|1981-11-17 00:00:00|5000.00| null| 10|
|| 7654|MARTIN| SALESMAN|7698|1981-09-28 00:00:00|3200.00|1400.00| 30|
|| 7788| SCOTT| ANALYST|7566|1982-12-09 00:00:00|3000.00| null| 20|
|| 7788| SCOTT| ANALYST|7566|1982-12-09 00:00:00|3000.00| null| 20|
|| 7788| SCOTT| ANALYST|7566|1982-12-09 00:00:00|3000.00| null| 20|
|| 7788| SCOTT| ANALYST|7566|1982-12-09 00:00:00|3000.00| null| 20|
|| 7788| SCOTT| ANALYST|7566|1982-12-09 00:00:00|3000.00| null| 20|
|| 7788| SCOTT| ANALYST|7566|1982-12-09 00:00:00|3000.00| null| 20|
|| 7788| SCOTT| ANALYST|7566|1982-12-09 00:00:00|3000.00| null| 20|
|| 7788| SCOTT| ANALYST|7566|1982-12-09 00:00:00|3000.00| null| 20|
|| 7788| SCOTT| ANALYST|7566|1982-12-09 00:00:00|3000.00| null| 20|
|| 7788| SCOTT| ANALYST|7566|1982-12-09 00:00:00|3000.00| null| 20|
|| 7788| SCOTT| ANALYST|7566|1982-12-09 00:00:00|3000.00| null| 20|
|+-----+------+---------+----+-------------------+-------+-------+------+
|only showing top 20 rows
|
|
|Process finished with exit code 0
|
|""".stripMargin
//用Properties传入
val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "liuzihan010616")
val jdbcDF2 = spark.read
.jdbc("jdbc:mysql://bigdata2:3306/try", "try.emp", connectionProperties)

excel

在idea里要先导入spark-excel的pom:这里的版本要和scala对应上

1
2
3
4
5
<dependency>
<groupId>com.crealytics</groupId>
<artifactId>spark-excel_2.12</artifactId>
<version>0.14.0</version>
</dependency>

然后代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
package sparkfirst

import java.util.Properties

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import com.crealytics.spark.excel._
object sparksql2 {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").getOrCreate()
val df = spark.read.excel(header = true,inferSchema = true).load("file:////C:\\Users\\dell\\Desktop\\2023届毕业设计题目-计算机-选题志愿表.xlsx")
df.show()
val result =
"""
|+--------------------------------------+--------+-------------------------------------+------+--------+--------+--------+
||2023届计算机科学与技术专业毕业设计选题| _c1| _c2| _c3| _c4| _c5| _c6|
|+--------------------------------------+--------+-------------------------------------+------+--------+--------+--------+
|| 序号|指导老师| 题目|学生数|第一志愿|第二志愿|第三志愿|
|| 1| 王海涛| 基于android的房产中介app...| 1| null| null| null|
|| 2| 王海涛| 基于android的酒店预约入住a...| 1| null| null| null|
|| 3| 王海涛| 基于android的有声书app的...| 1| null| null| null|
|| 4| 王海涛| 基于android的掌上医院app...| 1| null| null| null|
|| 5| 王海涛| 基于web的考试管理系统的设计与实现| 1| null| null| null|
|| 6| 王琢| 电商平台产品评论爬虫的设计| 1| null| null| null|
|| 7| 王琢| 基于Django的智能水务系统前端开发| 1| null| null| null|
|| 8| 王琢| 个人账本管理微信小程序开发| 1| null| null| null|
|| 9| 王琢| 智能水务系统远程监控模块的开发| 1| null| null| null|
|| 10| 张文波|基于安卓系统的硕士研究生招生预报名...| 1| null| null| null|
|| 11| 张文波|面向工业互联网的联网设备故障检测技...| 1| null| null| null|
|| 12| 张文波|面向工业互联网的联网设备运行维护系...| 1| null| null| null|
|| 13| 曹烨| 疫情防控管理信息系统的设计与开发| 1| null| null| null|
|| 14| 曹烨| 多线程下载器的设计与开发| 1| null| null| null|
|| 15| 曹烨| 坦克对战游戏的设计与开发| 1| null| null| null|
|| 16| 曹烨| 五子棋游戏大厅的设计与开发| 1| null| null| null|
|| 17| 杜焱| 疫情封闭人员及物资管理系统开发| 1| null| null| null|
|| 18| 杜焱| 志愿者服务系统开发| 1| null| null| null|
|| 19| 杜焱| 高校教师工作绩效管理系统开发| 1| null| null| null|
|+--------------------------------------+--------+-------------------------------------+------+--------+--------+--------+
|only showing top 20 rows
|
|
|Process finished with exit code 0
|
|""".stripMargin


hive

在生产上我们要对配置文件进行修改

1
Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.

直接cp hivehome 下的配置文件 hive-site.xml到spark的配置下或者做软连接到spark的配置文件下

即使hive和spark不在同一台机器上也是可以的,只不过不能做软连接了

如果缺少mysql驱动的化把mysql驱动添加到spark的jar文件夹里就好

或者用 –jars 路径 来配置启动方式

接下来我们直接执行如下语句

1
2
3
4
5
6
7
8
9
10
11
12
13
scala> spark.sql("show databases").show
+-------------+
| namespace|
+-------------+
| bigdata|
| bigdata_hive|
|bigdata_hive2|
|bigdata_hive3|
|bigdata_hive4|
| default|
| test|
+-------------+

我们可以用spark-sql脚本来执行hive的语句

如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[hadoop@bigdata5 conf]$ spark-sql --master local[4]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/01/12 10:10:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/01/12 10:10:15 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
23/01/12 10:10:15 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
Spark master: local[4], Application Id: local-1673489414801
spark-sql (default)> show databases;
namespace
bigdata
bigdata_hive
bigdata_hive2
bigdata_hive3
bigdata_hive4
default
test
Time taken: 2.213 seconds, Fetched 7 row(s)

我们一般都在spark-sql里进行测试sql语句,然后再通过代码部署=》不要再sparksql里创建表会有点问题=》尽可能再hive里键

维护数仓 =》 可以用spark-sql -e/-f sql文件 =》 推荐的方式维护离线数仓 好维护 简单

hive引擎如果改成spark =》 不稳定 =》 bug =》 有的时候spark的function无法使用

hive里有的function =》 spark始终有

但是spark里有的hive里没有

idea里也是要把hive-site放到resource文件夹里

东西很多想要什么上官网查看 https://spark.apache.org/docs/latest/sql-ref-syntax.html#ddl-statements

idea里要提前加上依赖

如下

1
2
3
4
5
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>3.2.1</version>
</dependency>

然后执行以下代码就ok了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
package sparkfirst

import org.apache.spark.sql.SparkSession

object sparksql3 {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").enableHiveSupport().getOrCreate()
val frame = spark.sql(
"""
|select
|*
|from
|bigdata_hive3.emp
|""".stripMargin)
frame.show()
frame.printSchema()
val result =
"""
|+-----+--------+---------+----+--------+----+----+------+
||empno| ename| job| mgr|hiredate| sal|comm|deptno|
|+-----+--------+---------+----+--------+----+----+------+
|| 7369| SMITH| CLERK|7902| null| 800|null| 20|
|| 7499| ALLEN| SALESMAN|7698| null|1600| 300| 30|
|| 7521| WARD| SALESMAN|7698| null|1250| 500| 30|
|| 7566| JONES| MANAGER|7839| null|2975|null| 20|
|| 7654| MARTIN| SALESMAN|7698| null|1250|1400| 30|
|| 7698| BLAKE| MANAGER|7839| null|2850|null| 30|
|| 7782| CLARK| MANAGER|7839| null|2450|null| 10|
|| 7788| SCOTT| ANALYST|7566| null|3000|null| 20|
|| 7839| KING|PRESIDENT|null| null|5000|null| 10|
|| 7844| TURNER| SALESMAN|7698| null|1500| 0| 30|
|| 7876| ADAMS| CLERK|7788| null|1100|null| 20|
|| 7900|lebulang| CLERK|7698| null| 950|null| 30|
|| 7902| FORD| ANALYST|7566| null|3000|null| 20|
|| 7934| MILLER| CLERK|7782| null|1300|null| 10|
|| 7839| KING|PRESIDENT|null| null|5000|null| 10|
|| 7654| MARTIN| SALESMAN|7698| null|3200|1400| 30|
|| 7788| SCOTT| ANALYST|7566| null|3000|null| 20|
|| 7788| SCOTT| ANALYST|7566| null|3000|null| 20|
|| 7788| SCOTT| ANALYST|7566| null|3000|null| 20|
|| 7788| SCOTT| ANALYST|7566| null|3000|null| 20|
|+-----+--------+---------+----+--------+----+----+------+
|only showing top 20 rows
|
|root
| |-- empno: string (nullable = true)
| |-- ename: string (nullable = true)
| |-- job: string (nullable = true)
| |-- mgr: long (nullable = true)
| |-- hiredate: date (nullable = true)
| |-- sal: decimal(10,0) (nullable = true)
| |-- comm: decimal(10,0) (nullable = true)
| |-- deptno: long (nullable = true)
|
|
|Process finished with exit code 0
|
|""".stripMargin
}
}

写数据

写数据的时候一般会伴随crc文件

TEXT

注意text仅仅支持一列的数据进行输出,不支持多列,因为我们的resource文件夹里有配置文件所以它走我们的配置文件压缩格式为bz2,不过可以自己指定格式,一般不指定且无配置文件是不压缩的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
package sparkfirst

import org.apache.spark.sql.SparkSession

object sparksql2 {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").getOrCreate()
val df2 = spark.read.option("lineSep",",").text("file:///D:\\test.txt")
df2.show()
//-----------------------------------------写数据
df2.write.text("file:///D:\\test1.txt")
//-------------------------------------------加压缩
df2.write.option("compression", "gzip").text("file:///D:\\test2.txt")
}
}

如果想解决,要自己定义外部数据源,相当于自己修改源码

或者把dataframe变成rdd进行输出 因为saveasTextFile是可以多列输出的

df2.rdd.saveAsTextFile("file:///D:\\test3.txt")

查看文件格式

json

1
2
//常用的输出方式追加append,或者覆盖(overwrite),或者忽略(ignore),错误等(error)
//df.write.mode(saveMode = "overwrite").json("hdfs://bigdata3:9000/spark")

csv

1
2
3
//写出可以用sep进行设置导出的分隔符,mode设置是不是覆盖,compression设置压缩
df5.write.options(Map("sep"->";","compression"->"gzip")).mode("overwrite").format("csv").save("file:///C:\\Users\\dell\\Desktop\\user_profile1.csv")

结果如下

JDBC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
//写出 --------------------------代码
//如果用overwrite,会把之前的表删掉,然后重新建一个,表的数据结构会发生改变
df.write.mode("append")
.format("jdbc")
.option("url", "jdbc:mysql://bigdata2:3306/try")
.option("dbtable", "emp1")
.option("user", "root")
.option("password", "liuzihan010616")
.save()
// --------------------------Properties
df.write.mode("append")
.jdbc("jdbc:mysql://bigdata2:3306/try", "emp1", connectionProperties)

// 可以在写的时候多创建列
df.write.mode("append")
.option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
.jdbc("jdbc:mysql://bigdata2:3306/try", "try.emp", connectionProperties)

excel

如下

1
2
df.write.mode("overwrite").excel(header = true,"A1").save("file:///C:\\Users\\dell\\Desktop\\2023届毕业设计题目-计算机-选题志愿表1.xlsx")

hive

一共有几种方法

  • ctas
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
  val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").enableHiveSupport().getOrCreate()
def main(args: Array[String]): Unit = {

spark.sql(
"""
|create table bigdata.sparkfinish as
|select
|*
|from(
|select
|area,
|product_name,
|rank() over(partition by area order by cnt) as rk
|from (
|select
|area,
|product_name,
|count(1) as cnt
|from bigdata.tmp
|group by area,product_name
|)
|)where rk < 3;
|""".stripMargin)

// --------------------------------------insert into 数据追加
spark.sql(
"""
|insert into table bigdata.sparkfinish
|select * from bigdata.sparkfinish
|""".stripMargin)

// ----------------------------------------数据覆盖
spark.sql(
"""
|insert overwrite table bigdata.sparkfinish
|select * from bigdata.sparkfinish
|""".stripMargin)
// ----------------------------------------分区表 emp_partition是元数据分区表 emp_partition1是后来键的分区表
// ---------------------------------------执行动态分区的时候要配置参数 静态分区不用
spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
spark.conf.set("hive.exec.dynamic.partition","true")
spark.sql(
"""
|insert overwrite table bigdata_hive3.emp_partition1 partition(deptno)
|select * from bigdata_hive3.emp_partition
|""".stripMargin)
//-----------------------------------------------------------------api
val frame = checksql(hivesqlchoose("empno , ename , job , mgr , deptno ", "bigdata_hive3.emp", "where sal > 3000"))

// -----------------------------这里如果用覆盖模式是会把整个表都弄美哦然后重新建表到数据的,所以一般不用,放置我们只对一个分区数据进行操作的时候别的数据不见
// -------------------------------------普通表
frame.write.mode(saveMode = "append").format("hive").saveAsTable("bigdata_hive3.emp89")
spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
spark.conf.set("hive.exec.dynamic.partition","true")
// -------------------------------------分区表
frame.write.partitionBy("deptno").mode(saveMode = "append").format("hive").saveAsTable("bigdata_hive3.emp891")
// ---------------------------------------insertInto 插入数据 如果是用于分区表它会自动使用动态分区 对普通表可以
// Exception in thread "main" org.apache.spark.sql.AnalysisException: insertInto() can't be used together with partitionBy(). Partition columns have already been defined for the table. It is not necessary to use partitionBy().
//frame.write.partitionBy("deptno").mode(saveMode = "overwrite").format("hive").insertInto("bigdata_hive3.emp891")
// 可以把数据作为路径写入到hdfs上 ,写入他对应的table path下

frame.select("empno","ename","job","mgr").write.mode("overwrite").parquet("hdfs://bigdata3:9000/user/hive/warehouse/bigdata_hive3.db/emp_partition1/deptno=20")
// 对于普通表这样写入是ok的,但是对于分区表 因为元数据的不同 所以可能会导致元数据关联不上
// 修复元数据库就ok了
// 注意这里要是parquet的存储格式的或者orc的,如果是text的会成乱码
// 修复元数据
// msck repair table table_name [ADD/DROP/SYNC partition]
// 或者通过rdd的方法进行存储
}


def hivesqlchoose(string: String*)={

val str = "select" + " " + string(0) + " " + "from" + " " + string(1)
if (string.length > 2){
str +" " + string(2)
}else{
str
}
}

def descfunctionsql(string: String)={
s"""
|desc function extended $string
|""".stripMargin
}

def checksql(string: String)={
spark.sql(string).show(false)
spark.sql(string)
}
}

sparksql

sparksql主要是处理结构化数据的 模块

结构化数据 =》 带有schema信息的数据

半结构化数据 = 》csv , json , orc , parquet

非结构化数据 =》 nosql =》 redis , hbase

sparksql => 不仅可以写sql 还可以编程

特性

sparksql =》 sql + dataframeapi =》 处理结构化数据

spark-core里的api这里也通用

有个统一的数据接口 =》 处理多种外部数据源 =》 mysql/hive/excel/csv/….. =》统一的api

整合hive =》 使用hive非常简单

sparksql 不仅仅是sql

hive on spark =》 hive的查询引擎是spark

spark on hive =》 sparksql 去hive上查数据 大部分人用这个

sparksql做了性能优化 =》 比RDD 高

基本介绍

sparksql比RDD的性能高原因:

sparksql底层跑的还是spark -core的RDD 只不过是做了优化

因为用户开发的schema

saprkcore => 编程模型 RDD

sparksql =》 RDD[数据集] + schema[字段 字段类型] =》 table

DataSet&DataFrame

sparksql : 编程模型 =》 dataSet/dataframe

dataset

分布式数据集

比RDD多出很多优势 =》做了很多优化 =》 效率高 =》 使用了强类型 =》 可以使用算子 = 》查询性能高 =》 spark1.6之后诞生的

Py不支持dataset的api

dataframe

dataeframe =》 也是一个dataset

dataframe =》 普通数据库里的一个table =》 可以使用算子 =》 dataset的rows

Row=》一行数据,仅仅包含列数据

dataframe =》table

和spark-core对比

spare-core =》 rdd

sparksql =》 df [RDD数据集 + 额外的信息scheme]

sparksql :

  • 1.0 =》schemaRdd :RDD 存数据 + scheme(类似元数据:存储额外的信息)
  • 1.6 = 》dataset =》 dataframe 变过来的

使用sparksession创建dataframe

工具 : idea,linux

引入 : sparksql依赖

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.2.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>${hive.version}</version>
</dependency>

linux

在linux里启动spark-shell的时候他会给你自动提供好spark-session

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[hadoop@bigdata5 ~]$ spark-shell --master local[4]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/01/10 10:11:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://bigdata5:4040
Spark context available as 'sc' (master = local[4], app id = local-1673316684351).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.1
/_/

Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

创建df如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

scala> val df = spark.read.json("file:///home/hadoop/data/json/Skills.json")
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string, animationId: bigint ... 24 more fields]

scala> df.show
23/01/10 10:14:42 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
+---------------+-----------+--------------------+-----------------------------------+--------------------+-------+---------+----+----------------+--------+-----------+------+--------+----------------------------------+--------+-------+---------------+----------------+----------------+-----+-----+-------+-----------+------+------+----------+
|_corrupt_record|animationId| damage| description| effects|hitType|iconIndex| id| message1|message2|messageType|mpCost| name| note|occasion|repeats|requiredWtypeId|requiredWtypeId1|requiredWtypeId2|scope|speed|stypeId|successRate|tpCost|tpGain|xianliCost|
+---------------+-----------+--------------------+-----------------------------------+--------------------+-------+---------+----+----------------+--------+-----------+------+--------+----------------------------------+--------+-------+---------------+----------------+----------------+-----+-----+-------+-----------+------+------+----------+
| [| null| null| null| null| null| null|null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| null,| null| null| null| null| null| null|null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| null| 1|{true, -1, a.atk ...| | [{21, 0, 1.0, 0}]| 1| 880| 1| %1的攻击!| | 1| 0| 攻击| 1 号技能会在选择“攻击”指令时使...| 1| 1| null| 0| 0| 1| 0| 0| 100| 0| 10| null|
| null| 0|{false, 0, 0, 0, 20}| | [{21, 2, 1.0, 0}]| 0| 688| 2|%1正在保护自己。| | 1| 0| 防御|1 号技能会在选择“防御”指令时使用。| 1| 1| null| 0| 0| 11| 10| 0| 100| 0| 10| null|
| null| -1|{true, -1, a.atk ...| | [{21, 0, 1.0, 0}]| 1| 880| 3| %1的攻击!| | 1| 0|连续攻击| | 1| 2| null| 0| 0| 1| 0| 2| 100| 0| 5| null|
| null| -1|{true, -1, a.atk ...| | [{21, 0, 1.0, 0}]| 1| 880| 4| %1的攻击!| | 1| 0|两次攻击| | 1| 1| null| 0| 0| 4| 0| 0| 100| 0| 5| null|
| null| -1|{true, -1, a.atk ...| | [{21, 0, 1.0, 0}]| 1| 849| 5| %1的攻击!| | 1| 0|三次攻击| | 1| 1| null| 0| 0| 5| 0| 0| 100| 0| 4| null|
| null| 0|{false, 0, 0, 0, 20}| | [{41, 0, 0.0, 0}]| 0| 883| 6| %1逃跑了。| | 1| 0| 逃跑| | 1| 1| null| 0| 0| 11| 0| 0| 100| 0| 0| null|
| null| 0|{false, 0, 0, 0, 20}| | []| 0| 979| 7| %1正在观望。| | 1| 0| 观望| | 1| 1| null| 0| 0| 0| 0| 0| 100| 0| 10| null|
| null| 41|{false, 0, 200 + ...| |[{21, 4, 1.0, 0},...| 0| 72| 8| %1吟唱了%2!| | 1| 5| 治愈| | 0| 1| null| 0| 0| 1| 0| 1| 100| 0| 10| null|
| null| 66|{false, 2, 100 + ...| 魔法\n初级的圣光技能,能召唤微弱...| [{44, 30, 1.0, 0}]| 2| 64| 9| %1吟唱了%2!| | 1| 5| 火焰| | 1| 1| null| 0| 0| 1| 0| 1| 100| 0| 10| null|
| null| 0|{false, 2, 285 + ...|呼吸法\n常见的呼吸法,运转时能够...| [{21, 21, 1.0, 0}]| 0| 3084| 10| %1施放了%2!| | 1| 0|小吐纳法| <Cast Animation: ...| 0| 1| null| 0| 0| 11| 0| 4| 100| 0| 0| null|
| null| 152|{false, 2, 285 + ...| |[{21, 153, 1.0, 0...| 0| 499| 11| %1使用了%2!| | 1| 0| 灭魂术| <Cast Animation: 0> | 1| 1| null| 0| 0| 2| 0| 0| 100| 0| 0| null|
| null| 38|{true, 1, 20000, ...| 基因锁·一阶\n觉醒了脚上的力量,...| [{21, 72, 0.2, 0}]| 0| 479| 12| %1使出了 %2!| | 1| 0| 骑士踢| <setup action>\na...| 1| 1| null| 0| 0| 1| 0| 2| 100| 0| 0| null|
| null| 125|{true, 0, 150 + a...| 基因锁·一阶\n觉醒了一种气功,能...| [{44, 24, 1.0, 0}]| 2| 4471| 13| %1施放了%2!| | 1| 50| 变身| <Cast Animation: ...| 1| 1| null| 0| 0| 11| 0| 2| 100| 0| 0| null|
| null| 23|{true, -1, 426500...| | [{21, 0, 1.0, 0}]| 1| 640| 14| %1的攻击!| | 1| 0|莫名剑法| <Cast Animation: ...| 1| 1| null| 0| 0| 1| 0| 2| 100| 0| 10| null|
| null| 0|{false, 4, 100 + ...| 基因锁·破碎\n已经达到身体的极限...| []| 0| 943| 15| | | 1| 0|无望三阶| <Hide in Battle>\...| 3| 1| null| 0| 0| 0| 0| 3| 100| 0| 0| null|
| null| 0|{false, 4, 100 + ...| 基因锁·破碎\n已经达到身体的极限...| []| 0| 943| 16| | | 1| 0|无望四阶| <Hide in Battle>\...| 3| 1| null| 0| 0| 0| 0| 3| 100| 0| 0| null|
| null| -1|{true, -1, 10000+...| | [{21, 0, 1.0, 0}]| 0| 880| 17| %1的攻击!| | 1| 0| 骑士拳| | 1| 1| null| 0| 0| 4| 0| 0| 100| 0| 5| null|
| null| 311|{false, 0, 300, 0...| 基因锁·四阶\n返祖·又北二百八十...| []| 0| 484| 18| %1施放了%2!| | 1| 0|孟极血脉| \n<passiveAPLUS:1...| 3| 1| null| 0| 0| 0| 0| 3| 100| 0| 0| null|
+---------------+-----------+--------------------+-----------------------------------+--------------------+-------+---------+----+----------------+--------+-----------+------+--------+----------------------------------+--------+-------+---------------+----------------+----------------+-----+-----+-------+-----------+------+------+----------+
only showing top 20 rows

idea

1
2
3
4
5
6
7
8
9
10
11
12
13
package sparkfirst

import org.apache.spark.sql.SparkSession

object sparksql {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("Sparksql01").master("local[2]").getOrCreate()
val frame = spark.read.json("file:///C:\\Users\\dell\\Desktop\\Skills.json")
frame.show()
spark.stop()
}
}

sparksql进行数据分析

  • sql
  • 代码

开发df

  • sql =》 idea api + sql 一起使用 或者 hive里的sql文件
  • api = > 一般用于开发大数据平台

学习api

加载dataframe中的一个字段 : select

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
scala> df.select("description").show
+-----------------------------------+
| description|
+-----------------------------------+
| null|
| null|
| |
| |
| |
| |
| |
| |
| |
| |
| 魔法\n初级的圣光技能,能召唤微弱...|
|呼吸法\n常见的呼吸法,运转时能够...|
| |
| 基因锁·一阶\n觉醒了脚上的力量,...|
| 基因锁·一阶\n觉醒了一种气功,能...|
| |
| 基因锁·破碎\n已经达到身体的极限...|
| 基因锁·破碎\n已经达到身体的极限...|
| |
| 基因锁·四阶\n返祖·又北二百八十...|
+-----------------------------------+
only showing top 20 rows
----------------------------------------------------------------------------源码
def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)

直接传字段的名字 =》 select("sdhakdhaj")或者 select($"niasda")或者select('hsjakd)=> 加上隐式转换import spark.implicits._ 但是在linux里不需要
select(col = "age") =》 加上import org.apache.spark.sql.functions._ ,linux不用
------------------------------------------------------------------------------

createOrReplaceTempView():创建临时表 =》 就可以用sql来查询了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

scala> df.createOrReplaceTempView("test")

scala> spark.sql("select * from test")
res4: org.apache.spark.sql.DataFrame = [_corrupt_record: string, animationId: bigint ... 24 more fields]

scala> spark.sql("select * from test").show()
+---------------+-----------+--------------------+-----------------------------------+--------------------+-------+---------+----+----------------+--------+-----------+------+--------+----------------------------------+--------+-------+---------------+----------------+----------------+-----+-----+-------+-----------+------+------+----------+
|_corrupt_record|animationId| damage| description| effects|hitType|iconIndex| id| message1|message2|messageType|mpCost| name| note|occasion|repeats|requiredWtypeId|requiredWtypeId1|requiredWtypeId2|scope|speed|stypeId|successRate|tpCost|tpGain|xianliCost|
+---------------+-----------+--------------------+-----------------------------------+--------------------+-------+---------+----+----------------+--------+-----------+------+--------+----------------------------------+--------+-------+---------------+----------------+----------------+-----+-----+-------+-----------+------+------+----------+
| [| null| null| null| null| null| null|null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| null,| null| null| null| null| null| null|null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null|
| null| 1|{true, -1, a.atk ...| | [{21, 0, 1.0, 0}]| 1| 880| 1| %1的攻击!| | 1| 0| 攻击| 1 号技能会在选择“攻击”指令时使...| 1| 1| null| 0| 0| 1| 0| 0| 100| 0| 10| null|
| null| 0|{false, 0, 0, 0, 20}| | [{21, 2, 1.0, 0}]| 0| 688| 2|%1正在保护自己。| | 1| 0| 防御|1 号技能会在选择“防御”指令时使用。| 1| 1| null| 0| 0| 11| 10| 0| 100| 0| 10| null|
| null| -1|{true, -1, a.atk ...| | [{21, 0, 1.0, 0}]| 1| 880| 3| %1的攻击!| | 1| 0|连续攻击| | 1| 2| null| 0| 0| 1| 0| 2| 100| 0| 5| null|
| null| -1|{true, -1, a.atk ...| | [{21, 0, 1.0, 0}]| 1| 880| 4| %1的攻击!| | 1| 0|两次攻击| | 1| 1| null| 0| 0| 4| 0| 0| 100| 0| 5| null|
| null| -1|{true, -1, a.atk ...| | [{21, 0, 1.0, 0}]| 1| 849| 5| %1的攻击!| | 1| 0|三次攻击| | 1| 1| null| 0| 0| 5| 0| 0| 100| 0| 4| null|
| null| 0|{false, 0, 0, 0, 20}| | [{41, 0, 0.0, 0}]| 0| 883| 6| %1逃跑了。| | 1| 0| 逃跑| | 1| 1| null| 0| 0| 11| 0| 0| 100| 0| 0| null|
| null| 0|{false, 0, 0, 0, 20}| | []| 0| 979| 7| %1正在观望。| | 1| 0| 观望| | 1| 1| null| 0| 0| 0| 0| 0| 100| 0| 10| null|
| null| 41|{false, 0, 200 + ...| |[{21, 4, 1.0, 0},...| 0| 72| 8| %1吟唱了%2!| | 1| 5| 治愈| | 0| 1| null| 0| 0| 1| 0| 1| 100| 0| 10| null|
| null| 66|{false, 2, 100 + ...| 魔法\n初级的圣光技能,能召唤微弱...| [{44, 30, 1.0, 0}]| 2| 64| 9| %1吟唱了%2!| | 1| 5| 火焰| | 1| 1| null| 0| 0| 1| 0| 1| 100| 0| 10| null|
| null| 0|{false, 2, 285 + ...|呼吸法\n常见的呼吸法,运转时能够...| [{21, 21, 1.0, 0}]| 0| 3084| 10| %1施放了%2!| | 1| 0|小吐纳法| <Cast Animation: ...| 0| 1| null| 0| 0| 11| 0| 4| 100| 0| 0| null|
| null| 152|{false, 2, 285 + ...| |[{21, 153, 1.0, 0...| 0| 499| 11| %1使用了%2!| | 1| 0| 灭魂术| <Cast Animation: 0> | 1| 1| null| 0| 0| 2| 0| 0| 100| 0| 0| null|
| null| 38|{true, 1, 20000, ...| 基因锁·一阶\n觉醒了脚上的力量,...| [{21, 72, 0.2, 0}]| 0| 479| 12| %1使出了 %2!| | 1| 0| 骑士踢| <setup action>\na...| 1| 1| null| 0| 0| 1| 0| 2| 100| 0| 0| null|
| null| 125|{true, 0, 150 + a...| 基因锁·一阶\n觉醒了一种气功,能...| [{44, 24, 1.0, 0}]| 2| 4471| 13| %1施放了%2!| | 1| 50| 变身| <Cast Animation: ...| 1| 1| null| 0| 0| 11| 0| 2| 100| 0| 0| null|
| null| 23|{true, -1, 426500...| | [{21, 0, 1.0, 0}]| 1| 640| 14| %1的攻击!| | 1| 0|莫名剑法| <Cast Animation: ...| 1| 1| null| 0| 0| 1| 0| 2| 100| 0| 10| null|
| null| 0|{false, 4, 100 + ...| 基因锁·破碎\n已经达到身体的极限...| []| 0| 943| 15| | | 1| 0|无望三阶| <Hide in Battle>\...| 3| 1| null| 0| 0| 0| 0| 3| 100| 0| 0| null|
| null| 0|{false, 4, 100 + ...| 基因锁·破碎\n已经达到身体的极限...| []| 0| 943| 16| | | 1| 0|无望四阶| <Hide in Battle>\...| 3| 1| null| 0| 0| 0| 0| 3| 100| 0| 0| null|
| null| -1|{true, -1, 10000+...| | [{21, 0, 1.0, 0}]| 0| 880| 17| %1的攻击!| | 1| 0| 骑士拳| | 1| 1| null| 0| 0| 4| 0| 0| 100| 0| 5| null|
| null| 311|{false, 0, 300, 0...| 基因锁·四阶\n返祖·又北二百八十...| []| 0| 484| 18| %1施放了%2!| | 1| 0|孟极血脉| \n<passiveAPLUS:1...| 3| 1| null| 0| 0| 0| 0| 3| 100| 0| 0| null|
+---------------+-----------+--------------------+-----------------------------------+--------------------+-------+---------+----+----------------+--------+-----------+------+--------+----------------------------------+--------+-------+---------------+----------------+----------------+-----+-----+-------+-----------+------+------+----------+
only showing top 20 rows


开发数仓:

  • sql文件 维护数仓 =》 推荐 好维护
  • idea
    • sql维护数仓 =》 例子 :滴滴
    • api维护数仓 =》 不好维护 =》 定义udf函数方便
      • 可以写通用性代码来维护 =》 写完之后牛皮

如何构建dataframe

通过spark-session

RDD =》 dataframe

  • 反射 数据结构(tuple,case class)-》形成dataframe-》RDD.toDF就可以了 其中 toDF(“列名”,”列名”。。。。)里面的是在dataframe里的列名
  • 编程的方式 - 》 形成dataframe-》
    • 准备RDD 结构 =》 row类型的

    • scheme =》 数据字段名字,以及字段类型

      • scheme : 可以理解为一个表的元数据 =》 字段的名字,字段的类型 =》 以StructType来维护的
        • fileds:一个字段的元数据是用StructFileds来维护的
    • createDataFrame =》 df

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      val rowRDD: RDD[Row] = inputRDD.map(line => {
      val splits = line.split(",")
      val uid = splits(0)
      val name = splits(1)
      val age = splits(2).toInt
      Row(uid, name, age)
      })

      val schema = StructType(Array(
      StructField("uid", StringType),
      StructField("name", StringType),
      StructField("age", IntegerType)
      ))

      val inputDF: DataFrame = spark.createDataFrame(rowRDD,schema)

dataframe/datasetr->RDD

  • df.rdd

df -> ds

  • 用as来转换 df.as[数据类型]=》dataset :一般这个as里是类

作业 把mysql的emp以及dept表以json的方式提取出来

通过sparksql来完成之前的需求

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
3116
3117
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
3251
3252
3253
3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
数据如下 : 
-----------------------------------------dept
{
"dept": [
{
"deptno" : 10,
"dname" : "ACCOUNTING",
"loc" : "NEW YORK"
},
{
"deptno" : 20,
"dname" : "RESEARCH",
"loc" : "DALLAS"
},
{
"deptno" : 30,
"dname" : "SALES",
"loc" : "CHICAGO"
},
{
"deptno" : 40,
"dname" : "OPERATIONS",
"loc" : "BOSTON"
}
]}
-----------------------------------------emp
{
"emp": [
{
"empno" : 7369,
"ename" : "SMITH",
"job" : "CLERK",
"mgr" : 7902,
"hiredate" : "1980-12-17T06:00:00Z",
"sal" : 800.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7499,
"ename" : "ALLEN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-20T06:00:00Z",
"sal" : 1600.00,
"comm" : 300.00,
"deptno" : 30
},
{
"empno" : 7521,
"ename" : "WARD",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-22T06:00:00Z",
"sal" : 1250.00,
"comm" : 500.00,
"deptno" : 30
},
{
"empno" : 7566,
"ename" : "JONES",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-04-02T06:00:00Z",
"sal" : 2975.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T05:00:00Z",
"sal" : 1250.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7698,
"ename" : "BLAKE",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-05-01T05:00:00Z",
"sal" : 2850.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7782,
"ename" : "CLARK",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-06-09T05:00:00Z",
"sal" : 2450.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T06:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7844,
"ename" : "TURNER",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-08T05:00:00Z",
"sal" : 1500.00,
"comm" : 0.00,
"deptno" : 30
},
{
"empno" : 7876,
"ename" : "ADAMS",
"job" : "CLERK",
"mgr" : 7788,
"hiredate" : "1983-01-12T06:00:00Z",
"sal" : 1100.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7900,
"ename" : "lebulang",
"job" : "CLERK",
"mgr" : 7698,
"hiredate" : "1981-12-03T06:00:00Z",
"sal" : 950.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7902,
"ename" : "FORD",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1981-12-03T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7934,
"ename" : "MILLER",
"job" : "CLERK",
"mgr" : 7782,
"hiredate" : "1982-01-23T06:00:00Z",
"sal" : 1300.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T06:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T05:00:00Z",
"sal" : 3200.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 3200.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 3200.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7369,
"ename" : "SMITH",
"job" : "CLERK",
"mgr" : 7902,
"hiredate" : "1980-12-17T20:00:00Z",
"sal" : 800.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7499,
"ename" : "ALLEN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-20T20:00:00Z",
"sal" : 1600.00,
"comm" : 300.00,
"deptno" : 30
},
{
"empno" : 7521,
"ename" : "WARD",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-22T20:00:00Z",
"sal" : 1250.00,
"comm" : 500.00,
"deptno" : 30
},
{
"empno" : 7566,
"ename" : "JONES",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-04-02T20:00:00Z",
"sal" : 2975.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 1250.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7698,
"ename" : "BLAKE",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-05-01T18:00:00Z",
"sal" : 2850.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7782,
"ename" : "CLARK",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-06-09T18:00:00Z",
"sal" : 2450.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7844,
"ename" : "TURNER",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-08T18:00:00Z",
"sal" : 1500.00,
"comm" : 0.00,
"deptno" : 30
},
{
"empno" : 7876,
"ename" : "ADAMS",
"job" : "CLERK",
"mgr" : 7788,
"hiredate" : "1983-01-12T20:00:00Z",
"sal" : 1100.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7900,
"ename" : "lebulang",
"job" : "CLERK",
"mgr" : 7698,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 950.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7902,
"ename" : "FORD",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7934,
"ename" : "MILLER",
"job" : "CLERK",
"mgr" : 7782,
"hiredate" : "1982-01-23T20:00:00Z",
"sal" : 1300.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7369,
"ename" : "SMITH",
"job" : "CLERK",
"mgr" : 7902,
"hiredate" : "1980-12-17T20:00:00Z",
"sal" : 800.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7499,
"ename" : "ALLEN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-20T20:00:00Z",
"sal" : 1600.00,
"comm" : 300.00,
"deptno" : 30
},
{
"empno" : 7521,
"ename" : "WARD",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-22T20:00:00Z",
"sal" : 1250.00,
"comm" : 500.00,
"deptno" : 30
},
{
"empno" : 7566,
"ename" : "JONES",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-04-02T20:00:00Z",
"sal" : 2975.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 1250.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7698,
"ename" : "BLAKE",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-05-01T18:00:00Z",
"sal" : 2850.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7782,
"ename" : "CLARK",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-06-09T18:00:00Z",
"sal" : 2450.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7844,
"ename" : "TURNER",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-08T18:00:00Z",
"sal" : 1500.00,
"comm" : 0.00,
"deptno" : 30
},
{
"empno" : 7876,
"ename" : "ADAMS",
"job" : "CLERK",
"mgr" : 7788,
"hiredate" : "1983-01-12T20:00:00Z",
"sal" : 1100.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7900,
"ename" : "lebulang",
"job" : "CLERK",
"mgr" : 7698,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 950.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7902,
"ename" : "FORD",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7934,
"ename" : "MILLER",
"job" : "CLERK",
"mgr" : 7782,
"hiredate" : "1982-01-23T20:00:00Z",
"sal" : 1300.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7369,
"ename" : "SMITH",
"job" : "CLERK",
"mgr" : 7902,
"hiredate" : "1980-12-17T06:00:00Z",
"sal" : 800.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7499,
"ename" : "ALLEN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-20T06:00:00Z",
"sal" : 1600.00,
"comm" : 300.00,
"deptno" : 30
},
{
"empno" : 7521,
"ename" : "WARD",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-22T06:00:00Z",
"sal" : 1250.00,
"comm" : 500.00,
"deptno" : 30
},
{
"empno" : 7566,
"ename" : "JONES",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-04-02T06:00:00Z",
"sal" : 2975.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T05:00:00Z",
"sal" : 1250.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7698,
"ename" : "BLAKE",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-05-01T05:00:00Z",
"sal" : 2850.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7782,
"ename" : "CLARK",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-06-09T05:00:00Z",
"sal" : 2450.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T06:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7844,
"ename" : "TURNER",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-08T05:00:00Z",
"sal" : 1500.00,
"comm" : 0.00,
"deptno" : 30
},
{
"empno" : 7876,
"ename" : "ADAMS",
"job" : "CLERK",
"mgr" : 7788,
"hiredate" : "1983-01-12T06:00:00Z",
"sal" : 1100.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7900,
"ename" : "lebulang",
"job" : "CLERK",
"mgr" : 7698,
"hiredate" : "1981-12-03T06:00:00Z",
"sal" : 950.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7902,
"ename" : "FORD",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1981-12-03T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7934,
"ename" : "MILLER",
"job" : "CLERK",
"mgr" : 7782,
"hiredate" : "1982-01-23T06:00:00Z",
"sal" : 1300.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T06:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T05:00:00Z",
"sal" : 3200.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 3200.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 3200.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7369,
"ename" : "SMITH",
"job" : "CLERK",
"mgr" : 7902,
"hiredate" : "1980-12-17T20:00:00Z",
"sal" : 800.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7499,
"ename" : "ALLEN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-20T20:00:00Z",
"sal" : 1600.00,
"comm" : 300.00,
"deptno" : 30
},
{
"empno" : 7521,
"ename" : "WARD",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-22T20:00:00Z",
"sal" : 1250.00,
"comm" : 500.00,
"deptno" : 30
},
{
"empno" : 7566,
"ename" : "JONES",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-04-02T20:00:00Z",
"sal" : 2975.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 1250.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7698,
"ename" : "BLAKE",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-05-01T18:00:00Z",
"sal" : 2850.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7782,
"ename" : "CLARK",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-06-09T18:00:00Z",
"sal" : 2450.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7844,
"ename" : "TURNER",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-08T18:00:00Z",
"sal" : 1500.00,
"comm" : 0.00,
"deptno" : 30
},
{
"empno" : 7876,
"ename" : "ADAMS",
"job" : "CLERK",
"mgr" : 7788,
"hiredate" : "1983-01-12T20:00:00Z",
"sal" : 1100.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7900,
"ename" : "lebulang",
"job" : "CLERK",
"mgr" : 7698,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 950.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7902,
"ename" : "FORD",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7934,
"ename" : "MILLER",
"job" : "CLERK",
"mgr" : 7782,
"hiredate" : "1982-01-23T20:00:00Z",
"sal" : 1300.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7369,
"ename" : "SMITH",
"job" : "CLERK",
"mgr" : 7902,
"hiredate" : "1980-12-17T20:00:00Z",
"sal" : 800.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7499,
"ename" : "ALLEN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-20T20:00:00Z",
"sal" : 1600.00,
"comm" : 300.00,
"deptno" : 30
},
{
"empno" : 7521,
"ename" : "WARD",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-22T20:00:00Z",
"sal" : 1250.00,
"comm" : 500.00,
"deptno" : 30
},
{
"empno" : 7566,
"ename" : "JONES",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-04-02T20:00:00Z",
"sal" : 2975.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 1250.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7698,
"ename" : "BLAKE",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-05-01T18:00:00Z",
"sal" : 2850.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7782,
"ename" : "CLARK",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-06-09T18:00:00Z",
"sal" : 2450.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7844,
"ename" : "TURNER",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-08T18:00:00Z",
"sal" : 1500.00,
"comm" : 0.00,
"deptno" : 30
},
{
"empno" : 7876,
"ename" : "ADAMS",
"job" : "CLERK",
"mgr" : 7788,
"hiredate" : "1983-01-12T20:00:00Z",
"sal" : 1100.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7900,
"ename" : "lebulang",
"job" : "CLERK",
"mgr" : 7698,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 950.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7902,
"ename" : "FORD",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7934,
"ename" : "MILLER",
"job" : "CLERK",
"mgr" : 7782,
"hiredate" : "1982-01-23T20:00:00Z",
"sal" : 1300.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7369,
"ename" : "SMITH",
"job" : "CLERK",
"mgr" : 7902,
"hiredate" : "1980-12-17T06:00:00Z",
"sal" : 800.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7499,
"ename" : "ALLEN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-20T06:00:00Z",
"sal" : 1600.00,
"comm" : 300.00,
"deptno" : 30
},
{
"empno" : 7521,
"ename" : "WARD",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-22T06:00:00Z",
"sal" : 1250.00,
"comm" : 500.00,
"deptno" : 30
},
{
"empno" : 7566,
"ename" : "JONES",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-04-02T06:00:00Z",
"sal" : 2975.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T05:00:00Z",
"sal" : 1250.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7698,
"ename" : "BLAKE",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-05-01T05:00:00Z",
"sal" : 2850.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7782,
"ename" : "CLARK",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-06-09T05:00:00Z",
"sal" : 2450.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T06:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7844,
"ename" : "TURNER",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-08T05:00:00Z",
"sal" : 1500.00,
"comm" : 0.00,
"deptno" : 30
},
{
"empno" : 7876,
"ename" : "ADAMS",
"job" : "CLERK",
"mgr" : 7788,
"hiredate" : "1983-01-12T06:00:00Z",
"sal" : 1100.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7900,
"ename" : "lebulang",
"job" : "CLERK",
"mgr" : 7698,
"hiredate" : "1981-12-03T06:00:00Z",
"sal" : 950.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7902,
"ename" : "FORD",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1981-12-03T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7934,
"ename" : "MILLER",
"job" : "CLERK",
"mgr" : 7782,
"hiredate" : "1982-01-23T06:00:00Z",
"sal" : 1300.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T06:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T05:00:00Z",
"sal" : 3200.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 3200.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 3200.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7369,
"ename" : "SMITH",
"job" : "CLERK",
"mgr" : 7902,
"hiredate" : "1980-12-17T20:00:00Z",
"sal" : 800.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7499,
"ename" : "ALLEN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-20T20:00:00Z",
"sal" : 1600.00,
"comm" : 300.00,
"deptno" : 30
},
{
"empno" : 7521,
"ename" : "WARD",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-22T20:00:00Z",
"sal" : 1250.00,
"comm" : 500.00,
"deptno" : 30
},
{
"empno" : 7566,
"ename" : "JONES",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-04-02T20:00:00Z",
"sal" : 2975.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 1250.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7698,
"ename" : "BLAKE",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-05-01T18:00:00Z",
"sal" : 2850.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7782,
"ename" : "CLARK",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-06-09T18:00:00Z",
"sal" : 2450.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7844,
"ename" : "TURNER",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-08T18:00:00Z",
"sal" : 1500.00,
"comm" : 0.00,
"deptno" : 30
},
{
"empno" : 7876,
"ename" : "ADAMS",
"job" : "CLERK",
"mgr" : 7788,
"hiredate" : "1983-01-12T20:00:00Z",
"sal" : 1100.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7900,
"ename" : "lebulang",
"job" : "CLERK",
"mgr" : 7698,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 950.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7902,
"ename" : "FORD",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7934,
"ename" : "MILLER",
"job" : "CLERK",
"mgr" : 7782,
"hiredate" : "1982-01-23T20:00:00Z",
"sal" : 1300.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7369,
"ename" : "SMITH",
"job" : "CLERK",
"mgr" : 7902,
"hiredate" : "1980-12-17T20:00:00Z",
"sal" : 800.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7499,
"ename" : "ALLEN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-20T20:00:00Z",
"sal" : 1600.00,
"comm" : 300.00,
"deptno" : 30
},
{
"empno" : 7521,
"ename" : "WARD",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-22T20:00:00Z",
"sal" : 1250.00,
"comm" : 500.00,
"deptno" : 30
},
{
"empno" : 7566,
"ename" : "JONES",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-04-02T20:00:00Z",
"sal" : 2975.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 1250.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7698,
"ename" : "BLAKE",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-05-01T18:00:00Z",
"sal" : 2850.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7782,
"ename" : "CLARK",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-06-09T18:00:00Z",
"sal" : 2450.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7844,
"ename" : "TURNER",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-08T18:00:00Z",
"sal" : 1500.00,
"comm" : 0.00,
"deptno" : 30
},
{
"empno" : 7876,
"ename" : "ADAMS",
"job" : "CLERK",
"mgr" : 7788,
"hiredate" : "1983-01-12T20:00:00Z",
"sal" : 1100.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7900,
"ename" : "lebulang",
"job" : "CLERK",
"mgr" : 7698,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 950.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7902,
"ename" : "FORD",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7934,
"ename" : "MILLER",
"job" : "CLERK",
"mgr" : 7782,
"hiredate" : "1982-01-23T20:00:00Z",
"sal" : 1300.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7369,
"ename" : "SMITH",
"job" : "CLERK",
"mgr" : 7902,
"hiredate" : "1980-12-17T06:00:00Z",
"sal" : 800.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7499,
"ename" : "ALLEN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-20T06:00:00Z",
"sal" : 1600.00,
"comm" : 300.00,
"deptno" : 30
},
{
"empno" : 7521,
"ename" : "WARD",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-22T06:00:00Z",
"sal" : 1250.00,
"comm" : 500.00,
"deptno" : 30
},
{
"empno" : 7566,
"ename" : "JONES",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-04-02T06:00:00Z",
"sal" : 2975.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T05:00:00Z",
"sal" : 1250.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7698,
"ename" : "BLAKE",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-05-01T05:00:00Z",
"sal" : 2850.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7782,
"ename" : "CLARK",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-06-09T05:00:00Z",
"sal" : 2450.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T06:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7844,
"ename" : "TURNER",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-08T05:00:00Z",
"sal" : 1500.00,
"comm" : 0.00,
"deptno" : 30
},
{
"empno" : 7876,
"ename" : "ADAMS",
"job" : "CLERK",
"mgr" : 7788,
"hiredate" : "1983-01-12T06:00:00Z",
"sal" : 1100.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7900,
"ename" : "lebulang",
"job" : "CLERK",
"mgr" : 7698,
"hiredate" : "1981-12-03T06:00:00Z",
"sal" : 950.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7902,
"ename" : "FORD",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1981-12-03T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7934,
"ename" : "MILLER",
"job" : "CLERK",
"mgr" : 7782,
"hiredate" : "1982-01-23T06:00:00Z",
"sal" : 1300.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T06:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T05:00:00Z",
"sal" : 3200.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T06:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 3200.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 3200.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7369,
"ename" : "SMITH",
"job" : "CLERK",
"mgr" : 7902,
"hiredate" : "1980-12-17T20:00:00Z",
"sal" : 800.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7499,
"ename" : "ALLEN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-20T20:00:00Z",
"sal" : 1600.00,
"comm" : 300.00,
"deptno" : 30
},
{
"empno" : 7521,
"ename" : "WARD",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-22T20:00:00Z",
"sal" : 1250.00,
"comm" : 500.00,
"deptno" : 30
},
{
"empno" : 7566,
"ename" : "JONES",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-04-02T20:00:00Z",
"sal" : 2975.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 1250.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7698,
"ename" : "BLAKE",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-05-01T18:00:00Z",
"sal" : 2850.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7782,
"ename" : "CLARK",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-06-09T18:00:00Z",
"sal" : 2450.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7844,
"ename" : "TURNER",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-08T18:00:00Z",
"sal" : 1500.00,
"comm" : 0.00,
"deptno" : 30
},
{
"empno" : 7876,
"ename" : "ADAMS",
"job" : "CLERK",
"mgr" : 7788,
"hiredate" : "1983-01-12T20:00:00Z",
"sal" : 1100.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7900,
"ename" : "lebulang",
"job" : "CLERK",
"mgr" : 7698,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 950.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7902,
"ename" : "FORD",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7934,
"ename" : "MILLER",
"job" : "CLERK",
"mgr" : 7782,
"hiredate" : "1982-01-23T20:00:00Z",
"sal" : 1300.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7369,
"ename" : "SMITH",
"job" : "CLERK",
"mgr" : 7902,
"hiredate" : "1980-12-17T20:00:00Z",
"sal" : 800.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7499,
"ename" : "ALLEN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-20T20:00:00Z",
"sal" : 1600.00,
"comm" : 300.00,
"deptno" : 30
},
{
"empno" : 7521,
"ename" : "WARD",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-02-22T20:00:00Z",
"sal" : 1250.00,
"comm" : 500.00,
"deptno" : 30
},
{
"empno" : 7566,
"ename" : "JONES",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-04-02T20:00:00Z",
"sal" : 2975.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7654,
"ename" : "MARTIN",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-28T18:00:00Z",
"sal" : 1250.00,
"comm" : 1400.00,
"deptno" : 30
},
{
"empno" : 7698,
"ename" : "BLAKE",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-05-01T18:00:00Z",
"sal" : 2850.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7782,
"ename" : "CLARK",
"job" : "MANAGER",
"mgr" : 7839,
"hiredate" : "1981-06-09T18:00:00Z",
"sal" : 2450.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7788,
"ename" : "SCOTT",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1982-12-09T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7839,
"ename" : "KING",
"job" : "PRESIDENT",
"mgr" : null,
"hiredate" : "1981-11-17T20:00:00Z",
"sal" : 5000.00,
"comm" : null,
"deptno" : 10
},
{
"empno" : 7844,
"ename" : "TURNER",
"job" : "SALESMAN",
"mgr" : 7698,
"hiredate" : "1981-09-08T18:00:00Z",
"sal" : 1500.00,
"comm" : 0.00,
"deptno" : 30
},
{
"empno" : 7876,
"ename" : "ADAMS",
"job" : "CLERK",
"mgr" : 7788,
"hiredate" : "1983-01-12T20:00:00Z",
"sal" : 1100.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7900,
"ename" : "lebulang",
"job" : "CLERK",
"mgr" : 7698,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 950.00,
"comm" : null,
"deptno" : 30
},
{
"empno" : 7902,
"ename" : "FORD",
"job" : "ANALYST",
"mgr" : 7566,
"hiredate" : "1981-12-03T20:00:00Z",
"sal" : 3000.00,
"comm" : null,
"deptno" : 20
},
{
"empno" : 7934,
"ename" : "MILLER",
"job" : "CLERK",
"mgr" : 7782,
"hiredate" : "1982-01-23T20:00:00Z",
"sal" : 1300.00,
"comm" : null,
"deptno" : 10
}
]}
----------------------------------------------------------------------------问题
1,查询出部门编号为30的所有员工的编号和姓名
2.找出部门编号为10中所有经理,和部门编号为20中所有销售员的详细资料。
3.查询所有员工详细信息,用工资降序排序,如果工资相同使用入职日期升序排序
4.列出薪金大于1500的各种工作及从事此工作的员工人数。
5.列出在销售部工作的员工的姓名,假定不知道销售部的部门编号。
6.查询姓名以S开头的\以S结尾\包含S字符\第二个字母为L __
7.查询每种工作的最高工资、最低工资、人数
8.列出薪金 高于 公司平均薪金的所有员工号,员工姓名,所在部门名称,上级领导,工资,工资等级
9.列出薪金 高于 在各自部门工作的员工的平均薪金的员工姓名和薪金、部门名称。

解决 :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
因为上述数据不符合spark的规定,所以我们要在vscode中把他转化
转化完成之后
-------------------------------------------------------
查询出部门编号为30的所有员工的编号和姓名
//---------------------------api
val cluname = emp.columns.toList
cluname.foreach(println(_))
emp.select("deptno","empno","ename").rdd.filter(x=>{
x(0)==30
}).saveAsTextFile("hdfs://bigdata3:9000/spark/1")
//---------------------------sql
emp.createOrReplaceTempView("tableemp")
spark.sql("select deptno,empno,ename from tableemp where deptno=30").show()
//---------------------------------------------------------------2
//找出部门编号为10中所有经理,和部门编号为20中所有销售员的详细资料。
//----------------------------api
emp.select("comm","deptno","empno","ename","hiredate","job","mgr","sal").rdd.filter(x=>{
(x(1)==10&&x(5)=="MANAGER")||(x(1)==20&&x(5)=="SALESMAN")
}).saveAsTextFile("hdfs://bigdata3:9000/spark/2")
//--------------------------sql
spark.sql("select * from tableemp where deptno=10 and job= 'MANAGER' or deptno = 20 and job= 'SALESMAN'").show()
/--------------------------------------------------------------3
//查询所有员工详细信息,用工资降序排序,如果工资相同使用入职日期升序排序
//----------------------api
emp.rdd.map(x=>{
if (x.isNullAt(0)){
var total = x.getDouble(7)
var hire = x.getString(4).split("Z")
var reallyhire = hire(0).split("T")
val date = reallyhire(0)+" " +reallyhire(1)
var Data = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(date)
(total,Data,x)
}else{
var total = x.getDouble(0)+x.getDouble(7)
var hire = x.getString(4).split("Z")
var reallyhire = hire(0).split("T")
val date = reallyhire(0)+" " +reallyhire(1)
var Data = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(date)
(total,Data,x)
}
}).sortBy(x=>(-x._1,x._2)).map(x=>{x._3}).saveAsTextFile("hdfs://bigdata3:9000/spark/3")

//------------------------------------sql
spark.sql("select * from (select (ifnull(comm,0)+sal) as total, hiredate,job,mgr,sal,empno,ename,deptno from tableemp ) order by total desc,hiredate asc")
-------------------------------------------4
//列出薪金大于1500的各种工作及从事此工作的员工人数。
//----------------------------------------api
emp.filter(Row=>{
Row.getDouble(7) > 1500
}).groupBy("job").count().rdd.saveAsTextFile("hdfs://bigdata3:9000/spark/4")
//----------------------------------------sql
spark.sql("select job,count(*) from tableemp where sal > 1500 group by job").show()
--------------------------------------------5
//-列出在销售部工作的员工的姓名,假定不知道销售部的部门编号。
//---------------------------api
dept.filter(x=>{
x.getString(1)=="SALES"
}).join(emp,"deptno").select("ename").rdd.saveAsTextFile("hdfs://bigdata3:9000/spark/5")

//------------------------sql
spark.sql("select ename from tableemp where deptno in(select deptno from tabledept where dname='SALES')").show()
//-----------------------------------------------------------6
//查询姓名以S开头的\以S结尾\包含S字符\第二个字母为L __
//----------------------------------api
emp.filter(x=>{
(x.getString(3).contains("S"))||(x.getString(3).startsWith("S"))||(x.getString(3).endsWith("S")||x.getString(3).charAt(1)=='L')
}).rdd.saveAsTextFile("hdfs://bigdata3:9000/spark/6")
//----------------------------------sql
spark.sql("select * from tableemp where ename like '%S%' or ename REGEXP '^.L'").show()
//-----------------------------------------------------------7
//查询每种工作的最高工资、最低工资、人数
//---------------------------------api
val frame2 = emp.map(x => {
if (x.isNullAt(0)) {
val d = x.getDouble(7)
(x.getString(5), d)
} else {
val d = x.getDouble(0) + x.getDouble(7)
(x.getString(5), d)
}
}).groupBy("_1").max("_2")

val frame1 = emp.map(x => {
if (x.isNullAt(0)) {
val d = x.getDouble(7)
(x.getString(5), d)
} else {
val d = x.getDouble(0) + x.getDouble(7)
(x.getString(5), d)
}
}).groupBy("_1").min("_2")

val frame = emp.map(x => {
if (x.isNullAt(0)) {
val d = x.getDouble(7)
(x.getString(5), d)
} else {
val d = x.getDouble(0) + x.getDouble(7)
(x.getString(5), d)
}
}).groupBy("_1").count()

frame.join(frame1 , "_1").join(frame2,"_1").rdd.saveAsTextFile("hdfs://bigdata3:9000/spark/7")

//-------------------------------sql
spark.sql("select max(ifnull(comm,0)+sal),min(ifnull(comm,0)+sal),count(*),job from tableemp group by job ").show()

//--------------------------------------------------------------8
//列出薪金 高于 公司平均薪金的所有员工号,员工姓名,所在部门名称,上级领导,工资,工资等级
//--------------------------------------------api
import org.apache.spark.sql.functions._
val list = emp.groupBy().avg("sal").rdd.map(_.getDouble(0)).collect().toList
println(list(0))
val value = emp.filter(x => {
x.getDouble(7) > list(0).toDouble
})
value.show()
val frame = emp.select($"empno".alias("mgr"), $"ename".alias("leader")).join(value, "mgr").join(dept,"deptno")
//----------------------------------------------------工资等级
// insert into salgrade values (1, 700, 1200);
// insert into salgrade values (2, 1201, 1400);
// insert into salgrade values (3, 1401, 2000);
// insert into salgrade values (4, 2001, 3000);
// insert into salgrade values (5, 3001, 9999);
frame.printSchema()
// frame.rdd.collect().foreach(println(_))
val value1 = frame.rdd.map(x => {
if (x.isNullAt(3)) {
val tmp = x.getDouble(8).toInt
if (tmp > 700 && tmp < 1200){
(1, x(1),x)
}else{
if (tmp > 1201 && tmp < 1400){
(2, x(1),x)
}else{
if (tmp > 1401 && tmp < 2000){
(3, x(1),x)
}else{
if (tmp > 2001 && tmp < 3000){
(4, x(1),x)
}else{
if (tmp > 3001 && tmp < 9999){
(5, x(1),x)
}else{
(5, x(1),x)
}
}
}
}
}
} else {
val tmp = (x.getDouble(3) + x.getDouble(8)).toInt
if (tmp > 700 && tmp < 1200){
(1, x(1),x)
}else{
if (tmp > 1201 && tmp < 1400){
(2, x(1),x)
}else{
if (tmp > 1401 && tmp < 2000){
(3, x(1),x)
}else{
if (tmp > 2001 && tmp < 3000){
(4, x(1),x)
}else{
if (tmp > 3001 && tmp < 9999){
(5, x(1),x)
}else{
(5, x(1),x)
}
}
}
}
}
}
})

value1.saveAsTextFile("hdfs://bigdata3:9000/spark/8")

// --------------------------------------------sql
emp.createOrReplaceTempView("tableemp1")
spark.sql("\n\nselect \nking.ename,\nking.empno,\ne1.ename as leader,\nking.earn,\ns.grade\nas sallevel\nfrom (\n select ename , empno , deptno , ifnull((sal + comm),sal) as earn ,mgr \n from tableemp\n where sal > (\n select avg(sal)\n from tableemp1\n )\n) as king \nleft join tabledept on king.deptno=tabledept.deptno \nleft join (\n select empno,ename\n from tableemp\n) e1 on king.mgr = e1.empno\nleft join salgrade as s on earn >= losal and earn <= hisal;").show()

//--------------------------------------------------------------------9
//列出薪金 高于 在各自部门工作的员工的平均薪金的员工姓名和薪金、部门名称。
//----------------------------------------------api
val frame = emp.groupBy("deptno").avg("sal").rdd.collect().toList
frame.foreach(println(_))

for (elem <- frame){
var name = elem(0).toString + "deptno"
var frame1 = emp.filter(x => {
(x.getLong(1).toString == elem(0).toString) && (x.getDouble(7) > elem(1).toString.toDouble)
}).join(dept, "deptno").rdd.saveAsTextFile(s"hdfs://bigdata3:9000/spark/9/$name")
}
//----------------------------------------------sql
spark.sql("\n\n\nselect * \nfrom \n(select\n*\nfrom (select avg(sal) as sal_avg, deptno as deptno1 from (select sal ,deptno from tableemp group by sal , deptno) as king group by deptno) as avg_basic left join tableemp\non tableemp.deptno=avg_basic.deptno1 and tableemp.sal > avg_basic.sal_avg) as basicinfo where basicinfo.deptno1 in (select deptno from tabledept);").show()

1点ok


spark

spark产生背景?
mr,hive批处理,离线处理存在一些局限性:

  • mr api 开发复杂
  • 只能做离线计算,不能实时计算
  • 性能不高

需求:

sql =》 mr

会产生多个job去完成一个需求

mr1=>mr2=>mr3

map => reduce

map处理完数据=》dask数据落盘 =》 reduce mr

kv进行操作 =》 k进行排序

什么是spark?

官网:spark.apache.org

计算特点: 不关注数据存储

特点 :

  • Batch/streaming data =》 批流一体
  • SQL analytics
  • Data science at scale
  • Machine learning

速度快:

  • 基于内存的运算
  • DAG =》 链式编程=》mr1=>mr2=>mr3
  • pipline通道的
  • 编程模型 线程级别的

易用性

  • 开发语言:java ,Scala , python , sql
  • 外部数据源
  • 80多个高级算子=》Scala算子
  • mr=》 去读mysql数据库 =》 DBinputformat
  • spark => 封装好了多种外部数据源 =》 jdbc , json ,csv
  • mr map/reduce
  • spark :80

通用性

  • 子模块
  • sparkcore =》 :离线计算
  • sparksql =》 离线计算
  • sparkstreaming,structstreaming =》 实时计算
  • mllib =》 机器学习
  • 图计算 =》 图处理

对于spark的子模块之间可以用交互式使用

运行作业的地方

  • yarn ***
  • mesos
  • k8s ***
  • standalone

hadoop生态圈 vs spark生态圈

  • Batch: mr , hive vs sparkcore,sparksql
  • Sql:hive,impala vs sparksql
  • stream : Strom vs spark streaming,sss
  • MLLib:Mahout vs MLLib
  • real存储 : HBase,cassandra vs DataSource Api

spark 能不能替换hadoop =》 替换不了=》spark =》 可以mr

spqrk版本:

  • spark 1.x
  • spark 2.x主流
  • spark 3.x主流

编程模型:

sparkcore =》 RDD

sparksql =》 dataframe & dataset

sparkstreaming =》 DS

sparkcore

RDD:rdd开发 降低开发人员的开发成本 vs mr

什么是rdd?

lower level = 》mr

high level =》 spark 高级算子

优点:

  • 弹性 分布式 数据集
  • 数据集 =》 partitions 元素 =》 一条一条数据
  • 可以用并行的方式进行计算

弹性?

  • 容错 =》 计算的时候 可以重试

分布式?

  • 存储
    • rdd:1 2 3 4 5 6
      • partition1:1 2 3
      • partition2: 4 5
      • partition:6
    • bigdata3:p1
    • bigdata4:p2
    • bigdtat5:p3
  • 计算
    • 对rdd的操作是操作里面的数据
  • 数据集
    • 就是构建rdd本身的数据
  • immutable 不可变的
    • scala : val var
    • rdda =》 rddb
    • 不可变 = 》rdda 通过一个计算到新的rdd
  • partition collection of elements => rdd可以被分区存储/计算
  • 一个rdd 是由多个partition所构成的
  • rdd数据存储是分布式的,是跨节点进行存储的

abstract

T泛型 =》 限定在人dd里面数据 是什么类型的如:RDD[String] , RDD[Int],RDD[Student]

Serializable序列化=》可以经过网络传输

@transient注解 这个属性不用序列化 [了解]

RDD 的特性:

  • rdd 的底层存储是系列的partition
  • 针对rdd做计算/操作其实就是对rdd底层的partition进行计算/操作
  • rdd之间的依赖关系
    • rdda =》rddb
    • rdd 不可变
    • rdda = 》b =》 c
  • Partitioner =》 kv类型的rdd
  • 默认分区是hash
  • 数据本地性 =》 减少数据传输的io ,优点
    • rdd进行操作的好处 :
      • 有限的作业调度在数据所在的节点上 =》 理想状况
      • 常见计算 =》 作业调度在别的节点上 ,数据另外存储在一台节点上,只能把数据通过网络传过去再进行计算

在rdd中可以用scala的map和其他高级函数

RDD操作 :

构建sparkcore 作业 :idea

添加依赖 :

1
2
3
4
5
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.2.1</version>
</dependency>

mapreduce :程序入口:job

初始化spark:

  • sparkContext =》 sparkcore 程序入口

  • SparkConf => 指定spark app 详细信息

    • AppName = 》作业名字
    • Master =》作业运行在什么地方 spark作业运行模式
      • local,yarn,stanalone,k8s,mesos
      • 公共中:yarn,k8s,local
      • 测试的时候:local
      • 一个spark作业里面只能由一个sparkcontext
  • 如和指定Master spark作业运行模式

    • local【K】模式

      • k指的是线程数
    • standalone => spark://HOST:PORT

    • yarn 两种模式:

      • client模式
      • cluster模式
    • k8s

      • k8s://HOST:PORT

rdd进行编程

创建rdd

parallelize existing collection =》 已经存在的集合

referencing a dataset in an external storage system,hdfs、hbase、其他数据存储系统

外部数据源存储
hdfs、local、hbase、s3、cos、
数据文件类型:
text files, SequenceFiles, and any other Hadoop InputFormat.

spark部署:

spark不是部署分布式的 参考hive

spark支持分布式部署 =》 standalone

步骤 : 解压 =》 软连接 =》 source

spark-core的脚本spark-shell

例子 : spark -shell --master local[2]

启动spark-shell:测试code

  • web ui =》 每个spark作业的 http://bigdata32:4040
  • 参数:--master => spark shell 以什么模式去运行
  • 可以用 --name 更改spark shell的名字

以下是关于参数的详细情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
spark-shell : 
--master spark作业运行环境
--deploy-mode yarn模式 运行选择
--class spark作业包 运行主类main class 包名
--name 指定spark作业的名字
--jars 指定第三方的依赖包
--conf 指定spark作业配置参数
yarn 参数补充:
--num-executors 指定 申请资源的参数
--executor-memory 指定 申请资源的参数
--executor-cores 指定 申请资源的参数
--queue 指定作业 运行在yarn的哪个队列上

spark-shell 交互式命令 底层调用 =》 spark-submit
开发者 主要使用的脚本 用于提交用户自己开发的spark作业

spark-shell:
spark-submit \
--class org.apache.spark.repl.Main \
--name "Spark shell" "$@"

spark-shell --master "local[2]":
spark-submit \
--class org.apache.spark.repl.Main \
--name "Spark shell" --master "local[2]"

算子:

filter:

1
2
3
4
5
6
7
8
9
10
11
scala> test.collect
res15: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171...

scala> test.filter(_>999).collect
res16: Array[Int] = Array(1000)
scala> test.collect
res15: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171...

scala> test.filter(_>999).collect
res16: Array[Int] = Array(1000)

mapPartitionsWithIndex:就是和mappartition是基本上一样的就是多了个索引

1
2
3
4
5
6
7
8
9
10
11
12
13
val rdd = sc.makeRDD(List(1,2,3,4),numSlices = 2)//分区为2
scala> rdd.mapPartitionsWithIndex((index,iter)=> {if(index ==1) { iter } else { Nil.iterator}}).collect.foreach(println)
3
4
scala> rdd.mapPartitionsWithIndex((index,iter)=> {iter.map(num => {(index , num)})}).collect.foreach(println)
(0,1)
(0,2)
(1,3)
(1,4)
-=--------------------------------------------------------mappartition
scala> test1.mapPartitions(_.map(_._2),true).collect.foreach(print)
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000

一般的使用场景是查看partition里的元素

每个rdd的数据落到是什么分区上,我们不用太管,后面会讲

mapValues:只对kv类型的数据进行操作,相当于是单独对每个values做处理

1
2
3
scala> test1.mapValues(_+3).collect.foreach(print)
(0,4)(0,5)(0,6)(0,7)(0,8)(0,9)(0,10)(0,11)(0,12)(0,13)(0,14)(0,15)(0,16)(0,17)(0,18)(0,19)(0,20)(0,21)(0,22)(0,23)(0,24)(0,25)(0,26)(0,27)(0,28)(0,29)(0,30)(0,31)(0,32)(0,33)(0,34)(0,35)(0,36)(0,37)(0,38)(0,39)(0,40)(0,41)(0,42)(0,43)(0,44)(0,45)(0,46)(0,47)(0,48)(0,49)(0,50)(0,51)(0,52)(0,53)(0,54)(0,55)(0,56)(0,57)(0,58)(0,59)(0,60)(0,61)(0,62)(0,63)(0,64)(0,65)(0,66)(0,67)(0,68)(0,69)(0,70)(0,71)(0,72)(0,73)(0,74)(0,75)(0,76)(0,77)(0,78)(0,79)(0,80)(0,81)(0,82)(0,83)(0,84)(0,85)(0,86)(0,87)(0,88)(0,89)(0,90)(0,91)(0,92)(0,93)(0,94)(0,95)(0,96)(0,97)(0,98)(0,99)(0,100)(0,101)(0,102)(0,103)(0,104)(0,105)(0,106)(0,107)(0,108)(0,109)(0,110)(0,111)(0,112)(0,113)(0,114)(0,115)(0,116)(0,117)(0,118)(0,119)(0,120)(0,121)(0,122)(0,123)(0,124)(0,125)(0,126)(0,127)(0,128)(0,129)(0,130)(0,131)(0,132)(0,133)(0,134)(0,135)(0,136)(0,137)(0,138)(0,139)(0,140)(0,141)(0,142)(0,143)(0,144)(0,145)(0,146)(0,147)(0,148)(0,149)(0,150)(0,151)(0,152)(0,153)(0,154)(0,155)(0,156)(0,157)(0,158)(0,159)(0,160)(0,161)(0,162)(0,163)(0,164)(0,165)(0,166)(0,167)(0,168)(0,169)(0,170)(0,171)(0,172)(0,173)(0,174)(0,175)(0,176)(0,177)(0,178)(0,179)(0,180)(0,181)(0,182)(0,183)(0,184)(0,185)(0,186)(0,187)(0,188)(0,189)(0,190)(0,191)(0,192)(0,193)(0,194)(0,195)(0,196)(0,197)(0,198)(0,199)(0,200)(0,201)(0,202)(0,203)(0,204)(0,205)(0,206)(0,207)(0,208)(0,209)(0,210)(0,211)(0,212)(0,213)(0,214)(0,215)(0,216)(0,217)(0,218)(0,219)(0,220)(0,221)(0,222)(0,223)(0,224)(0,225)(0,226)(0,227)(0,228)(0,229)(0,230)(0,231)(0,232)(0,233)(0,234)(0,235)(0,236)(0,237)(0,238)(0,239)(0,240)(0,241)(0,242)(0,243)(0,244)(0,245)(0,246)(0,247)(0,248)(0,249)(0,250)(0,251)(0,252)(0,253)(0,254)(0,255)(0,256)(0,257)(0,258)(0,259)(0,260)(0,261)(0,262)(0,263)(0,264)(0,265)(0,266)(0,267)(0,268)(0,269)(0,270)(0,271)(0,272)(0,273)(0,274)(0,275)(0,276)(0,277)(0,278)(0,279)(0,280)(0,281)(0,282)(0,283)(0,284)(0,285)(0,286)(0,287)(0,288)(0,289)(0,290)(0,291)(0,292)(0,293)(0,294)(0,295)(0,296)(0,297)(0,298)(0,299)(0,300)(0,301)(0,302)(0,303)(0,304)(0,305)(0,306)(0,307)(0,308)(0,309)(0,310)(0,311)(0,312)(0,313)(0,314)(0,315)(0,316)(0,317)(0,318)(0,319)(0,320)(0,321)(0,322)(0,323)(0,324)(0,325)(0,326)(0,327)(0,328)(0,329)(0,330)(0,331)(0,332)(0,333)(0,334)(0,335)(0,336)(0,337)(0,338)(0,339)(0,340)(0,341)(0,342)(0,343)(0,344)(0,345)(0,346)(0,347)(0,348)(0,349)(0,350)(0,351)(0,352)(0,353)(0,354)(0,355)(0,356)(0,357)(0,358)(0,359)(0,360)(0,361)(0,362)(0,363)(0,364)(0,365)(0,366)(0,367)(0,368)(0,369)(0,370)(0,371)(0,372)(0,373)(0,374)(0,375)(0,376)(0,377)(0,378)(0,379)(0,380)(0,381)(0,382)(0,383)(0,384)(0,385)(0,386)(0,387)(0,388)(0,389)(0,390)(0,391)(0,392)(0,393)(0,394)(0,395)(0,396)(0,397)(0,398)(0,399)(0,400)(0,401)(0,402)(0,403)(0,404)(0,405)(0,406)(0,407)(0,408)(0,409)(0,410)(0,411)(0,412)(0,413)(0,414)(0,415)(0,416)(0,417)(0,418)(0,419)(0,420)(0,421)(0,422)(0,423)(0,424)(0,425)(0,426)(0,427)(0,428)(0,429)(0,430)(0,431)(0,432)(0,433)(0,434)(0,435)(0,436)(0,437)(0,438)(0,439)(0,440)(0,441)(0,442)(0,443)(0,444)(0,445)(0,446)(0,447)(0,448)(0,449)(0,450)(0,451)(0,452)(0,453)(0,454)(0,455)(0,456)(0,457)(0,458)(0,459)(0,460)(0,461)(0,462)(0,463)(0,464)(0,465)(0,466)(0,467)(0,468)(0,469)(0,470)(0,471)(0,472)(0,473)(0,474)(0,475)(0,476)(0,477)(0,478)(0,479)(0,480)(0,481)(0,482)(0,483)(0,484)(0,485)(0,486)(0,487)(0,488)(0,489)(0,490)(0,491)(0,492)(0,493)(0,494)(0,495)(0,496)(0,497)(0,498)(0,499)(0,500)(0,501)(0,502)(0,503)(1,504)(1,505)(1,506)(1,507)(1,508)(1,509)(1,510)(1,511)(1,512)(1,513)(1,514)(1,515)(1,516)(1,517)(1,518)(1,519)(1,520)(1,521)(1,522)(1,523)(1,524)(1,525)(1,526)(1,527)(1,528)(1,529)(1,530)(1,531)(1,532)(1,533)(1,534)(1,535)(1,536)(1,537)(1,538)(1,539)(1,540)(1,541)(1,542)(1,543)(1,544)(1,545)(1,546)(1,547)(1,548)(1,549)(1,550)(1,551)(1,552)(1,553)(1,554)(1,555)(1,556)(1,557)(1,558)(1,559)(1,560)(1,561)(1,562)(1,563)(1,564)(1,565)(1,566)(1,567)(1,568)(1,569)(1,570)(1,571)(1,572)(1,573)(1,574)(1,575)(1,576)(1,577)(1,578)(1,579)(1,580)(1,581)(1,582)(1,583)(1,584)(1,585)(1,586)(1,587)(1,588)(1,589)(1,590)(1,591)(1,592)(1,593)(1,594)(1,595)(1,596)(1,597)(1,598)(1,599)(1,600)(1,601)(1,602)(1,603)(1,604)(1,605)(1,606)(1,607)(1,608)(1,609)(1,610)(1,611)(1,612)(1,613)(1,614)(1,615)(1,616)(1,617)(1,618)(1,619)(1,620)(1,621)(1,622)(1,623)(1,624)(1,625)(1,626)(1,627)(1,628)(1,629)(1,630)(1,631)(1,632)(1,633)(1,634)(1,635)(1,636)(1,637)(1,638)(1,639)(1,640)(1,641)(1,642)(1,643)(1,644)(1,645)(1,646)(1,647)(1,648)(1,649)(1,650)(1,651)(1,652)(1,653)(1,654)(1,655)(1,656)(1,657)(1,658)(1,659)(1,660)(1,661)(1,662)(1,663)(1,664)(1,665)(1,666)(1,667)(1,668)(1,669)(1,670)(1,671)(1,672)(1,673)(1,674)(1,675)(1,676)(1,677)(1,678)(1,679)(1,680)(1,681)(1,682)(1,683)(1,684)(1,685)(1,686)(1,687)(1,688)(1,689)(1,690)(1,691)(1,692)(1,693)(1,694)(1,695)(1,696)(1,697)(1,698)(1,699)(1,700)(1,701)(1,702)(1,703)(1,704)(1,705)(1,706)(1,707)(1,708)(1,709)(1,710)(1,711)(1,712)(1,713)(1,714)(1,715)(1,716)(1,717)(1,718)(1,719)(1,720)(1,721)(1,722)(1,723)(1,724)(1,725)(1,726)(1,727)(1,728)(1,729)(1,730)(1,731)(1,732)(1,733)(1,734)(1,735)(1,736)(1,737)(1,738)(1,739)(1,740)(1,741)(1,742)(1,743)(1,744)(1,745)(1,746)(1,747)(1,748)(1,749)(1,750)(1,751)(1,752)(1,753)(1,754)(1,755)(1,756)(1,757)(1,758)(1,759)(1,760)(1,761)(1,762)(1,763)(1,764)(1,765)(1,766)(1,767)(1,768)(1,769)(1,770)(1,771)(1,772)(1,773)(1,774)(1,775)(1,776)(1,777)(1,778)(1,779)(1,780)(1,781)(1,782)(1,783)(1,784)(1,785)(1,786)(1,787)(1,788)(1,789)(1,790)(1,791)(1,792)(1,793)(1,794)(1,795)(1,796)(1,797)(1,798)(1,799)(1,800)(1,801)(1,802)(1,803)(1,804)(1,805)(1,806)(1,807)(1,808)(1,809)(1,810)(1,811)(1,812)(1,813)(1,814)(1,815)(1,816)(1,817)(1,818)(1,819)(1,820)(1,821)(1,822)(1,823)(1,824)(1,825)(1,826)(1,827)(1,828)(1,829)(1,830)(1,831)(1,832)(1,833)(1,834)(1,835)(1,836)(1,837)(1,838)(1,839)(1,840)(1,841)(1,842)(1,843)(1,844)(1,845)(1,846)(1,847)(1,848)(1,849)(1,850)(1,851)(1,852)(1,853)(1,854)(1,855)(1,856)(1,857)(1,858)(1,859)(1,860)(1,861)(1,862)(1,863)(1,864)(1,865)(1,866)(1,867)(1,868)(1,869)(1,870)(1,871)(1,872)(1,873)(1,874)(1,875)(1,876)(1,877)(1,878)(1,879)(1,880)(1,881)(1,882)(1,883)(1,884)(1,885)(1,886)(1,887)(1,888)(1,889)(1,890)(1,891)(1,892)(1,893)(1,894)(1,895)(1,896)(1,897)(1,898)(1,899)(1,900)(1,901)(1,902)(1,903)(1,904)(1,905)(1,906)(1,907)(1,908)(1,909)(1,910)(1,911)(1,912)(1,913)(1,914)(1,915)(1,916)(1,917)(1,918)(1,919)(1,920)(1,921)(1,922)(1,923)(1,924)(1,925)(1,926)(1,927)(1,928)(1,929)(1,930)(1,931)(1,932)(1,933)(1,934)(1,935)(1,936)(1,937)(1,938)(1,939)(1,940)(1,941)(1,942)(1,943)(1,944)(1,945)(1,946)(1,947)(1,948)(1,949)(1,950)(1,951)(1,952)(1,953)(1,954)(1,955)(1,956)(1,957)(1,958)(1,959)(1,960)(1,961)(1,962)(1,963)(1,964)(1,965)(1,966)(1,967)(1,968)(1,969)(1,970)(1,971)(1,972)(1,973)(1,974)(1,975)(1,976)(1,977)(1,978)(1,979)(1,980)(1,981)(1,982)(1,983)(1,984)(1,985)(1,986)(1,987)(1,988)(1,989)(1,990)(1,991)(1,992)(1,993)(1,994)(1,995)(1,996)(1,997)(1,998)(1,999)(1,1000)(1,1001)(1,1002)(1,1003)

flatMap:和Scala里的是一样的

1
2
3
4
scala> test.flatMap(x=>x.to(3)).collect
res98: Array[Int] = Array(1, 2, 3, 2, 3, 3)


其他的算子

glom:把每个分区的数据形成一个数组,比mapPartitionWithIndex好用

1
2
scala> test.glom.collect
res62: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, ...

sample:抽样,随机抽样的

1
2
3
scala> (test.sample(false,0.77)).collect.foreach(print)
236781112141516182022232425262728303132333435363738394041434446474850515254555657596061626364666768697172747576777879818285868789919293969798991001011021031041051061071081091101111121131161181191201211221241251271281301311321331341351361371381391401411441461471491501531541551561611621631641651661671681691701711721731741751761781791801811831841861871891901911931951961971982002012022042052072082092112122132142162172182202212222242252262272282302322332352362372382402412432442452472482502512522532542552562572582602612632642652662672682692702712732742752762772782812822832842892902922932942952962972982993003013023033043053063073083093113123133143153163173183193203213223233263293303313323333343353363373383393403413423443463473483493513523543573583593603613623633663673683693703713723743763773783793803813823833873903913923943954004014024034044054074084094114124134144154164174184194224234244264274284294304314324334344354364374384394404414424454464494504514524534544554564574584604614634644654664674684694714724744754764794804814824834844854864874884904914924934944954974984995005015025065075085105115125135145165175185195225235245255265275295305315325355375385395405415425445455465475485495515525545565575585595605625635645655675685715735755765795805815825835845855895905925935955965975985996016026036046066086096106116126136146156166176206216226236246256266276296316326336346356366376386396406416436446456466476486496506516536546556576586596606616626636646676696706716726736746756766776796806816836846856866876896906916926936946976986997007037047057067087097107117127157167177187197217227237267277297307327337347357367377387417427447457467477487497517527537547557567577587597607627637667677697707727737747757777787797807817827837847867887897907917937957967987998018028038048058068078098108128138148178198208228248258268278288308318328348358368378388408418428438458468478488498508528538548558588598608618628638648658668678688698708718758768778788798808828838848878888898908918958988999009019029039049059069079099129149159179189199209219229249259269279289299309319329339349359369379389399419429439449469489499509519549559569589599609619629639649659679689699709729749759769799809819829839859869879889899909919939949969979989991000

union:简单的数据合并,不去重

intersection:两个rdd的交集

subtract:出现在a里的没有出现在b里

.collect:把结果以数组的形式转到控制台

distinct:去重和sql效果一样 =》 底层去重的方法是用reduceByKey进行去重的目的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
scala> val dd = sc.parallelize(List(1,2,2,3,4,5,6,7,8))
dd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[28] at parallelize at <console>:24

scala> dd.collect.foreach(print)
122345678
scala> dd.collect.foreach(println)
1
2
2
3
4
5
6
7
8

scala> dd.distinct.collect.foreach(println)
4
6
8
2
1
3
7
5

kv算子 : groupbykey =》 就是对key进行分组,和wordcount里的分组是一样的=》工作中不要使用,效率低,不灵活

预聚合:

  • mr : input = > map => combine(调优的过程) => reduce => output
  • combine => 预聚合 按照map的输出的key进行数据聚合

mapSideCombine = false 代表预聚合关闭

一般groupbykey是关闭预聚合的,reducebykey是开启的

1
2
3
4
5
scala> test1.collect
res25: Array[(Int, Int)] = Array((0,1), (0,2), (0,3), (0,4), (0,5), (0,6), (0,7), (0,8), (0,9), (0,10), (0,11), (0,12), (0,13), (0,14), (0,15), (0,16), (0,17), (0,18), (0,19), (0,20), (0,21), (0,22), (0,23), (0,24), (0,25), (0,26), (0,27), (0,28), (0,29), (0,30), (0,31), (0,32), (0,33), (0,34), (0,35), (0,36), (0,37), (0,38), (0,39), (0,40), (0,41), (0,42), (0,43), (0,44), (0,45), (0,46), (0,47), (0,48), (0,49), (0,50), (0,51), (0,52), (0,53), (0,54), (0,55), (0,56), (0,57), (0,58), (0,59), (0,60), (0,61), (0,62), (0,63), (0,64), (0,65), (0,66), (0,67), (0,68), (0,69), (0,70), (0,71), (0,72), (0,73), (0,74), (0,75), (0,76), (0,77), (0,78), (0,79), (0,80), (0,81), (0,82), (0,83), (0,84), (0,85), (0,86), (0,87), (0,88), (0,89), (0,90), (0,91), (0,92), (0,93), (0,...
scala> test1.groupByKey.collect.foreach(print)
(0,CompactBuffer(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500))(1,CompactBuffer(501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000))

reducebykey:对比groupby是相当于可以统计之后进行计算的

1
2
3
4
5
6
7
8
scala> test1.reduceByKey((x,y)=>{x+y}).collect.foreach(print)
(0,125250)(1,375250)
--------------------------------------其中的x+y代表拉完成之后,对他们进行相加
--------------------------------------通过reduceByKey实现distinct
scala> test1.reduceByKey((x,_)=>{x}).map(_._1).collect.foreach(println)
0
1

groupby:自定义分组

1
2
3
scala> test.groupBy(x=>{if(x%2==0){"2e"}else{"e2"}}).collect
res99: Array[(String, Iterable[Int])] = Array((e2,CompactBuffer(1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99, 101, 103, 105, 107, 109, 111, 113, 115, 117, 119, 121, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 145, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165, 167, 169, 171, 173, 175, 177, 179, 181, 183, 185, 187, 189, 191, 193, 195, 197, 199, 201, 203, 205, 207, 209, 211, 213, 215, 217, 219, 221, 223, 225, 227, 229, 231, 233, 235, 237, 239, 241, 243, 245, 247, 249, 251, 253, 255, 257, 259, 261, 263, 265, 267, 269, 271, 273, 275, 277, 279, 281, 283, 285, 287, 289, 291, 293, 295, 297, 299, 301, 303, 30...

sortbykey:按照key进行排序,分区排序,如果想达到全局排序,则要求你rdd里的只有一个的分区,降序就是把true变成false

1
2
3
4
5
6
7
8
9
10

scala> val r2 = sc.parallelize(List(("zuan",18),("kaige",20),("zihang",21)),1)
r2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[69] at parallelize at <console>:24

scala> r2.sortByKey(true)
res89: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[70] at sortByKey at <console>:25

scala> res89.collect
res90: Array[(String, Int)] = Array((kaige,20), (zihang,21), (zuan,18))

自定义排序:sortby

1
2
3
4
scala> r2.sortBy(x=>x._2,true).collect
res92: Array[(String, Int)] = Array((zuan,18), (kaige,20), (zihang,21))


join:他默认就是按照key进行关联的=》底层调用的是cogroup

1
2
3
4
5
6
7
8
9
10
11
12
scala> val r3 = sc.parallelize(List(("zuan","广西"),("kaige","中国"),("zihang","黑龙江")))
r3: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[77] at parallelize at <console>:24

scala> r1.join(r3).collect
res93: Array[(String, (Int, String))] = Array((zuan,(18,广西)), (kaige,(20,中国)), (zihang,(21,黑龙江)))

-------------------------cogroup
scala> r1.cogroup(r3).collect
res94: Array[(String, (Iterable[Int], Iterable[String]))] = Array((zuan,(CompactBuffer(18),CompactBuffer(广西))), (kaige,(CompactBuffer(20),CompactBuffer(中国))), (zihang,(CompactBuffer(21),CompactBuffer(黑龙江))))



都是根据key进行关联

但是cogroup的返回是集合当作value的

join则是返回的是值当value

分区规则:

1
2
分区号:0,元素是4 4%4=0
分区号:1,元素是9 9%4=1

action算子:会执行job的算子

collect=》把rdd的数据集拉回控制台 = 》driver端

foreach

foreachpartition:对每个分区进行处理=》首选的mysql数据集导入是它,因为获取mysql的次数能少一点

1
2
3
4
5
6
7
8
  def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
}
-----------------------------------使用
scala> test.foreachPartition(ax=>ax.foreach(print))
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000

reduce:mr里的reduce,这里不可以接collect,因为已经完成了

1
2
3
scala> test.reduce((x,y)=>x+y)
res108: Int = 500500

first:取出数据集里的第一个元素底层是take

1
2
3
4
5
6
7
scala> test.first()
res109: Int = 1
-------------------------take
scala> test.take(77)
res112: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77)


takeOrderd:升序获取前n个

1
2
3
4
5
6
scala> test.takeOrdered(2)
res127: Array[Int] = Array(1, 2)

scala> test.takeOrdered(55)
res128: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55)

top:就是取排名前几的数据底层是takeOrederd=》数据量比较小的时候可以使用

1
2
3
4
5
6
scala> test.top(1)
res113: Array[Int] = Array(1000)

scala> test.top(5)
res114: Array[Int] = Array(1000, 999, 998, 997, 996)

saveASTextFile

saveASSequenceFile

saveAsObjectFile

countByKey:统计key的个数

1
2
3
scala> test1.countByKey
res125: scala.collection.Map[Int,Long] = Map(0 -> 500, 1 -> 500)

collectAsMap:和countByKey有点类似

count:返回rdd里有多少个数

判断action算子和普通算子的方法=》源码底层有runjob

=》调用collect/其他action算子

案例:

1
2
3
4
5
6
7
8
一张表:
name price num
diar 300 1000
香奈儿 4000 2
螺蛳粉 200 98
30显卡 200 10
-----------------------------------------------按照价格进行排序【desc】,如果价格相同,按照库存排序【asc】

解决:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
数据类型:tuple【推荐】 , class  case class【推荐】
用tuple做
-----------------------------------------------------------------
val value = sc.parallelize(List("diar 300 1000",
"香奈儿 4000 2",
"螺蛳粉 200 98",
"30显卡 200 10"),1)

val etlData = value.map(x=>{
val strings=x.split(" ")
val name=strings(0)
val price=strings(1).toInt
val store=strings(2)
(name,price,store)
})

etlData.sortBy(x => ( -x._2 , x._3)).saveAsTextFile("hdfs://bigdata3:9000/data")
----------------------------------------------------------class

import org.apache.spark.{SparkConf, SparkContext}
import sparkfirst.ContextUtils
object sparktest {
def main(args: Array[String]): Unit = {
val sc:SparkContext = ContextUtils.getSparkContext("test")
val bb = new SparkContext(new SparkConf().setAppName("test").setMaster("local[2]"))

val value = sc.parallelize(List("diar 300 1000",
"香奈儿 4000 2",
"螺蛳粉 200 98",
"30显卡 200 10"),1)

val etlData = value.map(x=>{
val strings=x.split(" ")
val name=strings(0)
val price=strings(1).toDouble
val store=strings(2).toInt
new skuu(name,price,store)
})
etlData.sortBy(x=>(-x.d,x.str1)).collect.foreach(print(_))

sc.stop()
}
class skuu(val str: String,val d: Double,val str1: Int) extends Serializable{
override def toString: String =str + "\t" + d + "\t" + str1
}
}
------------------------------------------------------------------------------------------------------case class
case好处=》重写了toString,hashcode方法,自动实现了序列化,不用实例化
package sparkfirst

import org.apache.spark.{SparkConf, SparkContext}
import sparkfirst.ContextUtils
object sparktest {
def main(args: Array[String]): Unit = {
val sc:SparkContext = ContextUtils.getSparkContext("test")
val bb = new SparkContext(new SparkConf().setAppName("test").setMaster("local[2]"))

val value = sc.parallelize(List("diar 300 1000",
"香奈儿 4000 2",
"螺蛳粉 200 98",
"30显卡 200 10"),1)

val etlData = value.map(x=>{
val strings=x.split(" ")
val name=strings(0)
val price=strings(1).toDouble
val store=strings(2).toInt
skuu(name,price,store)
})
etlData.sortBy(x=>(-x.d,x.str1)).collect.foreach(print(_))

sc.stop()
}
case class skuu(val str: String,val d: Double,val str1: Int)
}


当用 saveAsTextFile("hdfs://bigdata3:9000/data/test")

的时候它会根据你的分区数来生成文件

需求:用两类进行 对比:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
package sparkfirst

import org.apache.spark.{SparkConf, SparkContext}
import sparkfirst.ContextUtils
object sparktest {
def main(args: Array[String]): Unit = {
val sc:SparkContext = ContextUtils.getSparkContext("test")
val bb = new SparkContext(new SparkConf().setAppName("test").setMaster("local[2]"))

val value = sc.parallelize(List(
"diar 300 1000",
"香奈儿 4000 2",
"螺蛳粉 200 98",
"30显卡 200 10"),1)

val etlData = value.map(x=>{
val strings=x.split(" ")
val name=strings(0)
val price=strings(1).toDouble
val store=strings(2).toInt
skuu(name,price,store)
})
etlData.sortBy(x=>(-x.d,x.str1)).collect.foreach(print(_))

sc.stop()
}
case class skuu(val str: String,val d: Double,val str1: Int) extends Ordered[skuu]{
override def compare(that: skuu): Int = {
if (this.d == that.d){
this.str1-that.str1
}else {
-(this.d - that.d).toInt
}
}
}
}



隐式转换

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
package sparkfirst

import org.apache.spark.{SparkConf, SparkContext}
import sparkfirst.ContextUtils
object sparktest {
def main(args: Array[String]): Unit = {
val sc:SparkContext = ContextUtils.getSparkContext("test")
val bb = new SparkContext(new SparkConf().setAppName("test").setMaster("local[2]"))

val value = sc.parallelize(List(
"diar 300 1000",
"香奈儿 4000 2",
"螺蛳粉 200 98",
"30显卡 200 10"),1)

val etlData = value.map(x=>{
val strings=x.split(" ")
val name=strings(0)
val price=strings(1).toDouble
val store=strings(2).toInt
skuu(name,price,store)
})
etlData.sortBy(x=>(-x.d,x.str1)).collect.foreach(print(_))

sc.stop()

implicitly def skutooreder(sku:skuu):Ordered[skuu]={
new Ordered[skuu]{
override def compare(that: skuu): Int = {
if (sku.d == that.d){
sku.str1-that.str1
}else {
-(sku.d - that.d).toInt
}
}
}

}


}












case class skuu(val str: String,val d: Double,val str1: Int)
}



例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
数据:
word show click
a,2,3
b,1,1
c,4,5
f,5,6
g,7,8
k,8,9
a,1,2
a,1,1
a,4,5
b,5,6
------------------------------------------------------------------------------
package sparkfirst

import org.apache.spark.{SparkConf, SparkContext}
import sparkfirst.ContextUtils
object sparktest {

def sub(name: String, tuple: (Double, Int))={
(name , tuple)
}

def main(args: Array[String]): Unit = {
val sc:SparkContext = ContextUtils.getSparkContext("test")
val bb = new SparkContext(new SparkConf().setAppName("test").setMaster("local[2]"))

val value = sc.parallelize(List(
"a,2,3",
"b,1,1",
"c,4,5",
"f,5,6",
"g,7,8",
"k,8,9",
"a,1,2",
"a,1,1",
"a,4,5",
"b,5,6"),1)

val etlData = value.map(x=>{
val strings=x.split(",")
val name=strings(0)
val price=strings(1).toDouble
val store=strings(2).toInt
sub(name,(price,store))
// (name,(price,store))
})

etlData.reduceByKey((x,y)=>{
(x._1+y._1,x._2+y._2)
}).map(x=>x._1+"\t"+x._2._1+"\t"+x._2._2+"\t")
// etlData.reduceByKey((x,y)=>{(x._1+y._1,x._2+y._2)}).map(x=>x._1+"\t"+x._2._1+"\t"+x._2._2+"\t")

sc.stop()
}

理论:

spark架构:

  • Application =》 spark作业 =》 driver program 和 executor on the cluster两个进程
    • driver:运行sparkContext
    • exacutor:运行任务并保存数据在内存和磁盘中
  • web ui
  • sparkcontext
  • application jar =》 开发好的代码生成的jar包 =》 包含spark作业 =》 包含 main方法 =》 用户自己开发完spark之后可以部署到服务器上
  • driver program =》 运行 jar包里的main方法 =》 创建sparkContext
  • Custer manager =》 集群管理者 =》 通过集群获取资源
  • Deploy mode =》 当把作业提交到yarn上的时候
    • cluster模式:是跑在集群内布的(driver)=》 yarn所在的机器里面
    • client模式:跑在集群外外的
  • Worker node =》 工作节点=》打工人=》运行集群代码==node manager
  • excutor = 》相当于yarn的container=》每个spark都有自己的excutor
  • task =》partition =》 rdd
  • job =》 spark 里的job =》 application里的job =》 一个application里可能会有多个job
  • stage =》 job的小单位,且每个stage之间是有依赖关系的
  • 一个application会包含1-n个job,一个job包含1-n个stage
  • 一个stage可以包含1-n个task
  • task和rdd里的分区数一一对应

spark的执行流程

sc去链接cluster manager

cluster manager 会给spark作业分配资源

spark一旦连接上Custer

启动=》exector=》存储和计算

sc发送代码 给 exector 发送task给他去运行

每个作业都是有自己的exector的

exector是相当于container=》资源隔离=》调度隔离

多个作业之间产生的数据是不可以进行共享的,但是当写到一个外部存储上,就可以了

spark-shell简单就是=》 提交很多个job =》相当于外部存储

用户无感知的连接到集群

可以监听exector的生命周期=》和driver之间有 通信

只要driver可以连接到Custer集群上,他就可以=》也就是说外部dirver也可以

提议:driver和work节点靠近一点,这样可以减少网络发送的时间,就是可以用本地来实现

spark整合yarn

首先在spark的conf文件夹里执行 cp spark-env.sh.template spark-env.sh

然后 vim spark-env.sh

HADOOP_CONF_DIR=/home/hadoop/app/hadoop/etc/hadoop 和 YARN_CONF_DIR=/home/hadoop/app/hadoop/etc/hadoop

加上之后重新启动spark执行 spark-shell --master yarn

就可以了

一般启动之后,不配别的,一般这样是占用5G的内存

案例:

spark统计用户行为分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    val value = sc.parallelize(List(
"u01,英雄联盟|绝活&职业|云顶|奴神,1,1",
"u01,英雄联盟|绝活&职业|云顶|金潺潺,1,1",
"u01,英雄联盟|绝活&职业|云顶|带粉上车,1,0",
"u02,星秀|好声音|女团|三年一班,1,1",
"u02,星秀|好声音|女团|奴神,1,1",
"u02,星秀|好声音|女团|将神,1,0",
"u02,星秀|好声音|女团|西索,1,1"),1)

val etlData = value.flatMap(x=>{
val strings=x.split(",")
val name=strings(0)
val type_log_total=strings(1)
val show=strings(2).toInt
val click=strings(3).toInt
val type_log_total_ni=type_log_total.split("\\|")
type_log_total_ni.map(x=>{
((name,x),(show,click))
})
//sub(name,(price,store))
//(name,(price,store))
})

// etlData.reduceByKey((x,y)=>{
// (x._1+y._1,x._2+y._2)
// }).map(x=>x._1+"\t"+x._2._1+"\t"+x._2._2+"\t")
//etlData.reduceByKey((x,y)=>{(x._1+y._1,x._2+y._2)}).map(x=>x._1+"\t"+x._2._1+"\t"+x._2._2+"\t")

etlData.reduceByKey((x,y)=>{
(x._1+y._1,x._2+y._2)
}).collect().foreach(println)

spark持久化:

  • RDD持久化 =》 一个操作会有很多次的生成rdd,比如会有100个rdd,我们为了节省时间以及链路资源,可以把第99个rdd持久化,然后只用对第99个操作就好了
  • 持久化也是可以容错的
  • 就是保存在内存中
  • 是针对rdd的每个分区来的
  • 持久化操作,是下次操作的时候才会从持久化的地方加载数据
  • 默认存储级别是在内存中存储
  • 用persist或者cache都可以 =》 是lazy的
  • 当内存不足的时候是不能进行持久化的 我们可以通过序列化,进行配置= 》 减少使用空间
  • 一般我们要在内存和cpu之间做权衡 =》4步选择法
    • 官方=》memory_only
    • memory_only_ser => 对空间会节省
    • 磁盘上 = 》 memory_only_desk => 没有不做序列化快
    • 容错 =》 带有副本的方式,的确是更安全可是数据太大对磁盘是负担
  • 移除持久化数据:可以设置自动的级别(LRU),就会过一段时间,然后自动/或者通过rdd调用 unpersist(true)他是立即执行的

启动spark-shell本身也是一个spark作业

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
scala> val test = sc.parallelize("hdfs://bigdata3:9000/flume/events/2022-12-13/events.1670898548750.log")
test: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[0] at parallelize at <console>:23

scala> test.collect
collect collectAsync

scala> test.collect
res0: Array[Char] = Array(h, d, f, s, :, /, /, b, i, g, d, a, t, a, 3, :, 9, 0, 0, 0, /, f, l, u, m, e, /, e, v, e, n, t, s, /, 2, 0, 2, 2, -, 1, 2, -, 1, 3, /, e, v, e, n, t, s, ., 1, 6, 7, 0, 8, 9, 8, 5, 4, 8, 7, 5, 0, ., l, o, g)

scala> test.persist
res1: test.type = ParallelCollectionRDD[0] at parallelize at <console>:23

scala> test.ca
cache cartesian

scala> test.cache
res2: test.type = ParallelCollectionRDD[0] at parallelize at <console>:23
---------------------------------------------------------------------------------------java的序列化方法
val names = Array[String]("刘子航","李信","花木兰","达摩","耀","貂蝉","吕布")
val gar = Array[String]("男","女")
val addres= Array[String]("山东","广西","大连")


val value1 = sc.parallelize(1 to 300000)

val value2 = new ArrayBuffer[persion]()
val value3 = value1.map(x => {
val name = names(Random.nextInt(6))
val s = gar(Random.nextInt(1))
val s1 = addres(Random.nextInt(2))
value2 += (persion(name, s, s1))
})

value3.persist(StorageLevel.MEMORY_ONLY_SER)
value3.count()


case class persion(name: String,gre:String,add:String){

}
-------------------------------------------------------Kyro序列化
速度比java快,但是不是支持所有的序列化的,有的没有,使用前要加上注册类

要先在conf里使用它
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
然后要注册所用的case class或者class
conf.registerKryoClasses(Array(classOf[Info]))
然后其余和上述一样

val names = Array[String]("刘子航","李信","花木兰","达摩","耀","貂蝉","吕布")
val gar = Array[String]("男","女")
val addres= Array[String]("山东","广西","大连")


val value1 = sc.parallelize(1 to 300000)

val value2 = new ArrayBuffer[persion]()
val value3 = value1.map(x => {
val name = names(Random.nextInt(6))
val s = gar(Random.nextInt(1))
val s1 = addres(Random.nextInt(2))
value2 += (persion(name, s, s1))
})

value3.persist(StorageLevel.MEMORY_ONLY_SER)
value3.count()


case class persion(name: String,gre:String,add:String){

}













依赖关系和血缘关系

血缘关系 : =》就是不同rdd之间的转化

依赖关系 =》

  • 宽依赖 : 一个父RDD里的parttion会被子RDD的partition使用多次 =》 会产生shuffer =》 有新的stage产生 = 》一个shuffle会划分两个stage =》引起shuffle就会产生stage
  • 窄依赖 : 同一个父RDD里的分区最多被子RDD使用一次 =》一个stage里完成的 =》 无shuffle

补充算子 :

repartition : repartition(num)重新分区 ,引起shuffle =》 底层调用的是coalesce =》 不可以减少分取数

coalesce : 一般用于减少RDD的分区数量,coalesce(num)窄依赖的 =》 不引起shuffle =》 可以增加分区数

生产上用于调整计算的并行度

判断执行位置

判断是不是对rdd里的元素进行操作,如果是操作,则是executor端的,如果不是则是driver端

例子 :求top2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83

val value = sc.parallelize(List(
"www.bvaidu,u01,20",
"www.githuba,u02,2",
"www.bvaidu,u02,100",
"www.bibi,u02,199",
"www.githuba,u01,100",
"www.githuba,u01,1",
"www.githuba,u01,10",
"www.bibi,u02,19",
"www.bibi,u01,199",
"www.baidu.com,uid01,1",
"www.baidu.com,uid01,10",
"www.baidu.com,uid02,3",
"www.baidu.com,uid02,5",
"www.github.com,uid01,11",
"www.github.com,uid01,10",
"www.github.com,uid02,30",
"www.github.com,uid02,50",
"www.bibili.com,uid01,110",
"www.bibili.com,uid01,10",
"www.bibili.com,uid02,2",
"www.bibili.com,uid02,3"),1)

val etlData = value.map(x=>{
val strings=x.split(",")
val yuming=strings(0)
val user=strings(1)
val cishu=strings(2).toInt
((yuming,user),(cishu))
//sub(name,(price,store))
//(name,(price,store))
})
etlData.reduceByKey((x,y)=>{
x+y
}).sortBy(x=> -x._2,true).map(x=>{
(x._1._2,(x._1._1,x._2))
}).groupByKey().map(x=>{
x._2.map(s=>{
(x._1,s._1,s._2)
}).take(2)}).saveAsTextFile("hdfs://bigdata3:9000/input/10.txt")
------------------------------------------------------------------------------简化版本
val value = sc.parallelize(List(
"www.bvaidu,u01,20",
"www.githuba,u02,2",
"www.bvaidu,u02,100",
"www.bibi,u02,199",
"www.githuba,u01,100",
"www.githuba,u01,1",
"www.githuba,u01,10",
"www.bibi,u02,19",
"www.bibi,u01,199",
"www.baidu.com,uid01,1",
"www.baidu.com,uid01,10",
"www.baidu.com,uid02,3",
"www.baidu.com,uid02,5",
"www.github.com,uid01,11",
"www.github.com,uid01,10",
"www.github.com,uid02,30",
"www.github.com,uid02,50",
"www.bibili.com,uid01,110",
"www.bibili.com,uid01,10",
"www.bibili.com,uid02,2",
"www.bibili.com,uid02,3"),1)

val etlData = value.map(x=>{
val strings=x.split(",")
val yuming=strings(0)
val user=strings(1)
val cishu=strings(2).toInt
((yuming,user),(cishu))
//sub(name,(price,store))
//(name,(price,store))
})
val value4 = etlData.map(x => {
(x._1._2)
}).distinct().collect()


for (elem <- value4){
etlData.filter(_._1._2 == elem).reduceByKey(_ + _).sortBy( -_._2).take(2).foreach(println(_))
}

累加器:广播变量=》后续再补充

案例 : wordcount

1
2
3
4
5
6
7
8
9
val wc = sc.textFile("hdfs://bigdata3:9000/3.log")

wc.flatMap(x=>{
x.split(",")
}).map(x=>{
(x,1)
}).reduceByKey((x,y)=>{
x+y
}).saveAsTextFile("hdfs://bigdata3:9000/input/11.txt")

部署spark作业

  • jar
  • spark-submit
1
2
3
4
5
6
7
8
9
10
11
12
spark-submit \
--class 包名 \
--master 模式 \
--name 作业名字 \
jar包在机器上的路径 \
想传的参数
-------------------------------------例子
spark-submit \
--class tool.jdbc.readjdbc \
/home/hadoop/project/jar/bigdatajava-1.0-SNAPSHOT.jar \
"jdbc:mysql://bigdata2:3306/try" "root" "liuzihan010616" "emp"
-----------------------------------------我这里不加msater是因为我再代码里已经加了

还可以用其内封装的方法进行传值=》不用 args

就是在submit的时候用 --conf传入

代码如下 :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
package com.dl2262.sparkcore.day02

import com.dl2262.sparkcore.util.{ContextUtils, FileUtils}
import org.apache.spark.SparkContext
import org.apache.spark.internal.Logging
import org.apache.spark.rdd.RDD

/**
*
* @author sxwang
* 01 05 8:28
*/
object WCApp extends Logging{

def main(args: Array[String]): Unit = {

// if(args.size != 2){
// logError("请正确输入2个参数:<input> <output>")
// System.exit(0)
// }
// val in = args(0)
// val out = args(1)


val sc: SparkContext = ContextUtils.getSparkContext(this.getClass.getSimpleName)

val in = sc.getConf.get("spark.input.path","hdfs://bigdata32:9000/input/")
val out = sc.getConf.get("spark.output.path","hdfs://bigdata32:9000/output/")


val input = sc.textFile(in)

FileUtils.deletePath( sc.hadoopConfiguration,out)

input.flatMap(line => {
line.split(",")
}).map(word => (word,1))
.reduceByKey(_+_)
.saveAsTextFile(out)

sc.stop()
}





}

例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
数据类型如下:
域名,用户,用户所在地,展示次数,点击次数
求分别按照域名,用户,用户所在地进行统计其top2
-----------------------------------------------------------------------------------制造数据
val domin = Array[String]("www.baidu.com","www.taobao.com","www.github.com","www.bilbil.com","www.csdn.com","www.zihang.com")
val userList = Array[String]("zihang","zuan","zihao","shuangxi","yuhang")
val beianlocal = Array[String]("广州","江西","太原","新疆","上海")
var stringe:List[String] = Nil
for(i <- 1 to(30000)){
val Randomdomin = domin(Random.nextInt(domin.length-1))
val RandomUserList = userList(Random.nextInt(userList.length-1))
val Randombeianlocal = beianlocal(Random.nextInt(beianlocal.length-1))
val tmp = List(Randomdomin + "," + RandomUserList + "," + Randombeianlocal )
stringe = stringe++tmp
}
-------------------------------------------------------------------------------域名 =>其他有时间再补充把,和这个一样
val basicdata = sc.parallelize(stringe)
-------------------------------------------------------------------------------解析数据
val ETLDATA = basicdata.map(x=>{
val strings = x.split(",")
val currentdomin = strings(0)
val currentuser = strings(1)
val currentadd = strings(2)
val click = Random.nextInt(100).toInt
val show = Random.nextInt(200).toInt
((currentdomin,currentuser,currentadd),(click,show))
})
--------------------------------------------------------------------------------处理并排序
for (elem <- domin){
ETLDATA.filter(_._1._1==elem).reduceByKey((x,y)=>{
(x._1+y._1,x._2+y._2)
}).sortBy( -_._2._1).take(2).foreach(println(_))
}
---------------------------------------------------------------------------------方法2
ETLDATA.reduceByKey((x,y)=>{
(x._1+y._1,y._2+x._2)
}).sortBy(x=> -x._2._1).map(x=>{
(x._1._1,(x._1._2,x._1._3,x._2))
}).groupByKey().map(x=>{
x._2.map(s=>{
(x._1,(s._1,s._2,s._3))
}).take(2)
}).collect