Hello World

置顶 | 发表于 2022-11-03
本文字数： 372 阅读时长 ≈ 1 分钟

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1 2	bash $ hexo new "My New Post"

More info: Writing

Run server

1	$ hexo server

More info: Server

Generate static files

1	$ hexo generate

More info: Generating

Deploy to remote sites

1	$ hexo deploy

More info: Deployment

监控yarn

发表于 2023-02-03 更新于 2023-02-09 分类于日志
本文字数： 18k 阅读时长 ≈ 16 分钟

目标

监控yarn的资源

流程

数据采集：采集yarn的指标

数据处理：实时处理：spark-streaming

数据输出：mysql(加索引)，olap（毫秒级别）

数据可视化：superset,dataease

olap:clickhouse,doris,tidb,phenix

oltp:支持事务的

链路

yarn -> jar 采集数据 -> kafka-> sparkstreaming -> ck -> superset/dataease

采集的数据格式：

文本数据：网络io占据量少，分隔符设置有要求
json数据： json要占用网络io ，解析方便

start

采集数据

yarn的api

添加yarn的依赖

<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-client</artifactId>
  <version>3.3.4</version>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-yarn-common</artifactId>
  <version>3.3.4</version>
</dependency>

在idea里来进行开发

设置获取yarn数据的接口

package sparkfirst

import java.util

import org.apache.hadoop.yarn.api.records.YarnApplicationState
import org.apache.hadoop.yarn.client.api.YarnClient
import org.apache.hadoop.yarn.conf.YarnConfiguration

trait YarnInfo {

  def getYarnInfo={
    val client = YarnClient.createYarnClient()
    /*
    init 方法里面要求的是conf ，是yarn的配置文件
    先在resource里加上yarn-site.xml
    然后new配置文件类
    然后启动客户端
     */
    val configuration = new YarnConfiguration()
    client.init(configuration)
    client.start()

    val states = util.EnumSet.noneOf(classOf[YarnApplicationState])
    states.add(YarnApplicationState.ACCEPTED)
    states.add(YarnApplicationState.RUNNING)
    states.add(YarnApplicationState.NEW)
    states.add(YarnApplicationState.SUBMITTED)
    states.add(YarnApplicationState.KILLED)
    states.add(YarnApplicationState.NEW_SAVING)
    states.add(YarnApplicationState.FAILED)


    val reports = client.getApplications(states)

    val value = reports.iterator()

    val builder = new StringBuilder
    while (value.hasNext){
      val report = value.next()
      val report1 = report.getApplicationResourceUsageReport
      val id = report.getApplicationId
      val host = report.getHost
      val applicationType = report.getApplicationType
      val name = report.getName
      val starttime = report.getStartTime
      val user = report.getUser
      val finishtime = report.getFinishTime
      val mem = report1.getMemorySeconds
      val vcore = report1.getVcoreSeconds
      val size = report1.getUsedResources.getMemorySize
      val cores = report1.getUsedResources.getVirtualCores
      val resources = report1.getUsedResources.getResources
      val state = report.getYarnApplicationState
      val url = report.getTrackingUrl
      val margin =
        s"""
           |report: ${report}
           |report1 : ${report1}
           |id:${id}
           |host:${host}
           |applicationtype : ${applicationType}
           |name : ${name}
           |starttime ${starttime}
           |finishtime : ${finishtime}
           |user:${user}
           |memeveryscends:${mem}
           |vcoreeveryscends:${vcore}
           |size:${size}
           |cores${cores}
           |state:${state}
           |url:${url}
           |resources:${resources.mkString(",")}
           |---
           |""".stripMargin
      builder.appendAll(margin)
    }

    println(builder)

    }
}

在主方法里继承接口并且实现方法

package sparkfirst
import org.apache.flink.api.java.utils.ParameterTool
object testyarn {

  def apply(parameterTool: ParameterTool): testyarn = new testyarn(parameterTool)
  def main(args: Array[String]): Unit = {
    val tool = ParameterTool.fromArgs(args)
    testyarn(tool).excute()
  }
}
class testyarn(parameterTool: ParameterTool) extends YarnInfo {




  def excute(): Unit ={

    getYarnInfo
  }
}

在机器上启动一个sparksql通过yarn模式的部署

获取数据如下

report: applicationId { id: 1 cluster_timestamp: 1675390427337 } user: "hadoop" queue: "default" name: "SparkSQL::192.168.41.132" host: "192.168.41.133" rpc_port: -1 yarn_application_state: RUNNING trackingUrl: "http://bigdata4:9999/proxy/application_1675390427337_0001/" diagnostics: "" startTime: 1675390547814 finishTime: 0 final_application_status: APP_UNDEFINED app_resource_Usage { num_used_containers: 3 num_reserved_containers: 0 used_resources { memory: 5120 virtual_cores: 3 resource_value_map { key: "memory-mb" value: 5120 units: "Mi" type: COUNTABLE } resource_value_map { key: "vcores" value: 3 units: "" type: COUNTABLE } } reserved_resources { memory: 0 virtual_cores: 0 resource_value_map { key: "memory-mb" value: 0 units: "Mi" type: COUNTABLE } resource_value_map { key: "vcores" value: 0 units: "" type: COUNTABLE } } needed_resources { memory: 5120 virtual_cores: 3 resource_value_map { key: "memory-mb" value: 5120 units: "Mi" type: COUNTABLE } resource_value_map { key: "vcores" value: 3 units: "" type: COUNTABLE } } memory_seconds: 3804903 vcore_seconds: 2232 queue_usage_percentage: 41.666664 cluster_usage_percentage: 41.666664 preempted_memory_seconds: 0 preempted_vcore_seconds: 0 application_resource_usage_map { key: "memory-mb" value: 3804903 } application_resource_usage_map { key: "vcores" value: 2232 } application_preempted_resource_usage_map { key: "memory-mb" value: 0 } application_preempted_resource_usage_map { key: "vcores" value: 0 } } originalTrackingUrl: "http://bigdata3:4040" currentApplicationAttemptId { application_id { id: 1 cluster_timestamp: 1675390427337 } attemptId: 1 } progress: 0.1 applicationType: "SPARK" log_aggregation_status: LOG_NOT_START unmanaged_application: false priority { priority: 0 } appNodeLabelExpression: "<Not set>" amNodeLabelExpression: "<DEFAULT_PARTITION>" appTimeouts { application_timeout_type: APP_TIMEOUT_LIFETIME application_timeout { application_timeout_type: APP_TIMEOUT_LIFETIME expire_time: "UNLIMITED" remaining_time: -1 } } launchTime: 1675390548575
report1 : num_used_containers: 3 num_reserved_containers: 0 used_resources { memory: 5120 virtual_cores: 3 resource_value_map { key: "memory-mb" value: 5120 units: "Mi" type: COUNTABLE } resource_value_map { key: "vcores" value: 3 units: "" type: COUNTABLE } } reserved_resources { memory: 0 virtual_cores: 0 resource_value_map { key: "memory-mb" value: 0 units: "Mi" type: COUNTABLE } resource_value_map { key: "vcores" value: 0 units: "" type: COUNTABLE } } needed_resources { memory: 5120 virtual_cores: 3 resource_value_map { key: "memory-mb" value: 5120 units: "Mi" type: COUNTABLE } resource_value_map { key: "vcores" value: 3 units: "" type: COUNTABLE } } memory_seconds: 3804903 vcore_seconds: 2232 queue_usage_percentage: 41.666664 cluster_usage_percentage: 41.666664 preempted_memory_seconds: 0 preempted_vcore_seconds: 0 application_resource_usage_map { key: "memory-mb" value: 3804903 } application_resource_usage_map { key: "vcores" value: 2232 } application_preempted_resource_usage_map { key: "memory-mb" value: 0 } application_preempted_resource_usage_map { key: "vcores" value: 0 }
id:application_1675390427337_0001
host:192.168.41.133
applicationtype : SPARK
name : SparkSQL::192.168.41.132
starttime 1675390547814
finishtime : 0
user:hadoop
memeveryscends:3804903
vcoreeveryscends:2232
size:5120
cores3
state:RUNNING
url:http://bigdata4:9999/proxy/application_1675390427337_0001/
resources:name: memory-mb, units: Mi, type: COUNTABLE, value: 5120, minimum allocation: 0, maximum allocation: 9223372036854775807, tags: [], attributes {},name: vcores, units: , type: COUNTABLE, value: 3, minimum allocation: 0, maximum allocation: 9223372036854775807, tags: [], attributes {}
---

yarn上的数据如下：

经对比数据一致

发送数据到kafka

如下：

package sparkfirst
import java.util.Properties

import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.kafka.clients.producer.Producer
import org.apache.kafka.clients.producer.ProducerRecord
import org.apache.flink.api.java.utils.ParameterTool

import scala.util.Random
object testyarn {
  def apply(parameterTool: ParameterTool): testyarn = new testyarn(parameterTool)
  def main(args: Array[String]): Unit = {
    val tool = ParameterTool.fromArgs(args)
    testyarn(tool).excute()
  }
}

class testyarn(parameterTool: ParameterTool) extends YarnInfo {

  val properties = new Properties
  properties.put("bootstrap.servers", "bigdata3:9092,bigdata4:9092,bigdata5:9092 ")
  properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
  properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
  properties.put("acks", "all")


  def excute() ={
    val producer: Producer[String, String] = new KafkaProducer[String, String](properties)
  val i = new Random().nextInt(10) % 3
  val strings = getYarnInfo.split("-------------------------------------------------------------")
  for (elem <- strings) {
    println(elem)
    producer.send(new ProducerRecord[String, String]("yarninfo", i, " ",  elem))
  }

    producer.close()
}
}

消费数据

通过sparkstreaming消费

package project

import java.lang.reflect.Field
import java.util.Properties

import org.apache.flink.api.java.utils.ParameterTool
import org.apache.poi.ss.formula.functions.T

import scala.reflect.runtime.{universe => ru}
import org.apache.spark.sql
import org.apache.spark.sql.catalyst.plans.logical.MapPartitions
import tool._
object makeYArninfo {

  def apply(parameterTool: ParameterTool): makeYArninfo = new makeYArninfo(parameterTool)

  def main(args: Array[String]): Unit = {
    val tool = ParameterTool.fromArgs(args)
      makeYArninfo(tool).excute()
  }
}

class makeYArninfo(parameterTool: ParameterTool) extends Serializable {
  import org.apache.spark.streaming.kafka010._
  import org.apache.spark.sql.SparkSession
  import tool._
  import org.apache.kafka.clients.consumer.ConsumerRecord
  import org.apache.kafka.common.serialization.StringDeserializer
  import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
  import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
  import org.apache.spark.TaskContext


  val kafkaip = parameterTool.get("kafkaip","bigdata3:9092,bigdata4:9092,bigdata5:9092")
  val groupid = parameterTool.get("groupid","test-3")
  val offsetreset = parameterTool.get("offsetset" , "earliest")
  val topicid = parameterTool.get("topic","yarninfo")
  val mideng = parameterTool.get("mideng","timestamp")
  val url = parameterTool.get("url","jdbc:clickhouse://ip:8123/bigdata")
  val root = parameterTool.get("root","default")
  val password = parameterTool.get("password","123456")
  val driver = parameterTool.get("driver","com.clickhouse.jdbc.ClickHouseDriver")
  val dbtable = parameterTool.get("dbtable","yarninfo_zihang")
  val mode = parameterTool.get("mode","append")



  val kafkaParams = Map[String,Object](
    "bootstrap.servers" -> kafkaip, // kafka地址
  "key.deserializer" -> classOf[StringDeserializer], // 反序列化
  "value.deserializer" -> classOf[StringDeserializer], // 反序列化
  "group.id" -> groupid, // 指定消费者组
  "auto.offset.reset" -> offsetreset, // 从什么地方开始消费
  "enable.auto.commit" -> (false: java.lang.Boolean) // offset的提交 是不是自动提交
  )
  private val streamingcontext = new streamingcontext

  private val savefile = new savefile

  def excute()={

    val streaming = streamingcontext.getstreamingnocheckpoint()
    val topic = Array(topicid)
    val stream = KafkaUtils.createDirectStream(
      streaming,
      PreferConsistent,
      Subscribe[String, String](topic, kafkaParams)
    )
    // 获取offset信息
    stream.foreachRDD { rdd =>
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      println(rdd.partitions.size)
      rdd.foreachPartition { iter =>
        val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
        println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
      }
      val spark = SparkSession.builder().config(rdd.sparkContext.getConf).getOrCreate()
      import spark.implicits._
      println("----------------------------------------------")
      val wordsDataFrametmp =  rdd.map(_.value()).filter(_.nonEmpty).map(line => {
        var str:String = line
        if (line.startsWith("\r\n\r\n")){
          if (line.startsWith("\r\n\r\n")){
            str = line.replace("\r\n\r\n", "\r\n")
          }
        }
        str.split("\r\n")
      }).filter(_.nonEmpty)

      var wordsDataFrame:sql.DataFrame = null

//      def getTypeTag[T: ru.TypeTag](obj: T) = ru.typeTag[T]
//
//      val tpe = getTypeTag(wordsDataFrametmp).tpe
//
//      tpe.dealias.getClass.getFields.foreach(println(_))
//      println("---------------------------------")
//      tpe.getClass.getDeclaredFields.foreach(println(_))



//      println("\"wordsDataFrametmp的数据\" ")
//      //wordsDataFrametmp.collect().foreach(_.foreach(println(_)))
//      println("wordsDataFrametmp")
//      wordsDataFrametmp.toDF("total").show(false)
//      println("rdd.map(_.value())")
//      rdd.map(_.value()).toDF("total").show(false)
//      println("rdd.map(_.value()).map(_.split(\"\\r\\n\"))")
//      rdd.map(_.value()).map(line => {
//        var str:String = line
//          if (line.startsWith("\r\n\r\n")){
//          str = line.replace("\r\n\r\n", "\r\n")
//        }
//       str.split("\r\n")
//      }).toDF("total").show(false)




      // ------------------------------------------------------------------------------------------------------
      if(!((wordsDataFrametmp.collect().length == 1)&&(wordsDataFrametmp.collect().length == 0))){
        wordsDataFrame= wordsDataFrametmp.map(strings=>{
          val id = strings(1).split(":")(1)
          val host = strings(2).split(":")(1)
          val applicationtype = strings(3).split(":")(1)
          val name = strings(4).split("&&")(1)
          val startime = strings(5).split(":")(1)
          val endtime = strings(6).split(":")(1)
          val user = strings(7).split(":")(1)
          val memeveryscends = strings(8).split(":")(1)
          val vcoreeveryscends = strings(9).split(":")(1)
          val size = strings(10).split(":")(1).toLong
          val cores = strings(11).split(":")(1).toLong
          val state = strings(12).split(":")(1)
          val url = strings(13).split("&&")(1)
          val queue = strings(14).split(":")(1)
          val timestamp = strings(15).split("&&")(1)
          (id,host,applicationtype,name,startime,endtime,user,memeveryscends,vcoreeveryscends,size,cores,state,url,queue,timestamp)
        })
          .toDF("id","host",
            "applicationtype","name",
            "startime","endtime",
            "user","memeveryscends",
            "vcoreeveryscends","size",
            "cores","state","url","queue","timestamp")
        if (!wordsDataFrame.isEmpty){
          wordsDataFrame.show()
          savefile.savetojdbc(spark, wordsDataFrame, url , root , password,dbtable,driver,mideng,mode)
        }
      }



       //存储offset和提交offset
      stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
    }
    streaming.start()
    streaming.awaitTermination()
  }


}

部署

在机器上部署添加依赖的时候可以使用–jars

如下：

spark-submit \
--master yarn \
--deploy-mode client \
--name 录入yarninfo \
--executor-memory 1g \
--num-executors 1 \
--executor-cores 1 \
--jars /home/hadoop/software/jar/kafka/spark-streaming-kafka-0-10_2.12-3.2.1.jar,/home/hadoop/software/jar/kafka/spark-token-provider-kafka-0-10_2.12-3.2.1.jar,/home/hadoop/software/jar/kafka/kafka-clients-2.2.1.jar,/home/hadoop/software/jar/connect/clickhouse-jdbc-0.3.2.jar,/home/hadoop/software/jar/connect/clickhouse-http-client-0.3.2.jar,/home/hadoop/software/jar/connect/clickhouse-client-0.3.2.jar,/home/hadoop/software/jar/flink/flink-clients_2.12-1.13.6.jar,/home/hadoop/software/jar/flink/flink-core-1.13.6.jar,/home/hadoop/software/jar/flink/flink-scala_2.12-1.13.6.jar,/home/hadoop/software/jar/flink/flink-java-1.13.6.jar \
--class project.makeYArninfo \
/home/hadoop/project/jar/bigdatajava-1.0-SNAPSHOT.jar \
--kafkaip namenode:9092,resourcemanager:9092,workers:9092 
-------------------------------------------------------------------
spark-submit \
--master yarn \
--name 采集yarn \
--deploy-mode client \
--executor-memory 1g \
--num-executors 1 \
--executor-cores 1 \
--jars /home/hadoop/software/jar/kafka/spark-streaming-kafka-0-10_2.12-3.2.1.jar,/home/hadoop/software/jar/kafka/spark-token-provider-kafka-0-10_2.12-3.2.1.jar,/home/hadoop/software/jar/kafka/kafka-clients-2.2.1.jar,/home/hadoop/software/jar/flink/flink-clients_2.12-1.13.6.jar,/home/hadoop/software/jar/flink/flink-core-1.13.6.jar,/home/hadoop/software/jar/flink/flink-scala_2.12-1.13.6.jar,/home/hadoop/software/jar/flink/flink-java-1.13.6.jar \
--class sparkfirst.testyarn \
--queue zihan \
/home/hadoop/project/jar/bigdatajava-1.0-SNAPSHOT.jar \
--kafkaip namenode:9092,resourcemanager:9092,workers:9092

或者使用maven的方式

通过参数

1 2	--repositories https://oss.sonatype.org/content/groups/public/ \ --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0,org.elasticsearch:elasticsearch-spark_2.10:2.2.0 \

就可以控制，packages里加的是依赖，上面则是maven仓库的地址

只有第一次使用的时候会下载，之后就不会了

或者直接打胖包通过插件

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-assembly-plugin</artifactId>
  <version>3.0.0</version>
  <configuration>
    <descriptorRefs>
      <descriptorRef>jar-with-dependencies</descriptorRef>
    </descriptorRefs>
  </configuration>
  <executions>
    <execution>
      <id>make-assembly</id>
      <phase>package</phase>
      <goals>
        <goal>single</goal>
      </goals>
    </execution>
  </executions>
</plugin>

xxl进行调度，报警等

然后我们把启动脚本封装到一个sh中，通过xxl进行调度

脚本如下：

pid=$(jps |  grep SparkSubmit | awk '{print $1}')
if [ ! -n "$pid" ];then
yarninfo.sh
ssh bigdata3 "/home/hadoop/shell/ding.sh 梅花十三 采集yarn日志 请登录查看 192.168.41.133 15046528047"
else
echo "信息正常"
fi

上述只是个简单的脚本，如果要真正的实时监控请自行编写

同步工具

发表于 2023-02-02 更新于 2023-02-03 分类于日志
本文字数： 21k 阅读时长 ≈ 19 分钟

api

jdbctohive

package project


import java.util

import org.apache.spark.sql.catalog.Catalog
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import tool.sqlUtils
import tool.getmysqldf
import tool.savefile
import tool.readfile
import org.apache.flink.api.java.utils.ParameterTool
object jdbctohive{
  def apply(parameterTool: ParameterTool): jdbctohive = new jdbctohive(parameterTool)

  def main(args: Array[String]): Unit = {
    if (args.length==0){
      println(
        """
          |欢迎使用本程序
          |参数详情 mysql hive
          |-------------------------mysql
          |url 例子 ： jdbc:mysql://bigdata2:3306/try
          |user 例子 ： root
          |password 例子 ： liuzihan010616
          |tablename => 支持谓词下压  例子 ： emp 或者 select * from emp 等
          |driver => com.mysql.jdbc.Driver
          |---------------------------hive
          |mode模式 overwrite append 等
          |hive中的table 例子 bigdata.emp
          |可选参数 分区字段 自动开启的是动态分区 例子 deptno
          |分区字段 [字段值] [标志位]：代表是不是只更新这一个分区的数据
          |jdbc:mysql://bigdata2:3306/try root liuzihan010616 "select * from emp " com.mysql.jdbc.Driver append default.tmp deptno,sal,test,re 999,888
          |""".stripMargin)
    }
    val tool = ParameterTool.fromArgs(args)
    jdbctohive(tool).excute(args)
  }
}




class jdbctohive(parameterTool: ParameterTool) {
  System.setProperty("HADOOP_USER_NAME","hadoop")
  val spark = SparkSession.builder().appName("sqoop").master("local[4]").enableHiveSupport().getOrCreate()
  spark.sparkContext.setCheckpointDir("/tmp/checkpoint")

  val getmysqldf = new readfile
  val sqlUtils = new sqlUtils
  val saveFile = new savefile
  private val catalog: Catalog = spark.catalog
  var changecolunm = false
  import spark.implicits._
  import org.apache.spark.sql.functions._

  val url = parameterTool.getRequired("url")
  val user = parameterTool.getRequired("user")
  val password = parameterTool.getRequired("password")
  val table = parameterTool.getRequired("table")
  val driver = parameterTool.getRequired("driver")
  val mode = parameterTool.getRequired("mode")
  val hivetable = parameterTool.getRequired("hivetable")
  val hivepartition = parameterTool.get("hivepartition",null)
  val partitionValues = parameterTool.get("partitionValues")
  val insertpartition = parameterTool.get("insertpartition")


  def excute(args: Array[String]): Unit = {


    // 获取jdbc的df
    val mysqlconnect = getmysqldf.getmysqldataframe(spark, url, user, password, table , driver)
    // 验证指示
    mysqlconnect.show()
    // 生成hive参数数组
//    var hiveconf = new Array[String](args.length-5)
//    hiveconf = util.Arrays.copyOfRange(args, 5, args.length)
    //hiveconf.foreach(println(_))
    jdbctohive(args.length,catalog,mysqlconnect)
    spark.stop()
  }



  def changecolnums(int: Int,resourcesql:DataFrame) ={
    var finallyresult:Dataset[Row] = null // 最终结果集
    var frame:DataFrame = null // 中间变量
    val strings2 = hivepartition.split(",")
    var hiveconclumns = spark.table(hivetable).columns // hive的列数
    //hiveconclumns.foreach(println((_))) // 验证hive的列数
    var mysqlconnect:DataFrame = resourcesql // 设置数据源的resource

    // 判断分区字段在不在jdbc的数据里，如果不在，则在jdbc的数据源中先添加上分区字段
    var strings1:Array[String] =null
    if (int > 8 && partitionValues != null){
      strings1 = partitionValues.split(",")
    }
    var flagtmp:Int = 0;
    for (elem <- strings2){
      if (!mysqlconnect.columns.contains(elem)){
        println(elem)
        println(strings1(flagtmp))
        mysqlconnect = mysqlconnect.withColumn(elem,lit(strings1(flagtmp)))
        flagtmp = flagtmp + 1
        mysqlconnect.show()
      }
    }



    val jdbcconclumns = mysqlconnect.columns // jdbc的列数


    var jdbcoldsource:Dataset[Row] = null // 源数据库的数据 checkpoint是为了破坏数据均衡，以后能编写变读取

    if (int == 10){
      hivepartition.split(",")(0) match {
        case "" => {
          println("-------------------------无操作")
        }
        case _ => {
          hivepartition.split(",").length match {
            case 1 =>
            {
              jdbcoldsource = spark.sql(
                s"""
                   |select * from ${hivetable} where ${hivepartition} != ${partitionValues}
                   |""".stripMargin).checkpoint()
            }
            case _ =>
            {
              var tmpstring:String  = null
              var flag:Int = 0
                 val flagvalue = partitionValues.split(",")
                  for (elem <- hivepartition.split(",")){
                 if (elem == hivepartition.split(",")(hivepartition.split(",").length-1)){
                   tmpstring = tmpstring + elem + "!=" + flagvalue(flag)
                 }else{
                   tmpstring = tmpstring + elem + "!=" + flagvalue(flag) + "and"
                 }
                    flag = flag + 1
              }
              jdbcoldsource = spark.sql(
                s"""
                   |select * from ${hivetable} where ${tmpstring}
                   |""".stripMargin).checkpoint()
            }
          }
        }
      }


    }else{
      jdbcoldsource =  spark.sql(
        s"""
           |select * from ${hivetable}
           |""".stripMargin).checkpoint()
    }

    var existcolunms: Array[String] = null  // 设置hive或者mysql的额外列
    var resultdf: DataFrame = jdbcoldsource // 获取hive的数据原始数据

    // 判断是hive的列多，还是数据源的列数多
    if (hiveconclumns.length >= jdbcconclumns.length){
      // 判断额外列的存在
      existcolunms= hiveconclumns.filter(hivecol => {
        val bool = jdbcconclumns.map(jdbccol => {
          jdbccol == hivecol
        }).contains(true)
        !bool
      })
      // 判断两个列数是不是相等
        if (existcolunms.isEmpty) {
          frame = mysqlconnect.selectExpr(hiveconclumns: _*)
          frame
        }else{
        // 列数不相等的时候让列数少的加列
        resultdf = mysqlconnect
        for (elem <- existcolunms){
          resultdf = resultdf.withColumn(elem, lit(null))
        }
        // 对字段进行排序 ， 让分区数据的分区字段在最后一列
        frame = resultdf.selectExpr(hiveconclumns: _*)
        // 验证数据
        frame.show()
        // 整合历史数据
        finallyresult = jdbcoldsource.union(frame)
        // 验证数据
        finallyresult.show()
        changecolunm = true
        finallyresult
      }
    }else{
      // 数据的列多
      existcolunms= jdbcconclumns.filter(jdbccol => {
        val bool = hiveconclumns.map(hivecol => {
          jdbccol == hivecol
        }).contains(true)
        !bool
      })

      if (existcolunms.isEmpty) {
        frame = mysqlconnect.selectExpr(hiveconclumns: _*)
        frame
      }else{
        for (elem <- existcolunms){
          resultdf = resultdf.withColumn(elem, lit(null))
        }
        frame = resultdf.selectExpr(jdbcconclumns: _*)
        finallyresult = frame.union(mysqlconnect)
        changecolunm = true
        finallyresult
      }
    }
  }






  def jdbctohive(int: Int,catalog: Catalog,mysqlconnect: DataFrame)={
    // 分割字符串获取hive的 表和数据库
    val hivedbandtables = hivetable.split("\\.")
    val hivepart = hivepartition.split(",")
    hivepart.foreach(println(_))
// catalog的方法 获取表存不存在的方法
//    catalog.listTables(strings(0)).show()
//    val empty = catalog.listTables(strings(0)).filter(x => {
//      x.name == strings(1)
//    }).isEmpty
    val empty = catalog.tableExists(hivedbandtables(0),hivedbandtables(1))
//-----------------------------------------------------------------------------
//    sql的方法
//    val empty1 = spark.sql(
//      """
//        |show tables in hivedb
//        |""".stripMargin).filter("tableName = 'hivetablename'").isEmpty
// --------------------------------------------------------------------------


    // 判断列数是不是相等
    var frameresult:DataFrame = mysqlconnect
    // 先判断表存不存在 ，因为判断列数的方法要求表存在
      empty match {
          // 表不存在
      case false => {
        // 判断输入的变量个数执行 判断分区表还是普通表
        if (int > 7) {
          println("-----------------分区表")
          // 判断分区的参数在不在列中 如果不在 ，则加上 ，在的话就自动往下走
          var hivepartval:Array[String] =null
            if (int > 8 && partitionValues != null){
            hivepartval = partitionValues.split(",")
          }
          var flagtmp:Int = 0;
          for (elem <- hivepart){
            if (!mysqlconnect.columns.contains(elem)){
              println(elem)
              println(hivepartval(flagtmp))
              frameresult = frameresult.withColumn(elem,lit(hivepartval(flagtmp)))
              flagtmp = flagtmp + 1
            }
          }
        }else{
          println("-----------普通表")
          frameresult = mysqlconnect
          mysqlconnect.show()
        }
      }

      case true => {
        // 表存在
        // 判断是不是分区表
        frameresult = changecolnums(int, mysqlconnect)
//        if (args.length > 7) {
//          println("-----------------分区表")
//          if (!mysqlconnect.columns.contains(args(7))){
//            frameresult = changecolnums(args, hiveconf, mysqlconnect)
//          }
//        }else{
//          println("-----------普通表")
//          frameresult = mysqlconnect
//        }
        frameresult.show()}
    }









    spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
    spark.conf.set("hive.exec.dynamic.partition","true")
    spark.conf.set("spark.sql.parquet.writeLegacyFormat", "true")
    println(empty)
    saveFile.savetohiveapi(spark,empty,frameresult,hivetable,mode,hivepartition,changecolunm)
}

}

hivetojdbc

package project

import java.util

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalog.Catalog
import tool.{getmysqldf, savefile, sqlUtils,readfile}
import org.apache.flink.api.java.utils.ParameterTool
object hivetojdbc{
  def apply(parameterTool: ParameterTool): hivetojdbc = new hivetojdbc(parameterTool)

  def main(args: Array[String]): Unit = {
    if (args.length==0){
      println(
        """
          |欢迎使用本程序
          |参数说明
          |总体参数种类 hive mysql
          |---------------------------hive
          |hive中要选择的字段 例子 ： "sal,big  / *  "
          |hive的table的名字 例子 ： bigdata_hive3.emp
          |hive中的 条件可以为空 例子 ： where sal > '300'
          |---------------------------mysql
          |savemode overwrite append 等
          |url 例子 ： jdbc:mysql://bigdata2:3306/try
          |user 例子 ： root
          |password 例子 ： liuzihan010616
          |dbtable 例子 ： emp
          |幂等性的列 ： 例子 ： sal
          |驱动名称 ： 例子 com.mysql.jdbc.Driver
          |""".stripMargin)
    }
    val tool = ParameterTool.fromArgs(args)
    hivetojdbc(tool).excute()
  }
}




class hivetojdbc(parameterTool: ParameterTool) {
  val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
  val getmysqldf = new readfile
  val sqlUtils = new sqlUtils
  val saveFile = new savefile
  private val catalog: Catalog = spark.catalog
  val hiveconclunms = parameterTool.getRequired("hiveconclumns")
  val hivetable = parameterTool.getRequired("hivetable")
  val hiveoption = parameterTool.get("hiveoption",null)
  val url = parameterTool.get("url","jdbc:mysql://bigdata2:3306/bigdata")
  val user = parameterTool.get("user","root")
  val pasword = parameterTool.get("password","liuzihan010616")
  val dbtable = parameterTool.getRequired("dbtable")
  val driver = parameterTool.getRequired("driver")
  val midengconclumns = parameterTool.getRequired("col")
  val mode = parameterTool.getRequired("mode")

  def excute(): Unit = {

    val frame = sqlUtils.checksql(spark, sqlUtils.hivesqlchoose(hiveconclunms,hivetable,hiveoption))
    saveFile.savetojdbc(spark,frame,url,user,pasword,dbtable,driver,midengconclumns,mode)
  }

}

sql方式

jdbctohive

package sparkfirst

import org.apache.spark.sql.SparkSession
import tool.savefile
import tool.sqlUtils
import org.apache.spark.sql.functions._
object test {
  val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").enableHiveSupport().getOrCreate()
  private val savefile = new savefile
  private val utils = new sqlUtils
  def main(args: Array[String]): Unit = {
    val df = spark.read.format("JDBC")
      .option("url","jdbc:mysql://bigdata2:3306/try")
      .option("dbtable", "emp")
      .option("user", "root")
      .option("password", "liuzihan010616")
      .load()
    df.select("sal").tail(1).foreach(println(_))
    println(df.select("sal").tail(1)(0)(0))
    df.show()


    var str:String = null
    val bool = spark.catalog.tableExists("default.tmp")
    if (bool){
      spark.sql(
        s"""
          |drop table default.tmp
          |""".stripMargin)
      str = utils.mkcreatesql(df, "default.tmp", "text", "','","deptno,hiredate")
      utils.checksql(spark,str)
    }else{
      str = utils.mkcreatesql(df, "default.tmp", "text", "','","deptno,hiredate")
      utils.checksql(spark,str)
    }

    spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
    spark.conf.set("hive.exec.dynamic.partition","true")
    val frame = df.withColumn("ee", lit("aaa"))
    utils.insertmake(spark,df,"default.tmp","','","deptno,hiredate")
    utils.changecolunms(spark,frame,"default.tmp")
    utils.insertmake(spark,frame,"default.tmp","','","deptno,hiredate")


  }
}

hivetojdbc

用sqlUtils里自己定义的api来进行

思路及实现功能

source

通过api进行对jdbc数据接收

通过

getmysqldf.getmysqldataframe(spark, url, user, password, table , driver)
----------------------------------------------------------------------------------------------
  def getmysqldataframe(sparkSession: SparkSession,string: String*) ={
    val sql = string(3)
    val frame: DataFrame = sparkSession.read.format("jdbc").options(Map("url" -> string(0), "user" -> string(1), "password" -> string(2), "dbtable" -> s"($sql) as tmp","driver"->string(4))).load()
    frame
  }

来获取jdbc数据

其他的方法：其中options 可以换成多个option来进行，option里是KV类型的

通过

 val frame = sqlUtils.checksql(spark, sqlUtils.hivesqlchoose(hiveconclunms,hivetable,hiveoption))
-----------------------------------------------------------------------------------------------------------
  def checksql(spark:SparkSession, string: String)={
    spark.sql(string)
}
---------------------------------------------------------------------------------------------------------------
 def hivesqlchoose(hiveconclumns:String,hivetable:String,hiveoptions:String)={

    "select" + " " + hiveconclumns + " " + "from" + " " +  hiveconclumns + " " + hiveoptions
  }

前提在saprksession处打开enablesupporthive参数

todo

通过api对数据进行整合以及处理

功能：

jdbctohive
- 基本功能
  - 同步普通表
  - 同步分区表
    - 单分区
    - 多分区
- 追加功能api
  - 用户自定义分区字段及其值
  - 用户在jdbc数据中增加列，hive中自动增加列
  - 分区字段更改且不丢失源数据
  - 可以实现表中自带的字段以及用户定义的字段一起分区的操作
  - 实现单独对一个分区的数据追加或者重新写入
  - 设置hive表的存储以及压缩格式
  - 实现对所有分区的追加或者重新写入
  - 通过flink的参数工具部署
- 追加功能sql
  - 实现用户自定义分区字段以及值
  - 用户在jdbc数据中增加列，hive中自动增加列
  - 可以实现表中自带的字段以及用户定义的字段一起分区的操作
  - 实现单独对一个分区的数据追加或者重新写入
  - 实现对所有分区的追加或者重新写入
  - 设置hive表的存储格式
  - 实现设置存储格式text等
  - 通过flink的参数工具部署
hivetojdbc
- 基本功能
  - 同步数据
- 追加功能
  - 幂等性操作

基本功能没有上面要注意的点，但是多分区的时候我是采用获取字符串然后split之后map加上数据类型然后mkstring制作的

sql：

但是分区一般分为数据里的分区字段，以及用户自定义的分区字段，针对于用户自定义的分区，我直接把他们定义为string，但是对于数据里的分区字段，我选择保留他原始的类型，通过筛选出他包含的分区字段的schema信息，然后通过他的datatype，来进行数据的备份，最后和上述用户自定义的分区字段拼接到一起，就可以了，最后前面加上partitioned by 就好了，注意点是要提取变量，以及判断最后一次的时机，以及如何判断分区列在不在字段里。

api：

对于api则更为简单，直接调用partitionbyapi然后把字符串通过split然后：_*的方式传入，就ok了，但是api的分区字段必须是在df里的，也就是说，我们要提前把分区字段加上，先判断有没有分区字段，然后进而加上分区字段

追加功能：

sink

通过api对数据进行输出到表

sink到hive

api

    saveFile.savetohiveapi(spark,empty,frameresult,hivetable,mode,hivepartition,changecolunm,fileformated,codec)
------------------------------------------------------------------------------------------------------------------
def savetohiveapi(sparkSession: SparkSession,boolean: Boolean,spark: DataFrame,hivetable:String,mode:String,hivepartition:String,changecolnums:Boolean,fileformat:String,codec:String) = {




    if (!boolean){
      if (hivepartition != null){
        spark.write.partitionBy(hivepartition.split(","):_*).option("fileFormat",fileformat).option("compression",codec).mode(mode).format("hive").saveAsTable(hivetable)
      }else {

        println(hivetable)
        println(hivepartition)
        spark.write.option("fileFormat",fileformat).option("compression",codec).mode(mode).format("hive").saveAsTable(hivetable)
      }

    }else{
      changecolnums match {
        case true =>  {
          if (hivepartition != null){
            if(sparkSession.table(hivetable).columns.length != spark.columns.length){
             sparkSession.sql(
               s"""
                 |drop table ${hivetable}
                 |""".stripMargin)
            }
            spark.write.partitionBy(hivepartition.split(","):_*).option("fileFormat",fileformat).option("compression",codec).mode(mode).format("hive").saveAsTable(hivetable)
          }else {
            spark.write.option("fileFormat",fileformat).option("compression",codec).mode(mode).format("hive").saveAsTable(hivetable)
          }
        }
        case false => spark.write.option("fileFormat",fileformat).option("compression",codec).mode(mode).format("hive").insertInto(hivetable)
      }

      spark.show()
      println(spark.count())

    }
  }

sql

def insertmake(sparkSession: SparkSession,dataFrame: DataFrame,tablename:String,otheroptions:String*) ={

    var strings:Array[String] = null

      dataFrame.selectExpr(sparkSession.table(tablename).columns:_*).createOrReplaceTempView("tmp")

   // val partitionstring = sparkSession.table(tablename).columns.tail(sparkSession.table(tablename).columns.length - 2)
    otheroptions.length match {
      case 0 => {
        sparkSession.sql(
          s"""
             |insert overwrite ${tablename}
             |select * from tmp
             |""".stripMargin)
      }
      case _ => {

        if (otheroptions.length > 1){
          strings = otheroptions(1).split(",").filter(conclunms => {
            !dataFrame.columns.contains(conclunms)
          })
          val fuzhiarray:Array[String] = util.Arrays.copyOfRange(otheroptions.toArray, 2, otheroptions.length)
          fuzhiarray.foreach(println(_))
          strings.isEmpty match {
            case true => {

          sparkSession.sql(
          s"""
            |insert overwrite ${tablename} partition(${otheroptions(1).split(",").map(conclunms => {s"${conclunms}"}).mkString(",")})
            |select * from tmp
            |""".stripMargin)
          }
            case false => {
              var tmpdf:DataFrame = dataFrame
              for (i <- 0 to strings.length-1){
                tmpdf = tmpdf.withColumn(strings(i),lit(fuzhiarray(i)))
              }
              tmpdf.show()
              tmpdf.printSchema()
              tmpdf = tmpdf.selectExpr(sparkSession.table(tablename).columns: _*)
              tmpdf.show()
              tmpdf.printSchema()
              val str = tmpdf.columns.mkString(",\n")
              tmpdf.createOrReplaceTempView("smp")
              sparkSession.sql(
                s"""
                   |insert overwrite ${tablename} partition(${otheroptions(1).split(",").map(conclunms => {s"${conclunms}"}).mkString(",")})
                   |select ${str} from smp
                   |""".stripMargin)
            }
          }



        }else{
          strings = otheroptions(0).split(",").filter(conclunms => {
            !dataFrame.columns.contains(conclunms)
          })
          val fuzhiarray:Array[String] = util.Arrays.copyOfRange(otheroptions.toArray, 1, otheroptions.length)

          strings.isEmpty match {
            case true => {
              sparkSession.sql(
                s"""
                   |insert overwrite ${tablename} partition(${otheroptions(0).split(",").map(conclunms => {s"${conclunms}"}).mkString(",")})
                   |select * from tmp
                   |""".stripMargin)
            }

            case false => {
              var tmpdf:DataFrame = dataFrame

              for (i <- 0 to strings.length-1){
                tmpdf = tmpdf.withColumn(strings(i),lit(fuzhiarray(i)))
              }
              tmpdf.show()
              tmpdf.printSchema()
              tmpdf = tmpdf.selectExpr(sparkSession.table(tablename).columns: _*)
              val str = tmpdf.columns.mkString(",\n")
              tmpdf.createOrReplaceTempView("smp")
              sparkSession.sql(
                s"""
                   |insert overwrite ${tablename} partition(${otheroptions(0).split(",").map(conclunms => {s"${conclunms}"}).mkString(",")})
                   |select ${str} from smp
                   |""".stripMargin)
            }
        }
      }
    }
  }
  }

sink到jdbc

api

def savetojdbc(spark: SparkSession,df: DataFrame, url:String,user:String,password:String,dbtable:String,driver:String,mideng:String,mode:String)={
    val map = Map("url" -> url,
      "user" -> user,
      "password" -> password,
      "dbtable" -> dbtable,
      "driver"-> driver)

//    df.write.mode(string(2)).format(string(1)).options(map).save()
    // -------------------------------------幂等性
    val connection = jdbcconnect.getconncet(driver,url,user,password)
    try{
      val bool = connection.createStatement().executeQuery(s"show tables like '${dbtable}'").next()
      if (!bool){
        throw new NullPointerException(s"写入的结果表${dbtable} 尚未创建！！！")
      }else{
        var flag:Any = null
        val flagbool = mysqldf.getmyqsldffromMap(spark, map).select(mideng).isEmpty
        if (!flagbool){
         flag = mysqldf.getmyqsldffromMap(spark, map).select(mideng).tail(1)(0)(0)
        }
        val tmpresult = df.select(mideng).filter(line => {
          line.getString(0) != flag
        })

        if (df.isEmpty){
          println("数据集为空")
        }else{
          if (tmpresult.isEmpty){
            println("你的数据已经插入过")
            df.show(false)
          }else {
//            df.show()
//            println(df.count())
            //val insertresult = tmpresult.join(df, string(5))
            tmpresult.show()
            println(tmpresult.count())
//            insertresult.show()
//            println(insertresult.count())
            tmpresult.write.mode(mode).format("jdbc").options(map).save()
          }
        }
      }
    }finally {
      connection.close()
    }


  }

project

发表于 2023-02-01 分类于日志
本文字数： 7.5k 阅读时长 ≈ 7 分钟

jdbctohive

package project


import java.util

import org.apache.spark.sql.catalog.Catalog
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import tool.sqlUtils
import tool.getmysqldf
import tool.savefile
import tool.readfile
object jdbctohive {
  System.setProperty("HADOOP_USER_NAME","hadoop")
  val spark = SparkSession.builder().appName("sqoop").master("local[4]").enableHiveSupport().getOrCreate()
  spark.sparkContext.setCheckpointDir("/tmp/checkpoint")

  val getmysqldf = new readfile
  val sqlUtils = new sqlUtils
  val saveFile = new savefile
  private val catalog: Catalog = spark.catalog
  var changecolunm = false
  import spark.implicits._
  import org.apache.spark.sql.functions._


  def main(args: Array[String]): Unit = {
    if (args.length==0){
      println(
        """
          |欢迎使用本程序
          |参数详情 mysql hive
          |-------------------------mysql
          |url 例子 ： jdbc:mysql://bigdata2:3306/try
          |user 例子 ： root
          |password 例子 ： liuzihan010616
          |tablename => 支持谓词下压  例子 ： emp 或者 select * from emp 等
          |driver => com.mysql.jdbc.Driver
          |---------------------------hive
          |mode模式 overwrite append 等
          |hive中的table 例子 bigdata.emp
          |可选参数 分区字段 自动开启的是动态分区 例子 deptno
          |""".stripMargin)
    }

    val url = args(0)
    val user = args(1)
    val password = args(2)
    val table = args(3)
    val driver = args(4)
    // 获取jdbc的df
    val mysqlconnect = getmysqldf.getmysqldataframe(spark, url, user, password, table , driver)
    // 验证指示
    mysqlconnect.show()
    // 生成hive参数数组
    var hiveconf = new Array[String](args.length-5)
    hiveconf = util.Arrays.copyOfRange(args, 5, args.length)
    //hiveconf.foreach(println(_))
    jdbctohive(args,catalog,mysqlconnect,hiveconf)
    spark.stop()
  }



  def changecolnums(args:Array[String],hiveconf:Array[String],resourcesql:DataFrame) ={
    var finallyresult:Dataset[Row] = null // 最终结果集
    var frame:DataFrame = null // 中间变量
    var hiveconclumns = spark.table(args(6)).columns // hive的列数
    hiveconclumns.foreach(println((_))) // 验证hive的列数
    var mysqlconnect:DataFrame = resourcesql // 设置数据源的resource

    // 判断分区字段在不在jdbc的数据里，如果不在，则在jdbc的数据源中先添加上分区字段
    if (args.length > 7){
      if (!resourcesql.columns.contains(args(7))){
        mysqlconnect = resourcesql.withColumn(args(7),lit(args(8)))
      }
    }

    val jdbcconclumns = mysqlconnect.columns // jdbc的列数


    var jdbcoldsource:Dataset[Row] = null // 源数据库的数据 checkpoint是为了破坏数据均衡，以后能编写变读取

    if (args.length == 10){
      jdbcoldsource = spark.sql(
        s"""
          |select * from ${hiveconf(1)} where ${hiveconf(2)} != ${hiveconf(3)}
          |""".stripMargin).checkpoint()
    }else{
      jdbcoldsource =  spark.sql(
        s"""
           |select * from ${hiveconf(1)}
           |""".stripMargin).checkpoint()
    }

    var existcolunms: Array[String] = null  // 设置hive或者mysql的额外列
    var resultdf: DataFrame = jdbcoldsource // 获取hive的数据原始数据

    // 判断是hive的列多，还是数据源的列数多
    if (hiveconclumns.length >= jdbcconclumns.length){
      // 判断额外列的存在
      existcolunms= hiveconclumns.filter(hivecol => {
        val bool = jdbcconclumns.map(jdbccol => {
          jdbccol == hivecol
        }).contains(true)
        !bool
      })
      // 判断两个列数是不是相等
        if (existcolunms.isEmpty) {
          frame = mysqlconnect.selectExpr(hiveconclumns: _*)
          frame
        }else{
        // 列数不相等的时候让列数少的加列
        resultdf = mysqlconnect
        for (elem <- existcolunms){
          resultdf = resultdf.withColumn(elem, lit(null))
        }
        // 对字段进行排序 ， 让分区数据的分区字段在最后一列
        frame = resultdf.selectExpr(hiveconclumns: _*)
        // 验证数据
        frame.show()
        // 整合历史数据
        finallyresult = jdbcoldsource.union(frame)
        // 验证数据
        finallyresult.show()
        changecolunm = true
        finallyresult
      }
    }else{
      // 数据的列多
      existcolunms= jdbcconclumns.filter(jdbccol => {
        val bool = hiveconclumns.map(hivecol => {
          jdbccol == hivecol
        }).contains(true)
        !bool
      })

      if (existcolunms.isEmpty) {
        frame = mysqlconnect.selectExpr(hiveconclumns: _*)
        frame
      }else{
        for (elem <- existcolunms){
          resultdf = resultdf.withColumn(elem, lit(null))
        }
        frame = resultdf.selectExpr(jdbcconclumns: _*)
        finallyresult = resultdf.union(mysqlconnect)
        changecolunm = true
        finallyresult
      }
    }
  }






  def jdbctohive(args:Array[String],catalog: Catalog,mysqlconnect: DataFrame, hiveconf: Array[String])={
    // 分割字符串获取hive的 表和数据库
    val strings = hiveconf(1).split("\\.")

// catalog的方法 获取表存不存在的方法
//    catalog.listTables(strings(0)).show()
//    val empty = catalog.listTables(strings(0)).filter(x => {
//      x.name == strings(1)
//    }).isEmpty
    val empty = catalog.tableExists(strings(0),strings(1))
//-----------------------------------------------------------------------------
//    sql的方法
//    val empty1 = spark.sql(
//      """
//        |show tables in hivedb
//        |""".stripMargin).filter("tableName = 'hivetablename'").isEmpty
// --------------------------------------------------------------------------


    // 判断列数是不是相等
    var frameresult:DataFrame = null
    // 先判断表存不存在 ，因为判断列数的方法要求表存在
      empty match {
          // 表不存在
      case false => {
        // 判断输入的变量个数执行 判断分区表还是普通表
        if (args.length > 7) {
          println("-----------------分区表")
          // 判断分区的参数在不在列中 如果不在 ，则加上 ，在的话就自动往下走
          if (!mysqlconnect.columns.contains(args(7))){
            frameresult = mysqlconnect.withColumn(args(7),lit(args(8)))
            frameresult.show()
          }
        }else{
          println("-----------普通表")
          frameresult = mysqlconnect
          mysqlconnect.show()
        }
      }

      case true => {
        // 表存在
        // 判断是不是分区表
        frameresult = changecolnums(args, hiveconf, mysqlconnect)
//        if (args.length > 7) {
//          println("-----------------分区表")
//          if (!mysqlconnect.columns.contains(args(7))){
//            frameresult = changecolnums(args, hiveconf, mysqlconnect)
//          }
//        }else{
//          println("-----------普通表")
//          frameresult = mysqlconnect
//        }
        frameresult.show()}
    }









    spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
    spark.conf.set("hive.exec.dynamic.partition","true")
    spark.conf.set("spark.sql.parquet.writeLegacyFormat", "true")
    println(empty)
    hiveconf.foreach(println(_))
    saveFile.savetohiveapi(empty,frameresult,hiveconf,changecolunm)
}

}

hivetojdbcs

package project

import java.util

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalog.Catalog
import project.jdbctohive.spark
import tool.{getmysqldf, savefile, sqlUtils,readfile}

object hivetojdbc {
  val spark = SparkSession.builder().appName("sqoop").master("local[4]").enableHiveSupport().getOrCreate()
  val getmysqldf = new readfile
  val sqlUtils = new sqlUtils
  val saveFile = new savefile
  private val catalog: Catalog = spark.catalog

  def main(args: Array[String]): Unit = {

    if (args.length==0){
      println(
        """
          |欢迎使用本程序
          |参数说明
          |总体参数种类 hive mysql
          |---------------------------hive
          |hive中要选择的字段 例子 ： "sal,big  / *  "
          |hive的table的名字 例子 ： bigdata_hive3.emp
          |hive中的 条件可以为空 例子 ： where sal > '300'
          |---------------------------mysql
          |savemode overwrite append 等
          |url 例子 ： jdbc:mysql://bigdata2:3306/try
          |user 例子 ： root
          |password 例子 ： liuzihan010616
          |dbtable 例子 ： emp
          |幂等性的列 ： 例子 ： sal
          |驱动名称 ： 例子 com.mysql.jdbc.Driver
          |""".stripMargin)
    }
    val frame = sqlUtils.checksql(spark, sqlUtils.hivesqlchoose(args))

      var mysqlconf = new Array[String](args.length-3)
      mysqlconf = util.Arrays.copyOfRange(args, 2, args.length)
      saveFile.savetojdbc(spark,frame,mysqlconf)


  }

}

flink

发表于 2023-01-25 更新于 2023-02-22 分类于日志
本文字数： 52k 阅读时长 ≈ 47 分钟

flink

简介

从14年到15年1月就正式开始

flink本身是德语词，代表快速灵巧

1
2
3

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

Here, we explain important aspects of Flink’s architecture.

数据类型

边界流：Bounded streams ：有明确的开始以及结束的流也为人所知是批处理（batch processing）

无边界流：Unbounded streams :有开始无结束的边界流

flink特点

flink处理的数据类型也在从边界流逐渐转向无边界流 :一起是边界流现在是无边界流

flink是个分布式系统

flink针对本地访问进行了优化-> 任务的状态始终保存在内存中如果追昂太大小超过可用内存，那么则会把他存储在能高效访问的磁盘数据结构中->任务通过访问本地（通常在内存中）状态来进行所有的计算，从而产生非常低的处理延迟。Flink 通过定期和异步地对本地状态进行持久化存储来保证故障场景下精确一次的状态一致性。

flink应用

数据

有界
无界
实时
离线

状态

只有在每一个单独的事件上进行转换操作的应用才不需要状态，换言之，每一个具有一定复杂度的流处理应用都是有状态的。

Flink 提供了许多状态管理相关的特性支持，其中包括：

多种状态基础类型 ：例如 value ，map ，list
插件化的State Backend ： State Backend 负责管理应用程序状态，并在需要的时候进行 checkpoint。Flink 支持多种 state backend，可以将状态存在内存或者 RocksDB。RocksDB 是一种高效的嵌入式、持久化键值存储引擎。Flink 也支持插件式的自定义 state backend 进行状态存储。
精确一次语义 ：就是kafka里的精准一次 -> 说明flink支持事务
超大数据量状态 ： Flink 能够利用其异步以及增量式的 checkpoint 算法，存储数 TB 级别的应用状态。
可弹性伸缩的应用 ： Flink 能够通过在更多或更少的工作节点上对状态进行重新分布，支持有状态应用的分布式的横向伸缩。

时间

时间是流处理应用另一个重要的组成部分。因为事件总是在特定时间点发生，所以大多数的事件流都拥有事件本身所固有的时间语义。进一步而言，许多常见的流计算都基于时间语义，例如窗口聚合、会话计算、模式检测和基于时间的 join。流处理的一个重要方面是应用程序如何衡量时间，即区分事件时间（event-time）和处理时间（processing-time）。

事件时间模式 ：使用事件时间语义的流处理应用根据事件本身自带的时间戳进行结果的计算。因此，无论处理的是历史记录的事件还是实时的事件，事件时间模式的处理总能保证结果的准确性和一致性。
Watermark 支持 ： Flink 引入了 watermark 的概念，用以衡量事件时间进展。Watermark 也是一种平衡处理延时和完整性的灵活机制。
迟到数据处理 ：当以带有 watermark 的事件时间模式处理数据流时，在计算完成之后仍会有相关数据到达。这样的事件被称为迟到事件。Flink 提供了多种处理迟到数据的选项，例如将这些数据重定向到旁路输出（side output）或者更新之前完成计算的结果
处理时间模式 ：除了事件时间模式，Flink 还支持处理时间语义。处理时间模式根据处理引擎的机器时钟触发计算，一般适用于有着严格的低延迟需求，并且能够容忍近似结果的流处理应用。

分层APi

Flink 根据抽象程度分层，提供了三种不同的 API。每一种 API 在简洁性和表达力上有着不同的侧重，并且针对不同的应用场景。

如下：

ProcessFunction

ProcessFunction 是 Flink 所提供的最具表达力的接口。ProcessFunction 可以处理一或两条输入数据流中的单个事件或者归入一个特定窗口内的多个事件。它提供了对于时间和状态的细粒度控制。开发者可以在其中任意地修改状态，也能够注册定时器用以在未来的某一时刻触发回调函数。因此，你可以利用 ProcessFunction 实现许多有状态的事件驱动应用所需要的基于单个事件的复杂业务逻辑。

相当于计时器的用处，下处是官方的例子：

官方的例子是设置开始，并登记一个4小时的的计时器，当提前返回end，则是提前结束并返回时间，当到4小时则清空状态并结束

/**

 * 将相邻的 keyed START 和 END 事件相匹配并计算两者的时间间隔
 * 输入数据为 Tuple2<String, String> 类型，第一个字段为 key 值， 
 * 第二个字段标记 START 和 END 事件。
    */
public static class StartEndDuration
    extends KeyedProcessFunction<String, Tuple2<String, String>, Tuple2<String, Long>> {

  private ValueState<Long> startTime;

  @Override
  public void open(Configuration conf) {
    // obtain state handle
    startTime = getRuntimeContext()
      .getState(new ValueStateDescriptor<Long>("startTime", Long.class));
  }

  /** Called for each processed event. */
  @Override
  public void processElement(
      Tuple2<String, String> in,
      Context ctx,
      Collector<Tuple2<String, Long>> out) throws Exception {

    switch (in.f1) {
      case "START":
        // set the start time if we receive a start event.
        startTime.update(ctx.timestamp());
        // register a timer in four hours from the start event.
        ctx.timerService()
          .registerEventTimeTimer(ctx.timestamp() + 4 * 60 * 60 * 1000);
        break;
      case "END":
        // emit the duration between start and end event
        Long sTime = startTime.value();
        if (sTime != null) {
          out.collect(Tuple2.of(in.f0, ctx.timestamp() - sTime));
          // clear the state
          startTime.clear();
        }
      default:
        // do nothing
    }
  }

  /** Called when a timer fires. */
  @Override
  public void onTimer(
      long timestamp,
      OnTimerContext ctx,
      Collector<Tuple2<String, Long>> out) {

    // Timeout interval exceeded. Cleaning up the state.
    startTime.clear();
  }
}

DataStream API

DataStream API 为许多通用的流处理操作提供了处理原语。这些操作包括窗口、逐条记录的转换操作，在处理事件时进行外部数据库查询等。DataStream API 支持 Java 和 Scala 语言，预先定义了例如 map()、reduce()、aggregate() 等函数。你可以通过扩展实现预定义接口或使用 Java、Scala 的 lambda 表达式实现自定义的函数。

下面的代码示例展示了如何捕获会话时间范围内所有的点击流事件，并对每一次会话的点击量进行计数。

// 网站点击 Click 的数据流
DataStream<Click> clicks = ...

DataStream<Tuple2<String, Long>> result = clicks
  // 将网站点击映射为 (userId, 1) 以便计数
  .map(
    // 实现 MapFunction 接口定义函数
    new MapFunction<Click, Tuple2<String, Long>>() {
      @Override
      public Tuple2<String, Long> map(Click click) {
        return Tuple2.of(click.userId, 1L);
      }
    })
  // 以 userId (field 0) 作为 key
  .keyBy(0)
  // 定义 30 分钟超时的会话窗口
  .window(EventTimeSessionWindows.withGap(Time.minutes(30L)))
  // 对每个会话窗口的点击进行计数，使用 lambda 表达式定义 reduce 函数
  .reduce((a, b) -> Tuple2.of(a.f0, a.f1 + b.f1));

SQL & Table API

Flink 支持两种关系型的 API，Table API 和 SQL。这两个 API 都是批处理和流处理统一的 API，这意味着在无边界的实时数据流和有边界的历史记录数据流上，关系型 API 会以相同的语义执行查询，并产生相同的结果。Table API 和 SQL 借助了 Apache Calcite 来进行查询的解析，校验以及优化。它们可以与 DataStream 和 DataSet API 无缝集成，并支持用户自定义的标量函数，聚合函数以及表值函数。

Flink 的关系型 API 旨在简化数据分析、数据流水线和 ETL 应用的定义。

下面的代码示例展示了如何使用 SQL 语句查询捕获会话时间范围内所有的点击流事件，并对每一次会话的点击量进行计数。此示例与上述 DataStream API 中的示例有着相同的逻辑。

1
2
3

SELECT userId, COUNT(*)
FROM clicks
GROUP BY SESSION(clicktime, INTERVAL '30' MINUTE), userId

库

Flink 具有数个适用于常见数据处理应用场景的扩展库。这些库通常嵌入在 API 中，且并不完全独立于其它 API。它们也因此可以受益于 API 的所有特性，并与其他库集成。

复杂事件处理(CEP) ：模式检测是事件流处理中的一个非常常见的用例。Flink 的 CEP 库提供了 API，使用户能够以例如正则表达式或状态机的方式指定事件模式。CEP 库与 Flink 的 DataStream API 集成，以便在 DataStream 上评估模式。CEP 库的应用包括网络入侵检测，业务流程监控和欺诈检测。
DataSet API ：DataSet API 是 Flink 用于批处理应用程序的核心 API。DataSet API 所提供的基础算子包括 map 、 reduce 、 (outer) join 、 co-group 、iterate等。所有算子都有相应的算法和数据结构支持，对内存中的序列化数据进行操作。如果数据大小超过预留内存，则过量数据将存储到磁盘。Flink 的 DataSet API 的数据处理算法借鉴了传统数据库算法的实现，例如混合散列连接（hybrid hash-join）和外部归并排序（external merge-sort）。
Gelly : Gelly 是一个可扩展的图形处理和分析库。Gelly 是在 DataSet API 之上实现的，并与 DataSet API 集成。因此，它能够受益于其可扩展且健壮的操作符。Gelly 提供了内置算法，如 label propagation、triangle enumeration 和 page rank 算法，也提供了一个简化自定义图算法实现的 Graph API。

flink运维

Flink通过几下多种机制维护应用可持续运行及其一致性:

检查点的一致性 : Flink的故障恢复机制是通过建立分布式应用服务状态一致性检查点实现的，当有故障产生时，应用服务会重启后，再重新加载上一次成功备份的状态检查点信息。结合可重放的数据源，该特性可保证 精确一次（exactly-once） 的状态一致性。
高效的检查点 : 如果一个应用要维护一个TB级的状态信息，对此应用的状态建立检查点服务的资源开销是很高的，为了减小因检查点服务对应用的延迟性（SLAs服务等级协议）的影响，Flink采用异步及增量的方式构建检查点服务。
端到端的精确一次 : Flink 为某些特定的存储支持了事务型输出的功能，及时在发生故障的情况下，也能够保证精确一次的输出。
集成多种集群管理服务 : Flink已与多种集群管理服务紧密集成，如 Hadoop YARN, Mesos, 以及 Kubernetes。当集群中某个流程任务失败后，一个新的流程服务会自动启动并替代它继续执行。
内置高可用服务 : Flink内置了为解决单点故障问题的高可用性服务模块，此模块是基于Apache ZooKeeper 技术实现的，Apache ZooKeeper是一种可靠的、交互式的、分布式协调服务组件。

Flink能够更方便地升级、迁移、暂停、恢复应用服务

而Flink的 Savepoint 服务就是为解决升级服务过程中记录流应用状态信息及其相关难题而产生的一种唯一的、强大的组件。一个 Savepoint，就是一个应用服务状态的一致性快照，因此其与checkpoint组件的很相似，但是与checkpoint相比，Savepoint 需要手动触发启动，而且当流应用服务停止时，它并不会自动删除。Savepoint 常被应用于启动一个已含有状态的流服务，并初始化其（备份时）状态。Savepoint 有以下特点：

便于升级应用服务版本 : Savepoint 常在应用版本升级时使用，当前应用的新版本更新升级时，可以根据上一个版本程序记录的 Savepoint 内的服务状态信息来重启服务。它也可能会使用更早的 Savepoint 还原点来重启服务，以便于修复由于有缺陷的程序版本导致的不正确的程序运行结果。
方便集群服务移植 : 通过使用 Savepoint，流服务应用可以自由的在不同集群中迁移部署。
方便Flink版本升级 : 通过使用 Savepoint，可以使应用服务在升级Flink时，更加安全便捷。
增加应用并行服务的扩展性 : Savepoint 也常在增加或减少应用服务集群的并行度时使用。
便于A/B测试及假设分析场景对比结果 : 通过把同一应用在使用不同版本的应用程序，基于同一个 Savepoint 还原点启动服务时，可以测试对比2个或多个版本程序的性能及服务质量。
暂停和恢复服务 : 一个应用服务可以在新建一个 Savepoint 后再停止服务，以便于后面任何时间点再根据这个实时刷新的 Savepoint 还原点进行恢复服务。
归档服务 : Savepoint 还提供还原点的归档服务，以便于用户能够指定时间点的 Savepoint 的服务数据进行重置应用服务的状态，进行恢复服务。

监控和控制应用服务

如其它应用服务一样，持续运行的流应用服务也需要监控及集成到一些基础设施资源管理服务中，例如一个组件的监控服务及日志服务等。监控服务有助于预测问题并提前做出反应，日志服务提供日志记录能够帮助追踪、调查、分析故障发生的根本原因。最后，便捷易用的访问控制应用服务运行的接口也是Flink的一个重要的亮点特征。

Flink与许多常见的日志记录和监视服务集成得很好，并提供了一个REST API来控制应用服务和查询应用信息。具体表现如下：

Web UI方式 : Flink提供了一个web UI来观察、监视和调试正在运行的应用服务。并且还可以执行或取消组件或任务的执行。
日志集成服务 :Flink实现了流行的slf4j日志接口，并与日志框架log4j或logback集成。
指标服务 : Flink提供了一个复杂的度量系统来收集和报告系统和用户定义的度量指标信息。度量信息可以导出到多个报表组件服务，包括 JMX, Ganglia, Graphite, Prometheus, StatsD, Datadog, 和 Slf4j.
标准的WEB REST API接口服务 : Flink提供多种REST API接口，有提交新应用程序、获取正在运行的应用程序的Savepoint服务信息、取消应用服务等接口。REST API还提供元数据信息和已采集的运行中或完成后的应用服务的指标信息。

flink优点

处理流式数据

事件驱动

低延迟

高吞吐

准确性，以及容错性

支持精准一次

应用：

数据源 -> etl -> 数仓 -> flink -> 报表业务等

flink解析

state ：存储在内存中的数据，内存的数据响应快，但是不稳定

checkpoint ：备份checkpoint ，当机器出现故障的时候可以恢复数据 -> 会周期行进行保存

对于数据的准确性：以lambad为例

通过两个系统（lambad系统,sparkStreaming）

流式处理 -> 实现快速
批处理 -> 保证顺序

先把数据通过流式处理进行数据处理，然后设定一定时间或者一定的数据量，当达到一定时间的时候或者数据量达到一定程度，再进行往下通过批处理发送，保证结果的顺序

衍生出flink

storm第一代

lambda第二代

flink第三代 -> 集成上面所有的

数据模型

sparkStreaming：

采用RDD 实际上就是一组小数据的集合RDD

flink：

基本就是数据流，以event序列

运行时架构

spark是批计算，将DAG划分为不同的stage，一个完成才开始下一个

flink是标准的流执行，一个事件再一个节点处理完后可以直接发送到下一个节点

配置文件

jobmanager ：针对整个job的 =》 driver =》会在启动的机器上 =》其会和taskmanager进行通信，默认通信端口是 6123

rpc.address : 启动的机器，在配置文件里设置的

rpc.port : 通信端口

heap.size : 堆的内存 jvm中

process.size : taskmanager 的占用的内存，包括jvm以及堆外内存默认开启

flink.size : task占用的内存，包括一些状态什么的 process.size 包含flink.size

numberTaskSlots: 一个任务在几个Solt上执行

parallelism：并行度这个参数和上面的不一样，这个是运行的时候来的，上一个是直接给你分，不是在运行的时候的

taskmanager ：针对job下的一个task的 =》 worker

搭建flink项目

在idea里

pom

   <dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-scala_2.12</artifactId>
  <version>1.13.6</version>
</dependency>

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-streaming-scala_2.12</artifactId>
  <version>1.13.6</version>
</dependency>

代码：

package flinklearn

import org.apache.flink.api.scala.ExecutionEnvironment


object frist {
  def apply(): frist = new frist()
  def main(args: Array[String]): Unit = {
   // frist().piwc()
    frist().Streamwc()
  }
}



class frist() {

  //创建批处理执行环境类比sparksession
  val pienvironment =org.apache.flink.api.scala.ExecutionEnvironment.getExecutionEnvironment
  val streamingenv= org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.getExecutionEnvironment

  // 批处理wc
  def piwc()={

    import org.apache.flink.api.scala._
    // 从文件中取数据
    val path = "F:\\bigdatajava\\src\\main\\resources\\wc.data"
    val value = pienvironment.readTextFile(path)

    // 对数据进行转换处理
    val resultds:DataSet[(String,Int)] = value.flatMap(_.split(",")).map((_,1)).groupBy(0).sum(1)
    resultds.print()
  }

  // 流处理wc
  def Streamwc() = {

    import org.apache.flink.streaming.api.scala._

    // 设置并行度 -> 界面的数字就是并行度，10> (flume,2) 前面的数字就是哪一个任务的id -> 是根据hash值进行分的 -> 默认是电脑的最大配置
    // 下面是全局设置
    // 还可按照每个算子后面设置
    // 因为每个算子都算一个单独的任务
    // val value1 = value.flatMap(_.split(",")).filter(_.nonEmpty).setParallelism(3).map((_, 1)).keyBy(0).sum(1).setParallelism(1)
    streamingenv.setParallelism(1)


    // 接受一个socket文本流
    val value = streamingenv.socketTextStream("bigdata3",8888)

    val value1 = value.flatMap(_.split(",")).filter(_.nonEmpty).map((_, 1)).keyBy(0).sum(1)

    value1.print().setParallelism(1)

    // 启动任务执行
    streamingenv.execute("first")

  }

}

部署flink并运行

先下载flink的包，我用的scala是2.12的所以下的是flink_scala_2.12的

根据自己的版本选择

地址：flink

下载完成上传到服务器

然后解压 -> 设置环境变量 -> 进入到flink的conf文件夹，编辑 flink-conf.yaml 文件把参数 jobmanager.rpc.address:设置成主节点，然后按照需求是不是开启高可用，以及设置检查点的文件夹（如果文件夹放在hdfs上，则flink要两个依赖包，flink自己没有的分别是 flink-shaded-hadoop-3-uber-3.1.1.7.2.9.0-173-9.0.jar以及 commons-cli-1.5.0.jar）可以区maven官网下载，然后放到flink/lib下，这两个jar包要按照自己hadoop的版本进行下载

->然后再编辑workers -> 添加上子节点的名字 -> 分发到各个机器上

然后再主节点启动start-cluster.sh

就成功了访问 主节点:8081

就可以访问flink的web页面

编写启动脚本如下：

case $1 in 
"start")
ssh bigdata5 "/home/hadoop/app/flink/bin/start-cluster.sh"
;;
"stop")
ssh bigdata5 "/home/hadoop/app/flink/bin/stop-cluster.sh"
;;
"status")
echo "web ui : bigdata5:8081"
jps| grep TaskManagerRunner
ssh bigdata4 "jps| grep TaskManagerRunner"
ssh bigdata5 "jps| grep StandaloneSessionClusterEntrypoint"
;;
*)
echo "error input you should use by start|stop|status"
;;
esac

把上述scala代码打包成jar包

web

上传到服务器的web界面如下

然后设置运行主类，以及并行度，参数，checkpoint点就好

如果上述没有放置两个jar包，则是无法执行再hdfs上设置checkpoint文件夹的

上述的代码执行之后输出在哪里呢？

他会输出在task manager里，至于具体在哪个里，应该点击输出任务

如下：

然后在web界面上点击task-manager -> 点击相应机器 -> 点击Stdout 就会看见控制台信息了

这就是web部署成功了

然后停止如下：

命令行

执行：flink run -m bigdata5:8081 -c flinklearn.frist -p 2 ./bigdatajava-1.0-SNAPSHOT.jar

就可以了，参数以及checkpoint可以加载后面，如果不设置，就走默认的

因为对于sockt文本流他的并行度就是1，所以外面无法改变

如下：

经过ctrl + c 或者其他操作之后，这个作业并不会停掉

通过 flink list 9723a168e896e048b777473cb871e10a后面的是job的id，其实知识为了更精准一下，这个参数是可选的

还可以接-a 代表查看所有的

通过 flink cancel jobid就可以对只定的jobid进行停止

如下：

部署模式

flink为我们的不同场景设置了不同的模式

会话模式
单作业模式
应用模式

会话

先启动集群，然后其他的进行提交作业，就是我们上述的模式

优点：相当于集群先启动，索要的资源已经固定好了，集群的生命周期高于任何的job，不和job的结束而改变

缺点：资源不够的时候会出问题

和另外的资源管理平台结合用

单作业

每个作业都启动一个flink集群，就不会出现上述资源不够的问题

就是按照把资源按照作业来划分

相当于container

一般的时候是首选的，但是flink本身是没有办法用单作业的

他要借助别人的容器化的管理机制-> yarn/ k8s

应用模式

上述两种是都先在客户端进行执行的，然后再发送给jobmanager，但是会占用网络带宽，

而且对于单作业模式的情况很可能会在客户端拆分成好几个作业，任何根据他每个作业就启动一个集群的说法，会造成大量的资源浪费

然后我们直接把作业发送到jobmanager上直接由他做处理，就是应用模式

和单作业很像

单作业是作业和集群一对一

应用是应用和集群一对一

独立模式

不依赖任何外部资源管理平台

最基本，也是最简单的

在实际项目中使用会比较少

因为对资源的管理有要求

再独立模式的时候，没有单作业的，因为必须要外部平台

应用模式 -> 可以但是使用少

首先把要运行的jar包放在flink的lib文件夹下

然后执行 standalone-job.sh start --job-classname flinklearn.frist因为flink会默认扫描lib包下的所有的jar包所以这里指定入口就好

然后执行 taskmanager.sh start

停掉集群：

1
2
3

standlone-job.sh stop

taskmanager.sh stop

yarn模式

客户端先把flink的一个应用提交到yarn上

yarn他的resourcemanager会向nodemanager申请容器 ?

在这些容器上flink会部署他的作业flink会根据作业所要的sloat的数量进行动态分配taskmanager的资源

hadoop至少是2.2及其以上

flink在1.8之前hadoop的版本和正常的版本是分开的，就是人家给了你两套

但是1.8-1.11我们要下载的仅仅只是hadoop的插件

但是1.11之后就更不用下载hadoop的插件了，我们主要就进行环境变量的配置就好了

要配置

export HADOOP_HOME=/home/hadoop/app/hadoop
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_CLASSPATH=`hadoop classpath`
export PATH=${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:$PATH

就好

然后要创建一个Yarnsession

在flink的主节点下用 yarn-session.sh -nm name就能关联上yarn

如下

但是仅仅这样启动的集群他的web界面查看后发现插槽是0，如下

这是因为我们启动的是yarn的模式的应用模式

当我们关掉它的时候yarnsession就会关掉了，我们可以加如下参数对它进行控制

-d ：分离模式，前台关掉，后面不会把yarnsession关掉

-jm ：配置jobmanager索要的内存默认单位 MB

-nm : 配置名字

-qu : 指定yarn的队列名字

-tm: 配置每个taskmanager的内存

注意：flink 从1.11之后就不再使用 -s和-n 指定插槽数量以及taskmanager的数量了，yarn会动态的进行分配的

然后用户还是可以通过web和命令行两种进行提交作业和上述standlone的时候是一样的

其实上述就是很简单的会话模式

单作业

在yarn模式的时候由于有了外部资源管理平台，就可以进行单作业模式了

执行 flink run -d -t yarn-per-job -c flinklearn.frist jar包的绝对路径

-d : 就是分离模式

-t ：是指定yarn模式的模式 yarn-per-job 就是单作业

-c ：是class入口

后面还可以接参数等等

早期还有一种把 -t pre-yarn-job 用 -m yarn-cluster 代替的写法

应用模式

和单作业模式很像，就是运行的参数不同

flink run-application -t yarn-application -c ....

查看作业

flink list -t yarn-application -Dyarn.application.id = ....

取消作业

flink cancel -t yarn-application -Dyarn.application.id = ....

还可以通过yarn.provided.lib.dirs配置选项指定位置，把jar上传到远程

1	flink run-application -t yarn-application -Dyarn.provided.lib.dirs="hdfs://bigdata3:9000/tmp/flinktmp" hdfs://bigdata3:9000/tmp/flinktmp

上传到hdfs上运行

flink运行时的架构

flink系统架构

作业管理器（jobmanager）

是flink集群中的任务管理中心以及调度中心

最核心的组件，负责单独处理job

在作业提交的时候jobmaster会先接受到要执行的应用，一般是客户端提交的，包括：jar，数据流图，作业图

jobmaster会把jobGraph转换成一个物理层面的数据流图，这个图被叫做执行图（ExecutionGraph），它包含了所有可以并发的任务，jobMaster会向资源管理器（ResourceManager）发送请求，申请执行任务必要的资源，一旦它获取了足够的资源，就会将执行图分别发到他们真正运行的TaskManager上

在运行过程中jobmaster会负责监控指标以及调度，比如说检查点的协调

资源管理器（resourcemanager）

在一个flink集群里只有一个，负责分配资源，所谓资源其实主要是taskmanager的任务槽（slot），任务槽就是flink集群中的资源调度单位，包含机器用来计算的我cpu和内存资源，每一个任务都要分配到一个solt上，主要是内存分开

分发器（Dispatcher）

他主要是负责提供一个rest接口，用来提交应用的，并且为每一个新提交的作业启动一个新的jobMaster组件，Diapatcher也会启动一个web UI 用来方便和展示监控作业的信息，Diapatcher在架构中并不是必须的在不同的模式种可能会被忽略

任务管理器（taskmanager）

flink种的worker

每一个taskmanager包含了一定的solt

插槽的数量限制了并行度：设置并行度的优先级代码最高其次是命令其次是配置文件

启动之后taskmanager 会将一个或者多个插槽提供给jobmaster调用，jobmaster就可以向插槽分配任务来执行

执行过程中，一个taskManager可以和其他的与运行同一job的taskmanager来交互数据

一些执行流程图如下：

flink的细节

程序和数据流：

所有的flink程序都是要由三部分组成的 source transform sink

在运行flink项目的时候flink的程序会被映射成逻辑数据流（dataflow），它包含了三个部分，每一个dataflow都以一个或者多个source开始，以一个或者多个sink结束，其类似有向无环图（DAG）

大部分情况，程序中的转换操作（transform）和dataflow的算子（operation）是一一对应的关系

并行度

每一个算子可以包含多个或者一个子任务，这些子任务在不同的线程，不同的物理机，不同的容器中是完全独立的

一个特定的算子的子任务的个数就被称为并行度

任务并行：就是相当于多个线程
数据并行：同一个算子可以茶城多分町是处理多份数据

例子：suorce的时候如何设置多并行?

它是把数据源进行复制，如何让每一个线程去处理不同数据最后再合到一起

数据传输形式

一个程序之间不同的算子可能有不同的并行度

算子之间的传输数据的形式可以是one-to-one也可以是redistributing的模式具体是什么取决于算子的种类

one-to-one:streaming维护着分区的顺序以及元素的顺序（比如source以及map之间）这意味着元素的个数顺序相同，map,filiter,flatMap,等算子，都是one-to-one的

Redistributing:指分区数量可能会发生改变，每一个算子，的子任务依据所选择的transform发送数据到不同的目标任务

例如：keyBy基于hashcode重新分区，而broadcast和rebalance会随即重新分区，这些算子都是引起redistributing的而这个过程就相当于spar中的shuffle

于是就诞生了算子链

flink使用一种称为任务链的优化技术，减少通信的开销，为了满足任务链的需求，将两个或者多个算子设为相同的并行度，通过本地转发（local forward）的放式进行链接

相同并行度的one-to-one操作，flink放在一起，链接形成一个task，并行度相同，并且是one-to-one操作，两个条件缺一不可

执行图

flink中的执行图可以分为StreamingGraph -> jobGraph -> ExcutionGraph -> 物理执行图

StreamingGraph：是根据用户的api自动生成的最初的图用来表明程序的拓扑结构
jobGraph：上面一个经过优化，提交给jobmanager的数据结构，将多个符合条件的节点chain到一起作为一个节点
ExcutionGraph ： jobmanager 根据jobGraph生成的并行化版本，是调度曾的核心的数据结构
物理执行图：在各个taskmanager上的，就是告诉他们怎么做的，是部署到taskmanager上的，不是一个数据结构

如下：

任务和任务槽

flink中每一个taskmanager就相当于是一个进程，他会在独立的线程上执行一个或者多个子任务

为了控制taskmanager能接收多少个task，Taskmanager通过task solt来进行控制，（一个taskmanager最少有一个slot）

slot最主要的作用就是隔离内存，因为cpu是没有办法真正隔离开的

flink里默认是允许子任务进行共享slot的，简单来说就是一个slot可以作为我们保存作业的整个管道

当我们将资源密集型和资源非密集型的任务放到一个slot中，他们就可以自行分配对资源占用的比例，从而保证最重的活平均分配给所有的taskmanager

slot和并行度

solt：静态概念：是指taskmanager具有的并发的执行的能力

通过参数taskmanager.numberOfTaskSlot进行配置

并行度：动态概念，就是真正所用到的并发能力

通过参数：parallelism.default进行设置

简单来说就是我可以拿起多沉的东西，但是我不用那么大的力气

flink控制任务调度（代码）

可以禁用算子链

通过 xxx.disableChaining()

可以实现一个slot单独给一个算子用，同时也不能把他纳入任何一条算子链

还可以用 xxx.startNewChain()

可以实现从xxx开始一个新的算子链，不管前面如何都要分开

还可以设置slot共享组

就是在一个共享组里的slot才可以共享slot

不在一个共享组里的slot他们必须分开

通过 xxx.slotSharingGroup(String)实现代表后面的算子默认情况下就是在String所在的共享组

DataStreamAPI

对于以后的apiDatasetapi即将被弃用

所以我们用datasetapi

可以把DS堪称一种比较特殊的java集合类型

比如一个socket文本流底层就是DataStream

如果想调用DS的api要进行先创建环境

创建环境

getExecutionEnvironment

它是相当于把下面两种放在一起了，自动判断

1 2	val pienvironment =org.apache.flink.api.scala.ExecutionEnvironment.getExecutionEnvironment val streamingenv= org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.getExecutionEnvironment

上述的getExecutionEnvironment方法是很智能的，它会自动识别我们是在本地调试还是在集群中调试，它会自动进行转换

createLocalEnvironment

是创建一个本地的环境，在调用的时候可以传入一个参数指定默认的并行度，如果不传入默认就是当前电脑的cpu核心数量

1 2	private val environment: StreamExecutionEnvironment = org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.createLocalEnvironment()

createRemoteEnvironment

调用远程的执行环境

1 2	private val environment: StreamExecutionEnvironment = org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.createRemoteEnvironment("bigdata5",8081,1,"jar包的路径")

它底层是这样定义的

def createRemoteEnvironment(
      host: String,
      port: Int,
      parallelism: Int,
      jarFiles: String*): StreamExecutionEnvironment = {

    val javaEnv = JavaEnv.createRemoteEnvironment(host, port, jarFiles: _*)
    javaEnv.setParallelism(parallelism)
    new StreamExecutionEnvironment(javaEnv)
  }

执行模式

经过上面获取的环境，我们就可以开始对其设置执行模式

在早期的代码中它把批处理和流处理分开了

通过代码

1 2	val pienvironment =org.apache.flink.api.scala.ExecutionEnvironment.getExecutionEnvironment val streamingenv= org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.getExecutionEnvironment

这样的方式

上面一个是批处理的

下面一个是流处理的

他们的api是基本相同的，但是包不同

但是现在的做法是直接用下面的那个

对于批处理而言：我们只要在提交的时候通过命令

flink run -Dexecution.runtime-mode=BATCH 。。。。

就可以证明他是批处理的

如果不处理上述的参数默认是STREAMING ：就是流处理的格式

或者在代码的时候直接通过

val streamingenv= org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.getExecutionEnvironment
 streamingenv.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC)
streamingenv.setRuntimeMode(RuntimeExecutionMode.BATCH)
streamingenv.setRuntimeMode(RuntimeExecutionMode.STREAMING)

里面传入相应的参数即可

但是一般不推荐这样做，因为这相当于固定死了，直接当命令行参数传递更好一点

在flink中批处理数据被划分到有界流中了，为什么还要批处理模式？

因为性能问题，流处理是来一条数据我处理一个数据，然后发送一条，批处理是来一堆数据我处理，如何再一起发送

对于批处理数据，它来的时候就是一堆来的，然后流处理的时候要一条一条发送，发送的次数多了，而对于批处理，我只用处理，然后一次发过去，就好了

这就是批处理还在flink中的原因

我们的flink代码是懒执行的，和懒加载是一个道理的，只有通过excute才开张真正的执行

source

源算子：就是读取数据源的算子

有界数据

读取有界数据的简单的测试方法

 val streamingenv= org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.getExecutionEnvironment

 case class event(uaer:String,url:String,timestamp:Long)

streamingenv.setParallelism(1)
    // 从元素中读取数据
    streamingenv.fromElements(1,2,3,4,5,65,67,7,7).print("from elem")
    streamingenv.fromElements(
      event("zihan","1211",1111),
      event("bob","1333",22222)
    ).print("from case class")

    // 这个可以从迭代器中读取数据，具体可以ctrl + p 查看
    val events = List(event("zihan", "1211", 1111), event("bob", "1333", 22222))
    streamingenv.fromCollection(events).print("from list")

    // 读取文本文件
    streamingenv.readTextFile("F:\\bigdatajava\\src\\main\\resources\\wc.data").print("from text")

输出结果为

from elem> 1
from elem> 2
from elem> 3
from elem> 4
from elem> 5
from elem> 65
from elem> 67
from elem> 7
from elem> 7
from case class> event(zihan,1211,1111)
from list> event(zihan,1211,1111)
from case class> event(bob,1333,22222)
from list> event(bob,1333,22222)
feom text> spark,linux，spark,spark
feom text> hadoop
feom text> linux,hive
feom text> flume,flink
feom text> gg,dd
feom text> ttm,ff
feom text> "zihan","1211",1111
feom text> "bob","1333",22222
[WARN ][2023-02-04 16:15:57][org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator$ReaderState$6.prepareToProcessRecord(ContinuousFileReaderOperator.java:178)]not processing any records while closed

Process finished with exit code 0

我们还可以把一些数据写进文本文件中然后进行读取

无界数据

我们一般是从kafka来接受数据的

我们先要引入链接kafka的依赖

如下：

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-kafka_2.12</artifactId>
  <version>1.13.6</version>
</dependency>

值得注意的是这个是官方的，他会自动根据kafka的版本进行更新，目前支持kafka0.10.0版本及以上的

有特殊需要就去找特殊的版本的

而且对于1.14版本之后的时候对我们要引入的方法有了更改从FlinkKafkaConsumer变成KafkaSource

代码如下

 
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
streamingenv.setParallelism(1)

    // 链接kafka
    val properties = new Properties()
    properties.put("bootstrap.servers", "bigdata3:9092,bigdata4:9092,bigdata5:9092 ")

    // 注意使用下面的那个方法的时候不用在此设置下面的参数，因为这个FlinkKafkaConsumer[T]里面已经封装好了，而且默认采用的就是精准一次
//    properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
//    properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
//    properties.put("acks", "all")
    /*
    传入参数说明FlinkKafkaConsumer[T]
    topic , 反序列化器 ， kafka配置参数
    上面的T是代表把获得的数据当作什么类型
     */

    streamingenv.addSource(new FlinkKafkaConsumer[String]("dl2262",new SimpleStringSchema(),properties)).print("kafka")

读取自定义数据源

如下

streamingenv.setParallelism(1)

/*
 自己定义外部数据源
 实现SourceFunction接口
 重写两个方法run()和cancel()
 run()获取数据的方法
 cance()控制停止的方法
 */

import flinklearn.clickSource

val stream = streamingenv.addSource(new clickSource)

stream.print("makebyself")

source方法

package flinklearn

import java.util.Calendar

import org.apache.flink.streaming.api.functions.source.SourceFunction

import scala.util.Random

object clickSource {

  def apply(): clickSource = new clickSource()

  def main(args: Array[String]): Unit = {

  }
}

/*
SourceFunction[T]
其中的泛型就是我们对应的返回的数据的类型
 */


class clickSource extends SourceFunction[event]{

  // 标志位
  var flag = true


  def excute(): Unit ={

  }


  override def run(sourceContext: SourceFunction.SourceContext[event]): Unit = {
    // 随机数生成器
    val random =new  Random()

    // 定义选择的范围
    val user = Array("1","2","3")
    val url = Array("/cat","/.dog","/info")

    //使用循环不停的发送数据，标志位做为判断题条件，不停的发送数据
    while (flag){
      val eventtmp = event(user(random.nextInt(2)),url(random.nextInt(2)),Calendar.getInstance().getTimeInMillis)
      // 调用上下文sourceContext的方法向下游发送数据
      sourceContext.collect(eventtmp)
      // 每隔1s发送一条数据
      Thread.sleep(1000)
    }

  }

  override def cancel(): Unit = {
    flag = false
  }

}

但是对于SourceFunction它本身就是个并行度只能为1的接口

和socket文本流一样

如果想设置多并行度的就要用ParallelSourceFunction这个接口，其使用和上面一样

flink支持的类型

flink里DS的数据类型都是由他的泛型进行控制的

1 2	val stream:DataStream[event] = streamingenv.addSource(new clickSource)

基本上scala和java里所有的他都支持，但只是基本上，因为flink是分布式的，它再每个节点之间交付数据的时候是要经网络传输的，序列化和反序列化，所以对于一些的数据类型就无法支持

他的底层类型都是封装在TypeInformation和types中的，可以点进去查看

泛型的时候不是由flink进行序列化的，他是由Kyro进行的所以就可能出现问题，要尽可能避免

算子

转换算子

map
filter
FlatMap
KeyBy:根据key进行聚合里面可以传入字符串，或者数字，或者和map里一样传入一个function
简单聚合方法 -> sum , min ,max 等
reduce：就是和spark里的reducebykey一样

调用的时候都是得到DS进行调用

使用如下:

    val value: DataStream[String] = streamingenv.readTextFile("F:\\bigdatajava\\src\\main\\resources\\wc.data")
    value.flatMap(_.split(",")).map((_,1)).keyBy(_._1).reduce((x,y)=>{
      (x._1,x._2+y._2)
    }).print()
-------------------------------------------数据
spark,linux,spark,spark
hadoop
linux,hive
flume,flink
gg,dd
ttm,ff
"zihan","1211",1111
"bob","1333",22222

函数类（udf）

为什么他们里面可以放function

查看底层源码可以看见

1 2	@Public public interface Function extends java.io.Serializable {}

他们继承于这个接口，并实现了各自的方法，所以就可以传入Function

进而导出udf是如何实现的

对于flink里的udf我们可以让它继承不同的function，然后再放进去

测试自定义udf的做法

package flinklearn
import org.apache.flink.api.common.functions.FilterFunction
import tool._
import org.apache.flink.streaming.api.scala._
object udf {

  def apply(): udf = new udf()

  def main(args: Array[String]): Unit = {
    udf().udftest
  }
}

class udf{

  private val streamingcontext = new streamingcontext

  def udftest = {
    val environment = streamingcontext.getflinkenv()

    val testdata = List(
      event("zihan", "1211", 1111),
      event("bob", "1333", 22222)
    )

    val testDS = environment.fromCollection(testdata)

    // 筛选特定数据
    testDS.filter( new myfiliterfunction() ).print()
  
    testDS.filter( new FilterFunction[event] {
      override def filter(value: event): Boolean = {
        value.uaer.contains("zihan")
      }
    }).print()

    environment.execute()
  }
}

// 实现自定义的function
class myfiliterfunction() extends FilterFunction[event]{
  def filter(value: event): Boolean = {
    value.uaer.contains("zi")
  }

}

注意，这里不要引用错包，如果引用错包，就会报错，因为scala和java的api名字是一样的

富函数（udf）

因为我们上述所说的udf是针对一条数据进行操作的

但是假如我们想对一批数据进行操作，也就是数据来之前对其进行操作怎么办？

我们要通过更加复杂的用户自定义类，是函数类的扩展版本

最大的不同就是富函数类，可以获取运行环境的上下文，以及有生命周期等

富函数类的继承接口是Rich…Fnction

它里面有两个方法：

open ：相当于算子初始化的时候和spring 里的初始化一样
close ：结束的时候和spring里的销毁是一样的

如下：

package flinklearn
import org.apache.flink.api.common.functions.{FilterFunction, RichMapFunction}
import org.apache.flink.configuration.Configuration
import tool._
import org.apache.flink.streaming.api.scala._
object udf {

  def apply(): udf = new udf()

  def main(args: Array[String]): Unit = {
    udf().udftest
  }
}

class udf{

  private val streamingcontext = new streamingcontext

  def udftest = {
    val environment = streamingcontext.getflinkenv()

    val testdata = List(
      event("zihan", "1211", 1111),
      event("bob", "1333", 22222)
    )

    val testDS = environment.fromCollection(testdata)

    // 定义富函数
    testDS.map( new myRichmap).print("2")

    val result =
      """
        |索引号0
        |编号为4a4f30b513560b972fd0e372460b71c4
        |2> 1211
        |2> 1333
        |这个是结束方法0
        |""".stripMargin
    environment.execute()
  }
}


// 实现富函数
class myRichmap extends RichMapFunction[event,String]{
  override def map(value: event): String = {
    value.url
  }

  // 在所有数据到来之前进行处理
  override def open(parameters: Configuration): Unit = {
    println("索引号" + getRuntimeContext.getIndexOfThisSubtask)
    println("编号为" + getRuntimeContext.getJobId)
  }

  // closa
  override def close(): Unit = {
    println("这个是结束方法" + getRuntimeContext.getIndexOfThisSubtask)
  }


}

注意，当多个并行度进行的时候，每一个并行调用map的时候都会进行初始化，以及销毁

分区函数

简单来说就是数据的重新分区的操作

简单介绍一下：keyby

keyby：是把每个key根据hash值然后取模运算的方法进行分区也就造成了，每一个相同的key一定可以在同一分区，不同的不一定不在同一分区

接下来我们要学习的算子，是可以真正控制分区的

如果用上面keyby有可能会造成数据倾斜，也就是我们现在的操作就是控制数据倾斜的

物理分区，一般在并行度减少的时候会自动进行

随机分区（shuffle）

使用方法很简单直接DS.shuffle就可以了

轮询分区（Round-Robin）

对比上面的shuffle是洗牌，则他就是发牌，和打扑克一样的那种，和kafka以及nginx是一样的

调用方式DS.rebalance()，其实Ds里上游到下游默认的就是轮询

重缩放分区（rescale）

它和上面的轮询很像

轮询是把每一个并行子任务的数据都进行轮询，就是如果上游是两个任务，下游是三个任务

轮询会让第一个子任务的第一个数据给下游的第一个，第一个第二个给下游的第二个，第一个的第三个给下游的第三个

上游的第二个子任务同理

但是这个并不是，它是做了个分组，旨在当前的组内进行轮询

就是相当于玩游戏局大了，要分开玩一样

每一个上游任务都会对应下游的一个组，然后再组里进行轮询，不能发牌给其他组

其本质上也就是按照taskmanager进行分组，每个taskmanager之间如果项进行通信则要经过网络传输代价比较大，

然后轮询其实就是再一个taskmanager（上游）和另外一个taskmanager（下游）之间进行通信，所以轮询的数据通道要建立M（上游数量）* N (下游数量)个通信

而现在的则不是，因为他是按照taskmanager进行分开的，所以它理论上是按照它组内的来进行的1上游数量）* N (下游数量)来通信，但是这里的n的数量都比上述小得多

但是要注意如果想优化性能要让上游子任务和下游子任务的数量是倍数的关系最好

使用的时候直接DS.rescale就好

可以用自定义数据源进行测试如下：

package flinklearn
import org.apache.flink.api.common.functions.{FilterFunction, RichMapFunction}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
import tool._
import org.apache.flink.streaming.api.scala._
object udf {

  def apply(): udf = new udf()

  def main(args: Array[String]): Unit = {
    udf().udftest
  }
}

class udf{

  private val streamingcontext = new streamingcontext

  def udftest = {
    val environment = streamingcontext.getflinkenv(3)

    val testdata = List(
      event("zihan", "1211", 1111),
      event("bob", "1333", 22222)
    )
    val testDS = environment.addSource(new trysource).setParallelism(3)
    testDS.rescale.print("rescale")
    environment.execute()
    val result =
      """
        |rescale:1> 2
        |rescale:2> 1
        |rescale:2> 3
        |rescale:1> 4
        |rescale:2> 5
        |rescale:1> 6
        |rescale:2> 7
        |rescale:1> 8
        |""".stripMargin
  
  }
}



class trysource extends RichParallelSourceFunction[Int]{
  override def run(ctx: SourceFunction.SourceContext[Int]): Unit = {
    for (i <- 0 to 7){
      if (getRuntimeContext.getIndexOfThisSubtask == (i+1)%2){
        ctx.collect(i+1)
      }
    }
  }

  override def cancel(): Unit = ???
}

通过结果我们可以知道1，3，5，7对应的子任务的id都是2 ，则2，4，6，8是1

满足我们设置的条件

广播分区（broadcast）

把一份数据复制成多个然后发送到下游所有子任务

但是一般会造成数据重复，但还是有作用的在用广播创建广播流的时候用

自定义分区

接口叫做partitionCustom

源码如下：

/**
 * Partitions a DataStream on the key returned by the selector, using a custom partitioner.
 * This method takes the key selector to get the key to partition on, and a partitioner that
 * accepts the key type.
 *
 * Note: This method works only on single field keys, i.e. the selector cannot return tuples
 * of fields.
 */
def partitionCustom[K: TypeInformation](partitioner: Partitioner[K], fun: T => K)
    : DataStream[T] = {

  val keyType = implicitly[TypeInformation[K]]
  val cleanFun = clean(fun)

  val keyExtractor = new KeySelector[T, K] with ResultTypeQueryable[K] {
    def getKey(in: T) = cleanFun(in)
    override def getProducedType(): TypeInformation[K] = keyType
  }

  asScalaStream(stream.partitionCustom(partitioner, keyExtractor))
}

Partitioner是分区器，后面的lambda表达式是提取当前分区字段的方法

点进去查看发现

public interface Partitioner<K> extends java.io.Serializable, Function {

    /**
     * Computes the partition for the given key.
     *
     * @param key The key.
     * @param numPartitions The number of partitions to partition into.
     * @return The partition index.
     */
    int partition(K key, int numPartitions);
}

它也是一个接口，他的返回值是要返回到下游子任务的编号，也就是分区的编号

如下：

package flinklearn
import org.apache.flink.api.common.functions.{FilterFunction, Partitioner, RichMapFunction}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
import tool._
import org.apache.flink.streaming.api.scala._
object udf {

  def apply(): udf = new udf()

  def main(args: Array[String]): Unit = {
    udf().udftest
  }
}

class udf{

  private val streamingcontext = new streamingcontext

  def udftest = {
    val environment = streamingcontext.getflinkenv(3)

    val testDS = environment.fromElements(1,1,2,3,4,5,6,67,7,8,8,5,6,4,3)
    testDS.partitionCustom( new Partitioner[Int]{
      override def partition(key: Int, numPartitions: Int): Int = {
        key % 2
      }
    }, x=>x ).print("rescale")
    environment.execute()
    val result =
      """
        |rescale:1> 2
        |rescale:1> 4
        |rescale:2> 1
        |rescale:1> 6
        |rescale:2> 1
        |rescale:1> 8
        |rescale:2> 3
        |rescale:1> 8
        |rescale:2> 5
        |rescale:1> 6
        |rescale:2> 67
        |rescale:1> 4
        |rescale:2> 7
        |rescale:2> 5
        |rescale:2> 3
        |
        |Process finished with exit code 0
        |
        |""".stripMargin

  }
}

但是对于case class 可能不好使，我用就是不好用

输出算子

可以调用addSink就可以自己定义一个sink

里面最关键的构造方法是一个invoke具体在源码里

当然SinkFunction一般我们不用用，因为官方给我们提供了好多

接下来我们按照官网进行学习

JDBC

先在idea里添加依赖

<dependency>
   <groupId>org.apache.flink</groupId>
   <artifactId>flink-connector-jdbc</artifactId>
   <version>1.16.0</version>
 </dependency>

已创建的 JDBC Sink 能够保证至少一次的语义。更有效的精确执行一次可以通过 upsert 语句或幂等更新实现。

val value1: DataStreamSink[Yarninfo] = value.addSink(
  JdbcSink.sink(
    "insert into yarninfo(id,host,applicationtype,name,startime,endtime,user,memeveryscends,vcoreeveryscends,size,cores,state,url) values(?,?,?,?,?,?,?,?,?,?,?,?,?)",
    new JdbcStatementBuilder[Yarninfo] {
      override def accept(t: PreparedStatement, u: Yarninfo): Unit = {
        t.setString(1, u.id)
        t.setString(2, u.host)
        t.setString(3, u.applicationtype)
        t.setString(4, u.name)
        t.setString(5, u.startime)
        t.setString(6, u.endtime)
        t.setString(7, u.user)
        t.setString(8, u.memeveryscends)
        t.setString(9, u.vcoreeveryscends)
        t.setString(10, u.size)
        t.setString(11, u.cores)
        t.setString(12, u.state)
        t.setString(13, u.url)
      }
    },
    new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
      .withUrl("jdbc:mysql://bigdata2:3306/bigdata")
      .withDriverName("com.mysql.jdbc.Driver")
      .withUsername("root")
      .withPassword("liuzihan010616")
      .build()
  )
)

如果要实现幂等性等要自己家进行操作

文件

flink写入到文件中

如下：

testDS.map(_.toString).addSink(StreamingFileSink.
  forRowFormat(new Path("./output"),
   new SimpleStringEncoder[String]("UTF-8"))
  .build())

分区数量等于生成的文件数量

还可以在.bulid之前用with来设置一些写入的参数

withBucketCheckInterval()：设置多长时间进行滚动一次
等，具体自己看下就ok

写入到hdfs上的时候也直接改一下path就好

kafka

如下：

testDS.map(_.toString).addSink(
  new FlinkKafkaProducer[String]
  ("bigdata3:9092,bigdata4:9092,bigdata5:9092","dl2262",new SimpleStringSchema())
)

就可以往kafka里写入了

自定义外部连接器

就是通过继承SinkFunction，以及对应的RichSinkfunction

实现invoke方法，放入自己写入的方法

时间语义

对于无界流，我们要查看它一定时间内的数据

对于分布式系统，我们没有一个绝对的时间指标

窗口进行数据的收集是以什么为标准的？

处理时间

就是我们对数据进行处理的时候的时间

事件事件

就是这个数据什么时候产生的

水位线

用来度量事件时间的度量

当我们使用事件事件的时候，假如我们要采集8点到9点的数据

那么当我们用事件事件，就是在数据生成的时候打上标记，进行统计他的事件的话，

假如下游还有对时间进行操作的事情，则只能去提取事件时间的时间戳，进行计算，

这样下游的操作就会延迟数据的输出时间，导致输出的数据是一段一段的

于是就把时间戳提出，当作一个变量，当对这个数据进行处理的时候，在时间戳上打个标记

并包装成一个特殊形式，直接插入数据流，跟随着数据一起流动，然后如果看见这个标志就会放到下游

就是在对每一条数据进行处理之后，我们会在这个数据之后加一个类似标记的东西，而这个东西是和数据时间有关系的，作用就是告诉下游我当前处理的数据是这个时间的

有序流中的水位线：

就是按照时间顺序进行插入时间戳，保证了数据的顺序

但是如果事件生成的特别快时间特别密，则水位线打上的时间会有所相同，然后因为数据量特别大，则打上所需要的时间和资源会特别多，于是我们从上面的转变成，每间隔一段时间插入一条水位线，每间隔一段时间插入一段水位线，然后这个插入的标准就是它之前最近一次提取到的时间戳，插入的时间周期默认是200ms（可以设置）ps：这插入的周期，是按照系统时间200ms之后就生成一次的

但是假设：

上游是三个分区。下游是一个分区，那么则可能出现乱序，

就是假如第一个分区正常处理时间数据。而对于第二个分区则是有问题或者延迟什么的，它发送了一个在之前时间的数据，就会发生乱序

第一个分区发送的数据如下：1，2，3，4数据全到下面的分区了

第二个分区又发送了个2的数据，就会出现数据集乱序的问题

解决方法：

设置一个标志位，保存之前最大的时间戳，然后用这个标志位进行推进时间并对比数据的时间戳，然后如过来的数据特别多可以采用和上面一样的方法进行周期执行的判断最大时间戳

但是上述的方法会出现问题，假如按照上述规则处理窗口，则可能会出问题，假如我们定义一个0-9s的一个窗口，按照这个方法，可能会出现迟到的数据，然后就会丢数据

解决方法：

设置延迟函数，就是让他延迟2s，就是真实数据的时间2s的话，则让水位线的时间是0s，这样就可以减少数据丢失的时间了，因为窗口是按照水位线的时间来的，但是上述的方法也不严谨，最终解决方法就是等足够长的时间

就是我们判断一个数据流中的最大乱序从程度，进行设置时间，假如22s后面跟着一个17s的数据，则说他的最大乱序程度是22-17=5s如果还有比这个大的，则就更新这个时间，同时这个时间也是要延迟多少秒的时间

水位线特性：

水位线是一个插入到数据流中的一个标记，可以认为是一个特殊的数据
水位线的主要内容就是一个时间戳，用来表示当前事件时间的进展的
水位线是基于数据的时间戳进行生成的
水位线的时间戳必须是单调递增的，以确保时间的推进
水位线可以通过设置延迟来进行处理迟到的数据

然后就不会出现小于等于t的时间数据了

但是如何确认最大乱序时间？

一般这个最大乱序时间，是按照一个正态分布的，最易最终我们就是在正确性和延时时间做一个权衡

在idea代码如下：

水位线的基本使用：

package flinklearn

import java.time.Duration

import org.apache.flink.api.common.eventtime.{SerializableTimestampAssigner, Watermark, WatermarkGenerator, WatermarkGeneratorSupplier, WatermarkOutput, WatermarkStrategy}
import org.apache.flink.api.java.utils.ParameterTool

object f3 {
  def apply(parameterTool: ParameterTool): f3 = new f3(parameterTool)

  def main(args: Array[String]): Unit = {
    val tool = ParameterTool.fromArgs(args)
    f3(tool).excute()
  }
}




class f3(parameterTool: ParameterTool) {

  import org.apache.flink.streaming.api.scala._

  def excute()={
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    // 设置水位线的默认事件默认是毫秒
    env.getConfig.setAutoWatermarkInterval(500)

    val value = env.addSource(new clickSource)

    // 有序流的水位线生成策略
    value.assignTimestampsAndWatermarks( WatermarkStrategy.forMonotonousTimestamps().withTimestampAssigner(new SerializableTimestampAssigner[event] {
      override def extractTimestamp(element: event, recordTimestamp: Long): Long = {
        element.timestamp
      }
    }))


    // 乱序流的水位线生成方法
    // 这里的Duration 是java.time下的
    value.assignTimestampsAndWatermarks( WatermarkStrategy.forBoundedOutOfOrderness[event](Duration.ofSeconds(5)).withTimestampAssigner(new SerializableTimestampAssigner[event] {
      override def extractTimestamp(element: event, recordTimestamp: Long): Long = {
        element.timestamp
      }
    }))

    // 自定义水位线
    value.assignTimestampsAndWatermarks(new WatermarkStrategy[event] {
      override def createWatermarkGenerator(context: WatermarkGeneratorSupplier.Context): WatermarkGenerator[event] = {
        new WatermarkGenerator[event] {
          // 底层默认要实现的两个方法 但是flink内置了几种基本的策略，在WatermarkStrategy源码中
          // 事件触发
          val delay = 5000L
          // 定义属性保存最大时间戳
          var maxtx = Long.MinValue + delay + 1

           // 判断最大时间戳
          // 按照系统时间做调度
          override def onEvent(event: event, eventTimestamp: Long, output: WatermarkOutput): Unit = {
            maxtx = Math.max(maxtx,event.timestamp)
          }

//          // 按照数据进行调度
//          override def onEvent(event: event, eventTimestamp: Long, output: WatermarkOutput): Unit = {
//            maxtx = Math.max(maxtx,event.timestamp)
//            val watermark = new Watermark[event](maxtx)
//            output.emitWatermark(watermark)
//          }


          // 周期行的生产水位线
          override def onPeriodicEmit(output: WatermarkOutput): Unit = {
            val watermark = new Watermark[event](maxtx -delay -1)
            // 周期性发射
            output.emitWatermark(watermark)
          }

        }
      }
    })

  }
}

但是我们还可以在数据源机械能配置，自定义source的时候可以直接定义水位线等参数如下

package flinklearn

import java.util.Calendar

import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.watermark.Watermark

import scala.util.Random

object clickSource {

  def apply(): clickSource = new clickSource()

  def main(args: Array[String]): Unit = {

  }
}

/*
SourceFunction[T]
其中的泛型就是我们对应的返回的数据的类型
 */


class clickSource extends RichParallelSourceFunction[event]{

  // 标志位
  var flag = true


  def excute(): Unit ={

  }


  override def run(sourceContext: SourceFunction.SourceContext[event]): Unit = {
    // 随机数生成器
    val random =new  Random()

    // 定义选择的范围
    val user = Array("1","2","3")
    val url = Array("/cat","/.dog","/info")




    //使用循环不停的发送数据，标志位做为判断题条件，不停的发送数据
    while (flag){
      val eventtmp = event(user(random.nextInt(2)),url(random.nextInt(2)),Calendar.getInstance().getTimeInMillis)
      // 为要发送的数据指定时间戳,按照下面指定完成之后发送数据的时候就会知道哪一个是时间戳，就可以不实现withTimestampAssigner了
      sourceContext.collectWithTimestamp(eventtmp,eventtmp.timestamp)
      // 往下游直接发送水位线,然后下游就可以不用assignTimestampsAndWatermarks这个方法了，因为水位线已经生成完了
      sourceContext.emitWatermark(new Watermark(eventtmp.timestamp))

      // 每隔1s发送一条数据
      sourceContext.collect(eventtmp)
      Thread.sleep(1000)
    }

  }

  override def cancel(): Unit = {
    flag = false
  }

}

就可以了

水位线的正常就是像数据一样正常的流动，这个是单分区的时候

如果想发送到多个下游的子任务，我们应该广播出去，

但是如果上游有多个分区，那么他们广播的水位线如果不一样，下游该采用哪一个水位线？

答案是最小的数据

我们会设置一个分区水位线的概念，就是采取最小的分区水位线

窗口

我们要观察，或者对一定时间内的数据进行操作，一般定义窗口的时候都是左闭右开，滑动窗口是可以出现重复的数据

但是对于时间时间语义下乱序的时候，就会有迟到的数据，然后我们就要设置延迟时间

但是，既然又迟到的数据，那么也就会有超前的数据在这个窗口中，于是我们不能简单的理解把窗口想象成简单的窗口

我们可以想象成桶的概念，就是简单的，如果这个时间戳是复合这个窗口规定的时间，则会被拉到一个桶中，

这样就不会出现时间不对的数据导致观察错误

窗口的分类：

时间窗口
- 滚动窗口：就是头连着尾巴一样，一直看，生产很多都是基于滚动窗口的，就类似于把数据分成很多个框框，挨个看
- 滑动窗口：基于上面滚动窗口，就像一个滑块一样的，从头滑到尾，也叫跳动窗口，滑动窗口的参数是滑动步长，就是每次滑动滑动的距离，如果把滑动步长跳到整个数据那么长，就会变成滚动窗口了
- 会话窗口：他的标准并不是给窗口设置一个固定的大小，开始和结束的规律也是完全没有的，窗口之间一定没有重叠的，会复杂点
- 全局窗口：就是全局的，默认是不会触发计算的因为数据不会停下，但是可以设置触发器，进行设置
计数窗口
- 滚动窗口：同上
- 滑动窗口：同上
- 会话窗口：同上
- 全局窗口：同上

时间窗口略微的复杂点，计数则更为简单

窗口api：可以堪称df的api的一小部分

首先，我们要确定我们做没做keyby

如果keyby了，则要通过调用.window进行开始，会在多个并行子任务上执行，针对每一个key进行执行

如果没做keyby，则是用调用.windowall(),相当于并行度都变成1

无论是上面的哪一个window/windowall

都要街上窗口分配器，然后加上聚合函数

除了全都要我们自定义的窗口分配器以外，flink都提供了内置的function

如下：

package flinklearn



import org.apache.flink.api.common.eventtime.{SerializableTimestampAssigner, WatermarkStrategy}
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.windowing.assigners.{EventTimeSessionWindows, ProcessingTimeSessionWindows, SlidingEventTimeWindows, SlidingProcessingTimeWindows, TumblingEventTimeWindows, TumblingProcessingTimeWindows}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction

object flinkWindos {
  def apply(parameterTool: ParameterTool): flinkWindos = new flinkWindos(parameterTool)

  def main(args: Array[String]): Unit = {
    val tool = ParameterTool.fromArgs(args)
    flinkWindos(tool).ecxcute()
  }
}


class flinkWindos(parameterTool: ParameterTool){

  import org.apache.flink.streaming.api.scala._

  def ecxcute()={
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val zihan = env.addSource(new clickSource)

    val zihan1 = zihan.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps().withTimestampAssigner(
      new SerializableTimestampAssigner[event] {
        override def extractTimestamp(element: event, recordTimestamp: Long): Long = {
          element.timestamp
        }
      }))

    zihan1.map(data => {(data.uaer,1)})
      .keyBy(_._1)
      .window(TumblingEventTimeWindows.of(Time.seconds(7)))// 基于事件时间的滚动窗口 , 偏移量为后面的参数
//      .window(TumblingProcessingTimeWindows.of(Time.days(1),Time.hours(-8))) // 基于处理时间的滚动窗口
//      .window(SlidingEventTimeWindows.of(Time.days(1),Time.minutes(10),Time.hours(-8))) // 基于时间时间的滑动窗口 步长为10min
//      .window(SlidingProcessingTimeWindows.of(Time.days(1),Time.minutes(10),Time.hours(-8))) // 基于处理时间的滑动窗口 步长为10min
//      .window(EventTimeSessionWindows.withGap(Time.seconds(10))) //  基于事件时间的会话窗口
//      .window(ProcessingTimeSessionWindows.withGap(Time.seconds(10))) // 基于处理时间的会话窗口
//      .countWindow(10) // 大小为10的滚动计数窗口
//      .countWindow(10,2) // 大小为10的滑动计数窗口，步长为2

    //  窗口函数
    // 分为增量窗口 和 全窗口
    // 增量聚合 是每来一条数据，就处理一条数据，然后存储他的状态，等窗口满足条件，直接输出
    // 全窗口，则是类似批处理的形式，把数据都聚集在一起，然后满足条件执行操作，在输出


    /*
    增量聚合函数包括（典型） ： ReduceFunction AggregateFunction
    规约聚合：reduceFunction -> 两两进行规约，就和之前简单函数的那个是一样的
    */
      // reduce 他在规约的过程中，中间是不能变的，就是数据的输入，输出，规则都一样
//      .reduce( (x,y)=> {
//        (x._1,x._2+y._2)
//      } )
//      .print()
    // aggre 则可以改变类型，比上面更为灵活
        .aggregate(new tryFunction)

    env.execute()

  }



}


class tryFunction extends org.apache.flink.api.common.functions.AggregateFunction[(String,Int),(Long,Set[String]),Double] {
  override def createAccumulator(): (Long, Set[String]) = {
    (0,Set[String]()) // 赋初值
  }

  // 计算过程
  override def add(value: (String, Int), accumulator: (Long, Set[String])): (Long, Set[String]) = {
    (value._2 + accumulator._1 , accumulator._2 + value._1)
  }

  // 结果
  override def getResult(accumulator: (Long, Set[String])): Double = {
    accumulator._1/accumulator._2.size
  }

  // 会话窗口要用的
  override def merge(a: (Long, Set[String]), b: (Long, Set[String])): (Long, Set[String]) = ???
}

全窗口函数：

就相当于针对于全局的窗口函数，而且它可以获取更多的信息

窗口函数现在处于一个迭代的过程中，所以可能会略微复杂些

首先本身上的窗口函数是通过.apply进行调用的里面的传入参数是WindowFunction，这个是最早的时候用的不过现在已经快被弃用了

因为出现了个比他更好的Function，是ProcessWindowFunction，因为这个方法不光可以获取上下文window信息，还可以获取很多其他的属性

WindowFunction：而且他是富函数

trait WindowFunction[IN, OUT, KEY, W <: Window] extends Function with Serializable {

  /**
    * Evaluates the window and outputs none or several elements.
    *
    * @param key    The key for which this window is evaluated.
    * @param window The window that is being evaluated.
    * @param input  The elements in the window being evaluated.
    * @param out    A collector for emitting elements.
    * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
    */
  def apply(key: KEY, window: W, input: Iterable[IN], out: Collector[OUT])
}

ProcessWindowFunction

abstract class ProcessWindowFunction[IN, OUT, KEY, W <: Window]
    extends AbstractRichFunction {

  /**
    * Evaluates the window and outputs none or several elements.
    *
    * @param key      The key for which this window is evaluated.
    * @param context  The context in which the window is being evaluated.
    * @param elements The elements in the window being evaluated.
    * @param out      A collector for emitting elements.
    * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
    */
  @throws[Exception]
  def process(key: KEY, context: Context, elements: Iterable[IN], out: Collector[OUT])

  /**
    * Deletes any state in the [[Context]] when the Window expires
    * (the watermark passes its `maxTimestamp` + `allowedLateness`).
    *
    * @param context The context to which the window is being evaluated
    * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
    */
  @throws[Exception]
  def clear(context: Context) {}

  /**
    * The context holding window metadata
    */
  abstract class Context {
    /**
      * Returns the window that is being evaluated.
      */
    def window: W

    /**
      * Returns the current processing time.
      */
    def currentProcessingTime: Long

    /**
      * Returns the current event-time watermark.
      */
    def currentWatermark: Long

    /**
      * State accessor for per-key and per-window state.
      */
    def windowState: KeyedStateStore

    /**
      * State accessor for per-key global state.
      */
    def globalState: KeyedStateStore

    /**
      * Emits a record to the side output identified by the [[OutputTag]].
      */
    def output[X](outputTag: OutputTag[X], value: X);
  }
}

下面我简单用ProcessWindowFunction进行创建

package flinklearn


import org.apache.flink.streaming.api.scala.function.{ProcessWindowFunction, WindowFunction}
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

object flinkwindowall {
  def apply(): flinkwindowall = new flinkwindowall()

  def main(args: Array[String]): Unit = {
    flinkwindowall().excute()
  }
}


class flinkwindowall(){

  import org.apache.flink.streaming.api.scala._



  def excute(): Unit ={
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val value = env.addSource(new clickSource)

    // 指定一个无关的数据，代表全局
    value.assignAscendingTimestamps(_.timestamp) // 创建水位线
      .keyBy(data => "key") // 设置全局分区
      .window(TumblingEventTimeWindows.of(Time.seconds(10))) // 开窗
      .process(new firstProcessWindowFunction ) // 调用ProcessWimdowFunction的方法


    env.execute()
  }
}

class firstProcessWindowFunction extends ProcessWindowFunction[event,String,String,TimeWindow]{
  override def process(key: String, context: Context, elements: Iterable[event], out: Collector[String]): Unit = {
    // 使用set进行去重
    var userset = Set[String]()


    // 从element中提取元素
    elements.map(userset += _.uaer)
    val uv = userset.size
    // 提取窗口信息，进行输出
    val end = context.window.getEnd
    val start = context.window.getStart

    println(s"从${start} 到 ${end} 的uv是${uv}")


  }
}

可以把上述的全窗口和增量放到一起，通过Aggregatortion里面可以传入两个参数，一个是增量的，一个是全窗口的

就表式，增量的结果变成了全窗口的输入，就是两者结合如下：

package flinklearn



import org.apache.flink.api.common.eventtime.{SerializableTimestampAssigner, WatermarkStrategy}
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.assigners.{EventTimeSessionWindows, ProcessingTimeSessionWindows, SlidingEventTimeWindows, SlidingProcessingTimeWindows, TumblingEventTimeWindows, TumblingProcessingTimeWindows}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

object flinkWindos {
  def apply(parameterTool: ParameterTool): flinkWindos = new flinkWindos(parameterTool)

  def main(args: Array[String]): Unit = {
    val tool = ParameterTool.fromArgs(args)
    flinkWindos(tool).ecxcute()
  }
}


class flinkWindos(parameterTool: ParameterTool){

  import org.apache.flink.streaming.api.scala._

  def ecxcute()={
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val zihan = env.addSource(new clickSource)

    val zihan1 = zihan.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps().withTimestampAssigner(
      new SerializableTimestampAssigner[event] {
        override def extractTimestamp(element: event, recordTimestamp: Long): Long = {
          element.timestamp
        }
      }))

   val zihan2 =  zihan1.map(data => {(data.uaer,1)})
      .keyBy(data => "key")
//      .window(TumblingEventTimeWindows.of(Time.seconds(7)))// 基于事件时间的滚动窗口 , 偏移量为后面的参数
//      .window(TumblingProcessingTimeWindows.of(Time.days(1),Time.hours(-8))) // 基于处理时间的滚动窗口
      .window(SlidingEventTimeWindows.of(Time.seconds(10),Time.minutes(2))) // 基于时间时间的滑动窗口 步长为10min
//      .window(SlidingProcessingTimeWindows.of(Time.days(1),Time.minutes(10),Time.hours(-8))) // 基于处理时间的滑动窗口 步长为10min
//      .window(EventTimeSessionWindows.withGap(Time.seconds(10))) //  基于事件时间的会话窗口
//      .window(ProcessingTimeSessionWindows.withGap(Time.seconds(10))) // 基于处理时间的会话窗口
//      .countWindow(10) // 大小为10的滚动计数窗口
//      .countWindow(10,2) // 大小为10的滑动计数窗口，步长为2

    //  窗口函数
    // 分为增量窗口 和 全窗口
    // 增量聚合 是每来一条数据，就处理一条数据，然后存储他的状态，等窗口满足条件，直接输出
    // 全窗口，则是类似批处理的形式，把数据都聚集在一起，然后满足条件执行操作，在输出


    /*
    增量聚合函数包括（典型） ： ReduceFunction AggregateFunction
    规约聚合：reduceFunction -> 两两进行规约，就和之前简单函数的那个是一样的
    */
      // reduce 他在规约的过程中，中间是不能变的，就是数据的输入，输出，规则都一样
//      .reduce( (x,y)=> {
//        (x._1,x._2+y._2)
//      } )
//      .print()
    // aggre 则可以改变类型，比上面更为灵活
       zihan2.aggregate(new tryFunction11, new firstProcessWindowFunction1).print()

    env.execute()
   
  }



}

import  org.apache.flink.api.common.functions._
class tryFunction11 extends AggregateFunction[(String,Int),(Long,Set[String]),Double] {
  override def createAccumulator(): (Long, Set[String]) = {
    (0L,Set[String]()) // 赋初值
  }

  // 计算过程
  override def add(value: (String, Int), accumulator: (Long, Set[String])) = {
    (value._2 + accumulator._1 , accumulator._2 + value._1)
  }

  // 结果
  override def getResult(accumulator: (Long, Set[String])): Double = {
    accumulator._1/accumulator._2.size
  }

  // 会话窗口要用的
  override def merge(a: (Long, Set[String]), b: (Long, Set[String])): (Long, Set[String]) = ???
}

class firstProcessWindowFunction1 extends ProcessWindowFunction[Double,Double,String,TimeWindow]{
  override def process(key: String, context: Context, elements:Iterable[Double], out: Collector[Double]): Unit ={


    var total:Double = 0
    elements.map(total+=_)
    // 提取窗口信息，进行输出
    val end = context.window.getEnd
    val start = context.window.getStart
    println(s"从${start} 到 ${end} 的rate是${elements}额外的统计信息是${total}")


  }


}

处理迟到数据

可以允许迟到数据

通过调用windowStream下的allowedLateness,设置允许迟到时间，等到达时间，则会发送到下游

还可以通过测输出流，进行收集过于迟到的数据，但是对这个侧输出流的操作是影响不到窗口的，和窗口相当于是分开的

代码：

package flinklearn
import java.time.Duration
import java.util.Calendar

import org.apache.flink.api.common.eventtime.{SerializableTimestampAssigner, WatermarkStrategy}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.assigners.{TumblingEventTimeWindows, TumblingProcessingTimeWindows}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
import tool._
object dealdelaydata {

  def main(args: Array[String]): Unit = {
    val environment = StreamExecutionEnvironment.getExecutionEnvironment
    val value = environment.socketTextStream("43.140.193.43", 6000).map(data=>{
      val strings = data.split(" ")
      loginfo(strings(0),strings(1))
    })
    val resulttmp = value.assignTimestampsAndWatermarks(WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(2)).withTimestampAssigner(new SerializableTimestampAssigner[loginfo] {
      override def extractTimestamp(element: loginfo, recordTimestamp: Long): Long = {
        element.dt.toLong
      }
    }))

//    val resulttmp2 = resulttmp.keyBy(_.log).window(new TumblingProcessingTimeWindows()).process(new myprocessTimeWindow)
//
//    resulttmp2.print()

    val flag = new OutputTag[loginfo]("test")
    val resluttmp3 = resulttmp.keyBy(_.log).window(  TumblingProcessingTimeWindows.of(Time.seconds(10))).allowedLateness(Time.seconds(10)).sideOutputLateData(flag).process( new myprocessTimeWindow)
    resluttmp3.print("resulttmp3的原始数据")
    resluttmp3.getSideOutput(flag).print("侧输出流")

    environment.execute()
  }
}

class myprocessTimeWindow extends ProcessWindowFunction[loginfo,String,String,TimeWindow] {
  override def process(key: String, context: Context, elements: Iterable[loginfo], out: Collector[String]): Unit = {

    out.collect(s"处理时间${context.window.getStart}~${context.window.getEnd}用户${key}的点击次数${elements.size}当前水位线为${context.currentWatermark}")
  }
}



import  org.apache.flink.api.common.functions._
class myeventTimewindow extends AggregateFunction[loginfo,String,String]{
  override def createAccumulator(): String = ???

  override def add(value: loginfo, accumulator: String): String = ???

  override def getResult(accumulator: String): String = ???

  override def merge(a: String, b: String): String = ???
}

处理函数

基本处理函数（ProcessFunction）

sparksql-4

发表于 2023-01-16 更新于 2023-01-25 分类于日志
本文字数： 25k 阅读时长 ≈ 22 分钟

sparkstreaming

用于实时计算的模块 =》 sparkstreaming，structuredstreaming

流处理：实时

实时来一套数据处理一条 storm，flink 数据叫event
近实时来一批数据处理 mini-batch sparkstreaming
数据会源源不断地来

批处理：离线

代码或者程序处理一个批次的数据
- 例子：数据放在hdfs上，我们对他进行处理 =》 ok

技术选型

生产上：

sparkstreaming，structuredstreaming 10%
flink 90%
storm 2%

开发角度：

code =》 flink > sparkstreaming
sql => flink > spark streaming

业务：

实施指标：都差不多
实时数仓：
- 代码：差不多
- sql文件： flinksql维护实时数仓 =》 ok

特性

容易使用 =》客观看

批流一体的处理方法 =》 sparksql <=> 流处理

低延迟高消费

简介

sparkstreaming开发是spark-core的一个扩展
接收数据的渠道多
还可以对数据进行流处理的可以机器学习等

一般来说流式处理会比批处理负载小，但不绝对

数据源：

kafka ****** 流式引擎重要的数据源 -》通过topic进行数据缓冲，它会根据sp的吞吐量来进行处理，两个引擎之间会有联系
flume **** 可以使用但是一般不用 flume 没有数据缓冲致命 -》直接把数据弄到sp里，如果数据量特别多，会让sp程序挂掉，因为如果sa程序的吞吐量比较小，则会崩掉，和sp无联系
hdfs

数据积压：kafka数据太大，导致sp程序一直处理不过来，一个出不来报表 =>解决方法

吞吐量提高
数据量减少

sparkstreaming运行机制

接收数据
拆分成batches

sparkstreaming -> kafka ：

5s处理数据
每5s会切分成一次batch
交给sparkngine处理
处理完的也是一个batch

sparkstreaming编程模型：Dstream

外部数据源
高级算子
类似RDD

idea开发先配置pom文件

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming_2.12</artifactId>
  <version>3.2.1</version>
</dependency>

idea代码

package sparkstreaming
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.SparkConf

object sparkstreaming1 {
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setMaster("local[4]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(conf, Seconds(5))
    // 或者通过sparkcontext进行创建
    //val ssc = new StreamingContext(sc, Seconds(1))
    // 数据源
    val lines = ssc.socketTextStream("bigdata5", 9999)
    // 处理数据
    val words = lines.flatMap(_.split(" "))
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)

    // 打印数据当前批次
    wordCounts.print()
    ssc.start()             // Start
    ssc.awaitTermination()  // Wait

    // 配置数据源在目标机器上执行nc -lk 9999 然后输入数据就ok了
  }

}

还可以在webui上查看

如下：

他的打印数据是处理当前批次的数据，不是累积批次的数据

双流join

api :

flink -》调用api
sparkstreaming code 很多 -》 api join stste

延迟数据

processtime + udf
eventime + watermaker
- 数据和离线对不上（容易）

如何构建DStream

从inputstream的方式生产上
receiver 测试用为面试准备

构建Dstream

inputstrteam

比如卡夫卡

receiver

用receiver接受的时候如果是本地则要大于1 -> local[2+]

因为sparkstreaming最少是有两部分切分以及处理，如果只给1则会没有资源进行处理

所以针对于receiver一个要大于等于

上面仅仅是针对receiver

例子：

1	val lines = ssc.socketTextStream("bigdata5", 9999)

因为他底层源码是

def socketTextStream(
    hostname: String,
    port: Int,
    storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
  ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
  socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
}

Dstream算子

转换操作：

Similar to that of RDDs, transformations allow the data from the input DStream to be modified. DStreams support many of the transformations available on normal Spark RDD’s. Some of the common ones are as follows.

Transformation	Meaning
map ( func )	Return a new DStream by passing each element of the source DStream through a functionfunc .
flatMap ( func )	Similar to map, but each input item can be mapped to 0 or more output items.
filter ( func )	Return a new DStream by selecting only the records of the source DStream on whichfunc returns true.
repartition ( numPartitions )	Changes the level of parallelism in this DStream by creating more or fewer partitions.
union ( otherStream )	Return a new DStream that contains the union of the elements in the source DStream andotherDStream .
count ()	Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce ( func )	Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a functionfunc (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
countByValue ()	When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey ( func , [ numTasks ])	When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function.Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property `spark.default.parallelism`) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
join ( otherStream , [ numTasks ])	When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup ( otherStream , [ numTasks ])	When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform ( func )	Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey ( func )	Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

输出操作：

Output Operation	Meaning
print ()	Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.``Python API This is calledpprint() in the Python API.
saveAsTextFiles ( prefix , [ suffix ])	Save this DStream’s contents as text files. The file name at each batch interval is generated based onprefix and suffix : “prefix-TIME_IN_MS[.suffix]” .
saveAsObjectFiles ( prefix , [ suffix ])	Save this DStream’s contents as `SequenceFiles` of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix : “prefix-TIME_IN_MS[.suffix]” .``Python API This is not available in the Python API.
saveAsHadoopFiles ( prefix , [ suffix ])	Save this DStream’s contents as Hadoop files. The file name at each batch interval is generated based onprefix and suffix : “prefix-TIME_IN_MS[.suffix]” .``Python API This is not available in the Python API.
foreachRDD ( func )	The most generic output operator that applies a function,func , to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

我们之前的计算代码只是计算当前批次的数据，也是sparkstreaming默认的

基于上面官方提出了状态

状态

有状态前后批次有联系
无状态前后批次无联系

用于解决统计类问题

updateStateByKey ( func )：这个算子

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount = ...  // add the new values with the previous running count to get the new count
    Some(newCount)
}
val runningCounts = pairs.updateStateByKey[Int](updateFunction _)

代码如下：

package sparkstreaming
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.SparkConf
import tool.streamingcontext
object sparkstreaming1 {
  private val streamingcontext = new streamingcontext
  def main(args: Array[String]): Unit = {

    val ssc = streamingcontext.getstreamcotext()
    // 或者通过sparkcontext进行创建
    //val ssc = new StreamingContext(sc, Seconds(1))
    // 数据源
    val lines = ssc.socketTextStream("bigdata5", 9999)
    // 处理数据
    val words = lines.flatMap(_.split(" "))
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)
    // 要指定checkpoint目录
    ssc.checkpoint("file:///D:\\checkpoint")
    val totalwc = pairs.updateStateByKey(updateFunction _)
    //wordCounts.updateStateByKey()
    // 打印数据当前批次
    wordCounts.print()
    totalwc.print()
    ssc.start()             // Start
    ssc.awaitTermination()  // Wait

    // 配置数据源在目标机器上执行nc -lk 9999 然后输入数据就ok了
  }

  def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
      // add the new values with the previous running count to get the new count
    val sum = newValues.sum
    val i = runningCount.getOrElse(0)
    Some(sum+i)
  }

}

但是这样也产生了个新问题

我们观察checkpoint文件夹

生成很多个小文件

我们该如何解决

生产上我们不用

但是必备的知识还是要的

为了容错，恢复作业，和kafka里的一样

checkpoint的存储东西

matestore 元数据

conf 作业里的配置信息
算子操作
未完成的批次

Data

就是批次的数据

使用场景

作业失败的时候回复的时候用
转换算子的时候

但是注意生产上用不了

如何使用

Checkpointing can be enabled by setting a directory in a fault-tolerant, reliable file system (e.g., HDFS, S3, etc.) to which the checkpoint information will be saved. This is done by using streamingContext.checkpoint(checkpointDirectory). This will allow you to use the aforementioned stateful transformations. Additionally, if you want to make the application recover from driver failures, you should rewrite your streaming application to have the following behavior.

When the program is being started for the first time, it will create a new StreamingContext, set up all the streams and then call start().
When the program is being restarted after failure, it will re-create a StreamingContext from the checkpoint data in the checkpoint directory.

idea代码

// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
  val ssc = new StreamingContext(...)   // new context
  val lines = ssc.socketTextStream(...) // create DStreams
  ...
  ssc.checkpoint(checkpointDirectory)   // set checkpoint directory
  ssc
}

// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)

// Do additional setup on context that needs to be done,
// irrespective of whether it is being started or restarted
context. ...

// Start the context
context.start()
context.awaitTermination()

缺点

小文件

修改代码就费了，就要重整

checkpoint用不了-》累计批次指标问题 -》出现问题

如何实现相同功能？

实现存储到外部，如何根据而外部文件进行累计

使用checkpoint

解决checkpoint修改代码报错和小文件问题

所以简历上不可以出现我在生产上用过updateStateByKey，坚决不会用

如何把处理好的数据存储到外部

如下：

dstream.foreachRDD { rdd =>
  val connection = createNewConnection()  // executed at the driver
  rdd.foreach { record =>
    connection.send(record) // executed at the worker
  }
}

idea

package sparkstreaming
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import tool.{mysqlutils, streamingcontext,savefile}
object sparkstreaming1 {
  private val mysqlutils = new mysqlutils
  private val streamingcontext = new streamingcontext
  private val savefile = new savefile
  def main(args: Array[String]): Unit = {


    val ssc = streamingcontext.getstreamcotext()
    // 或者通过sparkcontext进行创建
    //val ssc = new StreamingContext(sc, Seconds(1))
    // 数据源
    val lines = ssc.socketTextStream("bigdata5", 9999)
    // 处理数据
    val words = lines.flatMap(_.split(" "))
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)
    //val totalwc = pairs.updateStateByKey(updateFunction _)
    //wordCounts.updateStateByKey()
    // 打印数据当前批次
    wordCounts.print()
    //totalwc.print()
    // 把结果输入到mysql里 先在mysql里创建完表了
    // 下面会报错-> mysql链接没有进行序列化 ，我们不能加除非更改底层源码
    // closure 闭包 -> 方法内使用了方法外的变量 比如下述的connect
    wordCounts.foreachRDD(rdd=>{
      val connection = mysqlutils.getconnect("jdbc:mysql://bigdata2:3306/bigdata", "root", "liuzihan010616")
      rdd.foreach { record =>
        val sql = s"insert into wc values('${record._1}','${record._2}')"
        connection.createStatement.execute(sql)
      }
    })
// --------------------------------------------------------
    //对上述进行修改之后
    //这样是可以的但是性能不高
    //因为会一直拿链接，会造成性能下降
        wordCounts.foreachRDD(rdd=>{
          rdd.foreach { record =>
            val sql = s"insert into wc values('${record._1}','${record._2}')"
            val connection = mysqlutils.getconnect("jdbc:mysql://bigdata2:3306/bigdata", "root", "liuzihan010616")
            connection.createStatement.execute(sql)
          }
        })
    //优化性能
    wordCounts.foreachRDD(rdd=>{
      rdd.foreachPartition(record=>{
        val connection = mysqlutils.getconnect("jdbc:mysql://bigdata2:3306/bigdata", "root", "liuzihan010616")
        record.foreach(pari => {
          val sql = s"insert into wc values('${pari._1}','${pari._2}')"
          connection.createStatement.execute(sql)
        })
        mysqlutils.closeconnect(connection)
      })
    })
    // 再次进行优化 原因 -》 partition的数量过高
    // 通过连接池来进行
    // 或者通过coalse来控制这个分区数量
    dstream.foreachRDD { rdd =>
      rdd.foreachPartition { partitionOfRecords =>
        // ConnectionPool is a static, lazily initialized pool of connections
        val connection = ConnectionPool.getConnection()
        partitionOfRecords.foreach(record => connection.send(record))
        ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
      }
    }
    // 利用sparksql的方式写入 最推荐
    // 性能也很好因为用的是spark的
    wordCounts.foreachRDD(rdd=>{
      val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
      import spark.implicits._
      // Convert RDD[String] to DataFrame
      val wordsDataFrame = rdd.toDF("word","cnt")
      val srray:Array[String] = Array("append","jdbc:mysql://bigdata2:3306/bigdata","root","liuzihan010616","wc","word")
      savefile.savetojdbc(spark,wordsDataFrame,srray)
    })


    ssc.start()             // Start
    ssc.awaitTermination()  // Wait

    // 配置数据源在目标机器上执行nc -lk 9999 然后输入数据就ok了
  }

  def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
      // add the new values with the previous running count to get the new count
    val sum = newValues.sum
    val i = runningCount.getOrElse(0)
    Some(sum+i)
  }

}

transform

DStream和RDD之间交互的算子

需求：

一个数据是来自于mysql/文本数据：量小伪表
一个数据来自kafka sss读取形成的DStream 量大主业务线

实例：弹幕过滤功能

离线
实时

数据如下

主表：
不好看
垃圾
女主真好看
666
过滤的弹幕：
热巴真丑
鸡儿真美
王退出娱乐圈

离线:

package sparkstreaming
import org.apache.spark.sql.SparkSession
import tool._
object sparkstreaming2 {

  val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").enableHiveSupport().getOrCreate()
  def main(args: Array[String]): Unit = {
    var mainsql = List(
      "不好看",
      "垃圾",
      "女主真好看",
      "666",
      "热巴真丑",
      "鸡儿真美",
      "王退出娱乐圈")
    val maintable = spark.sparkContext.parallelize(mainsql)

    var black = List(
      "热巴真丑",
        "鸡儿真美",
      "王退出娱乐圈"
    )

    val blacktable = spark.sparkContext.parallelize(black)

    val value1 = maintable.map(x => {
      (x, 1)
    })
    val value = blacktable.map(x => {
      (x, true)
    })
    value1.leftOuterJoin(value).filter(_._2._2.getOrElse(false)!=true).map(_._1).foreach(println(_))
  }

}

实时：

private val streamingcontext = new streamingcontext
def main(args: Array[String]): Unit = {
  val ssc = streamingcontext.getstreamcotext()
  val maintable = ssc.socketTextStream("bigdata5", 9099)
  var black = List(
    "热巴真丑",
    "鸡儿真美",
    "王退出娱乐圈"
  )

  val blacktable = ssc.sparkContext.parallelize(black)

  val value = blacktable.map(x => {
    (x, true)
  })

  val value1 = maintable.map(x => {
    (x, 1)
  })
  val value2 = value1.transform(x => {
    x.leftOuterJoin(value).filter(_._2._2.getOrElse(false) != true).map(_._1)
  })

  value2.print()
  
  ssc.start()
  ssc.awaitTermination()

sparkstreaming和kafka整合

通过receiver方式读取kafka数据

kafka版本我们选择的是2.2.1

在sparkstreaming里默认的时候是至少一次

spark消费kafka数据形成的DStream里的分区数量和Kafka里的topic的分区数是一一对应的

分区数和task数是一一对应的-》并行度一一对应

官网：kafkaonspark

idea里的依赖

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
  <version>3.2.1</version>
</dependency>

使用：

-------------------------------kafka消费数据
kafka-console-consumer.sh \
--bootstrap-server bigdata3:9092,bigdata4:9092,bigdata5:9092 \
--topic dl2262 \
--from-beginning 
-----------------------------kafka创建topic
kafka-topics.sh \
--create \
--zookeeper bigdata3:2181,bigdata4:2181,bigdata5:2181/kafka \
--topic dl2262 \
--partitions 6 \
--replication-factor 3
-------------------------------kafka生产数据
kafka-console-producer.sh \
--broker-list bigdata3:9092,bigdata4:9092,bigdata5:9092 \
--topic dl2262
-------------------------------kafka查看topic
kafka-topics.sh \
--describe \
--zookeeper bigdata3:2181,bigdata4:2181,bigdata5:2181/kafka \
--topic dl2262 
-----------------------------代码
import org.apache.spark.sql.SparkSession
import tool._
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
object sparkstreaming2 {

  private val streamingcontext = new streamingcontext

  def main(args: Array[String]): Unit = {
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "bigdata3:9092,bigdata4:9092,bigdata5:9092", // kafka地址
      "key.deserializer" -> classOf[StringDeserializer], // 反序列化
      "value.deserializer" -> classOf[StringDeserializer], // 反序列化
      "group.id" -> "dl2262-1", // 指定消费者组
      "auto.offset.reset" -> "latest", // 从什么地方开始消费
      "enable.auto.commit" -> (false: java.lang.Boolean) // offset的提交 是不是自动提交
    )

     val example = streamingcontext.getstreamcotext()
    val topics = Array("dl2262")
    val stream = KafkaUtils.createDirectStream[String, String](
      example,
      PreferConsistent, // 数据存储策略 Kafka数据均匀分在各个exector上，一共有三种
      Subscribe[String, String](topics, kafkaParams) // 固定写法
    )

    stream.map(record => (record.value)).print()

    example.start()
    example.awaitTermination()
  }
}

上述用的是新版本的kafka的api

消费kafka数据做wc 将结果-》mysql

package sparkstreaming
import org.apache.spark.sql.SparkSession
import tool._
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

object sparkstreaming2 {

  private val streamingcontext = new streamingcontext
  private val savefile = new savefile
  def main(args: Array[String]): Unit = {
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "bigdata3:9092,bigdata4:9092,bigdata5:9092", // kafka地址
      "key.deserializer" -> classOf[StringDeserializer], // 反序列化
      "value.deserializer" -> classOf[StringDeserializer], // 反序列化
      "group.id" -> "dl2262-1", // 指定消费者组
      "auto.offset.reset" -> "latest", // 从什么地方开始消费
      "enable.auto.commit" -> (false: java.lang.Boolean) // offset的提交 是不是自动提交
    )


    val example = streamingcontext.getstreamcotext()
    val topics = Array("dl2262")
    val stream = KafkaUtils.createDirectStream[String, String](
      example,
      PreferConsistent, // 数据存储策略 Kafka数据均匀分在各个exector上，一共有三种
      Subscribe[String, String](topics, kafkaParams) // 固定写法
    )

    stream.map(record => (record.value)).print()

    val value = stream.map(record => (record.value)).flatMap(x => {
      x.split(",")
    }).map(word => {
      (word, 1)
    }).reduceByKey(_ + _)

    value.foreachRDD(rdd=>{
      val spark = SparkSession.builder().config(rdd.sparkContext.getConf).getOrCreate()
      import spark.implicits._
      val wordsDataFrame = rdd.toDF("word","cnt")
      val srray:Array[String] = Array("append","jdbc:mysql://bigdata2:3306/bigdata","root","liuzihan010616","wc","word")
      savefile.savetojdbc(spark,wordsDataFrame,srray)
    })

    example.start()
    example.awaitTermination()
  }}

消费完之后如果重启从上次挂掉的位置继续消费

要设置

enable.auto.commit
auto.offset.reset

才可以从断掉的位置开始

解决：

获取kafka offset
提交kafka offset

获取kafka的offset信息

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  rdd.foreachPartition { iter =>
    val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
    println(rdd.partitions.size)
    println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
  }
}

关于offset信息的解释：只要最后两列数据一样就代表这个topic里的数据都消费完了

6
dl2262 5 19 19
dl2262 4 18 18
dl2262 0 19 19
dl2262 2 77 77
dl2262 1 19 19
dl2262 3 46 47
-------------------------------------------
Time: 1673939535000 ms
-------------------------------------------
bidhashdas

6
dl2262 4 18 18
dl2262 3 47 47
dl2262 0 19 19
dl2262 5 19 19
dl2262 2 77 77
dl2262 1 19 19

注意：这些操作是获取到数据之后立刻这样做，就可以获得offset信息

其他对数据的进行操作，可以在这个里面做

如下：

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  println(rdd.partitions.size)
  rdd.foreachPartition { iter =>
    val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
    println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
  }
  // wc
  val spark = SparkSession.builder().config(rdd.sparkContext.getConf).getOrCreate()
  import spark.implicits._

  val wordsDataFrame = rdd.map(_.value).flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _).toDF("word","cnt")
  val srray:Array[String] = Array("append","jdbc:mysql://bigdata2:3306/bigdata","root","liuzihan010616","wc","word")
  savefile.savetojdbc(spark,wordsDataFrame,srray)

  // 存储offset


  // 提交offset


}

接下来，我们要进行提交offset

在提交offset之前

我们要存储offset

spark流处理默认的就是至少一次

存储offfset

checkpoints 不能用
kafka本身简单高效 -》消费语义-》至少一次/最多一次：但是最多一次我们不用 -》因为不支持事务 -》无法支持精准一次
可以使用支持事务的存储结构进行精准一次的交付语义

kafka本身：他存储的offset信息是存储在kafka的一共特殊的offset里比如_customer_offsets

stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
-------------------------------------------整体代码
package sparkstreaming
import org.apache.spark.sql.SparkSession
import tool._
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.TaskContext
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

object sparkstreaming2 {

  private val streamingcontext = new streamingcontext
  private val savefile = new savefile
  def main(args: Array[String]): Unit = {
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "bigdata3:9092,bigdata4:9092,bigdata5:9092", // kafka地址
      "key.deserializer" -> classOf[StringDeserializer], // 反序列化
      "value.deserializer" -> classOf[StringDeserializer], // 反序列化
      "group.id" -> "dl2262-1", // 指定消费者组
      "auto.offset.reset" -> "latest", // 从什么地方开始消费
      "enable.auto.commit" -> (false: java.lang.Boolean) // offset的提交 是不是自动提交
    )


    val example = streamingcontext.getstreamcotext()
    val topics = Array("dl2262")
    val stream = KafkaUtils.createDirectStream[String, String](
      example,
      PreferConsistent, // 数据存储策略 Kafka数据均匀分在各个exector上，一共有三种
      Subscribe[String, String](topics, kafkaParams) // 固定写法
    )
    // 获取offset信息
    stream.foreachRDD { rdd =>
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      println(rdd.partitions.size)
      rdd.foreachPartition { iter =>
        val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
        println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
      }
      // wc
      val spark = SparkSession.builder().config(rdd.sparkContext.getConf).getOrCreate()
      import spark.implicits._

      val wordsDataFrame = rdd.map(_.value).flatMap(_.split(",")).map((_, 1)).reduceByKey(_ + _).toDF("word","cnt")
      val srray:Array[String] = Array("append","jdbc:mysql://bigdata2:3306/bigdata","root","liuzihan010616","wc","word")
      savefile.savetojdbc(spark,wordsDataFrame,srray)

      // 存储offset和提交offset
      stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)




    }
    example.start()
    example.awaitTermination()
}

其他数据源：官方的

// The details depend on your data store, but the general idea looks like this

// begin from the offsets committed to the database
val fromOffsets = selectOffsetsFromYourDatabase.map { resultSet =>
  new TopicPartition(resultSet.string("topic"), resultSet.int("partition")) -> resultSet.long("offset")
}.toMap

val stream = KafkaUtils.createDirectStream[String, String](
  streamingContext,
  PreferConsistent,
  Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets)
)

stream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

  val results = yourCalculation(rdd)

  // begin your transaction

  // update results
  // update offsets where the end of existing offsets matches the beginning of this batch of offsets
  // assert that offsets were updated correctly

  // end your transaction
}

SSL的一些配置

val kafkaParams = Map[String, Object](
  // the usual params, make sure to change the port in bootstrap.servers if 9092 is not TLS
  "security.protocol" -> "SSL",
  "ssl.truststore.location" -> "/some-directory/kafka.client.truststore.jks",
  "ssl.truststore.password" -> "test1234",
  "ssl.keystore.location" -> "/some-directory/kafka.client.keystore.jks",
  "ssl.keystore.password" -> "test1234",
  "ssl.key.password" -> "test1234"
)

sparksql-3

发表于 2023-01-13 分类于日志
本文字数： 6.5k 阅读时长 ≈ 6 分钟

案例：业务数据 + 日志数据

业务数据在mysql里 =》

city_info
user_info

日志数据在hive里 =》

user_click

思路 =》先用代码把数据统计到df中对df进行操作

代码如下：

package sparkfirst

import org.apache.spark.sql.SparkSession

object xiangmu1 {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").enableHiveSupport().getOrCreate()
    import spark.implicits._
    val city_info = spark.read.format("jdbc")
      .options(Map("url"->args(0),"dbtable"->args(3),"user"->args(1),"password"->args(2),"driver"->"com.mysql.jdbc.Driver")).load()

    val user_info = spark.sql(
      """
        |select * from bigdata.product_info
        |""".stripMargin)

//    city_info.show()
//    user_info.show()

    val product_info = spark.read.textFile("hdfs://bigdata3:9000/data/user_click.txt")
//    product_info.show(false)

    val userlog = product_info.map(line => {
      val strings = line.split(",")
      val userid = strings(0)
      val sessionid = strings(1)
      val dt = strings(2)
      val cityid = strings(3)
      val shopid = strings(4)
      (userid, sessionid, dt, cityid, shopid)
    }).toDF("userid", "sessionid", "dt", "cityid", "shopid")


//    userlog.show(false)



    //----------------------------------------------------------------
    city_info.createOrReplaceTempView("city_info")
    userlog.createOrReplaceTempView("user_log")
    user_info.createOrReplaceTempView("product_info")
    //--------------------------------------------------------------
    spark.sql(
      """
        |drop table if exists bigdata.tmp
        |""".stripMargin)
    spark.sql(
      """
        |
        |
        |create table bigdata.tmp as
        |select
        |*
        |from (
        |    select * from city_info left join user_log on city_info.city_id = user_log.cityid left join product_info  on user_log.shopid = product_info.product_id
        |)
        |""".stripMargin)
    spark.sql(
      """
        |drop table if exists bigdata.sparkfinish
        |""".stripMargin)
    spark.sql(
      """
        |create table bigdata.sparkfinish as
        |select
        |*
        |from(
        |select
        |area,
        |product_name,
        |rank() over(partition by area order by cnt) as rk
        |from (
        |select
        |area,
        |product_name,
        |count(1) as cnt
        |from bigdata.tmp
        |group by area,product_name
        |)
        |)where rk < 3;
        |""".stripMargin)


  }
}

然后我们进行打包上传

进行打包之前，我们要把我们的

1	val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").enableHiveSupport().getOrCreate()

给注释掉，因为我们用spark-submit的时候会通过命令的方式指定他

打包上传之后我们可以进行调用脚本spark-submit脚本

但是这里还是有分歧的，因为spark-submit部署的时候一般是部署在yarn的要分几种模式的简单介绍一下两种模式

Cluster：

提交作业 client作业提交 client就可以关闭了对spark作业是没有影响的而且运行的时候
driver 是在集群机器里的
上yarn上看日志

client：

提交作业 client作业提交如果client关闭了 driver process 挂了对spark作业有影响的
是在 client机器里的
可以直接看见日志的

以下代码分别写一下

spark-submit \
--master yarn \
--deploy-mode client \
--name userlog \
--executor-memory 1g \
--num-executors 1 \
--executor-cores 1 \
--class sparkfirst.xiangmu1 \
/home/hadoop/project/jar/bigdatajava-1.0-SNAPSHOT.jar \
"jdbc:mysql://bigdata2:3306/bigdata" root liuzihan010616 city_info
--------------------------------------------------------cluster
spark-submit \
--master yarn \
--deploy-mode cluster \
--name userlog \
--executor-memory 1g \
--num-executors 2 \
--executor-cores 1 \
--class sparkfirst.xiangmu1 \
/home/hadoop/project/jar/bigdatajava-1.0-SNAPSHOT.jar \
"jdbc:mysql://bigdata2:3306/bigdata" root liuzihan010616 city_info
-------------------------------------------------------------- 以下是正常的因为我把mysql的driver加入到spark的jars文件夹里了，所以上面不用指定jar
spark-submit \
--master yarn \
--deploy-mode client  \
--name userlog \
--executor-memory 1g \
--num-executors 1 \
--executor-cores 1 \
--jars /home/hadoop/software/mysql-connector-java-5.1.28.jar \
--driver-class-path /home/hadoop/software/mysql-connector-java-5.1.28.jar \
--driver-library-path /home/hadoop/software/mysql-connector-java-5.1.28.jar \
--class sparkfirst.xiangmu1 \
/home/hadoop/project/jar/bigdatajava-1.0-SNAPSHOT.jar \
"jdbc:mysql://bigdata2:3306/bigdata" root liuzihan010616 city_info user_info
--------------------------------------------------------------------------------------------------
spark-submit \
--master yarn \
--deploy-mode cluster \
--name userlog \
--executor-memory 1g \
--num-executors 1 \
--executor-cores 1 \
--jars /home/hadoop/software/mysql-connector-java-5.1.28.jar \
--driver-class-path /home/hadoop/software/mysql-connector-java-5.1.28.jar \
--driver-library-path /home/hadoop/software/mysql-connector-java-5.1.28.jar \
--class sparkfirst.xiangmu1 \
/home/hadoop/project/jar/bigdatajava-1.0-SNAPSHOT.jar \
"jdbc:mysql://bigdata2:3306/bigdata" root liuzihan010616 city_info user_info

几种加入driver的方法是

在执行命令的时候加入命令：如上

直接加到jars里

这里比较不推荐的就是第二种，因为怕和spark本身的包产生冲突

执行流程

spark到yarn的执行流程和hadoop基本一样，除了spark的持久化操作要用到catche，其余都一样

driver =》manager

excuter =》 container

catlog

hive元数据在mysql里

spark 访问hive元数据通过jdbc

spark提供了个catlog

直接调用catlog可以直接拿到hive的元数据的功能

比如制作大数据分析平台

获取catlog

sparksesson.catlog

然后里面有很多的方法可以在idea里通过ctrl + f12 查看方法

冷数据可能放在cos 或者 oss上

udf

代码的方式定义udf
hive的udf可以在spark可以直接用

idea里定义udf
---------------------------------------
先导包
import org.apache.spark.sql.functions.udf
然后


val spark = SparkSession
  .builder()
  .appName("Spark SQL UDF scalar example")
  .getOrCreate()

// Define and register a zero-argument non-deterministic UDF
// UDF is deterministic by default, i.e. produces the same result for the same input.
val random = udf(() => Math.random())
spark.udf.register("random", random.asNondeterministic())
spark.sql("SELECT random()").show()
// +-------+
// |UDF()  |
// +-------+
// |xxxxxxx|
// +-------+

// Define and register a one-argument UDF
val plusOne = udf((x: Int) => x + 1)
spark.udf.register("plusOne", plusOne)
spark.sql("SELECT plusOne(5)").show()
// +------+
// |UDF(5)|
// +------+
// |     6|
// +------+

// Define a two-argument UDF and register it with Spark in one step
spark.udf.register("strLenScala", (_: String).length + (_: Int))
spark.sql("SELECT strLenScala('test', 1)").show()
// +--------------------+
// |strLenScala(test, 1)|
// +--------------------+
// |                   5|
// +--------------------+

// UDF in a WHERE clause
spark.udf.register("oneArgFilter", (n: Int) => { n > 5 })
spark.range(1, 10).createOrReplaceTempView("test")
spark.sql("SELECT * FROM test WHERE oneArgFilter(id)").show()
// +---+
// | id|
// +---+
// |  6|
// |  7|
// |  8|
// |  9|
// +---+

sparksql-2

发表于 2023-01-11 更新于 2023-01-13 分类于日志
本文字数： 49k 阅读时长 ≈ 45 分钟

构建df

rdd
hive
外部数据源
- json,csv,jdbc/odbc

加载外部数据源

api简介

TEXT

Property Name	Default	Meaning	Scope
wholetext	false	If true, read each file from input path(s) as a single row.	read
lineSep	`\r`, `\r\n`, `\n` (for reading), `\n` (for writing)	Defines the line separator that should be used for reading or writing.	read/write
compression	(none)	Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).	write

json

Property Name	Default	Meaning	Scope
`timeZone`	(value of `spark.sql.session.timeZone` configuration)	Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of `timeZone` are supported:``* Region-based zone ID: It should have the form ‘area/city’, such as ‘America/Los_Angeles’.* Zone offset: It should be in the format ‘(+	-)HH:mm’, for example ‘-08:00’ or ‘+01:00’. Also ‘UTC’ and ‘Z’ are supported as aliases of ‘+00:00’.Other short names like ‘CST’ are not recommended to use because they can be ambiguous.
`primitivesAsString`	`false`	Infers all primitive values as a string type.	read
`prefersDecimal`	`false`	Infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles.	read
`allowComments`	`false`	Ignores Java/C++ style comment in JSON records.	read
`allowUnquotedFieldNames`	`false`	Allows unquoted JSON field names.	read
`allowSingleQuotes`	`true`	Allows single quotes in addition to double quotes.	read
`allowNumericLeadingZero`	`false`	Allows leading zeros in numbers (e.g. 00012).	read
`allowBackslashEscapingAnyCharacter`	`false`	Allows accepting quoting of all character using backslash quoting mechanism .	read
`mode`	`PERMISSIVE`	Allows a mode for dealing with corrupt records during parsing.```PERMISSIVE`: when it meets a corrupted record, puts the malformed string into a field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. To keep corrupt records, an user can set a string type field named `columnNameOfCorruptRecord` in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a `columnNameOfCorruptRecord` field in an output schema. `DROPMALFORMED`: ignores the whole corrupted records. This mode is unsupported in the JSON built-in functions.* `FAILFAST`: throws an exception when it meets corrupted records.	read
`columnNameOfCorruptRecord`	(value of `spark.sql.columnNameOfCorruptRecord` configuration)	Allows renaming the new field having malformed string created by `PERMISSIVE` mode. This overrides spark.sql.columnNameOfCorruptRecord.	read
`dateFormat`	`yyyy-MM-dd`	Sets the string that indicates a date format. Custom date formats follow the formats atdatetime pattern. This applies to date type.	read/write
`timestampFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`	Sets the string that indicates a timestamp format. Custom date formats follow the formats atdatetime pattern. This applies to timestamp type.	read/write
`timestampNTZFormat`	yyyy-MM-dd’T’HH:mm:ss[.SSS]	Sets the string that indicates a timestamp without timezone format. Custom date formats follow the formats atDatetime Patterns. This applies to timestamp without timezone type, note that zone-offset and time-zone components are not supported when writing or reading this data type.	read/write
`multiLine`	`false`	Parse one record, which may span multiple lines, per file. JSON built-in functions ignore this option.	read
`allowUnquotedControlChars`	`false`	Allows JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.	read
`encoding`	Detected automatically when `multiLine` is set to `true` (for reading), `UTF-8` (for writing)	For reading, allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. For writing, Specifies encoding (charset) of saved json files. JSON built-in functions ignore this option.	read/write
`lineSep`	`\r`, `\r\n`, `\n` (for reading), `\n` (for writing)	Defines the line separator that should be used for parsing. JSON built-in functions ignore this option.	read/write
`samplingRatio`	`1.0`	Defines fraction of input JSON objects used for schema inferring.	read
`dropFieldIfAllNull`	`false`	Whether to ignore column of all null values or empty array/struct during schema inference.	read
`locale`	`en-US`	Sets a locale as language tag in IETF BCP 47 format. For instance,`locale` is used while parsing dates and timestamps.	read
`allowNonNumericNumbers`	`true`	Allows JSON parser to recognize set of “Not-a-Number” (NaN) tokens as legal floating number values.```+INF`: for positive infinity, as well as alias of `+Infinity` and `Infinity`. `-INF`: for negative infinity, alias `-Infinity`.* `NaN`: for other not-a-numbers, like result of division by zero.	read
`compression`	(none)	Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate). JSON built-in functions ignore this option.	write
`ignoreNullFields`	(value of `spark.sql.jsonGenerator.ignoreNullFields` configuration)	Whether to ignore null fields when generating JSON objects.	write

csv

Property Name	Default	Meaning	Scope
`sep`	,	Sets a separator for each field and value. This separator can be one or more characters.	read/write
`encoding`	UTF-8	For reading, decodes the CSV files by the given encoding type. For writing, specifies encoding (charset) of saved CSV files. CSV built-in functions ignore this option.	read/write
`quote`	“	Sets a single character used for escaping quoted values where the separator can be part of the value. For reading, if you would like to turn off quotations, you need to set not `null` but an empty string. For writing, if an empty string is set, it uses `u0000` (null character).	read/write
`quoteAll`	false	A flag indicating whether all values should always be enclosed in quotes. Default is to only escape values containing a quote character.	write
`escape`	\	Sets a single character used for escaping quotes inside an already quoted value.	read/write
`escapeQuotes`	true	A flag indicating whether values containing quotes should always be enclosed in quotes. Default is to escape all values containing a quote character.	write
`comment`		Sets a single character used for skipping lines beginning with this character. By default, it is disabled.	read
`header`	false	For reading, uses the first line as names of columns. For writing, writes the names of columns as the first line. Note that if the given path is a RDD of Strings, this header option will remove all lines same with the header if exists. CSV built-in functions ignore this option.	read/write
`inferSchema`	false	Infers the input schema automatically from data. It requires one extra pass over the data. CSV built-in functions ignore this option.	read
`enforceSchema`	true	If it is set to `true`, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to `false`, the schema will be validated against all headers in CSV files in the case when the `header` option is set to `true`. Field names in the schema and column names in CSV headers are checked by their positions taking into account `spark.sql.caseSensitive`. Though the default value is true, it is recommended to disable the `enforceSchema` option to avoid incorrect results. CSV built-in functions ignore this option.	read
`ignoreLeadingWhiteSpace`	`false` (for reading), `true` (for writing)	A flag indicating whether or not leading whitespaces from values being read/written should be skipped.	read/write
`ignoreTrailingWhiteSpace`	`false` (for reading), `true` (for writing)	A flag indicating whether or not trailing whitespaces from values being read/written should be skipped.	read/write
`nullValue`		Sets the string representation of a null value. Since 2.0.1, this `nullValue` param applies to all supported types including the string type.	read/write
`nanValue`	NaN	Sets the string representation of a non-number value.	read
`positiveInf`	Inf	Sets the string representation of a positive infinity value.	read
`negativeInf`	-Inf	Sets the string representation of a negative infinity value.	read
`dateFormat`	yyyy-MM-dd	Sets the string that indicates a date format. Custom date formats follow the formats atDatetime Patterns. This applies to date type.	read/write
`timestampFormat`	yyyy-MM-dd’T’HH:mm:ss[.SSS][XXX]	Sets the string that indicates a timestamp format. Custom date formats follow the formats atDatetime Patterns. This applies to timestamp type.	read/write
`timestampNTZFormat`	yyyy-MM-dd’T’HH:mm:ss[.SSS]	Sets the string that indicates a timestamp without timezone format. Custom date formats follow the formats atDatetime Patterns. This applies to timestamp without timezone type, note that zone-offset and time-zone components are not supported when writing or reading this data type.	read/write
`maxColumns`	20480	Defines a hard limit of how many columns a record can have.	read
`maxCharsPerColumn`	-1	Defines the maximum number of characters allowed for any given value being read. By default, it is -1 meaning unlimited length	read
`mode`	PERMISSIVE	Allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes. Note that Spark tries to parse only required columns in CSV under column pruning. Therefore, corrupt records can be different based on required set of fields. This behavior can be controlled by `spark.sql.csv.parser.columnPruning.enabled` (enabled by default).``* `PERMISSIVE`: when it meets a corrupted record, puts the malformed string into a field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. To keep corrupt records, an user can set a string type field named `columnNameOfCorruptRecord` in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. A record with less/more tokens than schema is not a corrupted record to CSV. When it meets a record having fewer tokens than the length of the schema, sets `null` to extra fields. When the record has more tokens than the length of the schema, it drops extra tokens.* `DROPMALFORMED`: ignores the whole corrupted records. This mode is unsupported in the CSV built-in functions.* `FAILFAST`: throws an exception when it meets corrupted records.	read
`columnNameOfCorruptRecord`	(value of `spark.sql.columnNameOfCorruptRecord` configuration)	Allows renaming the new field having malformed string created by `PERMISSIVE` mode. This overrides `spark.sql.columnNameOfCorruptRecord`.	read
`multiLine`	false	Parse one record, which may span multiple lines, per file. CSV built-in functions ignore this option.	read
`charToEscapeQuoteEscaping`	`escape` or `\0`	Sets a single character used for escaping the escape for the quote character. The default value is escape character when escape and quote characters are different,`\0` otherwise.	read/write
`samplingRatio`	1.0	Defines fraction of rows used for schema inferring. CSV built-in functions ignore this option.	read
`emptyValue`	(for reading),`""` (for writing)	Sets the string representation of an empty value.	read/write
`locale`	en-US	Sets a locale as language tag in IETF BCP 47 format. For instance, this is used while parsing dates and timestamps.	read
`lineSep`	`\r`, `\r\n` and `\n` (for reading), `\n` (for writing)	Defines the line separator that should be used for parsing/writing. Maximum length is 1 character. CSV built-in functions ignore this option.	read/write
`unescapedQuoteHandling`	STOP_AT_DELIMITER	Defines how the CsvParser will handle values with unescaped quotes.```STOP_AT_CLOSING_QUOTE`: If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found. `BACK_TO_DELIMITER`: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters of the current parsed value until the delimiter is found. If no delimiter is found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found.* `STOP_AT_DELIMITER`: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until the delimiter or a line ending is found in the input.* `SKIP_VALUE`: If unescaped quotes are found in the input, the content parsed for the given value will be skipped and the value set in nullValue will be produced instead.* `RAISE_ERROR`: If unescaped quotes are found in the input, a TextParsingException will be thrown.	read
`compression`	(none)	Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`, `snappy` and `deflate`). CSV built-in functions ignore this option.	write

jdbc

Property Name	Default	Meaning	Scope
`url`	(none)	The JDBC URL of the form `jdbc:subprotocol:subname` to connect to. The source-specific connection properties may be specified in the URL. e.g., `jdbc:postgresql://localhost/test?user=fred&password=secret`	read/write
`dbtable`	(none)	The JDBC table that should be read from or written into. Note that when using it in the read path anything that is valid in a `FROM` clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses. It is not allowed to specify `dbtable` and `query` options at the same time.	read/write
`query`	(none)	A query that will be used to read data into Spark. The specified query will be parenthesized and used as a subquery in the `FROM` clause. Spark will also assign an alias to the subquery clause. As an example, spark will issue a query of the following form to the JDBC Source.`SELECT <columns> FROM (<user_specified_query>) spark_gen_alias`Below are a couple of restrictions while using this option.1. It is not allowed to specify `dbtable` and `query` options at the same time.1. It is not allowed to specify `query` and `partitionColumn` options at the same time. When specifying `partitionColumn` option is required, the subquery can be specified using `dbtable` option instead and partition columns can be qualified using the subquery alias provided as part of `dbtable`.Example:`spark.read.format("jdbc").option("url", jdbcUrl).option("query", "select c1, c2 from t1").load()`	read/write
`driver`	(none)	The class name of the JDBC driver to use to connect to this URL.	read/write
`partitionColumn, lowerBound, upperBound`	(none)	These options must all be specified if any of them is specified. In addition,`numPartitions` must be specified. They describe how to partition the table when reading in parallel from multiple workers. `partitionColumn` must be a numeric, date, or timestamp column from the table in question. Notice that `lowerBound` and `upperBound` are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.	read
`numPartitions`	(none)	The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling `coalesce(numPartitions)` before writing.	read/write
`queryTimeout`	`0`	The number of seconds the driver will wait for a Statement object to execute to the given number of seconds. Zero means there is no limit. In the write path, this option depends on how JDBC drivers implement the API `setQueryTimeout`, e.g., the h2 JDBC driver checks the timeout of each query instead of an entire JDBC batch.	read/write
`fetchsize`	`0`	The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (e.g. Oracle with 10 rows).	read
`batchsize`	`1000`	The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. This option applies only to writing.	write
`isolationLevel`	`READ_UNCOMMITTED`	The transaction isolation level, which applies to current connection. It can be one of `NONE`, `READ_COMMITTED`, `READ_UNCOMMITTED`, `REPEATABLE_READ`, or `SERIALIZABLE`, corresponding to standard transaction isolation levels defined by JDBC’s Connection object, with default of `READ_UNCOMMITTED`. Please refer the documentation in `java.sql.Connection`.	write
`sessionInitStatement`	(none)	After each database session is opened to the remote DB and before starting to read data, this option executes a custom SQL statement (or a PL/SQL block). Use this to implement session initialization code. Example:`option("sessionInitStatement", """BEGIN execute immediate 'alter session set "_serial_direct_read"=true'; END;""")`	read
`truncate`	`false`	This is a JDBC writer related option. When `SaveMode.Overwrite` is enabled, this option causes Spark to truncate an existing table instead of dropping and recreating it. This can be more efficient, and prevents the table metadata (e.g., indices) from being removed. However, it will not work in some cases, such as when the new data has a different schema. In case of failures, users should turn off `truncate` option to use `DROP TABLE` again. Also, due to the different behavior of `TRUNCATE TABLE` among DBMS, it’s not always safe to use this. MySQLDialect, DB2Dialect, MsSqlServerDialect, DerbyDialect, and OracleDialect supports this while PostgresDialect and default JDBCDirect doesn’t. For unknown and unsupported JDBCDirect, the user option `truncate` is ignored.	write
`cascadeTruncate`	the default cascading truncate behaviour of the JDBC database in question, specified in the `isCascadeTruncate` in each JDBCDialect	This is a JDBC writer related option. If enabled and supported by the JDBC database (PostgreSQL and Oracle at the moment), this options allows execution of a `TRUNCATE TABLE t CASCADE` (in the case of PostgreSQL a `TRUNCATE TABLE ONLY t CASCADE` is executed to prevent inadvertently truncating descendant tables). This will affect other tables, and thus should be used with care.	write
`createTableOptions`		This is a JDBC writer related option. If specified, this option allows setting of database-specific table and partition options when creating a table (e.g.,`CREATE TABLE t (name string) ENGINE=InnoDB.`).	write
`createTableColumnTypes`	(none)	The database column data types to use instead of the defaults, when creating the table. Data type information should be specified in the same format as CREATE TABLE columns syntax (e.g:`"name CHAR(64), comments VARCHAR(1024)")`. The specified types should be valid spark sql data types.	write
`customSchema`	(none)	The custom schema to use for reading data from JDBC connectors. For example,`"id DECIMAL(38, 0), name STRING"`. You can also specify partial fields, and the others use the default type mapping. For example, `"id DECIMAL(38, 0)"`. The column names should be identical to the corresponding column names of JDBC table. Users can specify the corresponding data types of Spark SQL instead of using the defaults.	read
`pushDownPredicate`	`true`	The option to enable or disable predicate push-down into the JDBC data source. The default value is true, in which case Spark will push down filters to the JDBC data source as much as possible. Otherwise, if set to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark. Predicate push-down is usually turned off when the predicate filtering is performed faster by Spark than by the JDBC data source.	read
`pushDownAggregate`	`false`	The option to enable or disable aggregate push-down in V2 JDBC data source. The default value is false, in which case Spark will not push down aggregates to the JDBC data source. Otherwise, if sets to true, aggregates will be pushed down to the JDBC data source. Aggregate push-down is usually turned off when the aggregate is performed faster by Spark than by the JDBC data source. Please note that aggregates can be pushed down if and only if all the aggregate functions and the related filters can be pushed down. If `numPartitions` equals to 1 or the group by key is the same as `partitionColumn`, Spark will push down aggregate to data source completely and not apply a final aggregate over the data source output. Otherwise, Spark will apply a final aggregate over the data source output.	read
`pushDownLimit`	`false`	The option to enable or disable LIMIT push-down into V2 JDBC data source. The LIMIT push-down also includes LIMIT + SORT , a.k.a. the Top N operator. The default value is false, in which case Spark does not push down LIMIT or LIMIT with SORT to the JDBC data source. Otherwise, if sets to true, LIMIT or LIMIT with SORT is pushed down to the JDBC data source. If `numPartitions` is greater than 1, SPARK still applies LIMIT or LIMIT with SORT on the result from data source even if LIMIT or LIMIT with SORT is pushed down. Otherwise, if LIMIT or LIMIT with SORT is pushed down and `numPartitions` equals to 1, SPARK will not apply LIMIT or LIMIT with SORT on the result from data source.	read
`pushDownTableSample`	`false`	The option to enable or disable TABLESAMPLE push-down into V2 JDBC data source. The default value is false, in which case Spark does not push down TABLESAMPLE to the JDBC data source. Otherwise, if value sets to true, TABLESAMPLE is pushed down to the JDBC data source.	read
`keytab`	(none)	Location of the kerberos keytab file (which must be pre-uploaded to all nodes either by `--files` option of spark-submit or manually) for the JDBC client. When path information found then Spark considers the keytab distributed manually, otherwise `--files` assumed. If both `keytab` and `principal` are defined then Spark tries to do kerberos authentication.	read/write
`principal`	(none)	Specifies kerberos principal name for the JDBC client. If both `keytab` and `principal` are defined then Spark tries to do kerberos authentication.	read/write
`refreshKrb5Config`	`false`	This option controls whether the kerberos configuration is to be refreshed or not for the JDBC client before establishing a new connection. Set to true if you want to refresh the configuration, otherwise set to false. The default value is false. Note that if you set this option to true and try to establish multiple connections, a race condition can occur. One possble situation would be like as follows.1. refreshKrb5Config flag is set with security context 11. A JDBC connection provider is used for the corresponding DBMS1. The krb5.conf is modified but the JVM not yet realized that it must be reloaded1. Spark authenticates successfully for security context 11. The JVM loads security context 2 from the modified krb5.conf1. Spark restores the previously saved security context 11. The modified krb5.conf content just gone	read/write
`connectionProvider`	(none)	The name of the JDBC connection provider to use to connect to this URL, e.g.`db2`, `mssql`. Must be one of the providers loaded with the JDBC data source. Used to disambiguate when more than one provider can handle the specified driver and options. The selected provider must not be disabled by `spark.sql.sources.disabledJdbcConnProviderList`.	read/write

excel

暂时没找到，找到再补

hive

对于hive我们要定义输入格式输出格式甚至是元素内部，以及每个元素之间的分隔符如下

Property Name	Meaning
fileFormat	A fileFormat is kind of a package of storage format specifications, including “serde”, “input format” and “output format”. Currently we support 6 fileFormats: ‘sequencefile’, ‘rcfile’, ‘orc’, ‘parquet’, ‘textfile’ and ‘avro’.
inputFormat, outputFormat	These 2 options specify the name of a corresponding `InputFormat` and `OutputFormat` class as a string literal, e.g. `org.apache.hadoop.hive.ql.io.orc.OrcInputFormat`. These 2 options must be appeared in a pair, and you can not specify them if you already specified the `fileFormat` option.
serde	This option specifies the name of a serde class. When the `fileFormat` option is specified, do not specify this option if the given `fileFormat` already include the information of serde. Currently “sequencefile”, “textfile” and “rcfile” don’t include the serde information and you can use this option with these 3 fileFormats.
fieldDelim, escapeDelim, collectionDelim, mapkeyDelim, lineDelim	These options can only be used with “textfile” fileFormat. They define how to read delimited files into rows.

关于不同版本的spark关联到hive的源数据库如下配置

One of the most important pieces of Spark SQL’s Hive support is interaction with Hive metastore, which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below. Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL will compile against built-in Hive and use those classes for internal execution (serdes, UDFs, UDAFs, etc).

Property Name	Default	Meaning	Since Version
spark.sql.hive.metastore.version	2.3.9	Version of the Hive metastore. Available options are `0.12.0` through `2.3.9` and `3.0.0` through `3.1.2`.	1.40
spark.sql.hive.metastore.jars	builtin	Location of the jars that should be used to instantiate the HiveMetastoreClient. This property can be one of four options: 1.builtin Use Hive 2.3.9, which is bundled with the Spark assembly when `-Phive` is enabled. When this option is chosen, `spark.sql.hive.metastore.version` must be either `2.3.9` or not defined. maven Use Hive jars of specified version downloaded from Maven repositories. This configuration is not generally recommended for production deployments. path Use Hive jars configured by `spark.sql.hive.metastore.jars.path` in comma separated format. Support both local or remote paths. The provided jars should be the same version as `spark.sql.hive.metastore.version`. A classpath in the standard format for the JVM. This classpath must include all of Hive and its dependencies, including the correct version of Hadoop. The provided jars should be the same version as `spark.sql.hive.metastore.version`. These jars only need to be present on the driver, but if you are running in yarn cluster mode then you must ensure they are packaged with your application.	1.40
spark.sql.hive.metastore.jars.path	(empty)	Comma-separated paths of the jars that used to instantiate the HiveMetastoreClient. This configuration is useful only when `spark.sql.hive.metastore.jars` is set as `path`.``The paths can be any of the following format:1. `file://path/to/jar/foo.jar`1. `hdfs://nameservice/path/to/jar/foo.jar`1. `/path/to/jar/`(path without URI scheme follow conf `fs.defaultFS`‘s URI schema)1. `[http/https/ftp]://path/to/jar/foo.jar`Note that 1, 2, and 3 support wildcard. For example:1. `file://path/to/jar/,file://path2/to/jar//.jar`1. `hdfs://nameservice/path/to/jar/,hdfs://nameservice2/path/to/jar//.jar`	3.10
spark.sql.hive.metastore.sharedPrefixes	com.mysql.jdbc, org.postgresql, com.microsoft.sqlserver, oracle.jdbc	A comma-separated list of class prefixes that should be loaded using the classloader that is shared between Spark SQL and a specific version of Hive. An example of classes that should be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need to be shared are those that interact with classes that are already shared. For example, custom appenders that are used by log4j.	1.40
spark.sql.hive.metastore.barrierPrefixes	(empty)	A comma separated list of class prefixes that should explicitly be reloaded for each version of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a prefix that typically would be shared (i.e.`org.apache.spark.*`).	1.40

读数据

TEXT

官方简介

Spark SQL provides spark.read().text("file_name") to read a file or directory of text files into a Spark DataFrame, and dataframe.write().text("path") to write to a text file. When reading a text file, each line becomes each row that has string “value” column by default. The line separator can be changed as shown in the example below. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator, compression, and so on.

读的时候我们不用设置压缩格式，它和mr一样会自动解压

package sparkfirst

import org.apache.spark.sql.SparkSession

object sparksql2 {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").getOrCreate()
    val df = spark.read.text("file:///D:\\test.txt") // 返回值是DF
    df.show()
    df.printSchema()
  
  
    var result =
      """
+--------------------+
|               value|
+--------------------+
|as,s,ed,f,,,qq,eq...|
|,w,wq,e,w,ewq,we,...|
+--------------------+
      """
  
    //text所带有的schame信息是他自己给我们添加上的，所以有时候我们用这个会不方便
    val df1 = spark.read.textFile("file:///D:\\test.txt") // 返回值是dataset
    df1.printSchema()
    //--------------------------------------------------------------------
    //使用lineSep改变分隔符如下
    val df2 = spark.read.option("lineSep",",").text("file:///D:\\test.txt")
    df2.show()
    result =
      """
+---------+
|    value|
+---------+
|       as|
|        s|
|       ed|
|        f|
|         |
|         |
|       qq|
|eqedqwe\n|
|        w|
|       wq|
|        e|
|        w|
|      ewq|
|       we|
|        q|
|        e|
|wewqeqwel|
|       qe|
| lqeweqwl|
|     qw\n|
+---------+
      """
    //---------------------------------------------------------------------
    //使用wholetext把一整个文件当作一行来接受
    val df3 = spark.read.option("wholetext",true).text("file:///D:\\test.txt")
    df3.show()
    var result =
      """
+--------------------+
|               value|
+--------------------+
|as,s,ed,f,,,qq,eq...|
+--------------------+
      """
    //这里的text和textFile是可以相互用的
    val df4 = spark.read.option("wholetext",true).textFile("file:///D:\\test.txt")
    df4.show()
    var result =
      """
+--------------------+
|               value|
+--------------------+
|as,s,ed,f,,,qq,eq...|
+--------------------+
      """
    val df5 = spark.read.option("wholetext",true).format("text").load("file:///D:\\test.txt")
    df5.show()
    var result =
      """
+--------------------+
|               value|
+--------------------+
|as,s,ed,f,,,qq,eq...|
+--------------------+
      """

  }
}

我们点进去源码发现text的底层是

def text(paths: String*): DataFrame = format("text").load(paths : _*)

所以我们使用的时候可以

1	val df5=spark.read.option("wholetext",true).format("text").load("file:///D:\\test.txt")

json

简介

Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset[Row]. This conversion can be done using SparkSession.read.json() on either a Dataset[String], or a JSON file.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

json分为简单json和嵌套json

val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").getOrCreate()

//-------------------------普通json
val df = spark.read.json("file:///C:\\Users\\dell\\Desktop\\dept.json")
df.show()
df.printSchema()
//--------------------------嵌套json如果嵌套的是STruct => 打点 / ARRAY 类型 先炸开 再打点
var df1 = spark.read.format("json").load("file:///C:\\Users\\dell\\Desktop\\Skills.json")
df1.printSchema()
//--------------------------api
//-withColumn可以增加一个字段，或者把一个字段重命名 =》 提出字段
df1=df1.withColumn("critical",col("damage.critical"))
df1=df1.withColumn("elementId",explode(col("damage.elementId")))
df1.printSchema()
//------------------------删除字段
df1=df1.drop("damage.critical","damage.elementId")
//-------------------------sql
//------------------------对比hivesql
df1.createOrReplaceTempView("test")
//spark.sql("SELECT get_json_object('{\"a\":\"b\"}', '$.a');").show()
// struct可以用下面打点的方法
spark.sql(
  """
    |select
    |effects.ddd,
    |damage.ddddds
    |from
    |test
    |""".stripMargin).show()
//或者用爆炸加侧写进行 ：array元素,嵌套json
spark.sql(
  """
    |select
    |effects.ddd,
    |damage.ddddds
    |from
    |test
    |lateral view explode(store.fruit) as fruit
    |""".stripMargin)

csv

简介

Spark SQL provides spark.read().csv("file_name") to read a file or directory of files in CSV format into Spark DataFrame, and dataframe.write().csv("path") to write to a CSV file. Function option() can be used to customize the behavior of reading or writing, such as controlling behavior of the header, delimiter character, character set, and so on.

csv文件默认的分隔符是，但是可以更改

val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").getOrCreate()

val df = spark.read.format("csv").load("file:///C:\\Users\\dell\\Desktop\\user_profile.csv")
df.show()
var result=
  """
    |+---+---------+------+----+----------+---+--------------------+------------+----------+
    ||_c0|      _c1|   _c2| _c3|       _c4|_c5|                 _c6|         _c7|       _c8|
    |+---+---------+------+----+----------+---+--------------------+------------+----------+
    || id|device_id|gender| age|university|gpa|active_days_withi...|question_cnt|answer_cnt|
    ||  1|     2138|  male|  21|  北京大学|3.4|                   7|           2|        12|
    ||  2|     3214|  male|null|  复旦大学|  4|                  15|           5|        25|
    ||  3|     6543|female|  20|  北京大学|3.2|                  12|           3|        30|
    ||  4|     2315|female|  23|  浙江大学|3.6|                   5|           1|         2|
    ||  5|     5432|  male|  25|  山东大学|3.8|                  20|          15|        70|
    ||  6|     2131|  male|  28|  山东大学|3.3|                  15|           7|        13|
    ||  7|     4321|  male|  28|  复旦大学|3.6|                   9|           6|        52|
    |+---+---------+------+----+----------+---+--------------------+------------+----------+
    |""".stripMargin
// 这样默认是通过，进行分割的
//可以通过delimiter来设置分割参数，sep和他一样
val df1 = spark.read.option("delimiter",";").csv("file:///C:\\Users\\dell\\Desktop\\user_profile.csv")
df1.show()
result =
  """
    |+------------------------+
    ||                     _c0|
    |+------------------------+
    ||    "id","device_id",...|
    ||  1,2138,male,21,北京...|
    ||2,3214,male,,复旦大学...|
    ||    3,6543,female,20,...|
    ||    4,2315,female,23,...|
    ||  5,5432,male,25,山东...|
    ||  6,2131,male,28,山东...|
    ||  7,4321,male,28,复旦...|
    |+------------------------+
    |""".stripMargin
val df4 = spark.read.option("sep",";").csv("file:///C:\\Users\\dell\\Desktop\\user_profile.csv")
df4.show()
result =
  """
    |+------------------------+
    ||                     _c0|
    |+------------------------+
    ||    "id","device_id",...|
    ||  1,2138,male,21,北京...|
    ||2,3214,male,,复旦大学...|
    ||    3,6543,female,20,...|
    ||    4,2315,female,23,...|
    ||  5,5432,male,25,山东...|
    ||  6,2131,male,28,山东...|
    ||  7,4321,male,28,复旦...|
    |+------------------------+
    |""".stripMargin
//还可以从csv里加载表头
val df2 = spark.read.option("delimiter",",").option("header","true").format("csv").load("file:///C:\\Users\\dell\\Desktop\\user_profile.csv")
df2.show()
result=
  """
    |+---+---------+------+----+----------+---+---------------------+------------+----------+
    || id|device_id|gender| age|university|gpa|active_days_within_30|question_cnt|answer_cnt|
    |+---+---------+------+----+----------+---+---------------------+------------+----------+
    ||  1|     2138|  male|  21|  北京大学|3.4|                    7|           2|        12|
    ||  2|     3214|  male|null|  复旦大学|  4|                   15|           5|        25|
    ||  3|     6543|female|  20|  北京大学|3.2|                   12|           3|        30|
    ||  4|     2315|female|  23|  浙江大学|3.6|                    5|           1|         2|
    ||  5|     5432|  male|  25|  山东大学|3.8|                   20|          15|        70|
    ||  6|     2131|  male|  28|  山东大学|3.3|                   15|           7|        13|
    ||  7|     4321|  male|  28|  复旦大学|3.6|                    9|           6|        52|
    |+---+---------+------+----+----------+---+---------------------+------------+----------+
    |""".stripMargin
//还可以把上面两个option合并
val df3 = spark.read.options(Map("delimiter" -> "," ,"header" -> "true")).csv("file:///C:\\Users\\dell\\Desktop\\user_profile.csv")
df3.show()
result=
  """
    |+---+---------+------+----+----------+---+---------------------+------------+----------+
    || id|device_id|gender| age|university|gpa|active_days_within_30|question_cnt|answer_cnt|
    |+---+---------+------+----+----------+---+---------------------+------------+----------+
    ||  1|     2138|  male|  21|  北京大学|3.4|                    7|           2|        12|
    ||  2|     3214|  male|null|  复旦大学|  4|                   15|           5|        25|
    ||  3|     6543|female|  20|  北京大学|3.2|                   12|           3|        30|
    ||  4|     2315|female|  23|  浙江大学|3.6|                    5|           1|         2|
    ||  5|     5432|  male|  25|  山东大学|3.8|                   20|          15|        70|
    ||  6|     2131|  male|  28|  山东大学|3.3|                   15|           7|        13|
    ||  7|     4321|  male|  28|  复旦大学|3.6|                    9|           6|        52|
    |+---+---------+------+----+----------+---+---------------------+------------+----------+
    |""".stripMargin
//还可以加上自动推断类型，如果不加，它就会默认是字符串类型inferSchema
val df5 = spark.read.options(Map("sep"->",","header"->"true","inferSchema"->"true","encoding"->"UTF8")).csv("file:///C:\\Users\\dell\\Desktop\\user_profile.csv")
df5.show()
result=
  """
    |+---+---------+------+----+----------+---+---------------------+------------+----------+
    || id|device_id|gender| age|university|gpa|active_days_within_30|question_cnt|answer_cnt|
    |+---+---------+------+----+----------+---+---------------------+------------+----------+
    ||  1|     2138|  male|  21|  北京大学|3.4|                    7|           2|        12|
    ||  2|     3214|  male|null|  复旦大学|4.0|                   15|           5|        25|
    ||  3|     6543|female|  20|  北京大学|3.2|                   12|           3|        30|
    ||  4|     2315|female|  23|  浙江大学|3.6|                    5|           1|         2|
    ||  5|     5432|  male|  25|  山东大学|3.8|                   20|          15|        70|
    ||  6|     2131|  male|  28|  山东大学|3.3|                   15|           7|        13|
    ||  7|     4321|  male|  28|  复旦大学|3.6|                    9|           6|        52|
    |+---+---------+------+----+----------+---+---------------------+------------+----------+
    |""".stripMargin
df5.printSchema()
result=
  """
    |root
    | |-- id: integer (nullable = true)
    | |-- device_id: integer (nullable = true)
    | |-- gender: string (nullable = true)
    | |-- age: integer (nullable = true)
    | |-- university: string (nullable = true)
    | |-- gpa: double (nullable = true)
    | |-- active_days_within_30: integer (nullable = true)
    | |-- question_cnt: integer (nullable = true)
    | |-- answer_cnt: integer (nullable = true)
    |""".stripMargin
//等 剩下的请看api简介

df5.createOrReplaceTempView("csv")
spark.sql(
  """
    |select
    |gender,
    |device_id,
    |active_days_within_30,
    |university
    |from
    |csv
    |where university = '北京大学'
    |""".stripMargin).show()

result=
  """
    |+------+---------+---------------------+----------+
    ||gender|device_id|active_days_within_30|university|
    |+------+---------+---------------------+----------+
    ||  male|     2138|                    7|  北京大学|
    ||female|     6543|                   12|  北京大学|
    |+------+---------+---------------------+----------+
    |""".stripMargin

jdbc

val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").getOrCreate()

//用代码创建
val df = spark.read.format("JDBC")
  .option("url","jdbc:mysql://bigdata2:3306/try")
  .option("dbtable", "emp")
  .option("user", "root")
  .option("password", "liuzihan010616")
  .load()
df.printSchema()
df.show()
// 但是这样传入是把整个表直接传进来，但是有时候我们只要其中一一部分可以这样相当于谓词下压
    val sal =
  """
    |select
    |*
    |from
    |emp where sal > 1500
    |""".stripMargin
val df1 = spark.read.format("JDBC")
  .option("url","jdbc:mysql://bigdata2:3306/try")
  .option("dbtable", s"($sal) as tmp")
  .option("user", "root")
  .option("password", "liuzihan010616")
  .load()
df1.show()
 result =
  """
    |+-----+------+---------+----+-------------------+-------+-------+------+
    ||empno| ename|      job| mgr|           hiredate|    sal|   comm|deptno|
    |+-----+------+---------+----+-------------------+-------+-------+------+
    || 7499| ALLEN| SALESMAN|7698|1981-02-20 00:00:00|1600.00| 300.00|    30|
    || 7566| JONES|  MANAGER|7839|1981-04-02 00:00:00|2975.00|   null|    20|
    || 7698| BLAKE|  MANAGER|7839|1981-05-01 00:00:00|2850.00|   null|    30|
    || 7782| CLARK|  MANAGER|7839|1981-06-09 00:00:00|2450.00|   null|    10|
    || 7788| SCOTT|  ANALYST|7566|1982-12-09 00:00:00|3000.00|   null|    20|
    || 7839|  KING|PRESIDENT|null|1981-11-17 00:00:00|5000.00|   null|    10|
    || 7902|  FORD|  ANALYST|7566|1981-12-03 00:00:00|3000.00|   null|    20|
    || 7839|  KING|PRESIDENT|null|1981-11-17 00:00:00|5000.00|   null|    10|
    || 7654|MARTIN| SALESMAN|7698|1981-09-28 00:00:00|3200.00|1400.00|    30|
    || 7788| SCOTT|  ANALYST|7566|1982-12-09 00:00:00|3000.00|   null|    20|
    || 7788| SCOTT|  ANALYST|7566|1982-12-09 00:00:00|3000.00|   null|    20|
    || 7788| SCOTT|  ANALYST|7566|1982-12-09 00:00:00|3000.00|   null|    20|
    || 7788| SCOTT|  ANALYST|7566|1982-12-09 00:00:00|3000.00|   null|    20|
    || 7788| SCOTT|  ANALYST|7566|1982-12-09 00:00:00|3000.00|   null|    20|
    || 7788| SCOTT|  ANALYST|7566|1982-12-09 00:00:00|3000.00|   null|    20|
    || 7788| SCOTT|  ANALYST|7566|1982-12-09 00:00:00|3000.00|   null|    20|
    || 7788| SCOTT|  ANALYST|7566|1982-12-09 00:00:00|3000.00|   null|    20|
    || 7788| SCOTT|  ANALYST|7566|1982-12-09 00:00:00|3000.00|   null|    20|
    || 7788| SCOTT|  ANALYST|7566|1982-12-09 00:00:00|3000.00|   null|    20|
    || 7788| SCOTT|  ANALYST|7566|1982-12-09 00:00:00|3000.00|   null|    20|
    |+-----+------+---------+----+-------------------+-------+-------+------+
    |only showing top 20 rows
    |
    |
    |Process finished with exit code 0
    |
    |""".stripMargin
//用Properties传入
val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "liuzihan010616")
val jdbcDF2 = spark.read
  .jdbc("jdbc:mysql://bigdata2:3306/try", "try.emp", connectionProperties)

excel

在idea里要先导入spark-excel的pom：这里的版本要和scala对应上

<dependency>
   <groupId>com.crealytics</groupId>
   <artifactId>spark-excel_2.12</artifactId>
   <version>0.14.0</version>
 </dependency>

然后代码

package sparkfirst

import java.util.Properties

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import com.crealytics.spark.excel._
object sparksql2 {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").getOrCreate()
    val df = spark.read.excel(header = true,inferSchema = true).load("file:////C:\\Users\\dell\\Desktop\\2023届毕业设计题目-计算机-选题志愿表.xlsx")
    df.show()
    val result =
      """
        |+--------------------------------------+--------+-------------------------------------+------+--------+--------+--------+
        ||2023届计算机科学与技术专业毕业设计选题|     _c1|                                  _c2|   _c3|     _c4|     _c5|     _c6|
        |+--------------------------------------+--------+-------------------------------------+------+--------+--------+--------+
        ||                                  序号|指导老师|                                 题目|学生数|第一志愿|第二志愿|第三志愿|
        ||                                     1|  王海涛|          基于android的房产中介app...|     1|    null|    null|    null|
        ||                                     2|  王海涛|        基于android的酒店预约入住a...|     1|    null|    null|    null|
        ||                                     3|  王海涛|          基于android的有声书app的...|     1|    null|    null|    null|
        ||                                     4|  王海涛|          基于android的掌上医院app...|     1|    null|    null|    null|
        ||                                     5|  王海涛|    基于web的考试管理系统的设计与实现|     1|    null|    null|    null|
        ||                                     6|    王琢|           电商平台产品评论爬虫的设计|     1|    null|    null|    null|
        ||                                     7|    王琢|     基于Django的智能水务系统前端开发|     1|    null|    null|    null|
        ||                                     8|    王琢|           个人账本管理微信小程序开发|     1|    null|    null|    null|
        ||                                     9|    王琢|       智能水务系统远程监控模块的开发|     1|    null|    null|    null|
        ||                                    10|  张文波|基于安卓系统的硕士研究生招生预报名...|     1|    null|    null|    null|
        ||                                    11|  张文波|面向工业互联网的联网设备故障检测技...|     1|    null|    null|    null|
        ||                                    12|  张文波|面向工业互联网的联网设备运行维护系...|     1|    null|    null|    null|
        ||                                    13|    曹烨|     疫情防控管理信息系统的设计与开发|     1|    null|    null|    null|
        ||                                    14|    曹烨|             多线程下载器的设计与开发|     1|    null|    null|    null|
        ||                                    15|    曹烨|             坦克对战游戏的设计与开发|     1|    null|    null|    null|
        ||                                    16|    曹烨|           五子棋游戏大厅的设计与开发|     1|    null|    null|    null|
        ||                                    17|    杜焱|       疫情封闭人员及物资管理系统开发|     1|    null|    null|    null|
        ||                                    18|    杜焱|                   志愿者服务系统开发|     1|    null|    null|    null|
        ||                                    19|    杜焱|         高校教师工作绩效管理系统开发|     1|    null|    null|    null|
        |+--------------------------------------+--------+-------------------------------------+------+--------+--------+--------+
        |only showing top 20 rows
        |
        |
        |Process finished with exit code 0
        |
        |""".stripMargin

hive

在生产上我们要对配置文件进行修改

1	Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.

直接cp hivehome 下的配置文件 hive-site.xml到spark的配置下或者做软连接到spark的配置文件下

即使hive和spark不在同一台机器上也是可以的，只不过不能做软连接了

如果缺少mysql驱动的化把mysql驱动添加到spark的jar文件夹里就好

或者用 –jars 路径来配置启动方式

接下来我们直接执行如下语句

scala> spark.sql("show databases").show
+-------------+
|    namespace|
+-------------+
|      bigdata|
| bigdata_hive|
|bigdata_hive2|
|bigdata_hive3|
|bigdata_hive4|
|      default|
|         test|
+-------------+

我们可以用spark-sql脚本来执行hive的语句

如下：

[hadoop@bigdata5 conf]$ spark-sql --master local[4]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/01/12 10:10:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/01/12 10:10:15 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
23/01/12 10:10:15 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
Spark master: local[4], Application Id: local-1673489414801
spark-sql (default)> show databases;
namespace
bigdata
bigdata_hive
bigdata_hive2
bigdata_hive3
bigdata_hive4
default
test
Time taken: 2.213 seconds, Fetched 7 row(s)

我们一般都在spark-sql里进行测试sql语句，然后再通过代码部署=》不要再sparksql里创建表会有点问题=》尽可能再hive里键

维护数仓 =》可以用spark-sql -e/-f sql文件 =》推荐的方式维护离线数仓好维护简单

hive引擎如果改成spark =》不稳定 =》 bug =》有的时候spark的function无法使用

hive里有的function =》 spark始终有

但是spark里有的hive里没有

idea里也是要把hive-site放到resource文件夹里

东西很多想要什么上官网查看 https://spark.apache.org/docs/latest/sql-ref-syntax.html#ddl-statements

idea里要提前加上依赖

如下

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-hive_2.12</artifactId>
  <version>3.2.1</version>
</dependency>

然后执行以下代码就ok了

package sparkfirst

import org.apache.spark.sql.SparkSession

object sparksql3 {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").enableHiveSupport().getOrCreate()
    val frame = spark.sql(
      """
        |select
        |*
        |from
        |bigdata_hive3.emp
        |""".stripMargin)
    frame.show()
    frame.printSchema()
    val result =
      """
        |+-----+--------+---------+----+--------+----+----+------+
        ||empno|   ename|      job| mgr|hiredate| sal|comm|deptno|
        |+-----+--------+---------+----+--------+----+----+------+
        || 7369|   SMITH|    CLERK|7902|    null| 800|null|    20|
        || 7499|   ALLEN| SALESMAN|7698|    null|1600| 300|    30|
        || 7521|    WARD| SALESMAN|7698|    null|1250| 500|    30|
        || 7566|   JONES|  MANAGER|7839|    null|2975|null|    20|
        || 7654|  MARTIN| SALESMAN|7698|    null|1250|1400|    30|
        || 7698|   BLAKE|  MANAGER|7839|    null|2850|null|    30|
        || 7782|   CLARK|  MANAGER|7839|    null|2450|null|    10|
        || 7788|   SCOTT|  ANALYST|7566|    null|3000|null|    20|
        || 7839|    KING|PRESIDENT|null|    null|5000|null|    10|
        || 7844|  TURNER| SALESMAN|7698|    null|1500|   0|    30|
        || 7876|   ADAMS|    CLERK|7788|    null|1100|null|    20|
        || 7900|lebulang|    CLERK|7698|    null| 950|null|    30|
        || 7902|    FORD|  ANALYST|7566|    null|3000|null|    20|
        || 7934|  MILLER|    CLERK|7782|    null|1300|null|    10|
        || 7839|    KING|PRESIDENT|null|    null|5000|null|    10|
        || 7654|  MARTIN| SALESMAN|7698|    null|3200|1400|    30|
        || 7788|   SCOTT|  ANALYST|7566|    null|3000|null|    20|
        || 7788|   SCOTT|  ANALYST|7566|    null|3000|null|    20|
        || 7788|   SCOTT|  ANALYST|7566|    null|3000|null|    20|
        || 7788|   SCOTT|  ANALYST|7566|    null|3000|null|    20|
        |+-----+--------+---------+----+--------+----+----+------+
        |only showing top 20 rows
        |
        |root
        | |-- empno: string (nullable = true)
        | |-- ename: string (nullable = true)
        | |-- job: string (nullable = true)
        | |-- mgr: long (nullable = true)
        | |-- hiredate: date (nullable = true)
        | |-- sal: decimal(10,0) (nullable = true)
        | |-- comm: decimal(10,0) (nullable = true)
        | |-- deptno: long (nullable = true)
        |
        |
        |Process finished with exit code 0
        |
        |""".stripMargin
  }
}

写数据

写数据的时候一般会伴随crc文件

TEXT

注意text仅仅支持一列的数据进行输出，不支持多列，因为我们的resource文件夹里有配置文件所以它走我们的配置文件压缩格式为bz2，不过可以自己指定格式，一般不指定且无配置文件是不压缩的

package sparkfirst

import org.apache.spark.sql.SparkSession

object sparksql2 {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").getOrCreate()
    val df2 = spark.read.option("lineSep",",").text("file:///D:\\test.txt")
    df2.show()
    //-----------------------------------------写数据
    df2.write.text("file:///D:\\test1.txt")
    //-------------------------------------------加压缩
    df2.write.option("compression", "gzip").text("file:///D:\\test2.txt")
  }
}

如果想解决，要自己定义外部数据源，相当于自己修改源码

或者把dataframe变成rdd进行输出因为saveasTextFile是可以多列输出的

df2.rdd.saveAsTextFile("file:///D:\\test3.txt")

查看文件格式

json

1 2	//常用的输出方式追加append，或者覆盖(overwrite),或者忽略（ignore），错误等（error） //df.write.mode(saveMode = "overwrite").json("hdfs://bigdata3:9000/spark")

csv

1
2
3

//写出可以用sep进行设置导出的分隔符，mode设置是不是覆盖，compression设置压缩
  df5.write.options(Map("sep"->";","compression"->"gzip")).mode("overwrite").format("csv").save("file:///C:\\Users\\dell\\Desktop\\user_profile1.csv")

结果如下

JDBC

//写出 --------------------------代码
    //如果用overwrite，会把之前的表删掉，然后重新建一个，表的数据结构会发生改变
    df.write.mode("append")
      .format("jdbc")
      .option("url", "jdbc:mysql://bigdata2:3306/try")
      .option("dbtable", "emp1")
      .option("user", "root")
      .option("password", "liuzihan010616")
      .save()
      // --------------------------Properties
    df.write.mode("append")
      .jdbc("jdbc:mysql://bigdata2:3306/try", "emp1", connectionProperties)

    // 可以在写的时候多创建列
    df.write.mode("append")
      .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
      .jdbc("jdbc:mysql://bigdata2:3306/try", "try.emp", connectionProperties)

excel

如下

1 2	df.write.mode("overwrite").excel(header = true,"A1").save("file:///C:\\Users\\dell\\Desktop\\2023届毕业设计题目-计算机-选题志愿表1.xlsx")

hive

一共有几种方法

ctas

  val spark = SparkSession.builder().appName("Sparksql01").master("local[4]").enableHiveSupport().getOrCreate()
  def main(args: Array[String]): Unit = {

    spark.sql(
      """
        |create table bigdata.sparkfinish as
        |select
        |*
        |from(
        |select
        |area,
        |product_name,
        |rank() over(partition by area order by cnt) as rk
        |from (
        |select
        |area,
        |product_name,
        |count(1) as cnt
        |from bigdata.tmp
        |group by area,product_name
        |)
        |)where rk < 3;
        |""".stripMargin)

    // --------------------------------------insert into  数据追加
    spark.sql(
      """
        |insert into table bigdata.sparkfinish
        |select * from bigdata.sparkfinish
        |""".stripMargin)

    // ----------------------------------------数据覆盖
    spark.sql(
      """
        |insert overwrite table bigdata.sparkfinish
        |select * from bigdata.sparkfinish
        |""".stripMargin)
    // ----------------------------------------分区表 emp_partition是元数据分区表 emp_partition1是后来键的分区表
    // ---------------------------------------执行动态分区的时候要配置参数 静态分区不用
    spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
    spark.conf.set("hive.exec.dynamic.partition","true")
    spark.sql(
      """
        |insert overwrite table bigdata_hive3.emp_partition1 partition(deptno)
        |select * from bigdata_hive3.emp_partition
        |""".stripMargin)
//-----------------------------------------------------------------api
	  val frame = checksql(hivesqlchoose("empno , ename , job , mgr , deptno ", "bigdata_hive3.emp", "where sal > 3000"))

    // -----------------------------这里如果用覆盖模式是会把整个表都弄美哦然后重新建表到数据的，所以一般不用，放置我们只对一个分区数据进行操作的时候别的数据不见
    // -------------------------------------普通表
    frame.write.mode(saveMode = "append").format("hive").saveAsTable("bigdata_hive3.emp89")
        spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
        spark.conf.set("hive.exec.dynamic.partition","true")
    // -------------------------------------分区表
    frame.write.partitionBy("deptno").mode(saveMode = "append").format("hive").saveAsTable("bigdata_hive3.emp891")
    // ---------------------------------------insertInto 插入数据 如果是用于分区表它会自动使用动态分区 对普通表可以
    // Exception in thread "main" org.apache.spark.sql.AnalysisException:  insertInto() can't be used together with partitionBy(). Partition columns have already been defined for the table. It is not necessary to use partitionBy().
    //frame.write.partitionBy("deptno").mode(saveMode = "overwrite").format("hive").insertInto("bigdata_hive3.emp891")
    // 可以把数据作为路径写入到hdfs上 ，写入他对应的table path下

    frame.select("empno","ename","job","mgr").write.mode("overwrite").parquet("hdfs://bigdata3:9000/user/hive/warehouse/bigdata_hive3.db/emp_partition1/deptno=20")
    // 对于普通表这样写入是ok的，但是对于分区表 因为元数据的不同 所以可能会导致元数据关联不上
    // 修复元数据库就ok了
    // 注意这里要是parquet的存储格式的或者orc的，如果是text的会成乱码
    // 修复元数据
    // msck repair table table_name [ADD/DROP/SYNC partition]
    // 或者通过rdd的方法进行存储
}


  def hivesqlchoose(string: String*)={

    val str = "select" + " " + string(0) + " " + "from" + " " +  string(1)
    if (string.length > 2){
      str +" " + string(2)
    }else{
      str
    }
  }

  def descfunctionsql(string: String)={
    s"""
      |desc function extended $string
      |""".stripMargin
  }

  def checksql(string: String)={
    spark.sql(string).show(false)
    spark.sql(string)
  }
}

sparksql

发表于 2023-01-10 更新于 2023-01-11 分类于日志
本文字数： 116k 阅读时长 ≈ 1:45

sparksql

sparksql主要是处理结构化数据的模块

结构化数据 =》带有schema信息的数据

半结构化数据 = 》csv ， json ， orc ， parquet

非结构化数据 =》 nosql =》 redis , hbase

sparksql => 不仅可以写sql 还可以编程

特性

sparksql =》 sql + dataframeapi =》处理结构化数据

spark-core里的api这里也通用

有个统一的数据接口 =》处理多种外部数据源 =》 mysql/hive/excel/csv/….. =》统一的api

整合hive =》使用hive非常简单

sparksql 不仅仅是sql

hive on spark =》 hive的查询引擎是spark

spark on hive =》 sparksql 去hive上查数据大部分人用这个

sparksql做了性能优化 =》比RDD 高

基本介绍

sparksql比RDD的性能高原因：

sparksql底层跑的还是spark -core的RDD 只不过是做了优化

因为用户开发的schema

saprkcore => 编程模型 RDD

sparksql =》 RDD[数据集] + schema[字段字段类型] =》 table

DataSet&DataFrame

sparksql : 编程模型 =》 dataSet/dataframe

dataset

分布式数据集

比RDD多出很多优势 =》做了很多优化 =》效率高 =》使用了强类型 =》可以使用算子 = 》查询性能高 =》 spark1.6之后诞生的

Py不支持dataset的api

dataframe

dataeframe =》也是一个dataset

dataframe =》普通数据库里的一个table =》可以使用算子 =》 dataset的rows

Row=》一行数据，仅仅包含列数据

dataframe =》table

和spark-core对比

spare-core =》 rdd

sparksql =》 df [RDD数据集 + 额外的信息scheme]

sparksql ：

1.0 =》schemaRdd ：RDD 存数据 + scheme（类似元数据：存储额外的信息）
1.6 = 》dataset =》 dataframe 变过来的

使用sparksession创建dataframe

工具： idea，linux

引入： sparksql依赖

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-sql_2.12</artifactId>
  <version>3.2.1</version>
</dependency>
     <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-jdbc</artifactId>
        <version>${hive.version}</version>
    </dependency>

linux

在linux里启动spark-shell的时候他会给你自动提供好spark-session

[hadoop@bigdata5 ~]$ spark-shell --master local[4]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/01/10 10:11:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://bigdata5:4040
Spark context available as 'sc' (master = local[4], app id = local-1673316684351).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/
   
Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

创建df如下


scala> val df = spark.read.json("file:///home/hadoop/data/json/Skills.json")
df: org.apache.spark.sql.DataFrame = [_corrupt_record: string, animationId: bigint ... 24 more fields]

scala> df.show
23/01/10 10:14:42 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
+---------------+-----------+--------------------+-----------------------------------+--------------------+-------+---------+----+----------------+--------+-----------+------+--------+----------------------------------+--------+-------+---------------+----------------+----------------+-----+-----+-------+-----------+------+------+----------+
|_corrupt_record|animationId|              damage|                        description|             effects|hitType|iconIndex|  id|        message1|message2|messageType|mpCost|    name|                              note|occasion|repeats|requiredWtypeId|requiredWtypeId1|requiredWtypeId2|scope|speed|stypeId|successRate|tpCost|tpGain|xianliCost|
+---------------+-----------+--------------------+-----------------------------------+--------------------+-------+---------+----+----------------+--------+-----------+------+--------+----------------------------------+--------+-------+---------------+----------------+----------------+-----+-----+-------+-----------+------+------+----------+
|              [|       null|                null|                               null|                null|   null|     null|null|            null|    null|       null|  null|    null|                              null|    null|   null|           null|            null|            null| null| null|   null|       null|  null|  null|      null|
|          null,|       null|                null|                               null|                null|   null|     null|null|            null|    null|       null|  null|    null|                              null|    null|   null|           null|            null|            null| null| null|   null|       null|  null|  null|      null|
|           null|          1|{true, -1, a.atk ...|                                   |   [{21, 0, 1.0, 0}]|      1|      880|   1|      %1的攻击！|        |          1|     0|    攻击| 1 号技能会在选择“攻击”指令时使...|       1|      1|           null|               0|               0|    1|    0|      0|        100|     0|    10|      null|
|           null|          0|{false, 0, 0, 0, 20}|                                   |   [{21, 2, 1.0, 0}]|      0|      688|   2|%1正在保护自己。|        |          1|     0|    防御|1 号技能会在选择“防御”指令时使用。|       1|      1|           null|               0|               0|   11|   10|      0|        100|     0|    10|      null|
|           null|         -1|{true, -1, a.atk ...|                                   |   [{21, 0, 1.0, 0}]|      1|      880|   3|      %1的攻击！|        |          1|     0|连续攻击|                                  |       1|      2|           null|               0|               0|    1|    0|      2|        100|     0|     5|      null|
|           null|         -1|{true, -1, a.atk ...|                                   |   [{21, 0, 1.0, 0}]|      1|      880|   4|      %1的攻击！|        |          1|     0|两次攻击|                                  |       1|      1|           null|               0|               0|    4|    0|      0|        100|     0|     5|      null|
|           null|         -1|{true, -1, a.atk ...|                                   |   [{21, 0, 1.0, 0}]|      1|      849|   5|      %1的攻击！|        |          1|     0|三次攻击|                                  |       1|      1|           null|               0|               0|    5|    0|      0|        100|     0|     4|      null|
|           null|          0|{false, 0, 0, 0, 20}|                                   |   [{41, 0, 0.0, 0}]|      0|      883|   6|      %1逃跑了。|        |          1|     0|    逃跑|                                  |       1|      1|           null|               0|               0|   11|    0|      0|        100|     0|     0|      null|
|           null|          0|{false, 0, 0, 0, 20}|                                   |                  []|      0|      979|   7|    %1正在观望。|        |          1|     0|    观望|                                  |       1|      1|           null|               0|               0|    0|    0|      0|        100|     0|    10|      null|
|           null|         41|{false, 0, 200 + ...|                                   |[{21, 4, 1.0, 0},...|      0|       72|   8|    %1吟唱了%2！|        |          1|     5|    治愈|                                  |       0|      1|           null|               0|               0|    1|    0|      1|        100|     0|    10|      null|
|           null|         66|{false, 2, 100 + ...| 魔法\n初级的圣光技能,能召唤微弱...|  [{44, 30, 1.0, 0}]|      2|       64|   9|    %1吟唱了%2！|        |          1|     5|    火焰|                                  |       1|      1|           null|               0|               0|    1|    0|      1|        100|     0|    10|      null|
|           null|          0|{false, 2, 285 + ...|呼吸法\n常见的呼吸法，运转时能够...|  [{21, 21, 1.0, 0}]|      0|     3084|  10|    %1施放了%2！|        |          1|     0|小吐纳法|              <Cast Animation: ...|       0|      1|           null|               0|               0|   11|    0|      4|        100|     0|     0|      null|
|           null|        152|{false, 2, 285 + ...|                                   |[{21, 153, 1.0, 0...|      0|      499|  11|    %1使用了%2！|        |          1|     0|  灭魂术|              <Cast Animation: 0> |       1|      1|           null|               0|               0|    2|    0|      0|        100|     0|     0|      null|
|           null|         38|{true, 1, 20000, ...| 基因锁·一阶\n觉醒了脚上的力量，...|  [{21, 72, 0.2, 0}]|      0|      479|  12|   %1使出了 %2！|        |          1|     0|  骑士踢|              <setup action>\na...|       1|      1|           null|               0|               0|    1|    0|      2|        100|     0|     0|      null|
|           null|        125|{true, 0, 150 + a...| 基因锁·一阶\n觉醒了一种气功，能...|  [{44, 24, 1.0, 0}]|      2|     4471|  13|    %1施放了%2！|        |          1|    50|    变身|              <Cast Animation: ...|       1|      1|           null|               0|               0|   11|    0|      2|        100|     0|     0|      null|
|           null|         23|{true, -1, 426500...|                                   |   [{21, 0, 1.0, 0}]|      1|      640|  14|      %1的攻击！|        |          1|     0|莫名剑法|              <Cast Animation: ...|       1|      1|           null|               0|               0|    1|    0|      2|        100|     0|    10|      null|
|           null|          0|{false, 4, 100 + ...| 基因锁·破碎\n已经达到身体的极限...|                  []|      0|      943|  15|                |        |          1|     0|无望三阶|              <Hide in Battle>\...|       3|      1|           null|               0|               0|    0|    0|      3|        100|     0|     0|      null|
|           null|          0|{false, 4, 100 + ...| 基因锁·破碎\n已经达到身体的极限...|                  []|      0|      943|  16|                |        |          1|     0|无望四阶|              <Hide in Battle>\...|       3|      1|           null|               0|               0|    0|    0|      3|        100|     0|     0|      null|
|           null|         -1|{true, -1, 10000+...|                                   |   [{21, 0, 1.0, 0}]|      0|      880|  17|      %1的攻击！|        |          1|     0|  骑士拳|                                  |       1|      1|           null|               0|               0|    4|    0|      0|        100|     0|     5|      null|
|           null|        311|{false, 0, 300, 0...|  基因锁·四阶\n返祖·又北二百八十...|                  []|      0|      484|  18|    %1施放了%2！|        |          1|     0|孟极血脉|              \n<passiveAPLUS:1...|       3|      1|           null|               0|               0|    0|    0|      3|        100|     0|     0|      null|
+---------------+-----------+--------------------+-----------------------------------+--------------------+-------+---------+----+----------------+--------+-----------+------+--------+----------------------------------+--------+-------+---------------+----------------+----------------+-----+-----+-------+-----------+------+------+----------+
only showing top 20 rows

idea

package sparkfirst

import org.apache.spark.sql.SparkSession

object sparksql {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("Sparksql01").master("local[2]").getOrCreate()
    val frame = spark.read.json("file:///C:\\Users\\dell\\Desktop\\Skills.json")
    frame.show()
    spark.stop()
  }
}

sparksql进行数据分析

sql
代码

开发df

sql =》 idea api + sql 一起使用或者 hive里的sql文件
api = > 一般用于开发大数据平台

学习api

加载dataframe中的一个字段： select

scala> df.select("description").show
+-----------------------------------+
|                        description|
+-----------------------------------+
|                               null|
|                               null|
|                                   |
|                                   |
|                                   |
|                                   |
|                                   |
|                                   |
|                                   |
|                                   |
| 魔法\n初级的圣光技能,能召唤微弱...|
|呼吸法\n常见的呼吸法，运转时能够...|
|                                   |
| 基因锁·一阶\n觉醒了脚上的力量，...|
| 基因锁·一阶\n觉醒了一种气功，能...|
|                                   |
| 基因锁·破碎\n已经达到身体的极限...|
| 基因锁·破碎\n已经达到身体的极限...|
|                                   |
|  基因锁·四阶\n返祖·又北二百八十...|
+-----------------------------------+
only showing top 20 rows
----------------------------------------------------------------------------源码
 def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)

直接传字段的名字 =》 select（"sdhakdhaj"）或者 select($"niasda")或者select('hsjakd)=> 加上隐式转换import spark.implicits._ 但是在linux里不需要
select(col = "age") =》 加上import org.apache.spark.sql.functions._ ，linux不用
------------------------------------------------------------------------------

createOrReplaceTempView()：创建临时表 =》就可以用sql来查询了


scala> df.createOrReplaceTempView("test")

scala> spark.sql("select * from test")
res4: org.apache.spark.sql.DataFrame = [_corrupt_record: string, animationId: bigint ... 24 more fields]

scala> spark.sql("select * from test").show()
+---------------+-----------+--------------------+-----------------------------------+--------------------+-------+---------+----+----------------+--------+-----------+------+--------+----------------------------------+--------+-------+---------------+----------------+----------------+-----+-----+-------+-----------+------+------+----------+
|_corrupt_record|animationId|              damage|                        description|             effects|hitType|iconIndex|  id|        message1|message2|messageType|mpCost|    name|                              note|occasion|repeats|requiredWtypeId|requiredWtypeId1|requiredWtypeId2|scope|speed|stypeId|successRate|tpCost|tpGain|xianliCost|
+---------------+-----------+--------------------+-----------------------------------+--------------------+-------+---------+----+----------------+--------+-----------+------+--------+----------------------------------+--------+-------+---------------+----------------+----------------+-----+-----+-------+-----------+------+------+----------+
|              [|       null|                null|                               null|                null|   null|     null|null|            null|    null|       null|  null|    null|                              null|    null|   null|           null|            null|            null| null| null|   null|       null|  null|  null|      null|
|          null,|       null|                null|                               null|                null|   null|     null|null|            null|    null|       null|  null|    null|                              null|    null|   null|           null|            null|            null| null| null|   null|       null|  null|  null|      null|
|           null|          1|{true, -1, a.atk ...|                                   |   [{21, 0, 1.0, 0}]|      1|      880|   1|      %1的攻击！|        |          1|     0|    攻击| 1 号技能会在选择“攻击”指令时使...|       1|      1|           null|               0|               0|    1|    0|      0|        100|     0|    10|      null|
|           null|          0|{false, 0, 0, 0, 20}|                                   |   [{21, 2, 1.0, 0}]|      0|      688|   2|%1正在保护自己。|        |          1|     0|    防御|1 号技能会在选择“防御”指令时使用。|       1|      1|           null|               0|               0|   11|   10|      0|        100|     0|    10|      null|
|           null|         -1|{true, -1, a.atk ...|                                   |   [{21, 0, 1.0, 0}]|      1|      880|   3|      %1的攻击！|        |          1|     0|连续攻击|                                  |       1|      2|           null|               0|               0|    1|    0|      2|        100|     0|     5|      null|
|           null|         -1|{true, -1, a.atk ...|                                   |   [{21, 0, 1.0, 0}]|      1|      880|   4|      %1的攻击！|        |          1|     0|两次攻击|                                  |       1|      1|           null|               0|               0|    4|    0|      0|        100|     0|     5|      null|
|           null|         -1|{true, -1, a.atk ...|                                   |   [{21, 0, 1.0, 0}]|      1|      849|   5|      %1的攻击！|        |          1|     0|三次攻击|                                  |       1|      1|           null|               0|               0|    5|    0|      0|        100|     0|     4|      null|
|           null|          0|{false, 0, 0, 0, 20}|                                   |   [{41, 0, 0.0, 0}]|      0|      883|   6|      %1逃跑了。|        |          1|     0|    逃跑|                                  |       1|      1|           null|               0|               0|   11|    0|      0|        100|     0|     0|      null|
|           null|          0|{false, 0, 0, 0, 20}|                                   |                  []|      0|      979|   7|    %1正在观望。|        |          1|     0|    观望|                                  |       1|      1|           null|               0|               0|    0|    0|      0|        100|     0|    10|      null|
|           null|         41|{false, 0, 200 + ...|                                   |[{21, 4, 1.0, 0},...|      0|       72|   8|    %1吟唱了%2！|        |          1|     5|    治愈|                                  |       0|      1|           null|               0|               0|    1|    0|      1|        100|     0|    10|      null|
|           null|         66|{false, 2, 100 + ...| 魔法\n初级的圣光技能,能召唤微弱...|  [{44, 30, 1.0, 0}]|      2|       64|   9|    %1吟唱了%2！|        |          1|     5|    火焰|                                  |       1|      1|           null|               0|               0|    1|    0|      1|        100|     0|    10|      null|
|           null|          0|{false, 2, 285 + ...|呼吸法\n常见的呼吸法，运转时能够...|  [{21, 21, 1.0, 0}]|      0|     3084|  10|    %1施放了%2！|        |          1|     0|小吐纳法|              <Cast Animation: ...|       0|      1|           null|               0|               0|   11|    0|      4|        100|     0|     0|      null|
|           null|        152|{false, 2, 285 + ...|                                   |[{21, 153, 1.0, 0...|      0|      499|  11|    %1使用了%2！|        |          1|     0|  灭魂术|              <Cast Animation: 0> |       1|      1|           null|               0|               0|    2|    0|      0|        100|     0|     0|      null|
|           null|         38|{true, 1, 20000, ...| 基因锁·一阶\n觉醒了脚上的力量，...|  [{21, 72, 0.2, 0}]|      0|      479|  12|   %1使出了 %2！|        |          1|     0|  骑士踢|              <setup action>\na...|       1|      1|           null|               0|               0|    1|    0|      2|        100|     0|     0|      null|
|           null|        125|{true, 0, 150 + a...| 基因锁·一阶\n觉醒了一种气功，能...|  [{44, 24, 1.0, 0}]|      2|     4471|  13|    %1施放了%2！|        |          1|    50|    变身|              <Cast Animation: ...|       1|      1|           null|               0|               0|   11|    0|      2|        100|     0|     0|      null|
|           null|         23|{true, -1, 426500...|                                   |   [{21, 0, 1.0, 0}]|      1|      640|  14|      %1的攻击！|        |          1|     0|莫名剑法|              <Cast Animation: ...|       1|      1|           null|               0|               0|    1|    0|      2|        100|     0|    10|      null|
|           null|          0|{false, 4, 100 + ...| 基因锁·破碎\n已经达到身体的极限...|                  []|      0|      943|  15|                |        |          1|     0|无望三阶|              <Hide in Battle>\...|       3|      1|           null|               0|               0|    0|    0|      3|        100|     0|     0|      null|
|           null|          0|{false, 4, 100 + ...| 基因锁·破碎\n已经达到身体的极限...|                  []|      0|      943|  16|                |        |          1|     0|无望四阶|              <Hide in Battle>\...|       3|      1|           null|               0|               0|    0|    0|      3|        100|     0|     0|      null|
|           null|         -1|{true, -1, 10000+...|                                   |   [{21, 0, 1.0, 0}]|      0|      880|  17|      %1的攻击！|        |          1|     0|  骑士拳|                                  |       1|      1|           null|               0|               0|    4|    0|      0|        100|     0|     5|      null|
|           null|        311|{false, 0, 300, 0...|  基因锁·四阶\n返祖·又北二百八十...|                  []|      0|      484|  18|    %1施放了%2！|        |          1|     0|孟极血脉|              \n<passiveAPLUS:1...|       3|      1|           null|               0|               0|    0|    0|      3|        100|     0|     0|      null|
+---------------+-----------+--------------------+-----------------------------------+--------------------+-------+---------+----+----------------+--------+-----------+------+--------+----------------------------------+--------+-------+---------------+----------------+----------------+-----+-----+-------+-----------+------+------+----------+
only showing top 20 rows

开发数仓：

sql文件维护数仓 =》推荐好维护
idea
- sql维护数仓 =》例子：滴滴
- api维护数仓 =》不好维护 =》定义udf函数方便
  - 可以写通用性代码来维护 =》写完之后牛皮

如何构建dataframe

通过spark-session

RDD =》 dataframe

反射数据结构（tuple,case class）-》形成dataframe-》RDD.toDF就可以了其中 toDF(“列名”，”列名”。。。。)里面的是在dataframe里的列名

编程的方式 - 》形成dataframe-》

准备RDD 结构 =》 row类型的
scheme =》数据字段名字，以及字段类型
- scheme ：可以理解为一个表的元数据 =》字段的名字，字段的类型 =》以StructType来维护的
  - fileds:一个字段的元数据是用StructFileds来维护的

createDataFrame =》 df

val rowRDD: RDD[Row] = inputRDD.map(line => {
     val splits = line.split(",")
     val uid = splits(0)
     val name = splits(1)
     val age = splits(2).toInt
     Row(uid, name, age)
   })

   val schema = StructType(Array(
     StructField("uid", StringType),
     StructField("name", StringType),
     StructField("age", IntegerType)
   ))

   val inputDF: DataFrame = spark.createDataFrame(rowRDD,schema)

dataframe/datasetr->RDD

df.rdd

df -> ds

用as来转换 df.as[数据类型]=》dataset ：一般这个as里是类

作业把mysql的emp以及dept表以json的方式提取出来

通过sparksql来完成之前的需求

数据如下 ： 
-----------------------------------------dept
{
"dept": [
	{
		"deptno" : 10,
		"dname" : "ACCOUNTING",
		"loc" : "NEW YORK"
	},
	{
		"deptno" : 20,
		"dname" : "RESEARCH",
		"loc" : "DALLAS"
	},
	{
		"deptno" : 30,
		"dname" : "SALES",
		"loc" : "CHICAGO"
	},
	{
		"deptno" : 40,
		"dname" : "OPERATIONS",
		"loc" : "BOSTON"
	}
]}
-----------------------------------------emp
{
"emp": [
	{
		"empno" : 7369,
		"ename" : "SMITH",
		"job" : "CLERK",
		"mgr" : 7902,
		"hiredate" : "1980-12-17T06:00:00Z",
		"sal" : 800.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7499,
		"ename" : "ALLEN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-20T06:00:00Z",
		"sal" : 1600.00,
		"comm" : 300.00,
		"deptno" : 30
	},
	{
		"empno" : 7521,
		"ename" : "WARD",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-22T06:00:00Z",
		"sal" : 1250.00,
		"comm" : 500.00,
		"deptno" : 30
	},
	{
		"empno" : 7566,
		"ename" : "JONES",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-04-02T06:00:00Z",
		"sal" : 2975.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T05:00:00Z",
		"sal" : 1250.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7698,
		"ename" : "BLAKE",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-05-01T05:00:00Z",
		"sal" : 2850.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7782,
		"ename" : "CLARK",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-06-09T05:00:00Z",
		"sal" : 2450.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T06:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7844,
		"ename" : "TURNER",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-08T05:00:00Z",
		"sal" : 1500.00,
		"comm" : 0.00,
		"deptno" : 30
	},
	{
		"empno" : 7876,
		"ename" : "ADAMS",
		"job" : "CLERK",
		"mgr" : 7788,
		"hiredate" : "1983-01-12T06:00:00Z",
		"sal" : 1100.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7900,
		"ename" : "lebulang",
		"job" : "CLERK",
		"mgr" : 7698,
		"hiredate" : "1981-12-03T06:00:00Z",
		"sal" : 950.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7902,
		"ename" : "FORD",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1981-12-03T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7934,
		"ename" : "MILLER",
		"job" : "CLERK",
		"mgr" : 7782,
		"hiredate" : "1982-01-23T06:00:00Z",
		"sal" : 1300.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T06:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T05:00:00Z",
		"sal" : 3200.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 3200.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 3200.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7369,
		"ename" : "SMITH",
		"job" : "CLERK",
		"mgr" : 7902,
		"hiredate" : "1980-12-17T20:00:00Z",
		"sal" : 800.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7499,
		"ename" : "ALLEN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-20T20:00:00Z",
		"sal" : 1600.00,
		"comm" : 300.00,
		"deptno" : 30
	},
	{
		"empno" : 7521,
		"ename" : "WARD",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-22T20:00:00Z",
		"sal" : 1250.00,
		"comm" : 500.00,
		"deptno" : 30
	},
	{
		"empno" : 7566,
		"ename" : "JONES",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-04-02T20:00:00Z",
		"sal" : 2975.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 1250.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7698,
		"ename" : "BLAKE",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-05-01T18:00:00Z",
		"sal" : 2850.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7782,
		"ename" : "CLARK",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-06-09T18:00:00Z",
		"sal" : 2450.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7844,
		"ename" : "TURNER",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-08T18:00:00Z",
		"sal" : 1500.00,
		"comm" : 0.00,
		"deptno" : 30
	},
	{
		"empno" : 7876,
		"ename" : "ADAMS",
		"job" : "CLERK",
		"mgr" : 7788,
		"hiredate" : "1983-01-12T20:00:00Z",
		"sal" : 1100.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7900,
		"ename" : "lebulang",
		"job" : "CLERK",
		"mgr" : 7698,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 950.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7902,
		"ename" : "FORD",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7934,
		"ename" : "MILLER",
		"job" : "CLERK",
		"mgr" : 7782,
		"hiredate" : "1982-01-23T20:00:00Z",
		"sal" : 1300.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7369,
		"ename" : "SMITH",
		"job" : "CLERK",
		"mgr" : 7902,
		"hiredate" : "1980-12-17T20:00:00Z",
		"sal" : 800.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7499,
		"ename" : "ALLEN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-20T20:00:00Z",
		"sal" : 1600.00,
		"comm" : 300.00,
		"deptno" : 30
	},
	{
		"empno" : 7521,
		"ename" : "WARD",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-22T20:00:00Z",
		"sal" : 1250.00,
		"comm" : 500.00,
		"deptno" : 30
	},
	{
		"empno" : 7566,
		"ename" : "JONES",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-04-02T20:00:00Z",
		"sal" : 2975.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 1250.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7698,
		"ename" : "BLAKE",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-05-01T18:00:00Z",
		"sal" : 2850.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7782,
		"ename" : "CLARK",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-06-09T18:00:00Z",
		"sal" : 2450.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7844,
		"ename" : "TURNER",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-08T18:00:00Z",
		"sal" : 1500.00,
		"comm" : 0.00,
		"deptno" : 30
	},
	{
		"empno" : 7876,
		"ename" : "ADAMS",
		"job" : "CLERK",
		"mgr" : 7788,
		"hiredate" : "1983-01-12T20:00:00Z",
		"sal" : 1100.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7900,
		"ename" : "lebulang",
		"job" : "CLERK",
		"mgr" : 7698,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 950.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7902,
		"ename" : "FORD",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7934,
		"ename" : "MILLER",
		"job" : "CLERK",
		"mgr" : 7782,
		"hiredate" : "1982-01-23T20:00:00Z",
		"sal" : 1300.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7369,
		"ename" : "SMITH",
		"job" : "CLERK",
		"mgr" : 7902,
		"hiredate" : "1980-12-17T06:00:00Z",
		"sal" : 800.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7499,
		"ename" : "ALLEN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-20T06:00:00Z",
		"sal" : 1600.00,
		"comm" : 300.00,
		"deptno" : 30
	},
	{
		"empno" : 7521,
		"ename" : "WARD",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-22T06:00:00Z",
		"sal" : 1250.00,
		"comm" : 500.00,
		"deptno" : 30
	},
	{
		"empno" : 7566,
		"ename" : "JONES",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-04-02T06:00:00Z",
		"sal" : 2975.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T05:00:00Z",
		"sal" : 1250.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7698,
		"ename" : "BLAKE",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-05-01T05:00:00Z",
		"sal" : 2850.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7782,
		"ename" : "CLARK",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-06-09T05:00:00Z",
		"sal" : 2450.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T06:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7844,
		"ename" : "TURNER",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-08T05:00:00Z",
		"sal" : 1500.00,
		"comm" : 0.00,
		"deptno" : 30
	},
	{
		"empno" : 7876,
		"ename" : "ADAMS",
		"job" : "CLERK",
		"mgr" : 7788,
		"hiredate" : "1983-01-12T06:00:00Z",
		"sal" : 1100.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7900,
		"ename" : "lebulang",
		"job" : "CLERK",
		"mgr" : 7698,
		"hiredate" : "1981-12-03T06:00:00Z",
		"sal" : 950.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7902,
		"ename" : "FORD",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1981-12-03T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7934,
		"ename" : "MILLER",
		"job" : "CLERK",
		"mgr" : 7782,
		"hiredate" : "1982-01-23T06:00:00Z",
		"sal" : 1300.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T06:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T05:00:00Z",
		"sal" : 3200.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 3200.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 3200.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7369,
		"ename" : "SMITH",
		"job" : "CLERK",
		"mgr" : 7902,
		"hiredate" : "1980-12-17T20:00:00Z",
		"sal" : 800.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7499,
		"ename" : "ALLEN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-20T20:00:00Z",
		"sal" : 1600.00,
		"comm" : 300.00,
		"deptno" : 30
	},
	{
		"empno" : 7521,
		"ename" : "WARD",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-22T20:00:00Z",
		"sal" : 1250.00,
		"comm" : 500.00,
		"deptno" : 30
	},
	{
		"empno" : 7566,
		"ename" : "JONES",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-04-02T20:00:00Z",
		"sal" : 2975.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 1250.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7698,
		"ename" : "BLAKE",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-05-01T18:00:00Z",
		"sal" : 2850.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7782,
		"ename" : "CLARK",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-06-09T18:00:00Z",
		"sal" : 2450.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7844,
		"ename" : "TURNER",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-08T18:00:00Z",
		"sal" : 1500.00,
		"comm" : 0.00,
		"deptno" : 30
	},
	{
		"empno" : 7876,
		"ename" : "ADAMS",
		"job" : "CLERK",
		"mgr" : 7788,
		"hiredate" : "1983-01-12T20:00:00Z",
		"sal" : 1100.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7900,
		"ename" : "lebulang",
		"job" : "CLERK",
		"mgr" : 7698,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 950.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7902,
		"ename" : "FORD",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7934,
		"ename" : "MILLER",
		"job" : "CLERK",
		"mgr" : 7782,
		"hiredate" : "1982-01-23T20:00:00Z",
		"sal" : 1300.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7369,
		"ename" : "SMITH",
		"job" : "CLERK",
		"mgr" : 7902,
		"hiredate" : "1980-12-17T20:00:00Z",
		"sal" : 800.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7499,
		"ename" : "ALLEN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-20T20:00:00Z",
		"sal" : 1600.00,
		"comm" : 300.00,
		"deptno" : 30
	},
	{
		"empno" : 7521,
		"ename" : "WARD",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-22T20:00:00Z",
		"sal" : 1250.00,
		"comm" : 500.00,
		"deptno" : 30
	},
	{
		"empno" : 7566,
		"ename" : "JONES",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-04-02T20:00:00Z",
		"sal" : 2975.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 1250.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7698,
		"ename" : "BLAKE",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-05-01T18:00:00Z",
		"sal" : 2850.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7782,
		"ename" : "CLARK",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-06-09T18:00:00Z",
		"sal" : 2450.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7844,
		"ename" : "TURNER",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-08T18:00:00Z",
		"sal" : 1500.00,
		"comm" : 0.00,
		"deptno" : 30
	},
	{
		"empno" : 7876,
		"ename" : "ADAMS",
		"job" : "CLERK",
		"mgr" : 7788,
		"hiredate" : "1983-01-12T20:00:00Z",
		"sal" : 1100.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7900,
		"ename" : "lebulang",
		"job" : "CLERK",
		"mgr" : 7698,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 950.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7902,
		"ename" : "FORD",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7934,
		"ename" : "MILLER",
		"job" : "CLERK",
		"mgr" : 7782,
		"hiredate" : "1982-01-23T20:00:00Z",
		"sal" : 1300.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7369,
		"ename" : "SMITH",
		"job" : "CLERK",
		"mgr" : 7902,
		"hiredate" : "1980-12-17T06:00:00Z",
		"sal" : 800.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7499,
		"ename" : "ALLEN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-20T06:00:00Z",
		"sal" : 1600.00,
		"comm" : 300.00,
		"deptno" : 30
	},
	{
		"empno" : 7521,
		"ename" : "WARD",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-22T06:00:00Z",
		"sal" : 1250.00,
		"comm" : 500.00,
		"deptno" : 30
	},
	{
		"empno" : 7566,
		"ename" : "JONES",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-04-02T06:00:00Z",
		"sal" : 2975.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T05:00:00Z",
		"sal" : 1250.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7698,
		"ename" : "BLAKE",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-05-01T05:00:00Z",
		"sal" : 2850.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7782,
		"ename" : "CLARK",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-06-09T05:00:00Z",
		"sal" : 2450.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T06:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7844,
		"ename" : "TURNER",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-08T05:00:00Z",
		"sal" : 1500.00,
		"comm" : 0.00,
		"deptno" : 30
	},
	{
		"empno" : 7876,
		"ename" : "ADAMS",
		"job" : "CLERK",
		"mgr" : 7788,
		"hiredate" : "1983-01-12T06:00:00Z",
		"sal" : 1100.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7900,
		"ename" : "lebulang",
		"job" : "CLERK",
		"mgr" : 7698,
		"hiredate" : "1981-12-03T06:00:00Z",
		"sal" : 950.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7902,
		"ename" : "FORD",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1981-12-03T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7934,
		"ename" : "MILLER",
		"job" : "CLERK",
		"mgr" : 7782,
		"hiredate" : "1982-01-23T06:00:00Z",
		"sal" : 1300.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T06:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T05:00:00Z",
		"sal" : 3200.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 3200.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 3200.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7369,
		"ename" : "SMITH",
		"job" : "CLERK",
		"mgr" : 7902,
		"hiredate" : "1980-12-17T20:00:00Z",
		"sal" : 800.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7499,
		"ename" : "ALLEN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-20T20:00:00Z",
		"sal" : 1600.00,
		"comm" : 300.00,
		"deptno" : 30
	},
	{
		"empno" : 7521,
		"ename" : "WARD",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-22T20:00:00Z",
		"sal" : 1250.00,
		"comm" : 500.00,
		"deptno" : 30
	},
	{
		"empno" : 7566,
		"ename" : "JONES",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-04-02T20:00:00Z",
		"sal" : 2975.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 1250.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7698,
		"ename" : "BLAKE",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-05-01T18:00:00Z",
		"sal" : 2850.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7782,
		"ename" : "CLARK",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-06-09T18:00:00Z",
		"sal" : 2450.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7844,
		"ename" : "TURNER",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-08T18:00:00Z",
		"sal" : 1500.00,
		"comm" : 0.00,
		"deptno" : 30
	},
	{
		"empno" : 7876,
		"ename" : "ADAMS",
		"job" : "CLERK",
		"mgr" : 7788,
		"hiredate" : "1983-01-12T20:00:00Z",
		"sal" : 1100.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7900,
		"ename" : "lebulang",
		"job" : "CLERK",
		"mgr" : 7698,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 950.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7902,
		"ename" : "FORD",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7934,
		"ename" : "MILLER",
		"job" : "CLERK",
		"mgr" : 7782,
		"hiredate" : "1982-01-23T20:00:00Z",
		"sal" : 1300.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7369,
		"ename" : "SMITH",
		"job" : "CLERK",
		"mgr" : 7902,
		"hiredate" : "1980-12-17T20:00:00Z",
		"sal" : 800.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7499,
		"ename" : "ALLEN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-20T20:00:00Z",
		"sal" : 1600.00,
		"comm" : 300.00,
		"deptno" : 30
	},
	{
		"empno" : 7521,
		"ename" : "WARD",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-22T20:00:00Z",
		"sal" : 1250.00,
		"comm" : 500.00,
		"deptno" : 30
	},
	{
		"empno" : 7566,
		"ename" : "JONES",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-04-02T20:00:00Z",
		"sal" : 2975.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 1250.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7698,
		"ename" : "BLAKE",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-05-01T18:00:00Z",
		"sal" : 2850.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7782,
		"ename" : "CLARK",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-06-09T18:00:00Z",
		"sal" : 2450.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7844,
		"ename" : "TURNER",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-08T18:00:00Z",
		"sal" : 1500.00,
		"comm" : 0.00,
		"deptno" : 30
	},
	{
		"empno" : 7876,
		"ename" : "ADAMS",
		"job" : "CLERK",
		"mgr" : 7788,
		"hiredate" : "1983-01-12T20:00:00Z",
		"sal" : 1100.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7900,
		"ename" : "lebulang",
		"job" : "CLERK",
		"mgr" : 7698,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 950.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7902,
		"ename" : "FORD",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7934,
		"ename" : "MILLER",
		"job" : "CLERK",
		"mgr" : 7782,
		"hiredate" : "1982-01-23T20:00:00Z",
		"sal" : 1300.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7369,
		"ename" : "SMITH",
		"job" : "CLERK",
		"mgr" : 7902,
		"hiredate" : "1980-12-17T06:00:00Z",
		"sal" : 800.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7499,
		"ename" : "ALLEN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-20T06:00:00Z",
		"sal" : 1600.00,
		"comm" : 300.00,
		"deptno" : 30
	},
	{
		"empno" : 7521,
		"ename" : "WARD",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-22T06:00:00Z",
		"sal" : 1250.00,
		"comm" : 500.00,
		"deptno" : 30
	},
	{
		"empno" : 7566,
		"ename" : "JONES",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-04-02T06:00:00Z",
		"sal" : 2975.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T05:00:00Z",
		"sal" : 1250.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7698,
		"ename" : "BLAKE",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-05-01T05:00:00Z",
		"sal" : 2850.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7782,
		"ename" : "CLARK",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-06-09T05:00:00Z",
		"sal" : 2450.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T06:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7844,
		"ename" : "TURNER",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-08T05:00:00Z",
		"sal" : 1500.00,
		"comm" : 0.00,
		"deptno" : 30
	},
	{
		"empno" : 7876,
		"ename" : "ADAMS",
		"job" : "CLERK",
		"mgr" : 7788,
		"hiredate" : "1983-01-12T06:00:00Z",
		"sal" : 1100.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7900,
		"ename" : "lebulang",
		"job" : "CLERK",
		"mgr" : 7698,
		"hiredate" : "1981-12-03T06:00:00Z",
		"sal" : 950.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7902,
		"ename" : "FORD",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1981-12-03T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7934,
		"ename" : "MILLER",
		"job" : "CLERK",
		"mgr" : 7782,
		"hiredate" : "1982-01-23T06:00:00Z",
		"sal" : 1300.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T06:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T05:00:00Z",
		"sal" : 3200.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T06:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 3200.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 3200.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7369,
		"ename" : "SMITH",
		"job" : "CLERK",
		"mgr" : 7902,
		"hiredate" : "1980-12-17T20:00:00Z",
		"sal" : 800.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7499,
		"ename" : "ALLEN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-20T20:00:00Z",
		"sal" : 1600.00,
		"comm" : 300.00,
		"deptno" : 30
	},
	{
		"empno" : 7521,
		"ename" : "WARD",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-22T20:00:00Z",
		"sal" : 1250.00,
		"comm" : 500.00,
		"deptno" : 30
	},
	{
		"empno" : 7566,
		"ename" : "JONES",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-04-02T20:00:00Z",
		"sal" : 2975.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 1250.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7698,
		"ename" : "BLAKE",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-05-01T18:00:00Z",
		"sal" : 2850.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7782,
		"ename" : "CLARK",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-06-09T18:00:00Z",
		"sal" : 2450.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7844,
		"ename" : "TURNER",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-08T18:00:00Z",
		"sal" : 1500.00,
		"comm" : 0.00,
		"deptno" : 30
	},
	{
		"empno" : 7876,
		"ename" : "ADAMS",
		"job" : "CLERK",
		"mgr" : 7788,
		"hiredate" : "1983-01-12T20:00:00Z",
		"sal" : 1100.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7900,
		"ename" : "lebulang",
		"job" : "CLERK",
		"mgr" : 7698,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 950.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7902,
		"ename" : "FORD",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7934,
		"ename" : "MILLER",
		"job" : "CLERK",
		"mgr" : 7782,
		"hiredate" : "1982-01-23T20:00:00Z",
		"sal" : 1300.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7369,
		"ename" : "SMITH",
		"job" : "CLERK",
		"mgr" : 7902,
		"hiredate" : "1980-12-17T20:00:00Z",
		"sal" : 800.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7499,
		"ename" : "ALLEN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-20T20:00:00Z",
		"sal" : 1600.00,
		"comm" : 300.00,
		"deptno" : 30
	},
	{
		"empno" : 7521,
		"ename" : "WARD",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-02-22T20:00:00Z",
		"sal" : 1250.00,
		"comm" : 500.00,
		"deptno" : 30
	},
	{
		"empno" : 7566,
		"ename" : "JONES",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-04-02T20:00:00Z",
		"sal" : 2975.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7654,
		"ename" : "MARTIN",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-28T18:00:00Z",
		"sal" : 1250.00,
		"comm" : 1400.00,
		"deptno" : 30
	},
	{
		"empno" : 7698,
		"ename" : "BLAKE",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-05-01T18:00:00Z",
		"sal" : 2850.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7782,
		"ename" : "CLARK",
		"job" : "MANAGER",
		"mgr" : 7839,
		"hiredate" : "1981-06-09T18:00:00Z",
		"sal" : 2450.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7788,
		"ename" : "SCOTT",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1982-12-09T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7839,
		"ename" : "KING",
		"job" : "PRESIDENT",
		"mgr" : null,
		"hiredate" : "1981-11-17T20:00:00Z",
		"sal" : 5000.00,
		"comm" : null,
		"deptno" : 10
	},
	{
		"empno" : 7844,
		"ename" : "TURNER",
		"job" : "SALESMAN",
		"mgr" : 7698,
		"hiredate" : "1981-09-08T18:00:00Z",
		"sal" : 1500.00,
		"comm" : 0.00,
		"deptno" : 30
	},
	{
		"empno" : 7876,
		"ename" : "ADAMS",
		"job" : "CLERK",
		"mgr" : 7788,
		"hiredate" : "1983-01-12T20:00:00Z",
		"sal" : 1100.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7900,
		"ename" : "lebulang",
		"job" : "CLERK",
		"mgr" : 7698,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 950.00,
		"comm" : null,
		"deptno" : 30
	},
	{
		"empno" : 7902,
		"ename" : "FORD",
		"job" : "ANALYST",
		"mgr" : 7566,
		"hiredate" : "1981-12-03T20:00:00Z",
		"sal" : 3000.00,
		"comm" : null,
		"deptno" : 20
	},
	{
		"empno" : 7934,
		"ename" : "MILLER",
		"job" : "CLERK",
		"mgr" : 7782,
		"hiredate" : "1982-01-23T20:00:00Z",
		"sal" : 1300.00,
		"comm" : null,
		"deptno" : 10
	}
]}
----------------------------------------------------------------------------问题
1，查询出部门编号为30的所有员工的编号和姓名
2.找出部门编号为10中所有经理，和部门编号为20中所有销售员的详细资料。
3.查询所有员工详细信息，用工资降序排序，如果工资相同使用入职日期升序排序
4.列出薪金大于1500的各种工作及从事此工作的员工人数。
5.列出在销售部工作的员工的姓名，假定不知道销售部的部门编号。
6.查询姓名以S开头的\以S结尾\包含S字符\第二个字母为L  __
7.查询每种工作的最高工资、最低工资、人数
8.列出薪金 高于 公司平均薪金的所有员工号，员工姓名，所在部门名称，上级领导，工资，工资等级
9.列出薪金  高于  在各自部门工作的员工的平均薪金的员工姓名和薪金、部门名称。

解决：

因为上述数据不符合spark的规定，所以我们要在vscode中把他转化
转化完成之后
-------------------------------------------------------
查询出部门编号为30的所有员工的编号和姓名
 //---------------------------api
    val cluname = emp.columns.toList
    cluname.foreach(println(_))
    emp.select("deptno","empno","ename").rdd.filter(x=>{
      x(0)==30
    }).saveAsTextFile("hdfs://bigdata3:9000/spark/1")
    //---------------------------sql
    emp.createOrReplaceTempView("tableemp")
    spark.sql("select deptno,empno,ename from tableemp where deptno=30").show()
//---------------------------------------------------------------2
//找出部门编号为10中所有经理，和部门编号为20中所有销售员的详细资料。
    //----------------------------api
    emp.select("comm","deptno","empno","ename","hiredate","job","mgr","sal").rdd.filter(x=>{
      (x(1)==10&&x(5)=="MANAGER")||(x(1)==20&&x(5)=="SALESMAN")
    }).saveAsTextFile("hdfs://bigdata3:9000/spark/2")
    //--------------------------sql
    spark.sql("select * from tableemp where deptno=10 and job= 'MANAGER' or deptno = 20 and job= 'SALESMAN'").show()
/--------------------------------------------------------------3
//查询所有员工详细信息，用工资降序排序，如果工资相同使用入职日期升序排序
    //----------------------api
   emp.rdd.map(x=>{
     if (x.isNullAt(0)){
     var total = x.getDouble(7)
       var hire = x.getString(4).split("Z")
       var reallyhire = hire(0).split("T")
       val date = reallyhire(0)+" " +reallyhire(1)
       var Data = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(date)
       (total,Data,x)
     }else{
      var total = x.getDouble(0)+x.getDouble(7)
       var hire = x.getString(4).split("Z")
       var reallyhire = hire(0).split("T")
       val date = reallyhire(0)+" " +reallyhire(1)
       var Data = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(date)
       (total,Data,x)
     }
   }).sortBy(x=>(-x._1,x._2)).map(x=>{x._3}).saveAsTextFile("hdfs://bigdata3:9000/spark/3")

    //------------------------------------sql
    spark.sql("select * from (select (ifnull(comm,0)+sal) as total, hiredate,job,mgr,sal,empno,ename,deptno from tableemp ) order by total desc,hiredate asc")
-------------------------------------------4
//列出薪金大于1500的各种工作及从事此工作的员工人数。
    //----------------------------------------api
    emp.filter(Row=>{
      Row.getDouble(7) > 1500
    }).groupBy("job").count().rdd.saveAsTextFile("hdfs://bigdata3:9000/spark/4")
    //----------------------------------------sql
    spark.sql("select job,count(*) from tableemp where sal > 1500  group by job").show()
--------------------------------------------5
//-列出在销售部工作的员工的姓名，假定不知道销售部的部门编号。
    //---------------------------api
    dept.filter(x=>{
      x.getString(1)=="SALES"
    }).join(emp,"deptno").select("ename").rdd.saveAsTextFile("hdfs://bigdata3:9000/spark/5")

    //------------------------sql
    spark.sql("select ename from tableemp where deptno in(select  deptno from tabledept where dname='SALES')").show()
//-----------------------------------------------------------6
//查询姓名以S开头的\以S结尾\包含S字符\第二个字母为L  __
    //----------------------------------api
    emp.filter(x=>{
      (x.getString(3).contains("S"))||(x.getString(3).startsWith("S"))||(x.getString(3).endsWith("S")||x.getString(3).charAt(1)=='L')
    }).rdd.saveAsTextFile("hdfs://bigdata3:9000/spark/6")
    //----------------------------------sql
    spark.sql("select * from tableemp where ename like '%S%' or ename REGEXP '^.L'").show()
//-----------------------------------------------------------7
//查询每种工作的最高工资、最低工资、人数
    //---------------------------------api
val frame2 = emp.map(x => {
  if (x.isNullAt(0)) {
    val d = x.getDouble(7)
    (x.getString(5), d)
  } else {
    val d = x.getDouble(0) + x.getDouble(7)
    (x.getString(5), d)
  }
}).groupBy("_1").max("_2")

    val frame1 = emp.map(x => {
      if (x.isNullAt(0)) {
        val d = x.getDouble(7)
        (x.getString(5), d)
      } else {
        val d = x.getDouble(0) + x.getDouble(7)
        (x.getString(5), d)
      }
    }).groupBy("_1").min("_2")

    val frame = emp.map(x => {
      if (x.isNullAt(0)) {
        val d = x.getDouble(7)
        (x.getString(5), d)
      } else {
        val d = x.getDouble(0) + x.getDouble(7)
        (x.getString(5), d)
      }
    }).groupBy("_1").count()

    frame.join(frame1 , "_1").join(frame2,"_1").rdd.saveAsTextFile("hdfs://bigdata3:9000/spark/7")

    //-------------------------------sql
    spark.sql("select max(ifnull(comm,0)+sal),min(ifnull(comm,0)+sal),count(*),job from tableemp group by job ").show()

//--------------------------------------------------------------8
//列出薪金 高于 公司平均薪金的所有员工号，员工姓名，所在部门名称，上级领导，工资，工资等级
    //--------------------------------------------api
    import org.apache.spark.sql.functions._
val list = emp.groupBy().avg("sal").rdd.map(_.getDouble(0)).collect().toList
    println(list(0))
    val value = emp.filter(x => {
      x.getDouble(7) > list(0).toDouble
    })
    value.show()
    val frame = emp.select($"empno".alias("mgr"), $"ename".alias("leader")).join(value, "mgr").join(dept,"deptno")
//----------------------------------------------------工资等级
//    insert into salgrade values (1, 700, 1200);
//    insert into salgrade values (2, 1201, 1400);
//    insert into salgrade values (3, 1401, 2000);
//    insert into salgrade values (4, 2001, 3000);
//    insert into salgrade values (5, 3001, 9999);
    frame.printSchema()
//    frame.rdd.collect().foreach(println(_))
    val value1 = frame.rdd.map(x => {
      if (x.isNullAt(3)) {
        val tmp = x.getDouble(8).toInt
        if (tmp > 700 && tmp < 1200){
          (1, x(1),x)
        }else{
          if (tmp > 1201 && tmp < 1400){
            (2, x(1),x)
          }else{
            if (tmp > 1401 && tmp < 2000){
              (3, x(1),x)
            }else{
              if (tmp > 2001 && tmp < 3000){
                (4, x(1),x)
              }else{
                if (tmp > 3001 && tmp < 9999){
                  (5, x(1),x)
                }else{
                  (5, x(1),x)
                }
              }
            }
          }
        }
      } else {
        val tmp =  (x.getDouble(3) + x.getDouble(8)).toInt
        if (tmp > 700 && tmp < 1200){
          (1, x(1),x)
        }else{
          if (tmp > 1201 && tmp < 1400){
            (2, x(1),x)
          }else{
            if (tmp > 1401 && tmp < 2000){
              (3, x(1),x)
            }else{
              if (tmp > 2001 && tmp < 3000){
                (4, x(1),x)
              }else{
                if (tmp > 3001 && tmp < 9999){
                  (5, x(1),x)
                }else{
                  (5, x(1),x)
                }
              }
            }
          }
        }
      }
    })

    value1.saveAsTextFile("hdfs://bigdata3:9000/spark/8")

    // --------------------------------------------sql
    emp.createOrReplaceTempView("tableemp1")
    spark.sql("\n\nselect \nking.ename,\nking.empno,\ne1.ename as leader,\nking.earn,\ns.grade\nas sallevel\nfrom (\n  select ename , empno , deptno , ifnull((sal + comm),sal) as earn ,mgr \n  from tableemp\n  where sal > (\n    select avg(sal)\n    from tableemp1\n  )\n) as king \nleft join tabledept on king.deptno=tabledept.deptno \nleft join (\n  select empno,ename\n  from tableemp\n) e1 on king.mgr = e1.empno\nleft join salgrade as s on earn >= losal and earn <= hisal;").show()

//--------------------------------------------------------------------9
//列出薪金  高于  在各自部门工作的员工的平均薪金的员工姓名和薪金、部门名称。
    //----------------------------------------------api
val frame = emp.groupBy("deptno").avg("sal").rdd.collect().toList
    frame.foreach(println(_))

   for (elem <- frame){
     var name = elem(0).toString + "deptno"
     var frame1 = emp.filter(x => {
       (x.getLong(1).toString == elem(0).toString) && (x.getDouble(7) > elem(1).toString.toDouble)
     }).join(dept, "deptno").rdd.saveAsTextFile(s"hdfs://bigdata3:9000/spark/9/$name")
   }
    //----------------------------------------------sql
    spark.sql("\n\n\nselect  * \nfrom  \n(select\n*\nfrom (select avg(sal) as sal_avg, deptno as deptno1 from (select  sal ,deptno from tableemp group by sal , deptno) as king group by deptno) as avg_basic left join tableemp\non tableemp.deptno=avg_basic.deptno1  and tableemp.sal > avg_basic.sal_avg) as basicinfo where basicinfo.deptno1 in (select deptno from tabledept);").show()

1点ok

spark

发表于 2023-01-04 更新于 2023-01-11 分类于日志
本文字数： 49k 阅读时长 ≈ 45 分钟

spark

spark产生背景？
mr,hive批处理，离线处理存在一些局限性：

mr api 开发复杂
只能做离线计算，不能实时计算
性能不高

需求：

sql =》 mr

会产生多个job去完成一个需求

mr1=>mr2=>mr3

map => reduce

map处理完数据=》dask数据落盘 =》 reduce mr

kv进行操作 =》 k进行排序

什么是spark？

官网：spark.apache.org

计算特点：不关注数据存储

特点：

Batch/streaming data =》批流一体
SQL analytics
Data science at scale
Machine learning

速度快：

基于内存的运算
DAG =》链式编程=》mr1=>mr2=>mr3
pipline通道的
编程模型线程级别的

易用性

开发语言：java ，Scala ， python ， sql
外部数据源
80多个高级算子=》Scala算子
mr=》去读mysql数据库 =》 DBinputformat
spark => 封装好了多种外部数据源 =》 jdbc ， json ，csv
mr map/reduce
spark :80

通用性

子模块
sparkcore =》：离线计算
sparksql =》离线计算
sparkstreaming，structstreaming =》实时计算
mllib =》机器学习
图计算 =》图处理

对于spark的子模块之间可以用交互式使用

运行作业的地方

yarn ***
mesos
k8s ***
standalone

hadoop生态圈 vs spark生态圈

Batch： mr ， hive vs sparkcore，sparksql
Sql：hive，impala vs sparksql
stream ： Strom vs spark streaming，sss
MLLib：Mahout vs MLLib
real存储： HBase，cassandra vs DataSource Api

spark 能不能替换hadoop =》替换不了=》spark =》可以mr

spqrk版本：

spark 1.x
spark 2.x主流
spark 3.x主流

编程模型：

sparkcore =》 RDD

sparksql =》 dataframe & dataset

sparkstreaming =》 DS

sparkcore

RDD：rdd开发降低开发人员的开发成本 vs mr

什么是rdd？

lower level = 》mr

high level =》 spark 高级算子

优点：

弹性分布式数据集
数据集 =》 partitions 元素 =》一条一条数据
可以用并行的方式进行计算

弹性？

容错 =》计算的时候可以重试

分布式？

存储
- rdd：1 2 3 4 5 6
  - partition1:1 2 3
  - partition2: 4 5
  - partition:6
- bigdata3:p1
- bigdata4:p2
- bigdtat5:p3
计算
- 对rdd的操作是操作里面的数据
数据集
- 就是构建rdd本身的数据
immutable 不可变的
- scala ： val var
- rdda =》 rddb
- 不可变 = 》rdda 通过一个计算到新的rdd
partition collection of elements => rdd可以被分区存储/计算
一个rdd 是由多个partition所构成的
rdd数据存储是分布式的，是跨节点进行存储的

abstract

T泛型 =》限定在人dd里面数据是什么类型的如：RDD[String] , RDD[Int],RDD[Student]

Serializable序列化=》可以经过网络传输

@transient注解这个属性不用序列化 [了解]

RDD 的特性：

rdd 的底层存储是系列的partition
针对rdd做计算/操作其实就是对rdd底层的partition进行计算/操作
rdd之间的依赖关系
- rdda =》rddb
- rdd 不可变
- rdda = 》b =》 c
Partitioner =》 kv类型的rdd
默认分区是hash
数据本地性 =》减少数据传输的io ，优点
- rdd进行操作的好处：
  - 有限的作业调度在数据所在的节点上 =》理想状况
  - 常见计算 =》作业调度在别的节点上，数据另外存储在一台节点上，只能把数据通过网络传过去再进行计算

在rdd中可以用scala的map和其他高级函数

RDD操作：

构建sparkcore 作业：idea

添加依赖：

<dependency>
     <groupId>org.apache.spark</groupId>
     <artifactId>spark-core_2.12</artifactId>
     <version>3.2.1</version>
   </dependency>

mapreduce ：程序入口：job

初始化spark：

sparkContext =》 sparkcore 程序入口
SparkConf => 指定spark app 详细信息
- AppName = 》作业名字
- Master =》作业运行在什么地方 spark作业运行模式
  - local，yarn，stanalone，k8s,mesos
  - 公共中：yarn，k8s，local
  - 测试的时候：local
  - 一个spark作业里面只能由一个sparkcontext
如和指定Master spark作业运行模式
- local【K】模式
  - k指的是线程数
- standalone => spark://HOST:PORT
- yarn 两种模式：
  - client模式
  - cluster模式
- k8s
  - k8s://HOST:PORT

rdd进行编程

创建rdd

parallelize existing collection =》已经存在的集合

referencing a dataset in an external storage system，hdfs、hbase、其他数据存储系统

外部数据源存储
hdfs、local、hbase、s3、cos、
数据文件类型：
text files, SequenceFiles, and any other Hadoop InputFormat.

spark部署：

spark不是部署分布式的参考hive

spark支持分布式部署 =》 standalone

步骤：解压 =》软连接 =》 source

spark-core的脚本spark-shell

例子： spark -shell --master local[2]

启动spark-shell：测试code

web ui =》每个spark作业的 http://bigdata32:4040
参数：--master => spark shell 以什么模式去运行
可以用 --name 更改spark shell的名字

以下是关于参数的详细情况

spark-shell : 
	--master  spark作业运行环境 
	--deploy-mode yarn模式 运行选择
	--class  spark作业包 运行主类main  class 包名 
	--name  指定spark作业的名字 
	--jars 指定第三方的依赖包
	--conf 指定spark作业配置参数 
yarn 参数补充： 
	--num-executors 指定 申请资源的参数 
	--executor-memory 指定 申请资源的参数 
	--executor-cores 指定 申请资源的参数
	--queue 指定作业 运行在yarn的哪个队列上 

spark-shell 交互式命令 底层调用 =》 spark-submit 
	开发者 主要使用的脚本 用于提交用户自己开发的spark作业

spark-shell: 
	spark-submit \
	--class org.apache.spark.repl.Main \
	--name "Spark shell" "$@"

spark-shell --master "local[2]": 
	spark-submit \
	--class org.apache.spark.repl.Main \
	--name "Spark shell"  --master "local[2]"

算子：

filter：

scala> test.collect
res15: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171...

scala> test.filter(_>999).collect
res16: Array[Int] = Array(1000)
scala> test.collect
res15: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171...

scala> test.filter(_>999).collect
res16: Array[Int] = Array(1000)

mapPartitionsWithIndex：就是和mappartition是基本上一样的就是多了个索引

val rdd = sc.makeRDD(List(1,2,3,4),numSlices = 2)//分区为2
scala> rdd.mapPartitionsWithIndex((index,iter)=> {if(index ==1) { iter } else { Nil.iterator}}).collect.foreach(println)
3
4
scala> rdd.mapPartitionsWithIndex((index,iter)=> {iter.map(num => {(index , num)})}).collect.foreach(println)
(0,1)
(0,2)
(1,3)
(1,4)
-=--------------------------------------------------------mappartition
scala> test1.mapPartitions(_.map(_._2),true).collect.foreach(print)
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000

一般的使用场景是查看partition里的元素

每个rdd的数据落到是什么分区上，我们不用太管，后面会讲

mapValues：只对kv类型的数据进行操作，相当于是单独对每个values做处理

1
2
3

scala> test1.mapValues(_+3).collect.foreach(print)
(0,4)(0,5)(0,6)(0,7)(0,8)(0,9)(0,10)(0,11)(0,12)(0,13)(0,14)(0,15)(0,16)(0,17)(0,18)(0,19)(0,20)(0,21)(0,22)(0,23)(0,24)(0,25)(0,26)(0,27)(0,28)(0,29)(0,30)(0,31)(0,32)(0,33)(0,34)(0,35)(0,36)(0,37)(0,38)(0,39)(0,40)(0,41)(0,42)(0,43)(0,44)(0,45)(0,46)(0,47)(0,48)(0,49)(0,50)(0,51)(0,52)(0,53)(0,54)(0,55)(0,56)(0,57)(0,58)(0,59)(0,60)(0,61)(0,62)(0,63)(0,64)(0,65)(0,66)(0,67)(0,68)(0,69)(0,70)(0,71)(0,72)(0,73)(0,74)(0,75)(0,76)(0,77)(0,78)(0,79)(0,80)(0,81)(0,82)(0,83)(0,84)(0,85)(0,86)(0,87)(0,88)(0,89)(0,90)(0,91)(0,92)(0,93)(0,94)(0,95)(0,96)(0,97)(0,98)(0,99)(0,100)(0,101)(0,102)(0,103)(0,104)(0,105)(0,106)(0,107)(0,108)(0,109)(0,110)(0,111)(0,112)(0,113)(0,114)(0,115)(0,116)(0,117)(0,118)(0,119)(0,120)(0,121)(0,122)(0,123)(0,124)(0,125)(0,126)(0,127)(0,128)(0,129)(0,130)(0,131)(0,132)(0,133)(0,134)(0,135)(0,136)(0,137)(0,138)(0,139)(0,140)(0,141)(0,142)(0,143)(0,144)(0,145)(0,146)(0,147)(0,148)(0,149)(0,150)(0,151)(0,152)(0,153)(0,154)(0,155)(0,156)(0,157)(0,158)(0,159)(0,160)(0,161)(0,162)(0,163)(0,164)(0,165)(0,166)(0,167)(0,168)(0,169)(0,170)(0,171)(0,172)(0,173)(0,174)(0,175)(0,176)(0,177)(0,178)(0,179)(0,180)(0,181)(0,182)(0,183)(0,184)(0,185)(0,186)(0,187)(0,188)(0,189)(0,190)(0,191)(0,192)(0,193)(0,194)(0,195)(0,196)(0,197)(0,198)(0,199)(0,200)(0,201)(0,202)(0,203)(0,204)(0,205)(0,206)(0,207)(0,208)(0,209)(0,210)(0,211)(0,212)(0,213)(0,214)(0,215)(0,216)(0,217)(0,218)(0,219)(0,220)(0,221)(0,222)(0,223)(0,224)(0,225)(0,226)(0,227)(0,228)(0,229)(0,230)(0,231)(0,232)(0,233)(0,234)(0,235)(0,236)(0,237)(0,238)(0,239)(0,240)(0,241)(0,242)(0,243)(0,244)(0,245)(0,246)(0,247)(0,248)(0,249)(0,250)(0,251)(0,252)(0,253)(0,254)(0,255)(0,256)(0,257)(0,258)(0,259)(0,260)(0,261)(0,262)(0,263)(0,264)(0,265)(0,266)(0,267)(0,268)(0,269)(0,270)(0,271)(0,272)(0,273)(0,274)(0,275)(0,276)(0,277)(0,278)(0,279)(0,280)(0,281)(0,282)(0,283)(0,284)(0,285)(0,286)(0,287)(0,288)(0,289)(0,290)(0,291)(0,292)(0,293)(0,294)(0,295)(0,296)(0,297)(0,298)(0,299)(0,300)(0,301)(0,302)(0,303)(0,304)(0,305)(0,306)(0,307)(0,308)(0,309)(0,310)(0,311)(0,312)(0,313)(0,314)(0,315)(0,316)(0,317)(0,318)(0,319)(0,320)(0,321)(0,322)(0,323)(0,324)(0,325)(0,326)(0,327)(0,328)(0,329)(0,330)(0,331)(0,332)(0,333)(0,334)(0,335)(0,336)(0,337)(0,338)(0,339)(0,340)(0,341)(0,342)(0,343)(0,344)(0,345)(0,346)(0,347)(0,348)(0,349)(0,350)(0,351)(0,352)(0,353)(0,354)(0,355)(0,356)(0,357)(0,358)(0,359)(0,360)(0,361)(0,362)(0,363)(0,364)(0,365)(0,366)(0,367)(0,368)(0,369)(0,370)(0,371)(0,372)(0,373)(0,374)(0,375)(0,376)(0,377)(0,378)(0,379)(0,380)(0,381)(0,382)(0,383)(0,384)(0,385)(0,386)(0,387)(0,388)(0,389)(0,390)(0,391)(0,392)(0,393)(0,394)(0,395)(0,396)(0,397)(0,398)(0,399)(0,400)(0,401)(0,402)(0,403)(0,404)(0,405)(0,406)(0,407)(0,408)(0,409)(0,410)(0,411)(0,412)(0,413)(0,414)(0,415)(0,416)(0,417)(0,418)(0,419)(0,420)(0,421)(0,422)(0,423)(0,424)(0,425)(0,426)(0,427)(0,428)(0,429)(0,430)(0,431)(0,432)(0,433)(0,434)(0,435)(0,436)(0,437)(0,438)(0,439)(0,440)(0,441)(0,442)(0,443)(0,444)(0,445)(0,446)(0,447)(0,448)(0,449)(0,450)(0,451)(0,452)(0,453)(0,454)(0,455)(0,456)(0,457)(0,458)(0,459)(0,460)(0,461)(0,462)(0,463)(0,464)(0,465)(0,466)(0,467)(0,468)(0,469)(0,470)(0,471)(0,472)(0,473)(0,474)(0,475)(0,476)(0,477)(0,478)(0,479)(0,480)(0,481)(0,482)(0,483)(0,484)(0,485)(0,486)(0,487)(0,488)(0,489)(0,490)(0,491)(0,492)(0,493)(0,494)(0,495)(0,496)(0,497)(0,498)(0,499)(0,500)(0,501)(0,502)(0,503)(1,504)(1,505)(1,506)(1,507)(1,508)(1,509)(1,510)(1,511)(1,512)(1,513)(1,514)(1,515)(1,516)(1,517)(1,518)(1,519)(1,520)(1,521)(1,522)(1,523)(1,524)(1,525)(1,526)(1,527)(1,528)(1,529)(1,530)(1,531)(1,532)(1,533)(1,534)(1,535)(1,536)(1,537)(1,538)(1,539)(1,540)(1,541)(1,542)(1,543)(1,544)(1,545)(1,546)(1,547)(1,548)(1,549)(1,550)(1,551)(1,552)(1,553)(1,554)(1,555)(1,556)(1,557)(1,558)(1,559)(1,560)(1,561)(1,562)(1,563)(1,564)(1,565)(1,566)(1,567)(1,568)(1,569)(1,570)(1,571)(1,572)(1,573)(1,574)(1,575)(1,576)(1,577)(1,578)(1,579)(1,580)(1,581)(1,582)(1,583)(1,584)(1,585)(1,586)(1,587)(1,588)(1,589)(1,590)(1,591)(1,592)(1,593)(1,594)(1,595)(1,596)(1,597)(1,598)(1,599)(1,600)(1,601)(1,602)(1,603)(1,604)(1,605)(1,606)(1,607)(1,608)(1,609)(1,610)(1,611)(1,612)(1,613)(1,614)(1,615)(1,616)(1,617)(1,618)(1,619)(1,620)(1,621)(1,622)(1,623)(1,624)(1,625)(1,626)(1,627)(1,628)(1,629)(1,630)(1,631)(1,632)(1,633)(1,634)(1,635)(1,636)(1,637)(1,638)(1,639)(1,640)(1,641)(1,642)(1,643)(1,644)(1,645)(1,646)(1,647)(1,648)(1,649)(1,650)(1,651)(1,652)(1,653)(1,654)(1,655)(1,656)(1,657)(1,658)(1,659)(1,660)(1,661)(1,662)(1,663)(1,664)(1,665)(1,666)(1,667)(1,668)(1,669)(1,670)(1,671)(1,672)(1,673)(1,674)(1,675)(1,676)(1,677)(1,678)(1,679)(1,680)(1,681)(1,682)(1,683)(1,684)(1,685)(1,686)(1,687)(1,688)(1,689)(1,690)(1,691)(1,692)(1,693)(1,694)(1,695)(1,696)(1,697)(1,698)(1,699)(1,700)(1,701)(1,702)(1,703)(1,704)(1,705)(1,706)(1,707)(1,708)(1,709)(1,710)(1,711)(1,712)(1,713)(1,714)(1,715)(1,716)(1,717)(1,718)(1,719)(1,720)(1,721)(1,722)(1,723)(1,724)(1,725)(1,726)(1,727)(1,728)(1,729)(1,730)(1,731)(1,732)(1,733)(1,734)(1,735)(1,736)(1,737)(1,738)(1,739)(1,740)(1,741)(1,742)(1,743)(1,744)(1,745)(1,746)(1,747)(1,748)(1,749)(1,750)(1,751)(1,752)(1,753)(1,754)(1,755)(1,756)(1,757)(1,758)(1,759)(1,760)(1,761)(1,762)(1,763)(1,764)(1,765)(1,766)(1,767)(1,768)(1,769)(1,770)(1,771)(1,772)(1,773)(1,774)(1,775)(1,776)(1,777)(1,778)(1,779)(1,780)(1,781)(1,782)(1,783)(1,784)(1,785)(1,786)(1,787)(1,788)(1,789)(1,790)(1,791)(1,792)(1,793)(1,794)(1,795)(1,796)(1,797)(1,798)(1,799)(1,800)(1,801)(1,802)(1,803)(1,804)(1,805)(1,806)(1,807)(1,808)(1,809)(1,810)(1,811)(1,812)(1,813)(1,814)(1,815)(1,816)(1,817)(1,818)(1,819)(1,820)(1,821)(1,822)(1,823)(1,824)(1,825)(1,826)(1,827)(1,828)(1,829)(1,830)(1,831)(1,832)(1,833)(1,834)(1,835)(1,836)(1,837)(1,838)(1,839)(1,840)(1,841)(1,842)(1,843)(1,844)(1,845)(1,846)(1,847)(1,848)(1,849)(1,850)(1,851)(1,852)(1,853)(1,854)(1,855)(1,856)(1,857)(1,858)(1,859)(1,860)(1,861)(1,862)(1,863)(1,864)(1,865)(1,866)(1,867)(1,868)(1,869)(1,870)(1,871)(1,872)(1,873)(1,874)(1,875)(1,876)(1,877)(1,878)(1,879)(1,880)(1,881)(1,882)(1,883)(1,884)(1,885)(1,886)(1,887)(1,888)(1,889)(1,890)(1,891)(1,892)(1,893)(1,894)(1,895)(1,896)(1,897)(1,898)(1,899)(1,900)(1,901)(1,902)(1,903)(1,904)(1,905)(1,906)(1,907)(1,908)(1,909)(1,910)(1,911)(1,912)(1,913)(1,914)(1,915)(1,916)(1,917)(1,918)(1,919)(1,920)(1,921)(1,922)(1,923)(1,924)(1,925)(1,926)(1,927)(1,928)(1,929)(1,930)(1,931)(1,932)(1,933)(1,934)(1,935)(1,936)(1,937)(1,938)(1,939)(1,940)(1,941)(1,942)(1,943)(1,944)(1,945)(1,946)(1,947)(1,948)(1,949)(1,950)(1,951)(1,952)(1,953)(1,954)(1,955)(1,956)(1,957)(1,958)(1,959)(1,960)(1,961)(1,962)(1,963)(1,964)(1,965)(1,966)(1,967)(1,968)(1,969)(1,970)(1,971)(1,972)(1,973)(1,974)(1,975)(1,976)(1,977)(1,978)(1,979)(1,980)(1,981)(1,982)(1,983)(1,984)(1,985)(1,986)(1,987)(1,988)(1,989)(1,990)(1,991)(1,992)(1,993)(1,994)(1,995)(1,996)(1,997)(1,998)(1,999)(1,1000)(1,1001)(1,1002)(1,1003)

flatMap：和Scala里的是一样的

scala> test.flatMap(x=>x.to(3)).collect
res98: Array[Int] = Array(1, 2, 3, 2, 3, 3)

其他的算子

glom：把每个分区的数据形成一个数组，比mapPartitionWithIndex好用

1
2

scala> test.glom.collect
res62: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, ...

sample：抽样，随机抽样的

1
2
3

scala> (test.sample(false,0.77)).collect.foreach(print)
236781112141516182022232425262728303132333435363738394041434446474850515254555657596061626364666768697172747576777879818285868789919293969798991001011021031041051061071081091101111121131161181191201211221241251271281301311321331341351361371381391401411441461471491501531541551561611621631641651661671681691701711721731741751761781791801811831841861871891901911931951961971982002012022042052072082092112122132142162172182202212222242252262272282302322332352362372382402412432442452472482502512522532542552562572582602612632642652662672682692702712732742752762772782812822832842892902922932942952962972982993003013023033043053063073083093113123133143153163173183193203213223233263293303313323333343353363373383393403413423443463473483493513523543573583593603613623633663673683693703713723743763773783793803813823833873903913923943954004014024034044054074084094114124134144154164174184194224234244264274284294304314324334344354364374384394404414424454464494504514524534544554564574584604614634644654664674684694714724744754764794804814824834844854864874884904914924934944954974984995005015025065075085105115125135145165175185195225235245255265275295305315325355375385395405415425445455465475485495515525545565575585595605625635645655675685715735755765795805815825835845855895905925935955965975985996016026036046066086096106116126136146156166176206216226236246256266276296316326336346356366376386396406416436446456466476486496506516536546556576586596606616626636646676696706716726736746756766776796806816836846856866876896906916926936946976986997007037047057067087097107117127157167177187197217227237267277297307327337347357367377387417427447457467477487497517527537547557567577587597607627637667677697707727737747757777787797807817827837847867887897907917937957967987998018028038048058068078098108128138148178198208228248258268278288308318328348358368378388408418428438458468478488498508528538548558588598608618628638648658668678688698708718758768778788798808828838848878888898908918958988999009019029039049059069079099129149159179189199209219229249259269279289299309319329339349359369379389399419429439449469489499509519549559569589599609619629639649659679689699709729749759769799809819829839859869879889899909919939949969979989991000

union：简单的数据合并，不去重

intersection：两个rdd的交集

subtract：出现在a里的没有出现在b里

.collect:把结果以数组的形式转到控制台

distinct：去重和sql效果一样 =》底层去重的方法是用reduceByKey进行去重的目的

scala> val dd = sc.parallelize(List(1,2,2,3,4,5,6,7,8))
dd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[28] at parallelize at <console>:24

scala> dd.collect.foreach(print)
122345678
scala> dd.collect.foreach(println)
1
2
2
3
4
5
6
7
8

scala> dd.distinct.collect.foreach(println)
4
6
8
2
1
3
7
5

kv算子： groupbykey =》就是对key进行分组，和wordcount里的分组是一样的=》工作中不要使用，效率低，不灵活

预聚合：

mr ： input = > map => combine(调优的过程) => reduce => output
combine => 预聚合按照map的输出的key进行数据聚合

mapSideCombine = false 代表预聚合关闭

一般groupbykey是关闭预聚合的，reducebykey是开启的

scala> test1.collect
res25: Array[(Int, Int)] = Array((0,1), (0,2), (0,3), (0,4), (0,5), (0,6), (0,7), (0,8), (0,9), (0,10), (0,11), (0,12), (0,13), (0,14), (0,15), (0,16), (0,17), (0,18), (0,19), (0,20), (0,21), (0,22), (0,23), (0,24), (0,25), (0,26), (0,27), (0,28), (0,29), (0,30), (0,31), (0,32), (0,33), (0,34), (0,35), (0,36), (0,37), (0,38), (0,39), (0,40), (0,41), (0,42), (0,43), (0,44), (0,45), (0,46), (0,47), (0,48), (0,49), (0,50), (0,51), (0,52), (0,53), (0,54), (0,55), (0,56), (0,57), (0,58), (0,59), (0,60), (0,61), (0,62), (0,63), (0,64), (0,65), (0,66), (0,67), (0,68), (0,69), (0,70), (0,71), (0,72), (0,73), (0,74), (0,75), (0,76), (0,77), (0,78), (0,79), (0,80), (0,81), (0,82), (0,83), (0,84), (0,85), (0,86), (0,87), (0,88), (0,89), (0,90), (0,91), (0,92), (0,93), (0,...
scala> test1.groupByKey.collect.foreach(print)
(0,CompactBuffer(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500))(1,CompactBuffer(501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558, 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597, 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623, 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753, 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 910, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000))

reducebykey：对比groupby是相当于可以统计之后进行计算的

scala> test1.reduceByKey((x,y)=>{x+y}).collect.foreach(print)
(0,125250)(1,375250)
--------------------------------------其中的x+y代表拉完成之后，对他们进行相加
--------------------------------------通过reduceByKey实现distinct
scala> test1.reduceByKey((x,_)=>{x}).map(_._1).collect.foreach(println)
0
1

groupby：自定义分组

1
2
3

scala> test.groupBy(x=>{if(x%2==0){"2e"}else{"e2"}}).collect
res99: Array[(String, Iterable[Int])] = Array((e2,CompactBuffer(1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99, 101, 103, 105, 107, 109, 111, 113, 115, 117, 119, 121, 123, 125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 145, 147, 149, 151, 153, 155, 157, 159, 161, 163, 165, 167, 169, 171, 173, 175, 177, 179, 181, 183, 185, 187, 189, 191, 193, 195, 197, 199, 201, 203, 205, 207, 209, 211, 213, 215, 217, 219, 221, 223, 225, 227, 229, 231, 233, 235, 237, 239, 241, 243, 245, 247, 249, 251, 253, 255, 257, 259, 261, 263, 265, 267, 269, 271, 273, 275, 277, 279, 281, 283, 285, 287, 289, 291, 293, 295, 297, 299, 301, 303, 30...

sortbykey：按照key进行排序，分区排序，如果想达到全局排序，则要求你rdd里的只有一个的分区，降序就是把true变成false


scala> val r2 = sc.parallelize(List(("zuan",18),("kaige",20),("zihang",21)),1)
r2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[69] at parallelize at <console>:24

scala> r2.sortByKey(true)
res89: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[70] at sortByKey at <console>:25

scala> res89.collect
res90: Array[(String, Int)] = Array((kaige,20), (zihang,21), (zuan,18))

自定义排序：sortby

scala> r2.sortBy(x=>x._2,true).collect
res92: Array[(String, Int)] = Array((zuan,18), (kaige,20), (zihang,21))

join：他默认就是按照key进行关联的=》底层调用的是cogroup

scala> val r3 = sc.parallelize(List(("zuan","广西"),("kaige","中国"),("zihang","黑龙江")))
r3: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[77] at parallelize at <console>:24

scala> r1.join(r3).collect
res93: Array[(String, (Int, String))] = Array((zuan,(18,广西)), (kaige,(20,中国)), (zihang,(21,黑龙江)))

-------------------------cogroup
scala> r1.cogroup(r3).collect
res94: Array[(String, (Iterable[Int], Iterable[String]))] = Array((zuan,(CompactBuffer(18),CompactBuffer(广西))), (kaige,(CompactBuffer(20),CompactBuffer(中国))), (zihang,(CompactBuffer(21),CompactBuffer(黑龙江))))

都是根据key进行关联

但是cogroup的返回是集合当作value的

join则是返回的是值当value

分区规则：

1 2	分区号：0，元素是4 4%4=0 分区号：1，元素是9 9%4=1

action算子：会执行job的算子

collect=》把rdd的数据集拉回控制台 = 》driver端

foreach

foreachpartition：对每个分区进行处理=》首选的mysql数据集导入是它，因为获取mysql的次数能少一点

  def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
  }
-----------------------------------使用
scala> test.foreachPartition(ax=>ax.foreach(print))
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000

reduce：mr里的reduce，这里不可以接collect，因为已经完成了

1
2
3

scala> test.reduce((x,y)=>x+y)
res108: Int = 500500

first：取出数据集里的第一个元素底层是take

scala> test.first()
res109: Int = 1
-------------------------take
scala> test.take(77)
res112: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77)

takeOrderd:升序获取前n个

scala> test.takeOrdered(2)
res127: Array[Int] = Array(1, 2)

scala> test.takeOrdered(55)
res128: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55)

top:就是取排名前几的数据底层是takeOrederd=》数据量比较小的时候可以使用

scala> test.top(1)
res113: Array[Int] = Array(1000)

scala> test.top(5)
res114: Array[Int] = Array(1000, 999, 998, 997, 996)

saveASTextFile

saveASSequenceFile

saveAsObjectFile

countByKey:统计key的个数

1
2
3

scala> test1.countByKey
res125: scala.collection.Map[Int,Long] = Map(0 -> 500, 1 -> 500)

collectAsMap：和countByKey有点类似

count:返回rdd里有多少个数

判断action算子和普通算子的方法=》源码底层有runjob

=》调用collect/其他action算子

案例：

一张表：
name price num
diar 300 1000
香奈儿 4000 2
螺蛳粉 200 98
30显卡 200 10
-----------------------------------------------按照价格进行排序【desc】，如果价格相同，按照库存排序【asc】

解决：

数据类型：tuple【推荐】 ， class ， case class【推荐】
用tuple做
-----------------------------------------------------------------
    val value = sc.parallelize(List("diar 300 1000",
      "香奈儿 4000 2",
      "螺蛳粉 200 98",
      "30显卡 200 10")，1)

    val etlData = value.map(x=>{
      val strings=x.split(" ")
      val name=strings(0)
      val price=strings(1).toInt
      val store=strings(2)
      (name,price,store)
    })

    etlData.sortBy(x => ( -x._2 , x._3)).saveAsTextFile("hdfs://bigdata3:9000/data")
----------------------------------------------------------class

import org.apache.spark.{SparkConf, SparkContext}
import sparkfirst.ContextUtils
object sparktest {
  def main(args: Array[String]): Unit = {
    val sc:SparkContext = ContextUtils.getSparkContext("test")
    val bb = new SparkContext(new SparkConf().setAppName("test").setMaster("local[2]"))

    val value = sc.parallelize(List("diar 300 1000",
      "香奈儿 4000 2",
      "螺蛳粉 200 98",
      "30显卡 200 10")，1)

    val etlData = value.map(x=>{
      val strings=x.split(" ")
      val name=strings(0)
      val price=strings(1).toDouble
      val store=strings(2).toInt
      new skuu(name,price,store)
    })
    etlData.sortBy(x=>(-x.d,x.str1)).collect.foreach(print(_))

    sc.stop()
  }
  class skuu(val str: String,val d: Double,val str1: Int) extends Serializable{
    override def toString: String =str + "\t" + d + "\t" + str1
  }
}
------------------------------------------------------------------------------------------------------case class
case好处=》重写了toString，hashcode方法，自动实现了序列化，不用实例化
package sparkfirst

import org.apache.spark.{SparkConf, SparkContext}
import sparkfirst.ContextUtils
object sparktest {
  def main(args: Array[String]): Unit = {
    val sc:SparkContext = ContextUtils.getSparkContext("test")
    val bb = new SparkContext(new SparkConf().setAppName("test").setMaster("local[2]"))

    val value = sc.parallelize(List("diar 300 1000",
      "香奈儿 4000 2",
      "螺蛳粉 200 98",
      "30显卡 200 10")，1)

    val etlData = value.map(x=>{
      val strings=x.split(" ")
      val name=strings(0)
      val price=strings(1).toDouble
      val store=strings(2).toInt
      skuu(name,price,store)
    })
    etlData.sortBy(x=>(-x.d,x.str1)).collect.foreach(print(_))

    sc.stop()
  }
 case class skuu(val str: String,val d: Double,val str1: Int)
}

当用 saveAsTextFile("hdfs://bigdata3:9000/data/test")

的时候它会根据你的分区数来生成文件

需求：用两类进行对比：

package sparkfirst

import org.apache.spark.{SparkConf, SparkContext}
import sparkfirst.ContextUtils
object sparktest {
  def main(args: Array[String]): Unit = {
    val sc:SparkContext = ContextUtils.getSparkContext("test")
    val bb = new SparkContext(new SparkConf().setAppName("test").setMaster("local[2]"))

    val value = sc.parallelize(List(
      "diar 300 1000",
      "香奈儿 4000 2",
      "螺蛳粉 200 98",
      "30显卡 200 10"),1)

    val etlData = value.map(x=>{
      val strings=x.split(" ")
      val name=strings(0)
      val price=strings(1).toDouble
      val store=strings(2).toInt
      skuu(name,price,store)
    })
    etlData.sortBy(x=>(-x.d,x.str1)).collect.foreach(print(_))

    sc.stop()
  }
 case class skuu(val str: String,val d: Double,val str1: Int) extends Ordered[skuu]{
   override def compare(that: skuu): Int = {
     if (this.d == that.d){
       this.str1-that.str1
     }else {
       -(this.d - that.d).toInt
     }
   }
 }
}

隐式转换

package sparkfirst

import org.apache.spark.{SparkConf, SparkContext}
import sparkfirst.ContextUtils
object sparktest {
  def main(args: Array[String]): Unit = {
    val sc:SparkContext = ContextUtils.getSparkContext("test")
    val bb = new SparkContext(new SparkConf().setAppName("test").setMaster("local[2]"))

    val value = sc.parallelize(List(
      "diar 300 1000",
      "香奈儿 4000 2",
      "螺蛳粉 200 98",
      "30显卡 200 10"),1)

    val etlData = value.map(x=>{
      val strings=x.split(" ")
      val name=strings(0)
      val price=strings(1).toDouble
      val store=strings(2).toInt
      skuu(name,price,store)
    })
    etlData.sortBy(x=>(-x.d,x.str1)).collect.foreach(print(_))

    sc.stop()

    implicitly def skutooreder(sku:skuu):Ordered[skuu]={
      new Ordered[skuu]{
        override def compare(that: skuu): Int = {
          if (sku.d == that.d){
            sku.str1-that.str1
          }else {
            -(sku.d - that.d).toInt
          }
        }
      }

    }


  }
  
  
  
  
  
  
  
  
  
  
  
  
 case class skuu(val str: String,val d: Double,val str1: Int)
}

例子：

数据：
word show click
a,2,3
b,1,1
c,4,5
f,5,6
g,7,8
k,8,9
a,1,2
a,1,1
a,4,5
b,5,6
------------------------------------------------------------------------------
package sparkfirst

import org.apache.spark.{SparkConf, SparkContext}
import sparkfirst.ContextUtils
object sparktest {

  def sub(name: String, tuple: (Double, Int))={
    (name , tuple)
  }

  def main(args: Array[String]): Unit = {
    val sc:SparkContext = ContextUtils.getSparkContext("test")
    val bb = new SparkContext(new SparkConf().setAppName("test").setMaster("local[2]"))

    val value = sc.parallelize(List(
      "a,2,3",
      "b,1,1",
      "c,4,5",
      "f,5,6",
      "g,7,8",
      "k,8,9",
      "a,1,2",
      "a,1,1",
      "a,4,5",
      "b,5,6"),1)

    val etlData = value.map(x=>{
      val strings=x.split(",")
      val name=strings(0)
      val price=strings(1).toDouble
      val store=strings(2).toInt
      sub(name,(price,store))
     // (name,(price,store))
    })

    etlData.reduceByKey((x,y)=>{
      (x._1+y._1,x._2+y._2)
    }).map(x=>x._1+"\t"+x._2._1+"\t"+x._2._2+"\t")
   // etlData.reduceByKey((x,y)=>{(x._1+y._1,x._2+y._2)}).map(x=>x._1+"\t"+x._2._1+"\t"+x._2._2+"\t")
  
    sc.stop()
  }

理论：

spark架构：

Application =》 spark作业 =》 driver program 和 executor on the cluster两个进程
- driver：运行sparkContext
- exacutor：运行任务并保存数据在内存和磁盘中
web ui
sparkcontext
application jar =》开发好的代码生成的jar包 =》包含spark作业 =》包含 main方法 =》用户自己开发完spark之后可以部署到服务器上
driver program =》运行 jar包里的main方法 =》创建sparkContext
Custer manager =》集群管理者 =》通过集群获取资源
Deploy mode =》当把作业提交到yarn上的时候
- cluster模式：是跑在集群内布的（driver）=》 yarn所在的机器里面
- client模式：跑在集群外外的
Worker node =》工作节点=》打工人=》运行集群代码==node manager
excutor = 》相当于yarn的container=》每个spark都有自己的excutor
task =》partition =》 rdd
job =》 spark 里的job =》 application里的job =》一个application里可能会有多个job
stage =》 job的小单位，且每个stage之间是有依赖关系的
一个application会包含1-n个job，一个job包含1-n个stage
一个stage可以包含1-n个task
task和rdd里的分区数一一对应

spark的执行流程

sc去链接cluster manager

cluster manager 会给spark作业分配资源

spark一旦连接上Custer

启动=》exector=》存储和计算

sc发送代码给 exector 发送task给他去运行

每个作业都是有自己的exector的

exector是相当于container=》资源隔离=》调度隔离

多个作业之间产生的数据是不可以进行共享的，但是当写到一个外部存储上，就可以了

spark-shell简单就是=》提交很多个job =》相当于外部存储

用户无感知的连接到集群

可以监听exector的生命周期=》和driver之间有通信

只要driver可以连接到Custer集群上，他就可以=》也就是说外部dirver也可以

提议：driver和work节点靠近一点，这样可以减少网络发送的时间，就是可以用本地来实现

spark整合yarn

首先在spark的conf文件夹里执行 cp spark-env.sh.template spark-env.sh

然后 vim spark-env.sh

把 HADOOP_CONF_DIR=/home/hadoop/app/hadoop/etc/hadoop 和 YARN_CONF_DIR=/home/hadoop/app/hadoop/etc/hadoop

加上之后重新启动spark执行 spark-shell --master yarn

就可以了

一般启动之后，不配别的，一般这样是占用5G的内存

案例：

spark统计用户行为分析

    val value = sc.parallelize(List(
      "u01,英雄联盟|绝活&职业|云顶|奴神,1,1",
      "u01,英雄联盟|绝活&职业|云顶|金潺潺,1,1",
      "u01,英雄联盟|绝活&职业|云顶|带粉上车,1,0",
      "u02,星秀|好声音|女团|三年一班,1,1",
      "u02,星秀|好声音|女团|奴神,1,1",
      "u02,星秀|好声音|女团|将神,1,0",
      "u02,星秀|好声音|女团|西索,1,1"),1)

    val etlData = value.flatMap(x=>{
      val strings=x.split(",")
      val name=strings(0)
      val type_log_total=strings(1)
      val show=strings(2).toInt
      val click=strings(3).toInt
      val type_log_total_ni=type_log_total.split("\\|")
      type_log_total_ni.map(x=>{
        ((name,x),(show,click))
      })
      //sub(name,(price,store))
      //(name,(price,store))
    })

//    etlData.reduceByKey((x,y)=>{
//      (x._1+y._1,x._2+y._2)
//    }).map(x=>x._1+"\t"+x._2._1+"\t"+x._2._2+"\t")
   //etlData.reduceByKey((x,y)=>{(x._1+y._1,x._2+y._2)}).map(x=>x._1+"\t"+x._2._1+"\t"+x._2._2+"\t")

    etlData.reduceByKey((x,y)=>{
      (x._1+y._1,x._2+y._2)
    }).collect().foreach(println)

spark持久化:

RDD持久化 =》一个操作会有很多次的生成rdd，比如会有100个rdd，我们为了节省时间以及链路资源，可以把第99个rdd持久化，然后只用对第99个操作就好了
持久化也是可以容错的
就是保存在内存中
是针对rdd的每个分区来的
持久化操作，是下次操作的时候才会从持久化的地方加载数据
默认存储级别是在内存中存储
用persist或者cache都可以 =》是lazy的
当内存不足的时候是不能进行持久化的我们可以通过序列化，进行配置= 》减少使用空间
一般我们要在内存和cpu之间做权衡 =》4步选择法
- 官方=》memory_only
- memory_only_ser => 对空间会节省
- 磁盘上 = 》 memory_only_desk => 没有不做序列化快
- 容错 =》带有副本的方式，的确是更安全可是数据太大对磁盘是负担
移除持久化数据：可以设置自动的级别（LRU），就会过一段时间，然后自动/或者通过rdd调用 unpersist(true)他是立即执行的

启动spark-shell本身也是一个spark作业

scala> val test = sc.parallelize("hdfs://bigdata3:9000/flume/events/2022-12-13/events.1670898548750.log")
test: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[0] at parallelize at <console>:23

scala> test.collect
collect   collectAsync

scala> test.collect
res0: Array[Char] = Array(h, d, f, s, :, /, /, b, i, g, d, a, t, a, 3, :, 9, 0, 0, 0, /, f, l, u, m, e, /, e, v, e, n, t, s, /, 2, 0, 2, 2, -, 1, 2, -, 1, 3, /, e, v, e, n, t, s, ., 1, 6, 7, 0, 8, 9, 8, 5, 4, 8, 7, 5, 0, ., l, o, g)

scala> test.persist
res1: test.type = ParallelCollectionRDD[0] at parallelize at <console>:23

scala> test.ca
cache   cartesian

scala> test.cache
res2: test.type = ParallelCollectionRDD[0] at parallelize at <console>:23
---------------------------------------------------------------------------------------java的序列化方法
    val names = Array[String]("刘子航","李信","花木兰","达摩","耀","貂蝉","吕布")
    val gar = Array[String]("男","女")
    val addres= Array[String]("山东","广西","大连")


    val value1 = sc.parallelize(1 to 300000)

    val value2 = new ArrayBuffer[persion]()
    val value3 = value1.map(x => {
      val name = names(Random.nextInt(6))
      val s = gar(Random.nextInt(1))
      val s1 = addres(Random.nextInt(2))
      value2 += (persion(name, s, s1))
    })

    value3.persist(StorageLevel.MEMORY_ONLY_SER)
    value3.count()


case class persion(name: String,gre:String,add:String){

}
-------------------------------------------------------Kyro序列化
速度比java快，但是不是支持所有的序列化的，有的没有，使用前要加上注册类

要先在conf里使用它
 conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
然后要注册所用的case class或者class 
 conf.registerKryoClasses(Array(classOf[Info]))
然后其余和上述一样

    val names = Array[String]("刘子航","李信","花木兰","达摩","耀","貂蝉","吕布")
    val gar = Array[String]("男","女")
    val addres= Array[String]("山东","广西","大连")


    val value1 = sc.parallelize(1 to 300000)

    val value2 = new ArrayBuffer[persion]()
    val value3 = value1.map(x => {
      val name = names(Random.nextInt(6))
      val s = gar(Random.nextInt(1))
      val s1 = addres(Random.nextInt(2))
      value2 += (persion(name, s, s1))
    })

    value3.persist(StorageLevel.MEMORY_ONLY_SER)
    value3.count()


case class persion(name: String,gre:String,add:String){

}

依赖关系和血缘关系

血缘关系： =》就是不同rdd之间的转化

依赖关系 =》

宽依赖：一个父RDD里的parttion会被子RDD的partition使用多次 =》会产生shuffer =》有新的stage产生 = 》一个shuffle会划分两个stage =》引起shuffle就会产生stage
窄依赖：同一个父RDD里的分区最多被子RDD使用一次 =》一个stage里完成的 =》无shuffle

补充算子：

repartition ： repartition（num）重新分区，引起shuffle =》底层调用的是coalesce =》不可以减少分取数

coalesce ：一般用于减少RDD的分区数量，coalesce(num)窄依赖的 =》不引起shuffle =》可以增加分区数

生产上用于调整计算的并行度

判断执行位置

判断是不是对rdd里的元素进行操作，如果是操作，则是executor端的，如果不是则是driver端

例子：求top2


    val value = sc.parallelize(List(
      "www.bvaidu,u01,20",
      "www.githuba,u02,2",
      "www.bvaidu,u02,100",
      "www.bibi,u02,199",
      "www.githuba,u01,100",
      "www.githuba,u01,1",
      "www.githuba,u01,10",
      "www.bibi,u02,19",
      "www.bibi,u01,199",
      "www.baidu.com,uid01,1",
      "www.baidu.com,uid01,10",
      "www.baidu.com,uid02,3",
      "www.baidu.com,uid02,5",
      "www.github.com,uid01,11",
      "www.github.com,uid01,10",
      "www.github.com,uid02,30",
      "www.github.com,uid02,50",
      "www.bibili.com,uid01,110",
      "www.bibili.com,uid01,10",
      "www.bibili.com,uid02,2",
      "www.bibili.com,uid02,3"),1)

    val etlData = value.map(x=>{
      val strings=x.split(",")
      val yuming=strings(0)
      val user=strings(1)
      val cishu=strings(2).toInt
      ((yuming,user),(cishu))
      //sub(name,(price,store))
      //(name,(price,store))
    })
etlData.reduceByKey((x,y)=>{
      x+y
    }).sortBy(x=> -x._2,true).map(x=>{
      (x._1._2,(x._1._1,x._2))
    }).groupByKey().map(x=>{
        x._2.map(s=>{
          (x._1,s._1,s._2)
      }).take(2)}).saveAsTextFile("hdfs://bigdata3:9000/input/10.txt")
------------------------------------------------------------------------------简化版本
    val value = sc.parallelize(List(
      "www.bvaidu,u01,20",
      "www.githuba,u02,2",
      "www.bvaidu,u02,100",
      "www.bibi,u02,199",
      "www.githuba,u01,100",
      "www.githuba,u01,1",
      "www.githuba,u01,10",
      "www.bibi,u02,19",
      "www.bibi,u01,199",
      "www.baidu.com,uid01,1",
      "www.baidu.com,uid01,10",
      "www.baidu.com,uid02,3",
      "www.baidu.com,uid02,5",
      "www.github.com,uid01,11",
      "www.github.com,uid01,10",
      "www.github.com,uid02,30",
      "www.github.com,uid02,50",
      "www.bibili.com,uid01,110",
      "www.bibili.com,uid01,10",
      "www.bibili.com,uid02,2",
      "www.bibili.com,uid02,3"),1)

    val etlData = value.map(x=>{
      val strings=x.split(",")
      val yuming=strings(0)
      val user=strings(1)
      val cishu=strings(2).toInt
      ((yuming,user),(cishu))
      //sub(name,(price,store))
      //(name,(price,store))
    })
    val value4 = etlData.map(x => {
      (x._1._2)
    }).distinct().collect()


    for (elem <- value4){
     etlData.filter(_._1._2 == elem).reduceByKey(_ + _).sortBy( -_._2).take(2).foreach(println(_))
    }

累加器：广播变量=》后续再补充

案例： wordcount

val wc = sc.textFile("hdfs://bigdata3:9000/3.log")

wc.flatMap(x=>{
  x.split(",")
}).map(x=>{
  (x,1)
}).reduceByKey((x,y)=>{
  x+y
}).saveAsTextFile("hdfs://bigdata3:9000/input/11.txt")

部署spark作业

jar
spark-submit

spark-submit \
--class 包名 \
--master 模式 \
--name 作业名字 \
jar包在机器上的路径 \
想传的参数
-------------------------------------例子
spark-submit \
--class tool.jdbc.readjdbc \
/home/hadoop/project/jar/bigdatajava-1.0-SNAPSHOT.jar \
"jdbc:mysql://bigdata2:3306/try" "root" "liuzihan010616" "emp"
-----------------------------------------我这里不加msater是因为我再代码里已经加了

还可以用其内封装的方法进行传值=》不用 args

就是在submit的时候用 --conf传入

代码如下 :

package com.dl2262.sparkcore.day02

import com.dl2262.sparkcore.util.{ContextUtils, FileUtils}
import org.apache.spark.SparkContext
import org.apache.spark.internal.Logging
import org.apache.spark.rdd.RDD

/**
  *
  * @author sxwang
  *         01 05 8:28
  */
object WCApp extends Logging{

  def main(args: Array[String]): Unit = {

//    if(args.size != 2){
    //      logError("请正确输入2个参数：<input> <output>")
    //      System.exit(0)
    //    }
    //    val in = args(0)
    //    val out = args(1)


    val sc: SparkContext = ContextUtils.getSparkContext(this.getClass.getSimpleName)

    val in = sc.getConf.get("spark.input.path","hdfs://bigdata32:9000/input/")
    val out = sc.getConf.get("spark.output.path","hdfs://bigdata32:9000/output/")


    val input = sc.textFile(in)

    FileUtils.deletePath( sc.hadoopConfiguration,out)

    input.flatMap(line  => {
      line.split(",")
    }).map(word => (word,1))
      .reduceByKey(_+_)
      .saveAsTextFile(out)

    sc.stop()
  }





}

例子：

数据类型如下：
域名，用户，用户所在地，展示次数，点击次数
求分别按照域名，用户，用户所在地进行统计其top2
-----------------------------------------------------------------------------------制造数据
  val domin = Array[String]("www.baidu.com","www.taobao.com","www.github.com","www.bilbil.com","www.csdn.com","www.zihang.com")
    val userList = Array[String]("zihang","zuan","zihao","shuangxi","yuhang")
    val beianlocal = Array[String]("广州","江西","太原","新疆","上海")
    var stringe:List[String] = Nil
    for(i <- 1 to(30000)){
      val Randomdomin = domin(Random.nextInt(domin.length-1))
      val RandomUserList = userList(Random.nextInt(userList.length-1))
      val Randombeianlocal = beianlocal(Random.nextInt(beianlocal.length-1))
      val tmp = List(Randomdomin + "," + RandomUserList + "," + Randombeianlocal )
      stringe = stringe++tmp
    }
-------------------------------------------------------------------------------域名 =>其他有时间再补充把，和这个一样
 val basicdata = sc.parallelize(stringe)
-------------------------------------------------------------------------------解析数据
    val ETLDATA = basicdata.map(x=>{
      val strings = x.split(",")
      val currentdomin = strings(0)
      val currentuser = strings(1)
      val currentadd = strings(2)
      val click = Random.nextInt(100).toInt
      val show = Random.nextInt(200).toInt
      ((currentdomin,currentuser,currentadd),(click,show))
    })
--------------------------------------------------------------------------------处理并排序
      for (elem <- domin){
        ETLDATA.filter(_._1._1==elem).reduceByKey((x,y)=>{
          (x._1+y._1,x._2+y._2)
        }).sortBy( -_._2._1).take(2).foreach(println(_))
      }
---------------------------------------------------------------------------------方法2
    ETLDATA.reduceByKey((x,y)=>{
      (x._1+y._1,y._2+x._2)
    }).sortBy(x=> -x._2._1).map(x=>{
      (x._1._1,(x._1._2,x._1._3,x._2))
    }).groupByKey().map(x=>{
      x._2.map(s=>{
        (x._1,(s._1,s._2,s._3))
      }).take(2)
    }).collect