spark mllib (2)

Scroll Down

Spark 机器学习笔记(2)

Basic Statistics - RDD-based API 基本的统计

Summary statistics 累计统计

colStats() returns an instance of MultivariateStatisticalSummary, which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.

colStats() 函数返回一个MultivariateStatisticalSummary ,然后对这个对象进行方法的调用

SparkConf sparkConf = new SparkConf().setAppName("JavaSparkPi").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaRDD<Vector> mat = jsc.parallelize(
    Arrays.asList(
            Vectors.dense(1.0, 10.0, 100.0),
            Vectors.dense(2.0, 20.0, 200.0),
            Vectors.dense(3.0, 30.0, 300.0)
    )
); // an RDD of Vectors  //这里需要传入的是RDD<Vector>

// 计算结果
MultivariateStatisticalSummary summary = Statistics.colStats(mat.rdd());
System.out.println(summary.mean());  // 返回一个dense vector来存储每一列的均值
System.out.println(summary.variance());  // 逐列方差
System.out.println(summary.numNonzeros());  // 每一列的非零值

Correlations 关联分析

Calculating the correlation between two series of data is a common operation in Statistics. In spark.mllib we provide the flexibility to calculate pairwise correlations among many series. The supported correlation methods are currently Pearson’s and Spearman’s correlation.

计算两个序列相关性的一系列操作

JavaDoubleRDD seriesX = jsc.parallelizeDoubles(
                Arrays.asList(1.0, 2.0, 3.0, 3.0, 5.0));  //转化为一个RDD序列

// 分区数和元素个数要和seriesX一致
JavaDoubleRDD seriesY = jsc.parallelizeDoubles(
        Arrays.asList(11.0, 22.0, 33.0, 33.0, 555.0));

// 用pearson方法算相关度,如果要用spearman方法需要填写"spearman"
// 如果不写,pearson方法将作为默认方法
Double correlation = Statistics.corr(seriesX.srdd(), seriesY.srdd(), "pearson");
System.out.println("Correlation is: " + correlation);

JavaRDD<Vector> data = jsc.parallelize(
        Arrays.asList(
                Vectors.dense(1.0, 10.0, 100.0),
                Vectors.dense(2.0, 20.0, 200.0),
                Vectors.dense(5.0, 33.0, 366.0)
        )
);

// 计算相关矩阵
Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson");
System.out.println(correlMatrix.toString());

在 统计学中, 以查尔斯·斯皮尔曼命名的斯皮尔曼等级相关系数,即spearman相关系数。经常用希腊字母ρ表示。 它是衡量两个变量的依赖性的 非参数 指标。 它利用单调方程评价两个统计变量的相关性。 如果数据中没有重复值, 并且当两个变量完全单调相关时,斯皮尔曼相关系数则为+1或−1。

Pearson相关系数(Pearson CorrelationCoefficient)是用来衡量两个数据集合是否在一条线上面,它用来衡量定距变量间的线性关系。

Stratified sampling 分层抽样

The sampleByKey method will flip a coin to decide whether an observation will be sampled or not, therefore requires one pass over the data, and provides an expected sample size. sampleByKeyExact requires significant more resources than the per-stratum simple random sampling used in sampleByKey

sampleByKey 的抽取是比较随机的,sampleByKeyExact的抽取是比较精确的,sampleByKeyExact会比sampleByKey抽取更多样本

List<Tuple2<Integer, Character>> list = Arrays.asList(
                new Tuple2<>(1, 'a'),
                new Tuple2<>(1, 'b'),
                new Tuple2<>(2, 'c'),
                new Tuple2<>(2, 'd'),
                new Tuple2<>(2, 'e'),
                new Tuple2<>(3, 'f')
        );
JavaPairRDD<Integer, Character> data = jsc.parallelizePairs(list);
// specify the exact fraction desired from each key Map<K, Double>
// 这里注意,这里0.1,0.6代表比例
ImmutableMap<Integer, Double> fractions = ImmutableMap.of(1, 0.1, 2, 0.6, 3, 0.3);
// Get an approximate sample from each stratum
JavaPairRDD<Integer, Character> approxSample = data.sampleByKey(false, fractions);
// Get an exact sample from each stratum
JavaPairRDD<Integer, Character> exactSample = data.sampleByKeyExact(false, fractions);
approxSample.foreach((integerCharacterTuple2)->{
    System.out.println(integerCharacterTuple2._1+"   "+integerCharacterTuple2._2);
});
exactSample.foreach((integerCharacterTuple2)->{
    System.out.println(integerCharacterTuple2._1+"   "+integerCharacterTuple2._2);
});

Hypothesis testing 预测评估

可以进行卡方独立性检验

// a vector composed of the frequencies of events
Vector vec = Vectors.dense(0.1, 0.15, 0.2, 0.3, 0.25);

// compute the goodness of fit. If a second vector to test against is not supplied
// as a parameter, the test runs against a uniform distribution.
ChiSqTestResult goodnessOfFitTestResult = Statistics.chiSqTest(vec);
// summary of the test including the p-value, degrees of freedom, test statistic,
// the method used, and the null hypothesis.
System.out.println(goodnessOfFitTestResult + "\n");

// Create a contingency matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
Matrix mat = Matrices.dense(3, 2, new double[]{1.0, 3.0, 5.0, 2.0, 4.0, 6.0});

// conduct Pearson's independence test on the input contingency matrix
ChiSqTestResult independenceTestResult = Statistics.chiSqTest(mat);
// summary of the test including the p-value, degrees of freedom...
System.out.println(independenceTestResult + "\n");

// an RDD of labeled points
JavaRDD<LabeledPoint> obs = jsc.parallelize(
        Arrays.asList(
                new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)),
                new LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 0.0)),
                new LabeledPoint(-1.0, Vectors.dense(-1.0, 0.0, -0.5))
        )
);

// The contingency table is constructed from the raw (label, feature) pairs and used to conduct
// the independence test. Returns an array containing the ChiSquaredTestResult for every feature
// against the label.
ChiSqTestResult[] featureTestResults = Statistics.chiSqTest(obs.rdd());
int i = 1;
for (ChiSqTestResult result : featureTestResults) {
    System.out.println("Column " + i + ":");
    System.out.println(result + "\n");  // summary of the test
    i++;
}

Random data generation 随机数据生成

可以生成均匀分布,正态分布和泊松分布的数据

// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);     //标准正态分布
JavaDoubleRDD p = poissonJavaRDD(jsc, 1000000L, 10);    //泊松分布
JavaDoubleRDD uniform = uniformJavaRDD(jsc, 1000000L, 10);  //均匀分布
// Apply a transform to get a random double RDD following `N(1, 4)`.
// 通过公式转换为满足N(1, 4)的分布
JavaDoubleRDD v = u.mapToDouble(x -> 1.0 + 2.0 * x);