文件编码探测与原理、Java实现与构造让探测器失效文件

文章目录

  • 构造让探测器失效的文件
  • 文件编码探测原理
  • 探测器Java实现版本
  • 测试

构造让探测器失效的文件

我们用vscode打开一个文本文件的时候,默认会使用UTF-8编码,所以当文件不是UTF-8编码的时候就会乱码。

但是,好像notepad–这类编辑器就似乎总是能以正确的编码打开文本文件。

为什么呢?

notepad–这类编辑器真的总能以正确编码打开文件不出现乱码吗?

答案是否定的,不信,用下面的代码生成一个文件试一试,notepad–类编辑器是否能正确打开。

 @Test
public void write() throws IOException {
    Path path = Paths.get("F:\\tmp\\gb2312.txt");
    BufferedWriter bw = Files.newBufferedWriter(path, Charset.forName("GB2312"));
    for (int i = 1; i <= 100000; i++) {
        bw.write("瑜多爱");
        if (i % 100 == 0) {
            bw.newLine();
        }
    }
    bw.close();
}

见证奇迹的时候到了:
在这里插入图片描述

为什么会出现这种情况呢?

文件编码探测原理

其实,现在能找到的文件编码探测器,基本都是通过Mozilla4开源的探测器修改而来。

基本原理就是统计要检测文件中的所有字节落在不同编码区间的值的概率。

然后,选出所有可能编码中概率最大的作为文件编码。

探测器Java实现版本

Java版本很多都需要引入新的jar包,我这里找了一个不知道经过几手翻译的代码,改成了Java代码,并做了一点优化。

如果要求不高,可以尝试使用。

代码比较长,下面列不完,可以在https://download.csdn.net/download/trayvontang/89005882下载,

import lombok.Getter;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Optional;
import java.util.Set;

public class EncodingDetectHelper {

    private static final int[][] GBFreq = new int[94][94];

    private static final int[][] GBKFreq = new int[126][191];

    private static final int[][] Big5Freq = new int[94][158];

    private static final int[][] EUC_TWFreq = new int[94][94];

    private static final int[][] KRFreq = new int[94][94];

    private static final int[][] JPFreq = new int[94][94];

    static {
        initializeFrequencies();
    }

    /**
     * 探测文件编码,默认最大探测前300M内容
     * 默认探测UTF-8\GB2312\GBK\GB18030\UTF_16\ASCII 几种情况
     * @param contentFile 文件
     * @return 文件编码信息
     * @throws IOException IO异常
     */
    public static DetectEncoding detectEncoding(File contentFile) throws IOException {
        // 默认检查300M
        int length = 1024 * 1024 * 300;
        Set<DetectEncoding> checkEncodingSet = Set.of(
                DetectEncoding.UTF8, DetectEncoding.GB2312,
                DetectEncoding.GBK, DetectEncoding.GB18030,
                DetectEncoding.UTF_16,DetectEncoding.ASCII);
        return detectEncoding(contentFile, length, checkEncodingSet);
    }

    /**
     * 根据内容猜测文件编码
     * @param contentFile 内容文件
     * @param detectLength 探测内容长度
     * @param checkEncodingSet 检查的编码列表
     * @return 文件可能编码
     * @throws IOException IO异常
     */
    public static DetectEncoding detectEncoding(File contentFile, Integer detectLength, Set<DetectEncoding> checkEncodingSet) throws IOException {
        long length = contentFile.length();
        if (length < 4) {
            return null;
        }
        if (detectLength == null || detectLength == 0) {
            if (length < Integer.MAX_VALUE) {
                detectLength = Math.toIntExact(length);
            } else {
                detectLength = Integer.MAX_VALUE;
            }
        }
        if (detectLength > length) {
            detectLength = Math.toIntExact(length);
        }
        byte[] contentByte = new byte[detectLength];
        try (FileInputStream fis = new FileInputStream(contentFile)) {
            int read = fis.read(contentByte, 0, 4);
            if (read == -1) {
                throw new RuntimeException("未读取到文件数据-" + contentFile.getAbsolutePath());
            }
            // 先检查BOM,快速判断
            if (contentByte[0] == -17 && contentByte[1] == -69 && contentByte[2] == -65) { //EF BB BF
                return DetectEncoding.UTF8;
            } else if (contentByte[0] == -1 && contentByte[1] == -2 
            && contentByte[2] == 0 && contentByte[3] == 0) { // FF FE 00 00
                return DetectEncoding.UTF_32BE;
            } else if (contentByte[0] == 0 && contentByte[1] == 0 
            && contentByte[2] == -2 && contentByte[3] == -1) { // 00 00 FE FF
                return DetectEncoding.UTF_32LE;
            } else if (contentByte[0] == -2 && contentByte[1] == -1) { // FE FF
                return DetectEncoding.UTF_16BE;
            } else if (contentByte[0] == -1 && contentByte[1] == -2) { // FF FE
                return DetectEncoding.UTF_16LE;
            }
            read = fis.read(contentByte, 4, detectLength - 4);
            if (read == -1) {
                throw new RuntimeException("读取到文件数据异常-" + contentFile.getAbsolutePath());
            }
            return detectEncoding(contentByte, checkEncodingSet);
        }
    }

    private static DetectEncoding detectEncoding(byte[] contentByte, Set<DetectEncoding> checkEncodingSet) {
        Map<DetectEncoding, Integer> indexScoreMap = new HashMap<>();
        if (checkEncodingSet.contains(DetectEncoding.UTF8)) {
            indexScoreMap.put(DetectEncoding.UTF8, utf8Probability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.GB2312)) {
            indexScoreMap.put(DetectEncoding.GB2312, gb2312Probability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.GBK)) {
            indexScoreMap.put(DetectEncoding.GBK, gbkProbability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.GB18030)) {
            indexScoreMap.put(DetectEncoding.GB18030, gb18030Probability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.HZ)) {
            indexScoreMap.put(DetectEncoding.HZ, hzProbability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.BIG5)) {
            indexScoreMap.put(DetectEncoding.BIG5, big5Probability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.CNS11643)) {
            indexScoreMap.put(DetectEncoding.CNS11643, eucTwProbability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.ISO2022CN)) {
            indexScoreMap.put(DetectEncoding.ISO2022CN, iso2022CnProbability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.UNICODE)) {
            indexScoreMap.put(DetectEncoding.UNICODE, utf16Probability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.EUC_KR)) {
            indexScoreMap.put(DetectEncoding.EUC_KR, eucKrProbability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.CP949)) {
            indexScoreMap.put(DetectEncoding.CP949, cp949Probability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.JOHAB)) {
            indexScoreMap.put(DetectEncoding.JOHAB, 0);
        }
        if (checkEncodingSet.contains(DetectEncoding.ISO2022KR)) {
            indexScoreMap.put(DetectEncoding.ISO2022KR, iso2022KrProbability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.ASCII)) {
            indexScoreMap.put(DetectEncoding.ASCII, asciiProbability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.SJIS)) {
            indexScoreMap.put(DetectEncoding.SJIS, sjisProbability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.EUC_JP)) {
            indexScoreMap.put(DetectEncoding.EUC_JP, eucJpProbability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.ISO2022JP)) {
            indexScoreMap.put(DetectEncoding.ISO2022JP, iso2022JpProbability(contentByte));
        }
        if (checkEncodingSet.contains(DetectEncoding.UNICODET)) {
            indexScoreMap.put(DetectEncoding.UNICODET, 0);
        }
        if (checkEncodingSet.contains(DetectEncoding.UNICODES)) {
            indexScoreMap.put(DetectEncoding.UNICODES, 0);
        }
        if (checkEncodingSet.contains(DetectEncoding.ISO2022CN_GB)) {
            indexScoreMap.put(DetectEncoding.ISO2022CN_GB, 0);
        }
        if (checkEncodingSet.contains(DetectEncoding.ISO2022CN_CNS)) {
            indexScoreMap.put(DetectEncoding.ISO2022CN_CNS, 0);
        }
        if (checkEncodingSet.contains(DetectEncoding.OTHER)) {
            indexScoreMap.put(DetectEncoding.OTHER, 0);
        }
//        System.out.println(indexScoreMap);

        Optional<Map.Entry<DetectEncoding, Integer>> max = indexScoreMap.entrySet()
                .stream().max(Map.Entry.comparingByValue());

        if (max.isPresent()) {
            Map.Entry<DetectEncoding, Integer> entry = max.get();
            Integer value = entry.getValue();
            if (50 > value) { // Return OTHER if nothing scored above 50
                return DetectEncoding.OTHER;
            } else {
                return entry.getKey();
            }
        } else {
            return DetectEncoding.OTHER;
        }
    }

    private static int gb2312Probability(byte[] contentByte) {
        int i, contentLength = contentByte.length;
        int dbchars = 1, gbchars = 1;
        long gbfreq = 0, totalfreq = 1;
        float rangeValue, freqValue;
        int row, column;
        for (i = 0; i < contentLength - 1; i++) {
            if (contentByte[i] < 0) { // 非ASCII
                dbchars++;
                if ((byte) 0xA1 <= contentByte[i] && contentByte[i] <= (byte) 0xF7 && (byte) 0xA1 <= contentByte[i + 1]
                        && contentByte[i + 1] <= (byte) 0xFE) {
                    gbchars++;
                    totalfreq += 500;
                    row = contentByte[i] + 256 - 0xA1;
                    column = contentByte[i + 1] + 256 - 0xA1;
                    if (GBFreq[row][column] != 0) {
                        gbfreq += GBFreq[row][column];
                    } else if (15 <= row && row < 55) {
                        // In GB high-freq character range
                        gbfreq += 200;
                    }
                }
                i++;
            }
        }
        rangeValue = 50 * ((float) gbchars / (float) dbchars);
        freqValue = 50 * ((float) gbfreq / (float) totalfreq);
        return (int) (rangeValue + freqValue);
    }

    private static int gbkProbability(byte[] rawtext) {
        int i, rawTextLen = rawtext.length;
        int dbchars = 1, gbchars = 1;
        long gbfreq = 0, totalfreq = 1;
        float rangeValue, freqValue;
        int row, column;
        for (i = 0; i < rawTextLen - 1; i++) {
            if (rawtext[i] < 0) {
                dbchars++;
                // Extended GB range
                if ((byte) 0xA1 <= rawtext[i] && rawtext[i] <= (byte) 0xF7 && // Original GB range
                        (byte) 0xA1 <= rawtext[i + 1] && rawtext[i + 1] <= (byte) 0xFE) {
                    gbchars++;
                    totalfreq += 500;
                    row = rawtext[i] + 256 - 0xA1;
                    column = rawtext[i + 1] + 256 - 0xA1;
                    if (GBFreq[row][column] != 0) {
                        gbfreq += GBFreq[row][column];
                    } else if (15 <= row && row < 55) {
                        gbfreq += 200;
                    }
                } else if ((byte) 0x81 <= rawtext[i] && rawtext[i] <= (byte) 0xFE && (rawtext[i + 1] <= (byte) 0xFE || (byte) 0x40 <= rawtext[i + 1] && rawtext[i + 1] <= (byte) 0x7E)) {
                    gbchars++;
                    totalfreq += 500;
                    row = rawtext[i] + 256 - 0x81;
                    if (0x40 <= rawtext[i + 1] && rawtext[i + 1] <= 0x7E) {
                        column = rawtext[i + 1] - 0x40;
                    } else {
                        column = rawtext[i + 1] + 256 - 0x40;
                    }
                    if (GBKFreq[row][column] != 0) {
                        gbfreq += GBKFreq[row][column];
                    }
                }
                i++;
            }
        }
        rangeValue = 50 * ((float) gbchars / (float) dbchars);
        freqValue = 50 * ((float) gbfreq / (float) totalfreq);
        // For regular GB files, this would give the same score, so I handicap it slightly
        return (int) (rangeValue + freqValue) - 1;
    }

    private static int gb18030Probability(byte[] contentByte) {
        int i, contentLength = contentByte.length;
        int dbchars = 1, gbchars = 1;
        long gbfreq = 0, totalfreq = 1;
        float rangeValue, freqValue;
        int row, column;
        for (i = 0; i < contentLength - 1; i++) {
            if (contentByte[i] < 0) {
                dbchars++;
                // Extended GB range
                if ((byte) 0xA1 <= contentByte[i] && contentByte[i] <= (byte) 0xF7 && // Original GB range
                        i + 1 < contentLength && (byte) 0xA1 <= contentByte[i + 1] && contentByte[i + 1] <= (byte) 0xFE) {
                    gbchars++;
                    totalfreq += 500;
                    row = contentByte[i] + 256 - 0xA1;
                    column = contentByte[i + 1] + 256 - 0xA1;
                    if (GBFreq[row][column] != 0) {
                        gbfreq += GBFreq[row][column];
                    } else if (15 <= row && row < 55) {
                        gbfreq += 200;
                    }
                } else if ((byte) 0x81 <= contentByte[i] && contentByte[i] <= (byte) 0xFE && i + 1 < contentLength && (contentByte[i + 1] <= (byte) 0xFE || (byte) 0x40 <= contentByte[i + 1] && contentByte[i + 1] <= (byte) 0x7E)) {
                    gbchars++;
                    totalfreq += 500;
                    row = contentByte[i] + 256 - 0x81;
                    if (0x40 <= contentByte[i + 1] && contentByte[i + 1] <= 0x7E) {
                        column = contentByte[i + 1] - 0x40;
                    } else {
                        column = contentByte[i + 1] + 256 - 0x40;
                    }
                    if (GBKFreq[row][column] != 0) {
                        gbfreq += GBKFreq[row][column];
                    }
                } else if ((byte) 0x81 <= contentByte[i]
                        && contentByte[i] <= (byte) 0xFE
                        && // Extended GB range
                        i + 3 < contentLength && (byte) 0x30 <= contentByte[i + 1] && contentByte[i + 1] <= (byte) 0x39
                        && (byte) 0x81 <= contentByte[i + 2] && contentByte[i + 2] <= (byte) 0xFE && (byte) 0x30 <= contentByte[i + 3]
                        && contentByte[i + 3] <= (byte) 0x39) {
                    gbchars++;
                }
                i++;
            }
        }
        rangeValue = 50 * ((float) gbchars / (float) dbchars);
        freqValue = 50 * ((float) gbfreq / (float) totalfreq);
        // For regular GB files, this would give the same score, so I handicap it slightly
        return (int) (rangeValue + freqValue) - 1;
    }

    private static int hzProbability(byte[] contentByte) {
        int i, contentLength = contentByte.length;
        long hzfreq = 0, totalfreq = 1;
        float rangeValue, freqValue;
        int hzstart = 0;
        int row, column;
        for (i = 0; i < contentLength; i++) {
            if (contentByte[i] == '~') {
                if (contentByte[i + 1] == '{') {
                    hzstart++;
                    i += 2;
                    while (i < contentLength - 1) {
                        if (contentByte[i] == 0x0A || contentByte[i] == 0x0D) {
                            break;
                        } else if (contentByte[i] == '~' && contentByte[i + 1] == '}') {
                            i++;
                            break;
                        } else if ((0x21 <= contentByte[i] && contentByte[i] <= 0x77) && (0x21 <= contentByte[i + 1] && contentByte[i + 1] <= 0x77)) {
                            row = contentByte[i] - 0x21;
                            column = contentByte[i + 1] - 0x21;
                            totalfreq += 500;
                            if (GBFreq[row][column] != 0) {
                                hzfreq += GBFreq[row][column];
                            } else if (15 <= row && row < 55) {
                                hzfreq += 200;
                            }
                        }
                        i += 2;
                    }
                } else if (contentByte[i + 1] == '}') {
                    i++;
                } else if (contentByte[i + 1] == '~') {
                    i++;
                }
            }
        }
        if (hzstart > 4) {
            rangeValue = 50;
        } else if (hzstart > 1) {
            rangeValue = 41;
        } else if (hzstart > 0) { // Only 39 in case the sequence happened to occur
            rangeValue = 39; // in otherwise non-Hz text
        } else {
            rangeValue = 0;
        }
        freqValue = 50 * ((float) hzfreq / (float) totalfreq);
        return (int) (rangeValue + freqValue);
    }

    private static int big5Probability(byte[] contentByte) {
        int i, contentLength = contentByte.length;
        int dbchars = 1, bfchars = 1;
        float rangeValue, freqValue;
        long bffreq = 0, totalfreq = 1;
        int row, column;
        for (i = 0; i < contentLength - 1; i++) {
            if (contentByte[i] < 0) {
                dbchars++;
                if ((byte) 0xA1 <= contentByte[i]
                        && contentByte[i] <= (byte) 0xF9
                        && (((byte) 0x40 <= contentByte[i + 1] && contentByte[i + 1] <= (byte) 0x7E) || ((byte) 0xA1 <= contentByte[i + 1] && contentByte[i + 1] <= (byte) 0xFE))) {
                    bfchars++;
                    totalfreq += 500;
                    row = contentByte[i] + 256 - 0xA1;
                    if (0x40 <= contentByte[i + 1] && contentByte[i + 1] <= 0x7E) {
                        column = contentByte[i + 1] - 0x40;
                    } else {
                        column = contentByte[i + 1] + 256 - 0x61;
                    }
                    if (Big5Freq[row][column] != 0) {
                        bffreq += Big5Freq[row][column];
                    } else if (3 <= row && row <= 37) {
                        bffreq += 200;
                    }
                }
                i++;
            }
        }
        rangeValue = 50 * ((float) bfchars / (float) dbchars);
        freqValue = 50 * ((float) bffreq / (float) totalfreq);
        return (int) (rangeValue + freqValue);
    }


    /**
     * EUC-TW (CNS 11643) encoding
     *
     * @param contentByte 内容字节
     * @return 可能性
     */
    private static int eucTwProbability(byte[] contentByte) {
        int i, contentLength = contentByte.length;
        int dbchars = 1, cnschars = 1;
        long cnsfreq = 0, totalfreq = 1;
        float rangeValue, freqValue;
        int row, column;
        for (i = 0; i < contentLength - 1; i++) {
            if (contentByte[i] < 0) { // in ASCII range
                dbchars++;
                if (i + 3 < contentLength && (byte) 0x8E == contentByte[i] && (byte) 0xA1 <= contentByte[i + 1] && contentByte[i + 1] <= (byte) 0xB0
                        && (byte) 0xA1 <= contentByte[i + 2] && contentByte[i + 2] <= (byte) 0xFE && (byte) 0xA1 <= contentByte[i + 3]
                        && contentByte[i + 3] <= (byte) 0xFE) { // Planes 1 - 16
                    cnschars++;
                    // System.out.println("plane 2 or above CNS char");
                    // These are all less frequent chars so just ignore freq
                    i += 3;
                } else if ((byte) 0xA1 <= contentByte[i] && contentByte[i] <= (byte) 0xFE && // Plane 1
                        (byte) 0xA1 <= contentByte[i + 1] && contentByte[i + 1] <= (byte) 0xFE) {
                    cnschars++;
                    totalfreq += 500;
                    row = contentByte[i] + 256 - 0xA1;
                    column = contentByte[i + 1] + 256 - 0xA1;
                    if (EUC_TWFreq[row][column] != 0) {
                        cnsfreq += EUC_TWFreq[row][column];
                    } else if (35 <= row && row <= 92) {
                        cnsfreq += 150;
                    }
                    i++;
                }
            }
        }
        rangeValue = 50 * ((float) cnschars / (float) dbchars);
        freqValue = 50 * ((float) cnsfreq / (float) totalfreq);
        return (int) (rangeValue + freqValue);
    }

    private static int iso2022CnProbability(byte[] contentByte) {
        int i, contentLength = contentByte.length;
        int dbchars = 1, isochars = 1;
        long isofreq = 0, totalfreq = 1;
        float rangeValue, freqValue;
        int row, column;
        for (i = 0; i < contentLength - 1; i++) {
            if (contentByte[i] == (byte) 0x1B && i + 3 < contentLength) { // Escape char ESC
                if (contentByte[i + 1] == (byte) 0x24 && contentByte[i + 2] == 0x29 && contentByte[i + 3] == (byte) 0x41) { // GB Escape $ ) A
                    i += 4;
                    while (contentByte[i] != (byte) 0x1B) {
                        dbchars++;
                        if ((0x21 <= contentByte[i] && contentByte[i] <= 0x77) && (0x21 <= contentByte[i + 1] && contentByte[i + 1] <= 0x77)) {
                            isochars++;
                            row = contentByte[i] - 0x21;
                            column = contentByte[i + 1] - 0x21;
                            totalfreq += 500;
                            if (GBFreq[row][column] != 0) {
                                isofreq += GBFreq[row][column];
                            } else if (15 <= row && row < 55) {
                                isofreq += 200;
                            }
                            i++;
                        }
                        i++;
                    }
                } else if (i + 3 < contentLength && contentByte[i + 1] == (byte) 0x24 && contentByte[i + 2] == (byte) 0x29
                        && contentByte[i + 3] == (byte) 0x47) {
                    // CNS Escape $ ) G
                    i += 4;
                    while (contentByte[i] != (byte) 0x1B) {
                        dbchars++;
                        if ((byte) 0x21 <= contentByte[i] && contentByte[i] <= (byte) 0x7E && (byte) 0x21 <= contentByte[i + 1]
                                && contentByte[i + 1] <= (byte) 0x7E) {
                            isochars++;
                            totalfreq += 500;
                            row = contentByte[i] - 0x21;
                            column = contentByte[i + 1] - 0x21;
                            if (EUC_TWFreq[row][column] != 0) {
                                isofreq += EUC_TWFreq[row][column];
                            } else if (35 <= row && row <= 92) {
                                isofreq += 150;
                            }
                            i++;
                        }
                        i++;
                    }
                }
                if (contentByte[i] == (byte) 0x1B && i + 2 < contentLength && contentByte[i + 1] == (byte) 0x28 && contentByte[i + 2] == (byte) 0x42) { // ASCII:
                    // ESC
                    // ( B
                    i += 2;
                }
            }
        }
        rangeValue = 50 * ((float) isochars / (float) dbchars);
        freqValue = 50 * ((float) isofreq / (float) totalfreq);
        return (int) (rangeValue + freqValue);
    }

    private static int utf8Probability(byte[] contentByte) {
        int score;
        int i, contentLength = contentByte.length;
        int goodbytes = 0, asciibytes = 0;
        for (i = 0; i < contentLength; i++) {
            if ((contentByte[i] & (byte) 0x7F) == contentByte[i]) { // One byte
                asciibytes++;
                // Ignore ASCII, can throw off count
            } else {
                if (-64 <= contentByte[i] && contentByte[i] <= -33
                        && i + 1 < contentLength && contentByte[i + 1] <= -65) {
                    // Two bytes
                    goodbytes += 2;
                    i++;
                } else if (-32 <= contentByte[i] && contentByte[i] <= -17 && i + 2 < contentLength
                        && contentByte[i + 1] <= -65 && contentByte[i + 2] <= -65) {
                    // Three bytes
                    goodbytes += 3;
                    i += 2;
                }
            }
        }
        if (asciibytes == contentLength) {
            return 0;
        }
        score = (int) (100 * ((float) goodbytes / (float) (contentLength - asciibytes)));
        // If not above 98, reduce to zero to prevent coincidental matches
        // Allows for some (few) bad formed sequences
        if (score > 98) {
            return score;
        } else if (score > 95 && goodbytes > 30) {
            return score;
        } else {
            return 0;
        }
    }

    private static int utf16Probability(byte[] rawtext) {
        if (rawtext.length > 1 && ((byte) 0xFE == rawtext[0] && (byte) 0xFF == rawtext[1]) || // Big-endian
                ((byte) 0xFF == rawtext[0] && (byte) 0xFE == rawtext[1])) { // Little-endian
            return 100;
        }
        return 0;
    }

    private static int asciiProbability(byte[] rawtext) {
        int score = 75;
        int i, rawTextLen = rawtext.length;
        for (i = 0; i < rawTextLen; i++) {
            if (rawtext[i] < 0) {
                score = score - 5;
            } else if (rawtext[i] == (byte) 0x1B) { // ESC (used by ISO 2022)
                score = score - 5;
            }
            if (score <= 0) {
                return 0;
            }
        }
        return score;
    }


    private static int eucKrProbability(byte[] contentByte) {
        int i, contentLength = contentByte.length;
        int dbchars = 1, krchars = 1;
        long krfreq = 0, totalfreq = 1;
        float rangeValue, freqValue;
        int row, column;
        for (i = 0; i < contentLength - 1; i++) {
            if (contentByte[i] < 0) {
                dbchars++;
                if ((byte) 0xA1 <= contentByte[i] && contentByte[i] <= (byte) 0xFE && (byte) 0xA1 <= contentByte[i + 1]
                        && contentByte[i + 1] <= (byte) 0xFE) {
                    krchars++;
                    totalfreq += 500;
                    row = contentByte[i] + 256 - 0xA1;
                    column = contentByte[i + 1] + 256 - 0xA1;
                    if (KRFreq[row][column] != 0) {
                        krfreq += KRFreq[row][column];
                    } else if (15 <= row && row < 55) {
                        krfreq += 0;
                    }
                }
                i++;
            }
        }
        rangeValue = 50 * ((float) krchars / (float) dbchars);
        freqValue = 50 * ((float) krfreq / (float) totalfreq);
        return (int) (rangeValue + freqValue);
    }

    private static int cp949Probability(byte[] contentByte) {
        int i, contentLength = contentByte.length;
        int dbchars = 1, krchars = 1;
        long krfreq = 0, totalfreq = 1;
        float rangeValue, freqValue;
        int row, column;
        for (i = 0; i < contentLength - 1; i++) {
            if (contentByte[i] < 0) {
                dbchars++;
                if ((byte) 0x81 <= contentByte[i]
                        && contentByte[i] <= (byte) 0xFE
                        && ((byte) 0x41 <= contentByte[i + 1] && contentByte[i + 1] <= (byte) 0x5A || (byte) 0x61 <= contentByte[i + 1]
                        && contentByte[i + 1] <= (byte) 0x7A || (byte) 0x81 <= contentByte[i + 1] && contentByte[i + 1] <= (byte) 0xFE)) {
                    krchars++;
                    totalfreq += 500;
                    if ((byte) 0xA1 <= contentByte[i] && contentByte[i] <= (byte) 0xFE && (byte) 0xA1 <= contentByte[i + 1]
                            && contentByte[i + 1] <= (byte) 0xFE) {
                        row = contentByte[i] + 256 - 0xA1;
                        column = contentByte[i + 1] + 256 - 0xA1;
                        if (KRFreq[row][column] != 0) {
                            krfreq += KRFreq[row][column];
                        }
                    }
                }
                i++;
            }
        }
        rangeValue = 50 * ((float) krchars / (float) dbchars);
        freqValue = 50 * ((float) krfreq / (float) totalfreq);
        return (int) (rangeValue + freqValue);
    }

    private static int iso2022KrProbability(byte[] rawtext) {
        int i;
        for (i = 0; i < rawtext.length; i++) {
            if (i + 3 < rawtext.length && rawtext[i] == 0x1b && (char) rawtext[i + 1] == '$' && (char) rawtext[i + 2] == ')'
                    && (char) rawtext[i + 3] == 'C') {
                return 100;
            }
        }
        return 0;
    }

    private static int eucJpProbability(byte[] contentByte) {
        int i, contentLength = contentByte.length;
        int dbchars = 1, jpchars = 1;
        long jpfreq = 0, totalfreq = 1;
        float rangeValue, freqValue;
        int row, column;
        for (i = 0; i < contentLength - 1; i++) {
            if (contentByte[i] < 0) {
                dbchars++;
                if ((byte) 0xA1 <= contentByte[i] && contentByte[i] <= (byte) 0xFE && (byte) 0xA1 <= contentByte[i + 1]
                        && contentByte[i + 1] <= (byte) 0xFE) {
                    jpchars++;
                    totalfreq += 500;
                    row = contentByte[i] + 256 - 0xA1;
                    column = contentByte[i + 1] + 256 - 0xA1;
                    if (JPFreq[row][column] != 0) {
                        jpfreq += JPFreq[row][column];
                    } else if (15 <= row && row < 55) {
                        jpfreq += 0;
                    }
                }
                i++;
            }
        }
        rangeValue = 50 * ((float) jpchars / (float) dbchars);
        freqValue = 50 * ((float) jpfreq / (float) totalfreq);
        return (int) (rangeValue + freqValue);
    }

    private static int iso2022JpProbability(byte[] rawtext) {
        int i;
        for (i = 0; i < rawtext.length; i++) {
            if (i + 2 < rawtext.length && rawtext[i] == 0x1b && (char) rawtext[i + 1] == '$' && (char) rawtext[i + 2] == 'B') {
                return 100;
            }
        }
        return 0;
    }

    private static int sjisProbability(byte[] contentByte) {
        int i, contentLength = contentByte.length;
        int dbchars = 1, jpchars = 1;
        long jpfreq = 0, totalfreq = 1;
        float rangeValue, freqValue;
        int row, column, adjust;
        for (i = 0; i < contentLength - 1; i++) {
            if (contentByte[i] < 0) {
                dbchars++;
                if (i + 1 < contentByte.length && ((byte) 0x81 <= contentByte[i] && contentByte[i] <= (byte) 0x9F || (byte) 0xE0 <= contentByte[i] && contentByte[i] <= (byte) 0xEF) && ((byte) 0x40 <= contentByte[i + 1] && contentByte[i + 1] <= (byte) 0x7E || contentByte[i + 1] <= (byte) 0xFC)) {
                    jpchars++;
                    totalfreq += 500;
                    row = contentByte[i] + 256;
                    column = contentByte[i + 1] + 256;
                    if (column < 0x9f) {
                        adjust = 1;
                        column -= 0x20;
                    } else {
                        adjust = 0;
                        column -= 0x7e;
                    }
                    if (row < 0xa0) {
                        row = ((row - 0x70) << 1) - adjust;
                    } else {
                        row = ((row - 0xb0) << 1) - adjust;
                    }
                    row -= 0x20;
                    column = 0x20; // 什么情况?
                    if (row < JPFreq.length && column < JPFreq[row].length && JPFreq[row][column] != 0) {
                        jpfreq += JPFreq[row][column];
                    }
                    i++;
                } else if ((byte) 0xA1 <= contentByte[i] && contentByte[i] <= (byte) 0xDF) {
                    // half-width katakana, convert to full-width
                }
            }
        }
        rangeValue = 50 * ((float) jpchars / (float) dbchars);
        freqValue = 50 * ((float) jpfreq / (float) totalfreq);
        // For regular GB files, this would give the same score, so I handicap it slightly
        return (int) (rangeValue + freqValue) - 1;
    }

    private static void initializeFrequencies() {
        int i, j;
        for (i = 0; i < 94; i++) {
            for (j = 0; j < 94; j++) {
                GBFreq[i][j] = 0;
            }
        }
        for (i = 0; i < 126; i++) {
            for (j = 0; j < 191; j++) {
                GBKFreq[i][j] = 0;
            }
        }
        for (i = 0; i < 94; i++) {
            for (j = 0; j < 158; j++) {
                Big5Freq[i][j] = 0;
            }
        }
        for (i = 0; i < 94; i++) {
            for (j = 0; j < 94; j++) {
                EUC_TWFreq[i][j] = 0;
            }
        }
        for (i = 0; i < 94; i++) {
            for (j = 0; j < 94; j++) {
                JPFreq[i][j] = 0;
            }
        }
        // 文件太大,初始化缺失请从前面文件下载
        GBFreq[20][35] = 599;
        JPFreq[26][89] = 0;
    }

    @Getter
    public enum DetectEncoding {
        ISO2022CN_GB(1, "ISO2022CN_GB", "ISO-2022-CN-EXT", "ISO2022CN-GB"),
        ISO2022CN_CNS(2, "ISO2022CN_CNS", "ISO-2022-CN-EXT", "ISO2022CN-CNS"),
        CP949(3, "MS949", "x-windows-949", "CP949"),
        UNICODES(4, "Unicode", "UTF-16", "Unicode (Simp)"),
        UNICODET(5, "Unicode", "UTF-16", "Unicode (Trad)"),
        SJIS(6, "SJIS", "Shift_JIS", "Shift-JIS"),
        BIG5(7, "BIG5", "BIG5", "Big5"),
        ASCII(8, "ASCII", "ASCII", "ASCII"),
        GB18030(9, "GB18030", "GB18030", "GB18030"),
        CNS11643(10, "EUC-TW", "EUC-TW", "CNS11643"),
        UNICODE(11, "Unicode", "UTF-16", "Unicode"),
        OTHER(12, "ISO8859_1", "ISO8859-1", "OTHER"),
        GBK(13, "GBK", "GBK", "GBK"),
        ISO2022CN(14, "ISO2022CN", "ISO-2022-CN", "ISO2022 CN"),
        HZ(15, "ASCII", "HZ-GB-2312", "HZ"),
        JOHAB(16, "Johab", "x-Johab", "Johab"),
        ISO2022KR(17, "ISO2022KR", "ISO-2022-KR", "ISO 2022 KR"),
        UTF8(18, "UTF-8", "UTF-8", "UTF-8"),
        UTF8T(19, "UTF-8", "UTF-8", "UTF-8 (Trad)"),
        ISO2022JP(20, "ISO2022JP", "ISO-2022-JP", "ISO 2022 JP"),
        UTF8S(21, "UTF-8", "UTF-8", "UTF-8 (Simp)"),
        GB2312(22, "GB2312", "GB2312", "GB-2312"),
        EUC_JP(23, "EUC_JP", "EUC-JP", "EUC-JP"),
        EUC_KR(24, "EUC_KR", "EUC-KR", "EUC-KR"),
        UTF_16(24, "UTF-16", "UTF-16", "UTF-16"),
        UTF_16BE(24, "UTF-16BE", "UTF-16BE", "UTF-16BE"),
        UTF_16LE(24, "UTF-16LE", "UTF-16LE", "UTF-16LE"),
        UTF_32(24, "UTF-32", "UTF-32", "UTF-32"),
        UTF_32BE(24, "UTF-32BE", "UTF-32BE", "UTF-32BE"),
        UTF_32LE(24, "UTF-32LE", "UTF-32LE", "UTF-32LE");

        private final Integer id;
        private final String javaName;
        private final String htmlName;
        private final String niceName;

        DetectEncoding(Integer id, String javaName, String htmlName, String niceName) {
            this.id = id;
            this.javaName = javaName;
            this.htmlName = htmlName;
            this.niceName = niceName;
        }

    }

}

测试

 @Test
public void detect() throws IOException {
    File file = new File("F:\\tmp\\gb2312.txt");
    System.out.println(EncodingDetectHelper.detectEncoding(file));
}

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/475737.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

评估单细胞数据聚类指标 AvgBIO

从scGPT的报告中看到的&#xff1a; 从ChatGPT到scGPT 生成式AI助力单细胞生物学_哔哩哔哩_bilibili

深入理解Mysql索引底层原理(看这一篇文章就够了)

目录 前言 1、Mysql 索引底层数据结构选型 1.1 哈希表&#xff08;Hash&#xff09; 1.2 二叉查找树(BST) 1.3 AVL 树和红黑树 1.4 B 树 1.5 B树 2、Innodb 引擎和 Myisam 引擎的实现 2.1 MyISAM 引擎的底层实现&#xff08;非聚集索引方式&#xff09; 2.2 Innodb 引…

Kubernetes的Namespace使用

在 Kubernetes 中&#xff0c;命名空间提供了一种用于隔离单个集群中的资源组的机制。资源名称在命名空间内必须是唯一的&#xff0c;但不能跨命名空间。基于命名空间的作用域仅适用于命名空间物体 &#xff08;例如部署、服务等&#xff09;而不是集群范围的对象&#xff08;例…

ini配置文件操作方法

一、ini文件介绍 ini配置文件常用于存储项目全局变量 如&#xff1a;接口地址、输出文件路径、项目地址、用户名、密码等 二、ini文件编写格式 [节点] 选项选项值 ;表示注释 注意&#xff1a;节点名不可以重复【所以写入新节点前要判断是否存在】 三、.ini 文件读取 1…

linux内核编译详解

前言 Linux内核是Linux操作系统的核心&#xff0c;也是整个Linux功能体现的核心&#xff0c;就如同发动机在汽车中的重要性。内核主要功能包括进程管理、内存管理、文件管理、设备管理、网络管理等。Linux内核是单内核设计&#xff0c;但却采用了微内核的模块化设计&#xff0…

ffmpeg6.0如何实现解封装

前言 在播放器的播放视频、音视频媒体文件的推流等实际应用中,解封装(demux)这个操作是不可避免的,也是最基础的操作。 拿播放器播放MP4来说,如果想实现播放器视频画面的播放和音频声音的播放都需要经过这个解封装的步骤。因为MP4是一种媒体文件格式,是一种封装格式,M…

【数字IC/FPGA】书籍推荐(1)----《轻松成为设计高手--Verilog HDL实用精解》

在下这几年关于数字电路、Verilog、FPGA和IC方面的书前前后后都读了不少&#xff0c;发现了不少好书&#xff0c;也在一些废话书上浪费过时间。接下来会写一系列文章&#xff0c;把一部分读过的书做个测评&#xff0c;根据个人标准按十分制满分来打分分享给大家。 书名&#xf…

【php基础】输出、变量、布尔类型、字符串

php基础补充 1. 输出2.和"的区别3.变量3.1变量的命名规则3.2 两个对象指向同一个值3.3 可变变量 4.变量的作用域5. 检测变量6. 布尔类型7.字符串定义与转义8.字符串常用函数9.常量 1. 输出 echo: 输出 print: 输出&#xff0c;输出成功返回1 print_r(): 输出数组 var_dum…

leetcode 3035

leetcode 3035 题目 例子 思路 统计字符出现的频次&#xff0c;5个a(字符可以成为回文)。 将所有字符放在一起考虑&#xff0c;因为字符是可以任意移动。[“aabb”,“a”] > [“abba”, “a”] 只要奇数个字符的种类&#xff0c;不要超过字符数组的size就可以。 代码实现…

无管理员权限更新gcc

进入网址下载所需要的gcc版本文件 网址为&#xff1a;https://mirrors.kernel.org/gnu/gcc/ wget https://mirrors.kernel.org/gnu/gcc/gcc-8.5.0/gcc-8.5.0.tar.gz解压并安装 tar -xzvf gcc-8.5.0.tar.gz cd gcc-8.5.0 ./contrib/download_prerequisites ../gcc-8.5.0/confi…

#LT8713SX适用于Type-C/DP1.4转三路Type-C/DP1.4/HDMI2.0应用方案,分辨率高达4K60HZ,支持SST/MST功能。

1. 描述 LT8713SX是一款高性能Type-C/DP1.4转Type-C/DP1.4/HD-DVI2.0转换器&#xff0c;具有三个可配置的DP1.4/HD-DVI2.0/DP输出接口和音频输出接口。LT8713SX 支持 DisplayPort™ 单流传输 &#xff08;SST&#xff09; 模式和多流传输 &#xff08;MST&#xff09; 模式。当…

基于javaweb(springboot)城市地名地址信息管理系统设计和实现

基于javaweb(springboot)城市地名地址信息管理系统设计和实现 博主介绍&#xff1a;多年java开发经验&#xff0c;专注Java开发、定制、远程、文档编写指导等,csdn特邀作者、专注于Java技术领域 作者主页 央顺技术团队 Java毕设项目精品实战案例《1000套》 欢迎点赞 收藏 ⭐留言…

【Qt问题】初始化菜单QMenu的时候,一直报错

问题描述&#xff1a; 我在初始化菜单的时候&#xff0c;一直报错&#xff0c;我检查了很多遍&#xff0c;都找不到问题所在&#xff0c;而且报的错很离谱&#xff0c;说我缺少右括号")"&#xff0c;但是这个语法是怎么都不可能缺少右括号&#xff0c;具体报错界面如…

谷歌应用上架,如何选择IP?

在讨论IP对于谷歌上架的重要性或影响时&#xff0c;需要明确一点&#xff1a;开发者账号质量可以直接影响上架成功率&#xff0c;而IP是影响账号质量的重要因素之一。因此&#xff0c;IP对于谷歌上架的重要性&#xff0c;不言而喻。 我们都清楚&#xff0c;谷歌是不允许一个用户…

网络世界的城关——网卡

网络世界的城关——网卡 网卡到底是什么&#xff1f;网卡的功能网卡的真面目网卡的组成网卡的种类1.基于网络连接方式分类2.基于总线接口类型分类3.基于接口类型的分类4.基于传输速度的分类5.基于应用领域的分类 网卡到底是什么&#xff1f; 网卡我们可以这样通俗地理解&#x…

2024 Java开发跳槽、面试心得体会

前言 由于个人发展的原因和工作上的变动&#xff0c;产生了想出来看看机会的想法&#xff0c;在决定要换工作后就开始复习准备。从年前就开始看面经&#xff0c;系统复习自己使用的技术栈&#xff0c;把自己项目中的技术梳理清楚。3月初开始在招聘网站上投简历&#xff0c;到三…

【C语言】模拟实现 atoi

文章目录 atoi()函数模拟实现思路分析代码呈现 atoi()函数 通过上述cplusplus和MSDN对atoi函数的介绍我们可以得出以下几个关键点 库函数&#xff1a; <stdlib.h>形参&#xff1a;const char * str返回值&#xff1a; int作用&#xff1a;atoi函数是将一个字符串转化成一…

S2-066分析与复现

Foreword 自struts2官方纰漏S2-066漏洞已经有一段时间&#xff0c;期间断断续续地写&#xff0c;直到最近才完成&#xff0c;o(╥﹏╥)o。羞愧地回顾一下官方通告&#xff1a; 2023.12.9发布&#xff0c;编号CVE-2023-50164&#xff0c;主要影响版本是 2.5.0-2.5.32 以及 6.0.…

国产AI插件StartAI PS平替之【局部重绘】

PS beta有创成式填充&#xff0c;StartAI有【局部重绘】【扩图】&#xff0c;国内设计师的好物推荐。图像延展填充这么做&#xff1f;StartAI【扩图】帮你填充图像&#xff0c;【局部重绘】帮你调整图像细节。 【局部重绘】适用于广告、摄影等图像领域。 原图 我们对图片中的…

最近公共祖先(Tarjin)

【模板】最近公共祖先&#xff08;LCA&#xff09; 题目描述 如题&#xff0c;给定一棵有根多叉树&#xff0c;请求出指定两个点直接最近的公共祖先。 输入格式 第一行包含三个正整数 N , M , S N,M,S N,M,S&#xff0c;分别表示树的结点个数、询问的个数和树根结点的序号…