protobuf之string bytes的区别_purple尘的博客

link之家
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
相关文章推荐
朝气蓬勃的包子 · Python name 'device' ...· 1 年前 ·
玩命的吐司 · udp ...· 2 年前 ·
大气的饼干 · 默认规则引用 - Azure DevOps ...· 2 年前 ·
正直的人字拖 · ruby on rails - Mysql ...· 3 年前 ·
protobuf提供了多种基础数据格式，包括string/bytes。从字面意义上，我们了解bytes适用于任意的二进制字节序列。然而对C++程序员来讲，std::string既能存储ASCII文本字符串，也能存储任意多个 \ 0 的二进制序列。那么区别在哪里呢？
同时在实际使用中，我们偶尔会看到类似这样的运行错误：
[libprotobuf ERROR google/protobuf/wire_format.cc:1091] String field 'str' contains invalid UTF-8 data when serializing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes. 
[libprotobuf ERROR google/protobuf/wire_format.cc:1091] String field 'str' contains invalid UTF-8 data when parsing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes. 
这篇文章从源码角度分析下string/bytes类型的区别。 
在之前的文章里介绍过protobuf序列化的过程，我们看下string/bytes序列化的过程。 在之前的文章里介绍过protobuf序列化的过程，我们看下string/bytes序列化的过程。 
所有的序列化操作都会在SerializeFieldWithCachedSizes这个函数里进行。根据不同的类型调用对应的序列化函数，例如对于string类型 
       case FieldDescriptor::TYPE_STRING: {
        string scratch;
        const string& value = field->is_repeated() ?
          message_reflection->GetRepeatedStringReference(
            message, field, j, &scratch) :
          message_reflection->GetStringReference(message, field, &scratch);
        VerifyUTF8StringNamedField(value.data(), value.length(), SERIALIZE,
                                   field->name().c_str());
        WireFormatLite::WriteString(field->number(), value, output);
        break;
而对于bytes类型： 
       case FieldDescriptor::TYPE_BYTES: {
        string scratch;
        const string& value = field->is_repeated() ?
          message_reflection->GetRepeatedStringReference(
            message, field, j, &scratch) :
          message_reflection->GetStringReference(message, field, &scratch);
        WireFormatLite::WriteBytes(field->number(), value, output);
        break;
可以看到在序列化时主要有两点区别： 
string类型调用了VerifyUTF8StringNamedField函数
序列化函数不同：WriteString vs WriteBytes 
关于第二点，两个函数都定义在wire_format_lite.cc，实现是相同的。 
那么我们继续看下第一点，VerifyUTF8StringNamedField调用了VerifyUTF8StringFallback（话说一直不理解fallback在这里什么意思，protobuf源码里经常看到这个后缀）。看下这个函数的实现： 
void WireFormat::VerifyUTF8StringFallback(const char* data,
                                          int size,
                                          Operation op,
                                          const char* field_name) {
  if (!IsStructurallyValidUTF8(data, size)) {
    const char* operation_str = NULL;
    switch (op) {
      case PARSE:
        operation_str = "parsing";
        break;
      case SERIALIZE: 
        operation_str = "serializing";
        break;
      // no default case: have the compiler warn if a case is not covered.
    string quoted_field_name = "";
    if (field_name != NULL) {
      quoted_field_name = StringPrintf(" '%s'", field_name);
    // no space below to avoid double space when the field name is missing.
    GOOGLE_LOG(ERROR) << "String field" << quoted_field_name << " contains invalid "
               << "UTF-8 data when " << operation_str << " a protocol "
               << "buffer. Use the 'bytes' type if you intend to send raw "
               << "bytes. ";
运行错误是从这里输出的，关键还是在于IsStructurallyValidUTF8这个函数，实现在structurally_valid.cc里： 
 bool IsStructurallyValidUTF8(const char* buf, int len) {
  if (!module_initialized_) return true;
  int bytes_consumed = 0;
  UTF8GenericScanFastAscii(&utf8acceptnonsurrogates_obj,
                           buf, len, &bytes_consumed);
  return (bytes_consumed == len);
这里逐个字符扫描是否符合utf-8规范，比如110xxxxx 10xxxxxx这样，具体可以参考utf-8的编码标准。 
反序列化过程类似。 
看到这里我们可以得到这样的结论： 
protobuf里的string/bytes在C++接口里实现上都是std::string。
两者序列化、反序列化格式上一致，不过对于string格式，会有一个utf-8格式的检查。 
出于效率，我们应当在确定字段编码格式后直接使用bytes，减少utf8编码的判断，效率上会有提高。 
注意以上代码在pb2.6下，2.4不会输出field_name。 
据了解java接口上有一定的区别，分别对应String以及ByteString。 
                    转自：http://izualzhy.cn/c/cpp/2017/03/20/protobuf-difference-between-string-and-bytesprotobuf提供了多种基础数据格式，包括string/bytes。从字面意义上，我们了解bytes适用于任意的二进制字节序列。然而对C++程序员来讲，std::string既能存储ASCII文本字符串，也能存储任意
				博客搬家，原地址：https://langzi989.github.io/2017/06/07/protoBuffer中string与byte类型区别/
从上一节protobuffer的介绍中我们知道字符串类型在protobuffer中有string和bytes两种类型，那这两种类型有什么区别呢,什么时候用string,什么时候用bytes。在C++中两种类型分别对应的是什么类型.下面将揭开迷雾
				protobuf提供了多种基础数据格式，包括string/bytes。从字面意义上，我们了解bytes适用于任意的二进制字节序列。然而对C++程序员来讲，std::string既能存储ASCII文本字符串，也能存储任意多个\0的二进制序列。那么区别在哪里呢？
同时在实际使用中，我们偶尔会看到类似这样的运行错误：
				编译前工作
        需要先安装好protocol buffer库并配置好环境变量，请参考protocol buffer之linux编译。
        解压源码，例如：protobuf-c.zip到目录/home/workspace/protobuf-c，通过终端进入，输入命令：
./autogen.sh
        生成configure文件。
        在父目录创建构建目录，例如：/home/workspace/protobuf-c_build，通过终端进入构建目录，输入命令：
./../protobuf-c/configure \
-host=arm-linux \
				1  go grpc-go 相关技术专栏 总入口
2  Protobuf介绍与实战 图文专栏 文章目录
当数据类型为string,bytes,embedded messages,packed repeated fields时，
采用的是Length-delimited编码方式，即TLV结构；(TLV结构介绍，可参考前文)
整体采用的是TLV编码结构
但是，变量值V的编码方式是不同一的。
当类型为string, bytes时，变量值采用的是UTF-8编码(我对UTF-8编码规则并不了解，这一点，仅个
Google Protocol Buffers 简称 Protobuf，类似 json 或 XML，是一种序列化结构数据的机制，但是比它们更小、更快、更简单。同时支持多语言，跨平台。
目前主要有两个大版本：proto2 和 proto3。
其中 proto2 支持 Java、Python、 Objective-C、和 C++。
proto3 增加了对Go、JavaNano、Ruby、和 C#的支持。
proto例子
syntax = proto3;
package tutorial;
import google/protobuf/timestamp.proto;
				在 C++ 中，可以通过 `google::protobuf::BytesValue` 类来创建一个 BytesValue 对象，并给其赋值。下面是一段示例代码：
```cpp
#include <google/protobuf/wrappers.pb.h>
#include <iostream>
int main() {
  // 创建一个 BytesValue 对象
  google::protobuf::BytesValue bytes_value;
  // 对 BytesValue 对象赋值
  std::string bytes_str = "your bytes value here";
  bytes_value.set_value(bytes_str);
  // 输出 BytesValue 对象的值
  std::cout << "BytesValue: " << bytes_value.value() << std::endl;
  return 0;
在上面的代码中，我们首先包含了 `google/protobuf/wrappers.pb.h` 头文件，然后创建了一个空的 `BytesValue` 对象。接着，我们可以通过 `set_value()` 方法对这个对象进行赋值。需要注意的是，这里赋的值是一个 `std::string` 类型的字符串，而不是字节字符串。最后，我们通过 `value()` 方法获取 BytesValue 对象的值并输出。
需要注意的是，在使用 C++ 的 Protocol Buffers 库时，需要编写相应的 `.proto` 文件，并通过 protoc 工具生成对应的 C++ 代码。在上面的示例代码中，我们假设已经生成了 `google::protobuf::BytesValue` 类的定义。