文章目录
  1. 1. 踩过的坑儿
    1. 1.1. 1.undefined reference to …
    2. 1.2. 3. invalid initialization of non-const reference of type
    3. 1.3. 4. …multiple definition of …
    4. 1.4. 5. …error: cannot allocate an object of abstract type …
    5. 1.5. 6. … Error in `./xx’: free(): invalid pointer: 0x00000000006042e0 …
    6. 1.6. 7. as ‘this’ argument discards qualifiers [-fpermissive] …
    7. 1.7. 8. double free / free: invalid pointer
    8. 1.8. 9. the following virtual functions are pure within ‘mit::FFM’…
      1. 1.8.0.1. 17. duplicate symbol __ZN6openmi4zeroE in:
      2. 1.8.0.2. 18. tools/logging2.cc:40:23: error: member function 'Length' not viable: 'this' argument has type 'const openmi::LogStream::Buffer'
      3. 1.8.0.3. 19. note: candidate template ignored: invalid explicitly-specified argument for template parameter 'NDIMS' typename TTypes::Tensor TensorType();
      4. 1.8.0.4. 20. Bus error: 10 (core dumped)
      5. 1.8.0.5. 21. error: allocation of incomplete type 'Eigen::ThreadPoolDevice'
      6. 1.8.0.6. 22. error: C++ requires a type specifier for all declarations
      7. 1.8.0.7. 23. [malloc: *** error for object 0x7ff62a6010e8: incorrect checksum for freed object - object was probably modified after being freed.]
      8. 1.8.0.8. 24. ['operator()' cannot be the name of a variable or data member]
      9. 1.8.0.9. 25. [Assertion failed: (dimensions_match(m_leftImpl.dimensions(), m_rightImpl.dimensions())), function evalSubExprsIfNeeded]
      10. 1.8.0.10. 26. [strncpy core dump]
    9. 1.8.1. 27. basic_string::_S_construct NULL not valid
    10. 1.8.2. 28. Conditional jump or move depends on uninitialised value(s)
      1. 1.8.2.1. 29: Cannot access memory at address 0x7f0ca4de1d08 …
      2. 1.8.2.2. 30 unsupported/Eigen/CXX11/src/Tensor/TensorStorage.h:39:7: error: class template partial
      3. 1.8.2.3. 31. dyld: lazy symbol binding failed: Symbol not found: ___emutls_get_address
      4. 1.8.2.4. 32. libtool: error: unrecognised option: ‘-static’
    11. 1.8.3. 33. /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20’ not found
  2. 1.9. 编程技巧
  • author: zhouyongsdzh@foxmail.com
  • date: 2014-12-25
  • weibo: @周永_52ML

内容列表

踩过的坑儿

1.undefined reference to …

参考链接:http://blog.csdn.net/jfkidear/article/details/8276203

异常示例1:

1
undefined reference to `dmlc::Config::Config(std::istream&, bool)'

主要原因是

**链接库函数代码,即```target_link_libraries(${exec_name} dmlc)```**,相当于在g++上添加了参数```-ldmlc```
1

> 类似的问题: ```undefined reference to 'pthread_create'``` 需要添加```-lpthread

异常示例2:

1
2
3
4
5
~/workplace/DiMLSys/third_party/root/lib/libdmlc.a(hdfs_filesys.o): In function `dmlc::io::HDFSFileSystem::HDFSFileSystem(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
hdfs_filesys.cc:(.text+0xb1): undefined reference to `hdfsConnect'
~/workplace/DiMLSys/third_party/root/lib/libdmlc.a(hdfs_filesys.o): In function `dmlc::io::HDFSFileSystem::GetPathInfo(dmlc::io::URI const&)':
hdfs_filesys.cc:(.text+0xcd0): undefined reference to `hdfsGetPathInfo'
hdfs_filesys.cc:(.text+0x143a): undefined reference to `hdfsFreeFileInfo

编译dmlc-core时,发现是

1

异常示例3: 

```c++ 
undefined reference to `omp_get_num_procs`

主要原因是使用了OpenMP,但编译时没有配置OpenMP相关编译环境,需要在CMakeLists.txt中配置

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")```条件。
1

使用OpenMP时,需要在CMake文件中 添加 **编译环境代码**,即:```set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")```.

异常示例4:

```c++
undefined reference to mit::FFM<unsigned long, float>::Predict(dmlc::Row<unsigned long> const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> > const&)'

主要原因:是这里把模版类分离编译导致。就是把模版类的声明和实现分别放在了头文件和源文件中。而g++本身不支持模版类的分离编译,所有提示找不到方法的具体实现(在*.cc中)。

解决方案:要么不使用模版类,要么把声明和定义放在同一个*.h文件中。参考:http://blog.sina.com.cn/s/blog_6cef0cb50100nb7o.html

异常示例5:

1
thrift-0.10.0/lib/cpp/src/thrift/server/TNonblockingServer.cpp:1602: undefined reference to `event_del'

使用nm libevent.a | grep event_del查看,静态包里面存在该函数。头文件和库文件都存在。因此这种情况大概率是库文件顺序错误。把libevent.a放在libthrift.a之后,编译正常。参考rpc_thrift/CMakeLists.txt

同示例5:

1
2
3
4
5
6
7
/src/thrift/transport/TBufferTransports.h:544: undefined reference to 'vtable for apache::thrift::transport::TMemoryBuffer'
```

这里的问题是依赖库`libthrift.a与libthriftnb.a`的先后顺序导致。**应该`libthriftnb.a`在前,`libthrift.a`在后**。

> 注意:thrift安装问题,如果./configure时出现以下错误,可编辑configure文件,注释掉所有的`PKG_CHECK_MODULES`检查。
>

./configure: line 18262: syntax error near unexpected token QT,' ./configure: line 18262: PKG_CHECK_MODULES(QT, QtCore >= 4.3, QtNetwork >= 4.3, have_qt=yes, have_qt=no)’

1
2
3
4
5
6
7
8
9
10
11

异常示例6: Linux环境openmit-ps生成可执行文件时,如果链接静态库时,编译失败;链接动态库时,编译成功。可以确定静态库中包含undefined的函数和变量。

解决方案:静态库生成的编译环境中添加`-shared -fPIC`。如此,再生成动态库时,如果依赖了添加`-shared -fPIC`的静态库时,应该能成功。⚠️ **共享库编译选项不能出现在编译目标为可执行文件的编译任务上,这里的`-shared`不能出现在上面**。

异常示例7:

```bash
Undefined symbols for architecture x86_64:
"openmi::LocalDevice::use_global_threadpool_", referenced from:
openmi::LocalDevice::LocalDevice(openmi::Allocator*, int) in libopenmi_core.a(local_device.o)

原因:LocalDevice中的全局静态变量未初始化。在对应*.cc文件中初始化即可。

因此,出现

reference to `...` ```问题时,通常有如下原因:
1

1. 检查include头文件是否存在,如果没有需要添加```include_directories()

  1. 检查相应的链接库是否存在,如果没有需要
    dmlc)```;
    1
    3. 检查对应的编译环境是否缺失,比如pthread, OpenMP都需要在g++编译时,添加对应的编译环境。
    4. 查看对应的类是否是模版类。如果是模版类,不应该有对应的*.cc文件,因为g++不支持模版类的分离编译;
    5. 如果头文件和库文件均存在,可尝试**调整库文件顺序**。
    6. 依赖静态库编译失败,依赖动态库编译成功。解决方案:**静态库重新编译,并添加`-shared -fPIC`**。注意:可执行程序不可添加静态库编译选项。
    
    
    <h2 id="2.error-while-loading-shared-libraries">2. error while loading shared libraries: *.so : cannot open shared object file</h2>
    
    [... error while loading shared libraries: *.so : cannot open shared object file: No such file or directory](http://blog.csdn.net/sahusoft/article/details/7388617)
    
    错误提示程序执行时无法加载共享库```*.so```,可能不存在或者没有找到。
    
    解决方案:
    
    1. 首先,用```locate *.so```命令检查共享库是否存在,如果不存在,需要网上下载和安装。如果存在,进入第二步
    2. 将```*.so```所对应的目录加入```LD_LIBRARY_PATH```路径中,举例操作:
    
    ```sh
    LD_LIBRARY_PATH=${JAVA_HOME}/jre/lib/amd64/servier:$LD_LIBRARY_PATH
    export LD_LIBRARY_PATH
    ``` 
    
    上面的配置在MakeFile中可以直接找到对应的环境变量。在CMakeLists中如何使用呢? cmake使用环境变量,需要使用```ENV```关键词。即: ```$ENV{LD_LIBRARY_PATH}

在使用automake编译时,也出现类似的错误:./openmit: error while loading shared libraries: libprotobuf.so.12: cannot open shared object file: No such file or directory. automake下的解决方案是?

3. invalid initialization of non-const reference of type

…invalid initialization of non-const reference of type…

1
2
~/workspace/openmit/openmit/include/openmit/data.h:24:41: error: invalid initialization of non-const reference of type ‘std::__cxx11::string& {aka std::__cxx11::basic_string<char>&}’ from an rvalue of type ‘std::__cxx11::string {aka std::__cxx11::basic_string<char>}’
std::string & data_format = "auto") {

错误提示的含义:c++中临时变量不能作为非const的引用参数

4. …multiple definition of …

…multiple definition of …

1
2
3
4
CMakeFiles/openmit.dir/worker.cc.o: In function `mit::WorkerParam::__MANAGER__()':
worker.cc:(.text+0x176): multiple definition of `mit::WorkerParam::__MANAGER__()' // worker.cc提示多次定义error
CMakeFiles/openmit.dir/cli_main.cc.o:cli_main.cc:(.text+0x3b6): first defined here // 最早在cli_main.cc中被定义
collect2: error: ld returned 1 exit status

上面出现错误的原因:把变量的定义(DMLC_REGISTER_PARAMETER(WorkerParam);)放在了worker.h文件中,而worker.cc和cli_main.cc都include了worker.h,进行了两次变量的定义,所以提示错误。

解决方案:worker.h中的变量定义放在worker.cc中。如此可避免变量重复定义的问题。

  1. 编译是针对一个一个文件来说的,而链接则是针对一个工程所有的.o文件而言的;
  2. ifndef只是对防止一个文件的重复编译有效;
  3. 全局变量最好在.cpp文件中定义,在.h文件中加上extern申明,因为在.h文件中定义,容易在链接时造成变量重定义;

如果有“公共函数”需要放在base.h文件中,比如void NewKey(...) { ... },为了防止出现multiple defination of ...问题,可以在前面加上inline,即inline void NewKey(...) { ... }

5. …error: cannot allocate an object of abstract type …

…error: cannot allocate an object of abstract type …

在基类中申明的虚函数,在派生类中必须继承并实现。在new一个派生类时才不会报该错误。

此外,Unit * base = new SimpleUnit();而不能是Unit base = new SimpleUnit();.

在C++中,new一个类时,需要用指针接着。参考:C++创建对象,new与不new的区别

6. … Error in `./xx’: free(): invalid pointer: 0x00000000006042e0 …

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
*** Error in `./xx': free(): invalid pointer: 0x00000000006042e0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f3989dea7e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7f3989df2e0a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f3989df698c]
./xx[0x4015dc]
./xx[0x401d02]
./xx[0x401bdb]
./xx[0x401397]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f3989d93830]
./xx[0x401039]
======= Memory map: ========
00400000-00403000 r-xp 00000000 08:01 2490760 ~/workspace/openmit/openmit/language/xx
...
7f398a351000-7f398a352000 rw-p 00015000 08:01 398344 /lib/x86_64-linux-gnu/libgcc_s.so.1
...
7f398a6d0000-7f398a6d4000 rw-p 00000000 00:00 0
7f398a6d4000-7f398a6fa000 r-xp 00000000 08:01 397534 /lib/x86_64-linux-gnu/ld-2.23.so
...
7f398a8fb000-7f398a8fc000 rw-p 00000000 00:00 0
7fffc3ff7000-7fffc4019000 rw-p 00000000 00:00 0 [stack]
7fffc41e1000-7fffc41e3000 r--p 00000000 00:00 0 [vvar]
7fffc41e3000-7fffc41e5000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Aborted (core dumped)

背景:在工厂方法派生类返回实现, 对应的调用方式:std::unique_ptr<A> a(A::Create("b", 10)); 报的错误:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
static B * Get(std::string type, int a) {
//return new B(type, a);
static B b(type, a); // stack space
return & b;
}
```

如果改成下面的实现方式则成功:

```c++
static B * Get(std::string type, int a) {
return new B(type, a); // heap space
//static B b(type, a);
//return & b;
}

7. as ‘this’ argument discards qualifiers [-fpermissive] …

… error: passing ‘const std::unordered_map’ as ‘this’ argument discards qualifiers [-fpermissive] …

具体错误:

1
2
3
4
5
6
7
8
9
~/workspace/openmit/openmit/test/unittests/unittest_openmit_unit.cc: In function ‘void run(const std::unordered_map<int, mit::Unit*>&, int)’:
~/workspace/openmit/openmit/test/unittests/unittest_openmit_unit.cc:7:37: error: passing ‘const std::unordered_map<int, mit::Unit*>’ as ‘this’ argument discards qualifiers [-fpermissive]
mit::Unit * unit = map_weight_[key];

^
In file included from /usr/include/c++/5/unordered_map:48:0,
from ~/workspace/openmit/openmit/include/openmit/unit.h:4,
from ~/workspace/openmit/openmit/test/unittests/unittest_openmit_unit.cc:1:
/usr/include/c++/5/bits/unordered_map.h:667:7: note: in call to ‘std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::mapped_type& std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::operator[](const key_type&) [with _Key = int; _Tp = mit::Unit*; _Hash = std::hash<int>; _Pred = std::equal_to<int>; _Alloc = std::allocator<std::pair<const int, mit::Unit*> >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::mapped_type = mit::Unit*; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::key_type = int]’
operator[](const key_type& __k)

问题背景:

1
2
3
4
5
void run(const std::unordered_map<int, mit::Unit * > & map_weight_, int key) {
mit::Unit * unit = map_weight_[key];
unit->SetLinearItem(0.0999);
std::cout << "unit.Linear: " << unit->LinearWeight() << std::endl;
}

主要原因是:当const map_weight_对象调用operator[]时,编译器检测出问题。对一个const对象调用non-const成员函数是不允许的,因为non-const成员函数不保证一定不修改对象。

编译器在这里做了一个假定,假定operator[]试图修改map_weight_对象,而与此同时,map_weight_是const的,所有试图修改const对象的都会报error

unordered_map[]运算符会在索引项不存在的时候自动创建一个对象,有可能会改变map本身,所以不能在一个const map对象上使用[]操作。

解决办法:去掉const,或者operator[]改成const方法(这里比较困难).

1
void run(std::unordered_map<int, mit::Unit * > & map_weight_, int key) { ... }

8. double free / free: invalid pointer

src/learner.cc出现内存泄漏:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
mit_float * pvals = map_grad[keys[0]]->Data();
auto offset = map_grad[keys[0]]->Size();
for (size_t i = 1; i < nfeature; ++i) {
std::cout << "Learner::Run i: " << i << std::endl;
memcpy(pvals + offset,
map_grad[keys[i]]->Data(),
map_grad[keys[i]]->Size() * sizeof(mit_float));

offset += map_grad[keys[i]]->Size();
}

std::cout << "nfeature done" << std::endl;
std::vector<mit_float> grad_vals(pvals, pvals + offset);
vals = &grad_vals;

换成下面代码则正常:

1
2
3
4
5
6
7
8
// map_grad_ --> vals
vals->clear();
for (auto i = 0u; i < nfeature; i++) {
mit::Unit * unit = map_grad[keys[i]];
vals->insert(vals->end(),
unit->Data(), unit->Data() + unit->Size());
delete unit;
}

继续跟进问题:

注意:

  1. 一个地址只能由一个指针指向,不能多个指针指向一个地址,否则会出现double free or corruption (fasttop) ...问题;

9. the following virtual functions are pure within ‘mit::FFM’…

具体是mit::Model类中有4个纯虚函数(virtual type method() = 0;),而在子类中仅覆写了两个(override),因而提示`下面的虚函数是纯的,必需要覆写

1

> 纯虚函数:"virtual type method() =0;"; 如果不带`=0`,只有`virtual type method();`则在子类中可以不覆写(不过存在隐患)。

<h2 id="10.binding-const-value_type-to-reference-of-type-discards-qualifiers">10. binding ‘const `value_type` {aka const float}’ to reference of type ‘`mit::mit_float`& {aka float&}’ discards qualifiers</h2> 

现象:参数为`const std::vector<int> & keys`, 用`keys[i]`参数调用另外一个函数`func(int & key)`报的错误。

解决办法:函数行参加上const即可,即`func(const int & key)`

<h2 id="11.Program-terminated-with-signal-11-Segmentation-fault.0">11. Program terminated with signal 11, Segmentation fault.0</h2>

[Program terminated with signal 11, Segmentation fault.0  0x00007f7113d037f1 in ?? ()](https://blog.csdn.net/xufandecsdn/article/details/80609546) 

背景:linux环境下编译openmit-ps,生成bin/{client,server}可执行文件。运行bin/server是报错。查看core文件 发现上述错误。即**程序未进入main函数就出现段错误**。

```sh
Program terminated with signal 11, Segmentation fault.
#0  0x00007f7113d037f1 in ?? ()
(gdb) bt
#0  0x00007f7113d037f1 in ?? ()
#1  0x0000000000000000 in ?? ()
(gdb)
```

问题定位:gdb单步调试发现问题。检查编译选项,发现是`CMAKE_CXX_FLAGS`编译选项包含`-shared -fPIC`。原来是**编译参数搞错,将生成共享库的编译参数错误地用于生成可执行文件!**

解决方案:**生成可执行程序的编译选项去除与共享库相关的编译参数**。

问题:[可执行文件添加-static, 提示/bin/ld: cannot find -lopenmit_idl (该目录下只有动态库,无静态库)。 去除-static编译可以通过](https://www.cnblogs.com/yunsicai/p/3191002.html)。

原因是:链接器(ld)默认会连接动态库,但如果编译选项添加-static在编译可执行程序时,会链接静态库,**如果静态库not exist,那么会提示cannot found**。

类似的错误还有:

```sh
Reading symbols from featurex/featurex/test/test...done.
[New Thread 6541]
Core was generated by `./test'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000000005d6 in ?? ()
(gdb) bt
#0  0x00000000000005d6 in ?? ()
#1  0x00007f4f639d5600 in main (argc=Unhandled dwarf expression opcode 0xf3
) at /opt/meituan/zhouyong03/featurex/featurex/test/test.cc:4
(gdb)
```

<h2 id="12.Heap-check-constructor-called-twice.">12. Check failed: !`internal_init_start_has_run`: Heap-check constructor called twice.  Perhaps you both linked in the heap checker, and also used LD_PRELOAD to load it? Aborted (core dumped)]()</h2>

背景:openmit-ps编译生成动态库和可执行文件 均链接了tcmalloc的静态库(libtcmalloc.a),执行可执行文件`/bin/server`上述错误。如果链接动态库(libtcmalloc.so),则可以正常运行。

---
### 13. **[-bash: ./bin/server: /lib/ld64.so.1: bad ELF interpreter: No such file or directory](https://stackoverflow.com/questions/23604471/ld64-so-present-in-ldd-missing-at-runtime)**

背景:使用如下命令编译openmit-ps/test/server.cc时,运行`./bin/server`报上述错误。

```sh
/bin/g++ -std=c++11 -g -O3 -Wall -static -DNDEBUG -DHAVE_NETINET_IN_H -fopenmp -O2 -DNDEBUG -rdynamic CMakeFiles/server.dir/server.o -o ../../bin/server -L/home/sankuai/.openmit_deps/lib -L/opt/meituan/zhouyong03/openmix/openmit-ps/../openmit-common/lib -L/opt/meituan/zhouyong03/openmix/openmit-ps/../openmit-idl/lib -L/opt/meituan/zhouyong03/openmix/openmit-ps/lib -L/opt/meituan/zhouyong03/openmix/openmit-ps -L/home/sankuai/.openmit_deps/lib64 -static -Wl,--whole-archive -lopenmit_ps -Wl,--no-whole-archive -Wl,--eh-frame-hdr -Wl,-Bstatic -lopenmit_idl -lopenmit_common -lboost_system -lboost_thread -lthriftnb -lthrift -lprotobuf -lprotoc -lprotobuf-lite -lsnappy -levent -lssl -lglog -lgflags -ltcmalloc_minimal -Wl,-Bdynamic -lpthread -lrt
```

解决方案:使用`ldd 可执行程序`命令,看具体的错误信息。 

```
openmit-ps[master*]$ ldd bin/server
     linux-vdso.so.1 =>  (0x00007fff1f5fe000)
     libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f451a3ef000)
     librt.so.1 => /lib64/librt.so.1 (0x00007f451a1e6000)
     libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f4519edd000)
     libm.so.6 => /lib64/libm.so.6 (0x00007f4519bdb000)
     libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f45199b4000)
     libc.so.6 => /lib64/libc.so.6 (0x00007f45195f3000)
     /lib/ld64.so.1 => /lib64/ld-linux-x86-64.so.2 (0x00007f451a624000)
     libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f45193dd000)
```

是链接时的路径与运行时的不一致。

解决办法:在`LINK_FALGS`选项中显示制定链接器,即添加`-Wl,--dynamic-linker=/lib64/ld-linux-x86-64.so.2`. 可执行程序可以运行。 

CMake指定链接选项方式:[`SET_TARGET_PROPERTIES(foo PROPERTIES LINK_FLAGS -Wl,-whole-archive -lopenmit_ps ...)`](https://cmake.org/pipermail/cmake/2003-August/004244.html)

---
### 14. [`Exception in thread "main" java.lang.UnsatisfiedLinkError: libfeature_extractor.so: libfeature_extractor.so: cannot allocate memory in static TLS block`]()

jni家在so时报错,不能分配静态TLS块(线程局部存储)。要求so中不可以存在静态变量、全局变量等。这里的解决方案:编译so时去除`-ltcmalloc_minimal`,重新编译so,再System.load()。即编译选项:

```sh
env_config['CPPPATH'] += path
env_config['LIBPATH'] += lib_path
env_config['LIBS'] = libs
env_config['CXXFLAGS'] += ' -Wno-unused-local-typedefs -Wno-unused-variable'
env_config['LINKFLAGS'] += ' -Wl,--eh-frame-hdr'

env = Environment(**env_config)
#env['_LIBFLAGS'] = '-Wl,-Bstatic -lmlxproto -lmlxcommon -lprotobuf -lglog -lgflags -lboost_system -lboost_timer -lboost_chrono -ltcmalloc_minimal -lcityhash -lrt -Wl,-Bdynamic -lpthread'
env['_LIBFLAGS'] = '-Wl,-Bstatic -lmlxproto -lmlxcommon -lprotobuf -lglog -lgflags -lboost_system -lboost_timer -lboost_chrono -lcityhash -lrt -Wl,-Bdynamic -lpthread'

#sources = [Glob('../../src/*.cc')] + [Glob('../../proto/*.cc')] + [Glob('../../src/features/*.cc')] + [Glob("./*.cc")]
env.SharedLibrary('feature_extractor', sources)
```

---
### 15. [malloc: *** error for object 0x7f86a3c08620: pointer being freed was not allocated]()

错误原因:指针未初始化,指向了“别的”地址(不为null),在释放时提示“释放了未分配地址的指针”。

解决方案:任何指针必须初始化,或者给null值。如下:

```c++
class Tensor {
// ....
private:
  TensorShape tensor_shape_;
  TensorBuffer<T>* buf_ = nullptr;    // 必须初始化
}; // class Tensor
```

Case2: 

```c++
core_framework_executor_test(69510,0x7fff7b20f000) malloc: *** error for object 0x3b03126f3b03126f: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
^CAbort trap: 6 (core dumped)
```

原因:程序由1~13行改为15~23行报上述错误,原因待查。

```c++
if (is_lbcast && is_rbcast) {
    Y.device(d) = X0.reshape(lreshape_dims).broadcast(lbcast_dims).binaryExpr(
      X1.reshape(rreshape_dims).broadcast(rbcast_dims), typename FUNCTOR::func());
  } else if (is_lbcast && !is_rbcast) {
    Y.device(d) = X0.reshape(lreshape_dims).broadcast(lbcast_dims).binaryExpr(
      X1.reshape(rreshape_dims), typename FUNCTOR::func());
  } else if (!is_lbcast && is_rbcast) {
    Y.device(d) = X0.reshape(lreshape_dims).binaryExpr(
      X1.reshape(rreshape_dims).broadcast(rbcast_dims), typename FUNCTOR::func());
  } else {
    Y.device(d) = X0.reshape(lreshape_dims).binaryExpr(
      X1.reshape(rreshape_dims), typename FUNCTOR::func());
  }
  /*
    auto X00 = X0.reshape(lreshape_dims);
    if (is_lbcast) {
      X00 = X00.broadcast(lbcast_dims);
    }
    auto X11 = X1.reshape(rreshape_dims);
    if (is_rbcast) {
      X11 = X11.broadcast(rbcast_dims);
    }
    Y.device(d) = X00.binaryExpr(X11, typename FUNCTOR::func());
  */

  LOG(INFO) << "Y:\n" << Y;
```

Case2: 指针与局部变量赋值的问题

```c++
void GradientRegistry::Lookup(const std::string& op, GradCreator* creator) {
  auto it = grad_creator_mapper_.find(op);
  CHECK(it != grad_creator_mapper_.end())
    << op << " not in gradient registry.";
  LOG(DEBUG) << op << " has exists.";
  //creator = &(it->second);    // Error. 报core dump. 由于GradCreator是函数指针,难以排查
  *creator = it->second;
}
```

关于操作符重载:

> 这里有一点要注意:**返回值不能是引用**。因为是引用,其引用的是v0(局部变量)的对象,而v0在函数结束时会被销毁,所以引用将指向一个不存在的对象。而使用MyVector则是在v0被销毁时构造它的拷贝,**调用函数将得到该拷贝**。所以拷贝构造函数必要时需要重写。

---
### 16. [error: no matching constructor for initialization of 'std::thread']()

错误提示:

```c++
thread_local_test.cc:44:15: error: no matching constructor for initialization of 'std::thread'
  std::thread t3(foo, 22);
              ^  ~~~~

/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/thread:379:9: note: candidate constructor template not viable: requires single argument ‘f’, but 2 arguments were
provided
thread::thread(_Fp
f)
^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/thread:268:5: note: candidate constructor not viable: requires 1 argument, but 2 were provided
thread(const thread&);
^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/thread:275:5: note: candidate constructor not viable: requires 0 arguments, but 2 were provided
thread() _NOEXCEPT : __t_(0) {}
^

1
2
3
4
5
6
7
8
9
10
11
12
13
14

代码如下:

```c++
#include <thread> // c++11

void foo(int value) {
openmi::ThreadLocal<int> g_i;
g_i.Value() = value;
std::cout << "foo tid=" << std::this_thread::get_id() << ", n=" << g_i.Value() << std::endl;
}

std::thread t3(foo, 22);
t3.join();

编译条件:g++ -g -pthread thread_local_test.cc -o xx

解决方案:编译条件需要添加-std=c++11 即可解决


17. duplicate symbol __ZN6openmi4zeroE in:

编译时具体错误:

1
2
3
4
5
duplicate symbol __ZN6openmi4zeroE in:
/var/folders/5v/rh3q6_sx62998l6x__3_zllr0000gn/T/log_stream_test-4dfdef.o
/var/folders/5v/rh3q6_sx62998l6x__3_zllr0000gn/T/log_stream-5415a7.o
ld: 1 duplicate symbol for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

编译条件:g++ log_stream_test.cc log_stream.cc -o xx

错误分析:上述错误意思是出现了重复的系统符号,即openmi命名空间下的 zero变量。在log_stream.h文件中确实发现该变量是根据digits得到的:

1
2
3
const char digits[] = "9876543210123456789";
const char digitsHex[] = "0123456789ABCDEF";
const char* zero = digits + 9;

如果程序是自己写的,大概率可能是头文件(*.h)中_定义_了变量或函数,并且被源文件(.cc/cpp)多次引用 造成的。

解决方案:zero变量计算放在.cc文件中。为避免此种情况再次发生,最好将程序的声明和定义分别放在不同的文件中。不要在头文件中有定义;除非全局变量或inline函数。


18. tools/logging2.cc:40:23: error: member function 'Length' not viable: 'this' argument has type 'const openmi::LogStream::Buffer'

具体错误:

1
2
3
openmit-common/tools/logging2.cc:40:23: error: member function 'Length' not viable: 'this' argument has type 'const openmi::LogStream::Buffer'
(aka 'const DataBuffer<openmi::kSmallBuffer>'), but function is not marked const
output_(buf.Data(), buf.Length());

问题:非const参数 传到了const参数上。

同样错误

1
2
tensor_shape.cc:59:23: error: member function 'Shape' not viable: 'this' argument has type 'const openmi::TensorShape',
but function is not marked const

19. note: candidate template ignored: invalid explicitly-specified argument for template parameter 'NDIMS' typename TTypes<T, NDIMS>::Tensor TensorType();

具体错误:

1
2
3
4
5
6
  /Users/zhouyong03/myhome/openmit/openmix/openmit-mix/unittest/core_framework_tensor_test.cc:13:21: error: no matching member function for call to 'TensorType'
auto tt = tensor->TensorType<dsize>();
~~~~~~~~^~~~~~~~~~~~~~~~~
/Users/zhouyong03/myhome/openmit/openmix/openmit-mix/graph_new/core/framework/tensor.h:28:37: note: candidate template ignored: invalid explicitly-specified argument for template
parameter 'NDIMS'
typename TTypes<T, NDIMS>::Tensor TensorType();

问题:这里参数模版NDIMS原型是template,dsize是局部变量。C++规定:绑定非类型模版(nontype template parameters)的实参必须是常量表达式(constexpr)。 局部变量等不允许传入。


20. Bus error: 10 (core dumped)

问题:提示总线错误Bus error。
原因:提前使用了未new(即未分配空间)的对象。

1
2
3
4
5
6
7
data_->fullname_ = file;
data_->basename_ = ConstShortFileName(data_->fullname_);
data_->log_severity_ = severity;
allocated_ = NULL;
allocated_ = new LogMessageData();
data_ = allocated_;
data_->first_fatal_ = false;

解决:将1,2,3行放在第6行之后即可。


21. error: allocation of incomplete type 'Eigen::ThreadPoolDevice'

编译时遇到错误:

1
2
3
4
5
6
7
8
openmi/core/common_runtime/local_device.cc:13:11: error: allocation of incomplete type 'Eigen::ThreadPoolDevice'
new Eigen::ThreadPoolDevice(
^~~~~~~~~~~~~~~~~~~~~~~
openmi-base/base/device.h:7:8: note: forward declaration of 'Eigen::ThreadPoolDevice'
struct ThreadPoolDevice;
^
openmi/core/common_runtime/local_device.cc:44:43: error: member access into incomplete type 'Eigen::ThreadPoolDevice'
SetEigenCpuDevice(tp_info->eigen_device_->get());

问题原因:出现这个问题,表明编译器不知道所用的struct 或者是class的具体实现,这里可以看到device.h第7行的前向声明没有找到对应实现。 这里提示Eigen::ThreadPoolDevice不完整的原因是使用Eigen线程池必须要添加编译选项 -DEIGEN_USE_THREADS

解决方案:编译时添加编译选项-DEIGEN_USE_THREADS


22. error: C++ requires a type specifier for all declarations

出错原因: 代码片段没有写在函数中。
解决方法: 将代码片段写进函数中。


23. [malloc: *** error for object 0x7ff62a6010e8: incorrect checksum for freed object - object was probably modified after being freed.]

具体错误信息:

1
2
core_framework_executor_test(87702,0x7fff7b20f000) malloc: *** error for object 0x7ff62a6010e8: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug

错误原因:Eigen矩阵乘法运算,rows和cols不匹配。例如: w*x = y,前向的转制问题,会影响到后置;

解决办法:代码中check W*X与Y的shape是否相等,提前曝出问题;


24. ['operator()' cannot be the name of a variable or data member]

错误日志与代码

1
2
3
4
5
6
7
8
9
openmi/core/softmax_op.cc:10:17: error: expected ')'
void operator(Device& d, typename TTypes<T>::ConstMatrix logits,
^
openmi/core/softmax_op.cc:10:16: note: to match this '('
void operator(Device& d, typename TTypes<T>::ConstMatrix logits,
^
openmi/core/softmax_op.cc:10:8: error: 'operator()' cannot be the name of a variable or data member

void operator(Device& d, typename TTypes<T>::ConstMatrix logits,
^

代码

1
2
3
4
template <typename Device, typename T>
struct SoftmaxFunctor {
void operator(Device& d, typename TTypes<T>::ConstMatrix logits, typename TTypes<T>::Matrix softmax, const bool is_log);
}; // struct SoftmaxFunctor

原因:仿函数实现形式错误,应改为operator()(....),不能缺少operator与参数列表之间的()


25. [Assertion failed: (dimensions_match(m_leftImpl.dimensions(), m_rightImpl.dimensions())), function evalSubExprsIfNeeded]

错误日志与代码:

1
Assertion failed: (dimensions_match(m_leftImpl.dimensions(), m_rightImpl.dimensions())), function evalSubExprsIfNeeded, file /Users/zhouyong03/myhome/openmit/tech-stacks/ml_eigen/third_party/deps/eigen/include/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorAssign.h, line 122.

代码:

1
2
auto d = context->template eigen_device<Device>();
Y.device(d) = X.sum(depth_dim);

错误提示的意思是:等号左右结果的rank不一致。X.sum()计算后rank=0, X.sum(depth_dim)计算后rank=NDIM-1. NDIM为X的rank。所以上述错误有两种解决方案:

  1. 重新确定Y的rank,如:TensorMap<T, NDIM-1> Y(${Y_dims})
  2. 计算时改为:Y.device(d) = X.sum(depth_dim).eval().reshape(${Y_dims})

26. [strncpy core dump]

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <string>
#include <iostream>

int size = 20;

int main(int argc, char** argv) {
char* buf = new char[size];

std::string x("hello,hello");
strncpy(buf, x.c_str(), size);
buf[size-1] = 0;
std::cout << "1 buff: " << buf << std::endl; // right

std::string y("hello,hello");
strncpy(buf, y.c_str(), size);
buf[size-1] = 0;
std::cout << "2 buff: " << buf << std::endl; // right

const char* delim_token_ = ",";
char* token = nullptr;
while ((token = strsep(&buf, delim_token_)) != nullptr) {
std::cout << "token: " << token;
}
std::cout << "strsep buf: " << buf << std::endl; // error

std::string z("hello,hello");
strncpy(buf, z.c_str(), size);
buf[size-1] = 0;
std::cout << "z buff: " << buf << std::endl; // core dump


return 0;
}
1
2
3
4
5
zhouyong03deMacBook-Pro:unittest zhouyong03$ g++ -std=c++11 strncpy_test.cc -o xx
zhouyong03deMacBook-Pro:unittest zhouyong03$ ./xx
1 buff: hello,hello
2 buff: hello,hello
Segmentation fault: 11 (core dumped)

错误原因是strsep会不断的移位buf指针,最后buf指向null,所以strsep buf报错.


27. basic_string::_S_construct NULL not valid

具体错误:

1
2
3
terminate called after throwing an instance of 'std::logic_error'
what(): basic_string::_S_construct NULL not valid
Aborted (core dumped)

非常明显的错误,就是用null初始化了std::string字符串。那么,关键是定位初始化字符串是可能是null值的代码。这里使用valgrind工具定位,发现下述错误:

1
2
3
4
5
6
7
8
9
10
11
12
13
==29007== 144 bytes in 1 blocks are possibly lost in loss record 5 of 7
==29007== at 0x4C26E85: malloc (vg_replace_malloc.c:309)
==29007== by 0x4EEE746: __cxa_allocate_exception (in /usr/lib64/libstdc++.so.6.0.13)
==29007== by 0x4E94801: std::__throw_logic_error(char const*) (in /usr/lib64/libstdc++.so.6.0.13)
==29007== by 0x4ECFE58: ??? (in /usr/lib64/libstdc++.so.6.0.13)
==29007== by 0x4ECFF32: std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) (in /usr/lib64/libstdc++.so.6.0.13)
==29007== by 0x4059F3: featurex::internal::SystemInfo::Hostname() (logging.cc:357)
==29007== by 0x4068FC: featurex::internal::LogFile::PrettyLogFileName(long*) (logging.cc:540)
==29007== by 0x40637A: featurex::internal::LogFile::RollFile() (logging.cc:494)
==29007== by 0x405ED9: featurex::internal::LogFile::LogFile(char const*, char const*, long, std::string, bool, int, int) (logging.cc:446)
==29007== by 0x40457B: featurex::LogDestination::LogDestination(long, int) (logging.cc:77)
==29007== by 0x404421: featurex::LogDestination::log_destination(int, long) (logging.cc:56)
==29007== by 0x404F5C: featurex::LogMessage::Flush() (logging.cc:208)

定位到获取HostName失败,代码返回NULL所致。错误代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
std::string SystemInfo::Hostname() {
if (host_name_.empty()) {
char hostname[32];
if (gethostname(hostname,sizeof(hostname)) != 0) {
std::runtime_error("get host name error.");
printf("[%s:%s:%d] host name get failed.\n", __FILE__, __func__, __LINE__);
return NULL; // error code.
}
std::string host_name(hostname);
//host_name.pop_back(); // drop '\n' in hostname
host_name_ = host_name;
}

return host_name_;
}

解决:gethostname获取失败时不应返回NULL, 可以给默认值;2. 获取失败的原因是char hostname[32]数组太小,可以改大一些。修改后的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
std::string SystemInfo::Hostname() {
if (host_name_.empty()) {
const int length = 128;
char hostname[length];
if (gethostname(hostname, sizeof(hostname)) != 0) {
std::runtime_error("get host name error.");
printf("[%s:%s:%d] host name get failed.\n", __FILE__, __func__, __LINE__);
hostname[length - 1] = '\0';
}
std::string host_name(hostname);
//host_name.pop_back(); // drop '\n' in hostname
host_name_ = host_name;
}

return host_name_;
}

28. Conditional jump or move depends on uninitialised value(s)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
==6081== Conditional jump or move depends on uninitialised value(s)
==6081== at 0x1002770E5: featurex::internal::LogFile::WriteUnlocked(char const*, unsigned long) (logging.cc:480)
==6081== by 0x10027552D: featurex::LogMessage::Flush() (logging.cc:470)
==6081== by 0x100275762: featurex::LogMessage::~LogMessage() (logging.cc:151)
==6081== by 0x10026E093: featurex::Combine::InternalRun(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) (combine.cc:82)
==6081== by 0x10026E3FE: featurex::Combine::InternalRun(featurex::FeatureValue*, featurex::FeatureValue*) (combine.cc:119)
==6081== by 0x10026D9D3: featurex::Combine::Compute(std::__1::shared_ptr<featurex::InputValue> const&) (combine.cc:62)
==6081== by 0x100264027: featurex::FeatureManager::Compute(std::__1::shared_ptr<featurex::FeatureInfo>&) (feature_manager.cc:251)
==6081== by 0x10026488D: featurex::FeatureManager::ExtractUniqueFeature(std::__1::shared_ptr<featurex::RawFeature> const&, std::__1::shared_ptr<featurex::RawFeature> const&, std::__1::shared_ptr<featurex::proto::FeatureResult>&) (feature_manager.cc:358)
==6081== by 0x100260010: featurex::FeatureExtractor::ExtractUniqueFeature() (feature_extractor.cc:263)
==6081== by 0x100260C62: featurex::FeatureExtractor::Extract(featurex::proto::RawFeature&) (feature_extractor.cc:349)
==6081== by 0x100002943: main (feature_extractor_valgrind_test.cc:61)
==6081== Uninitialised value was created by a heap allocation
==6081== at 0x10023FD11: malloc (vg_replace_malloc.c:302)
==6081== by 0x1006887DD: operator new(unsigned long) (in /usr/lib/libc++.1.dylib)
==6081== by 0x100274968: featurex::LogDestination::LogDestination(long long, int) (logging.cc:73)
==6081== by 0x1002754CE: featurex::LogMessage::Flush() (logging.cc:71)
==6081== by 0x100275762: featurex::LogMessage::~LogMessage() (logging.cc:151)
==6081== by 0x100259077: featurex::FeatureConf::LoadConf(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) (feature_conf.cc:24)
==6081== by 0x10025EE1C: featurex::FeatureExtractor::Init(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) (feature_extractor.cc:84)
==6081== by 0x1000025EA: main (feature_extractor_valgrind_test.cc:28)

原因:

Uninitialised value was created by a heap allocationConditional jump or move depends on uninitialised value(s)这类错误,表示有些变量未初始化。

解决方案:1. 找到未初始化的变量(包括基本类型变量),然后初始化;2. 涉及到分配内存的变量,尝试使用calloc替换malloc。

该case的原因是:1. LogFile::thread_safe_LogFile::flush_interval_等变量未初始化;


29: Cannot access memory at address 0x7f0ca4de1d08 …

1
2
3
4
5
6
7
Failed to read a valid object file image from memory.
Core was generated by `/usr/local/java8/bin/java -server -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-'.
Program terminated with signal 6, Aborted.
#0 0x00007f1083a234f5 in ?? ()
(gdb) bt
#0 0x00007f1083a234f5 in ?? ()
Cannot access memory at address 0x7f0ca4de1d08

featurex框架上线初始化feature_extractor时报的错误,场景:同时初始化到综和美食的配置,每次初始化会初始化num_thread个;

网上有评论上述错误可能是:

1
2
3
a valid object file image 这可不是图片呀!应该是多线程引起的内存冲突。 不好查。

这个一看 应该是多线程导致内存泄露的问题,错误信息是获取有效的内存镜像失败;core文件是产生自/usr/local/niker,你可以用gdb调试,跟踪一下进入这个文件之后的一些线程调用函数。

在这里,使用gdb java8 ${corefile}打开,bt命令观察发现如下错误:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Loaded symbols for /usr/local/jdk1.8.0_45/jre/lib/amd64/libawt.so
Reading symbols from /usr/local/jdk1.8.0_45/jre/lib/amd64/libawt_headless.so...(no debugging symbols found)...done.
Loaded symbols for /usr/local/jdk1.8.0_45/jre/lib/amd64/libawt_headless.so
Reading symbols from /tmp/libidxdb-jni-c.so4690097535416358788...done.
Loaded symbols for /tmp/libidxdb-jni-c.so4690097535416358788
Reading symbols from /opt/meituan/mobile/adsms/webroot/WEB-INF/classes/libgomp.so.1...done.
Loaded symbols for /opt/meituan/mobile/adsms/webroot/WEB-INF/classes/libgomp.so.1
Reading symbols from /opt/meituan/mobile/adsms/webroot/WEB-INF/classes/libfeaturex.so...done.
Loaded symbols for /opt/meituan/mobile/adsms/webroot/WEB-INF/classes/libfeaturex.so
Core was generated by `/usr/local/java8/bin/java -server -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-'.
Program terminated with signal 6, Aborted.
#0 0x00007f64168364f5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.212.el6_10.3.x86_64 jdk-1.8.0-45.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-37.el6_6.x86_64 libcom_err-1.41.12-21.el6.x86_64 libgcc-4.4.7-23.el6.x86_64 libselinux-2.0.94-5.8.el6.x86_64 libstdc++-4.4.7-23.el6.x86_64 openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x00007f64168364f5 in raise () from /lib64/libc.so.6
#1 0x00007f6416837cd5 in abort () from /lib64/libc.so.6
#2 0x00007f641614b6b5 in os::abort(bool) () from /usr/local/jdk1.8.0_45/jre/lib/amd64/server/libjvm.so
#3 0x00007f64162e8da3 in VMError::report_and_die() () from /usr/local/jdk1.8.0_45/jre/lib/amd64/server/libjvm.so
#4 0x00007f6416150bdf in JVM_handle_linux_signal () from /usr/local/jdk1.8.0_45/jre/lib/amd64/server/libjvm.so
#5 0x00007f6416147493 in signalHandler(int, siginfo*, void*) () from /usr/local/jdk1.8.0_45/jre/lib/amd64/server/libjvm.so
#6 <signal handler called>
#7 0x00007f6416fbd64f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#8 0x00007f61d85c32fc in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib64/libstdc++.so.6
#9 0x00007f6049efde6b in wait<featurex::FeatureExtractor::ClearProtoObject()::<lambda()> > (this=0x7f62a6fe4980)
at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/condition_variable:98
#10 featurex::FeatureExtractor::ClearProtoObject (this=0x7f62a6fe4980)
at /opt/meituan/zhouyong03/featurex/featurex/feature/feature_extractor.cc:575
#11 0x00007f61d85c3470 in ?? () from /usr/lib64/libstdc++.so.6
#12 0x00007f6416fb9aa1 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f64168ecc4d in clone () from /lib64/libc.so.6

可以看到是feature_extractor.cc:575 ClearProtoObject报的错误,代码实现为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
void FeatureExtractor::ClearProtoObject() {
assert(stop_ == false);
while (!stop_) {
std::unique_lock<std::mutex> lock(mutex_);
cond_.wait(lock, [this] {
if (stop_) {
return true;
}
if (full_request_batch_ptr_ == nullptr) {
return false;
}
return full_request_batch_ptr_->has_instances();
});

if (full_request_batch_ptr_ != nullptr) {
full_request_batch_ptr_->Clear();
}
}
DLOG(INFO) << "Async ClearProtoObject done.";
}

是其中cond_.wait报错,暂时先去除后台线程清理proto object的功能。

之所以使用gdb java8打开core文件,是因为直接使用gdb打开,出现以下提示Core was generated by /usr/local/java8/bin/java -server -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-'. 因此需要添加java8(注意,不能是java)


30 unsupported/Eigen/CXX11/src/Tensor/TensorStorage.h:39:7: error: class template partial

具体错误:

1
2
3
4
5
6
~/.openmi_deps/include/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorStorage.h:39:7: error: class template partial
specialization is not more specialized than the primary template [-Winvalid-partial-specialization]
class TensorStorage<T, FixedDimensions, Options_>
^
~/.openmi_deps/include/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorStorage.h:34:63: note: template is declared here
template<typename T, typename Dimensions, int Options_> class TensorStorage;

原因:mac升级至10.15,更高版本的clang对模版检查更智能。

解决:

1
2
3
4
将unsupported/Eigen/CXX11/src/Tensor/TensorStorage.h中的
template<typename T, typename Dimensions, int Options_, > class TensorStorage;
替换为
template<typename T, typename Dimensions, int Options_, typename empty = void> class TensorStorage;

31. dyld: lazy symbol binding failed: Symbol not found: ___emutls_get_address

1
2
3
4
5
6
7
8
9
10
 ./xgboost demo/binary_classification/mushroom.conf
dyld: lazy symbol binding failed: Symbol not found: ___emutls_get_address
Referenced from: ~/myhome/openmit/openmi/openmi-base/project_deps/third_party/xgboost/./xgboost
Expected in: /usr/lib/libSystem.B.dylib

dyld: Symbol not found: ___emutls_get_address

Referenced from: ~/myhome/openmit/openmi/openmi-base/project_deps/third_party/xgboost/./xgboost
Expected in: /usr/lib/libSystem.B.dylib

Abort trap: 6

编译正常,运行时提示Symbol找不到,主要原因是运行时加载了旧的gcc动态库,而编译时用的是新gcc库(macos编译时用的gcc-9, 默认是4.2.1)。

解决方案1:在~/.bash_profile中DYLD_FALLBACK_LIBRARY_PATH指定为新gcc的lib路径,即:

export DYLD_FALLBACK_LIBRARY_PATH=/usr/local/Cellar/gcc/9.2.0_2/lib/gcc/9

解决方案2:macos系统中的LD_LIBRARY_PATH更换为DYLD_LIBRARY_PATH

最主要的原因是macos动态库不应该配置在LD_LIBRARY_PATH而是DYLD_LIBRARY_PATH变量

macos寻找动态库的顺序依次是:DYLD_LIBRARY_PATH -> BACK_FRAMEWORK_PATH | DYLD_FALLBACK_LIBRARY_PATH,见下面对DYLD_LIBRARY_PATH的解读

1
This is a colon separated list of directories that contain libraries. The dynamic linker searches these directories before it searches the default locations for libraries. It allows you to test new versions of existing libraries. For each library that a program uses, the dynamic linker looks for it in each directory in DYLD_LIBRARY_PATH in turn. If it still can't find the library, it then searches BACK_FRAMEWORK_PATH and DYLD_FALLBACK_LIBRARY_PATH in turn.


32. libtool: error: unrecognised option: ‘-static’

问题:mac上安装boost时出现下述错误,而libtool是使用brew安装的。

1
2
libtool: unrecognized option `-static'
libtool: Try `libtool --help' for more information.

解决:需要使用mac系统自带的libtool版本才可以(路径在:/Library/Developer/CommandLineTools/usr/bin

33. /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20’ not found

问题:/libcaml_featurex.so: /usr/lib64/libstdc++.so.6: version GLIBCXX_3.4.20' not found

原因:编译时用的gcc4.9.2目录下的libstdc++.so.6;离线集群运行时默认走的是/usr/lib64/libstdc++.so.6,前者版本高;

解决方案:项目CMakeList.txt编译时使用gcc4.9.2版本的libstdc++静态库,示例如下:

1
2
add_library(stdc++ STATIC IMPORTED)
set_target_properties(stdc++ PROPERTIES IMPORTED_LOCATION ${stdcpp_path}) # stdcpp_path为libstdc++.a的路径

编程技巧

回调函数

1
2
3
4
5
6
7
8
9
10
class OpKernelBase {
using Callback = std::function<void(Status)>;
void OpKernelContext::SetLaunchDone(OpKernelBase::Callback launch_done) {
launch_done_ = launch_done;
}
private:
Callback launch_done_;
};

ctx->SetLaunchDone([this, node_id, ctx](Status st){LaunchDone(node_id, ctx, st);})l
文章目录
  1. 1. 踩过的坑儿
    1. 1.1. 1.undefined reference to …
    2. 1.2. 3. invalid initialization of non-const reference of type
    3. 1.3. 4. …multiple definition of …
    4. 1.4. 5. …error: cannot allocate an object of abstract type …
    5. 1.5. 6. … Error in `./xx’: free(): invalid pointer: 0x00000000006042e0 …
    6. 1.6. 7. as ‘this’ argument discards qualifiers [-fpermissive] …
    7. 1.7. 8. double free / free: invalid pointer
    8. 1.8. 9. the following virtual functions are pure within ‘mit::FFM’…
      1. 1.8.0.1. 17. duplicate symbol __ZN6openmi4zeroE in:
      2. 1.8.0.2. 18. tools/logging2.cc:40:23: error: member function 'Length' not viable: 'this' argument has type 'const openmi::LogStream::Buffer'
      3. 1.8.0.3. 19. note: candidate template ignored: invalid explicitly-specified argument for template parameter 'NDIMS' typename TTypes::Tensor TensorType();
      4. 1.8.0.4. 20. Bus error: 10 (core dumped)
      5. 1.8.0.5. 21. error: allocation of incomplete type 'Eigen::ThreadPoolDevice'
      6. 1.8.0.6. 22. error: C++ requires a type specifier for all declarations
      7. 1.8.0.7. 23. [malloc: *** error for object 0x7ff62a6010e8: incorrect checksum for freed object - object was probably modified after being freed.]
      8. 1.8.0.8. 24. ['operator()' cannot be the name of a variable or data member]
      9. 1.8.0.9. 25. [Assertion failed: (dimensions_match(m_leftImpl.dimensions(), m_rightImpl.dimensions())), function evalSubExprsIfNeeded]
      10. 1.8.0.10. 26. [strncpy core dump]
    9. 1.8.1. 27. basic_string::_S_construct NULL not valid
    10. 1.8.2. 28. Conditional jump or move depends on uninitialised value(s)
      1. 1.8.2.1. 29: Cannot access memory at address 0x7f0ca4de1d08 …
      2. 1.8.2.2. 30 unsupported/Eigen/CXX11/src/Tensor/TensorStorage.h:39:7: error: class template partial
      3. 1.8.2.3. 31. dyld: lazy symbol binding failed: Symbol not found: ___emutls_get_address
      4. 1.8.2.4. 32. libtool: error: unrecognised option: ‘-static’
    11. 1.8.3. 33. /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20’ not found
  2. 1.9. 编程技巧