C++总结笔记
- 1. 踩过的坑儿
- 1.1. 1.undefined reference to …
- 1.2. 3. invalid initialization of non-const reference of type
- 1.3. 4. …multiple definition of …
- 1.4. 5. …error: cannot allocate an object of abstract type …
- 1.5. 6. … Error in `./xx’: free(): invalid pointer: 0x00000000006042e0 …
- 1.6. 7. as ‘this’ argument discards qualifiers [-fpermissive] …
- 1.7. 8. double free / free: invalid pointer
- 1.8. 9. the following virtual functions are pure within ‘mit::FFM’…
- 1.8.0.1. 17. duplicate symbol __ZN6openmi4zeroE in:
- 1.8.0.2. 18. tools/logging2.cc:40:23: error: member function 'Length' not viable: 'this' argument has type 'const openmi::LogStream::Buffer'
- 1.8.0.3. 19. note: candidate template ignored: invalid explicitly-specified argument for template parameter 'NDIMS' typename TTypes
::Tensor TensorType(); - 1.8.0.4. 20. Bus error: 10 (core dumped)
- 1.8.0.5. 21. error: allocation of incomplete type 'Eigen::ThreadPoolDevice'
- 1.8.0.6. 22. error: C++ requires a type specifier for all declarations
- 1.8.0.7. 23. [malloc: *** error for object 0x7ff62a6010e8: incorrect checksum for freed object - object was probably modified after being freed.]
- 1.8.0.8. 24. ['operator()' cannot be the name of a variable or data member]
- 1.8.0.9. 25. [Assertion failed: (dimensions_match(m_leftImpl.dimensions(), m_rightImpl.dimensions())), function evalSubExprsIfNeeded]
- 1.8.0.10. 26. [strncpy core dump]
- 1.8.1. 27. basic_string::_S_construct NULL not valid
- 1.8.2. 28. Conditional jump or move depends on uninitialised value(s)
- 1.8.2.1. 29: Cannot access memory at address 0x7f0ca4de1d08 …
- 1.8.2.2. 30 unsupported/Eigen/CXX11/src/Tensor/TensorStorage.h:39:7: error: class template partial
- 1.8.2.3. 31. dyld: lazy symbol binding failed: Symbol not found: ___emutls_get_address
- 1.8.2.4. 32. libtool: error: unrecognised option: ‘-static’
- 1.8.3. 33. /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20’ not found
- 1.9. 编程技巧
- author: zhouyongsdzh@foxmail.com
- date: 2014-12-25
- weibo: @周永_52ML
内容列表
- 写在前面
- 调试与程序分析工具
- gdb
- gperf
- 多线程编程
- 踩过的坑儿
踩过的坑儿
1.undefined reference to …
参考链接:http://blog.csdn.net/jfkidear/article/details/8276203
异常示例1:
1 | undefined reference to `dmlc::Config::Config(std::istream&, bool)' |
主要原因是1
> 类似的问题: ```undefined reference to 'pthread_create'``` 需要添加```-lpthread
异常示例2:
1 | ~/workplace/DiMLSys/third_party/root/lib/libdmlc.a(hdfs_filesys.o): In function `dmlc::io::HDFSFileSystem::HDFSFileSystem(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)': |
编译dmlc-core时,发现是1
异常示例3:
```c++
undefined reference to `omp_get_num_procs`
主要原因是使用了OpenMP,但编译时没有配置OpenMP相关编译环境,需要在CMakeLists.txt中配置1
使用OpenMP时,需要在CMake文件中 添加 **编译环境代码**,即:```set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")```.
异常示例4:
```c++
undefined reference to mit::FFM<unsigned long, float>::Predict(dmlc::Row<unsigned long> const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> > const&)'
主要原因:是这里把模版类分离编译导致。就是把模版类的声明和实现分别放在了头文件和源文件中。而g++本身不支持模版类的分离编译,所有提示找不到方法的具体实现(在*.cc中)。
解决方案:要么不使用模版类,要么把声明和定义放在同一个*.h文件中。参考:http://blog.sina.com.cn/s/blog_6cef0cb50100nb7o.html
异常示例5:
1 | thrift-0.10.0/lib/cpp/src/thrift/server/TNonblockingServer.cpp:1602: undefined reference to `event_del' |
使用nm libevent.a | grep event_del
查看,静态包里面存在该函数。头文件和库文件都存在。因此这种情况大概率是库文件顺序错误。把libevent.a放在libthrift.a之后,编译正常。参考rpc_thrift/CMakeLists.txt
同示例5:
1 | /src/thrift/transport/TBufferTransports.h:544: undefined reference to 'vtable for apache::thrift::transport::TMemoryBuffer' |
./configure: line 18262: syntax error near unexpected token QT,'
./configure: line 18262:
PKG_CHECK_MODULES(QT, QtCore >= 4.3, QtNetwork >= 4.3, have_qt=yes, have_qt=no)’1
2
3
4
5
6
7
8
9
10
11
异常示例6: Linux环境openmit-ps生成可执行文件时,如果链接静态库时,编译失败;链接动态库时,编译成功。可以确定静态库中包含undefined的函数和变量。
解决方案:静态库生成的编译环境中添加`-shared -fPIC`。如此,再生成动态库时,如果依赖了添加`-shared -fPIC`的静态库时,应该能成功。⚠️ **共享库编译选项不能出现在编译目标为可执行文件的编译任务上,这里的`-shared`不能出现在上面**。
异常示例7:
```bash
Undefined symbols for architecture x86_64:
"openmi::LocalDevice::use_global_threadpool_", referenced from:
openmi::LocalDevice::LocalDevice(openmi::Allocator*, int) in libopenmi_core.a(local_device.o)
原因:LocalDevice中的全局静态变量未初始化。在对应*.cc文件中初始化即可。
因此,出现1
1. 检查include头文件是否存在,如果没有需要添加```include_directories()
- 检查相应的链接库是否存在,如果没有需要
dmlc)```; 1
3. 检查对应的编译环境是否缺失,比如pthread, OpenMP都需要在g++编译时,添加对应的编译环境。 4. 查看对应的类是否是模版类。如果是模版类,不应该有对应的*.cc文件,因为g++不支持模版类的分离编译; 5. 如果头文件和库文件均存在,可尝试**调整库文件顺序**。 6. 依赖静态库编译失败,依赖动态库编译成功。解决方案:**静态库重新编译,并添加`-shared -fPIC`**。注意:可执行程序不可添加静态库编译选项。 <h2 id="2.error-while-loading-shared-libraries">2. error while loading shared libraries: *.so : cannot open shared object file</h2> [... error while loading shared libraries: *.so : cannot open shared object file: No such file or directory](http://blog.csdn.net/sahusoft/article/details/7388617) 错误提示程序执行时无法加载共享库```*.so```,可能不存在或者没有找到。 解决方案: 1. 首先,用```locate *.so```命令检查共享库是否存在,如果不存在,需要网上下载和安装。如果存在,进入第二步 2. 将```*.so```所对应的目录加入```LD_LIBRARY_PATH```路径中,举例操作: ```sh LD_LIBRARY_PATH=${JAVA_HOME}/jre/lib/amd64/servier:$LD_LIBRARY_PATH export LD_LIBRARY_PATH ``` 上面的配置在MakeFile中可以直接找到对应的环境变量。在CMakeLists中如何使用呢? cmake使用环境变量,需要使用```ENV```关键词。即: ```$ENV{LD_LIBRARY_PATH}
在使用automake编译时,也出现类似的错误:./openmit: error while loading shared libraries: libprotobuf.so.12: cannot open shared object file: No such file or directory
. automake下的解决方案是?
3. invalid initialization of non-const reference of type
…invalid initialization of non-const reference of type…
1 | ~/workspace/openmit/openmit/include/openmit/data.h:24:41: error: invalid initialization of non-const reference of type ‘std::__cxx11::string& {aka std::__cxx11::basic_string<char>&}’ from an rvalue of type ‘std::__cxx11::string {aka std::__cxx11::basic_string<char>}’ |
错误提示的含义:c++中临时变量不能作为非const的引用参数
—
4. …multiple definition of …
1 | CMakeFiles/openmit.dir/worker.cc.o: In function `mit::WorkerParam::__MANAGER__()': |
上面出现错误的原因:把变量的定义(DMLC_REGISTER_PARAMETER(WorkerParam);
)放在了worker.h文件中,而worker.cc和cli_main.cc都include了worker.h,进行了两次变量的定义,所以提示错误。
解决方案:worker.h中的变量定义放在worker.cc中。如此可避免变量重复定义的问题。
- 编译是针对一个一个文件来说的,而链接则是针对一个工程所有的.o文件而言的;
- ifndef只是对防止一个文件的重复编译有效;
- 全局变量最好在.cpp文件中定义,在.h文件中加上extern申明,因为在.h文件中定义,容易在链接时造成变量重定义;
如果有“公共函数”需要放在base.h文件中,比如
void NewKey(...) { ... }
,为了防止出现multiple defination of ...
问题,可以在前面加上inline
,即inline void NewKey(...) { ... }
5. …error: cannot allocate an object of abstract type …
…error: cannot allocate an object of abstract type …
在基类中申明的虚函数,在派生类中必须继承并实现。在new一个派生类时才不会报该错误。
此外,Unit * base = new SimpleUnit();
而不能是Unit base = new SimpleUnit();
.
在C++中,new一个类时,需要用指针接着。参考:C++创建对象,new与不new的区别
6. … Error in `./xx’: free(): invalid pointer: 0x00000000006042e0 …
1 | *** Error in `./xx': free(): invalid pointer: 0x00000000006042e0 *** |
背景:在工厂方法派生类返回实现, 对应的调用方式:std::unique_ptr<A> a(A::Create("b", 10));
报的错误:
1 | static B * Get(std::string type, int a) { |
7. as ‘this’ argument discards qualifiers [-fpermissive] …
… error: passing ‘const std::unordered_map
具体错误:
1 | ~/workspace/openmit/openmit/test/unittests/unittest_openmit_unit.cc: In function ‘void run(const std::unordered_map<int, mit::Unit*>&, int)’: |
问题背景:
1 | void run(const std::unordered_map<int, mit::Unit * > & map_weight_, int key) { |
主要原因是:当const map_weight_
对象调用operator[]
时,编译器检测出问题。对一个const对象调用non-const成员函数是不允许的,因为non-const成员函数不保证一定不修改对象。
编译器在这里做了一个假定,假定operator[]
试图修改map_weight_
对象,而与此同时,map_weight_
是const的,所有试图修改const对象的都会报error。
unordered_map
的[]
运算符会在索引项不存在的时候自动创建一个对象,有可能会改变map本身,所以不能在一个const map对象上使用[]
操作。
解决办法:去掉const,或者operator[]
改成const方法(这里比较困难).
1 | void run(std::unordered_map<int, mit::Unit * > & map_weight_, int key) { ... } |
8. double free / free: invalid pointer
src/learner.cc
出现内存泄漏:
1 | mit_float * pvals = map_grad[keys[0]]->Data(); |
换成下面代码则正常:
1 | // map_grad_ --> vals |
继续跟进问题:
注意:
- 一个地址只能由一个指针指向,不能多个指针指向一个地址,否则会出现
double free or corruption (fasttop) ...
问题;
9. the following virtual functions are pure within ‘mit::FFM’…
具体是mit::Model
类中有4个纯虚函数(virtual type method() = 0;
),而在子类中仅覆写了两个(override
),因而提示`下面的虚函数是纯的,必需要覆写1
> 纯虚函数:"virtual type method() =0;"; 如果不带`=0`,只有`virtual type method();`则在子类中可以不覆写(不过存在隐患)。
<h2 id="10.binding-const-value_type-to-reference-of-type-discards-qualifiers">10. binding ‘const `value_type` {aka const float}’ to reference of type ‘`mit::mit_float`& {aka float&}’ discards qualifiers</h2>
现象:参数为`const std::vector<int> & keys`, 用`keys[i]`参数调用另外一个函数`func(int & key)`报的错误。
解决办法:函数行参加上const即可,即`func(const int & key)`
<h2 id="11.Program-terminated-with-signal-11-Segmentation-fault.0">11. Program terminated with signal 11, Segmentation fault.0</h2>
[Program terminated with signal 11, Segmentation fault.0 0x00007f7113d037f1 in ?? ()](https://blog.csdn.net/xufandecsdn/article/details/80609546)
背景:linux环境下编译openmit-ps,生成bin/{client,server}可执行文件。运行bin/server是报错。查看core文件 发现上述错误。即**程序未进入main函数就出现段错误**。
```sh
Program terminated with signal 11, Segmentation fault.
#0 0x00007f7113d037f1 in ?? ()
(gdb) bt
#0 0x00007f7113d037f1 in ?? ()
#1 0x0000000000000000 in ?? ()
(gdb)
```
问题定位:gdb单步调试发现问题。检查编译选项,发现是`CMAKE_CXX_FLAGS`编译选项包含`-shared -fPIC`。原来是**编译参数搞错,将生成共享库的编译参数错误地用于生成可执行文件!**
解决方案:**生成可执行程序的编译选项去除与共享库相关的编译参数**。
问题:[可执行文件添加-static, 提示/bin/ld: cannot find -lopenmit_idl (该目录下只有动态库,无静态库)。 去除-static编译可以通过](https://www.cnblogs.com/yunsicai/p/3191002.html)。
原因是:链接器(ld)默认会连接动态库,但如果编译选项添加-static在编译可执行程序时,会链接静态库,**如果静态库not exist,那么会提示cannot found**。
类似的错误还有:
```sh
Reading symbols from featurex/featurex/test/test...done.
[New Thread 6541]
Core was generated by `./test'.
Program terminated with signal 11, Segmentation fault.
#0 0x00000000000005d6 in ?? ()
(gdb) bt
#0 0x00000000000005d6 in ?? ()
#1 0x00007f4f639d5600 in main (argc=Unhandled dwarf expression opcode 0xf3
) at /opt/meituan/zhouyong03/featurex/featurex/test/test.cc:4
(gdb)
```
<h2 id="12.Heap-check-constructor-called-twice.">12. Check failed: !`internal_init_start_has_run`: Heap-check constructor called twice. Perhaps you both linked in the heap checker, and also used LD_PRELOAD to load it? Aborted (core dumped)]()</h2>
背景:openmit-ps编译生成动态库和可执行文件 均链接了tcmalloc的静态库(libtcmalloc.a),执行可执行文件`/bin/server`上述错误。如果链接动态库(libtcmalloc.so),则可以正常运行。
---
### 13. **[-bash: ./bin/server: /lib/ld64.so.1: bad ELF interpreter: No such file or directory](https://stackoverflow.com/questions/23604471/ld64-so-present-in-ldd-missing-at-runtime)**
背景:使用如下命令编译openmit-ps/test/server.cc时,运行`./bin/server`报上述错误。
```sh
/bin/g++ -std=c++11 -g -O3 -Wall -static -DNDEBUG -DHAVE_NETINET_IN_H -fopenmp -O2 -DNDEBUG -rdynamic CMakeFiles/server.dir/server.o -o ../../bin/server -L/home/sankuai/.openmit_deps/lib -L/opt/meituan/zhouyong03/openmix/openmit-ps/../openmit-common/lib -L/opt/meituan/zhouyong03/openmix/openmit-ps/../openmit-idl/lib -L/opt/meituan/zhouyong03/openmix/openmit-ps/lib -L/opt/meituan/zhouyong03/openmix/openmit-ps -L/home/sankuai/.openmit_deps/lib64 -static -Wl,--whole-archive -lopenmit_ps -Wl,--no-whole-archive -Wl,--eh-frame-hdr -Wl,-Bstatic -lopenmit_idl -lopenmit_common -lboost_system -lboost_thread -lthriftnb -lthrift -lprotobuf -lprotoc -lprotobuf-lite -lsnappy -levent -lssl -lglog -lgflags -ltcmalloc_minimal -Wl,-Bdynamic -lpthread -lrt
```
解决方案:使用`ldd 可执行程序`命令,看具体的错误信息。
```
openmit-ps[master*]$ ldd bin/server
linux-vdso.so.1 => (0x00007fff1f5fe000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f451a3ef000)
librt.so.1 => /lib64/librt.so.1 (0x00007f451a1e6000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f4519edd000)
libm.so.6 => /lib64/libm.so.6 (0x00007f4519bdb000)
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f45199b4000)
libc.so.6 => /lib64/libc.so.6 (0x00007f45195f3000)
/lib/ld64.so.1 => /lib64/ld-linux-x86-64.so.2 (0x00007f451a624000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f45193dd000)
```
是链接时的路径与运行时的不一致。
解决办法:在`LINK_FALGS`选项中显示制定链接器,即添加`-Wl,--dynamic-linker=/lib64/ld-linux-x86-64.so.2`. 可执行程序可以运行。
CMake指定链接选项方式:[`SET_TARGET_PROPERTIES(foo PROPERTIES LINK_FLAGS -Wl,-whole-archive -lopenmit_ps ...)`](https://cmake.org/pipermail/cmake/2003-August/004244.html)
---
### 14. [`Exception in thread "main" java.lang.UnsatisfiedLinkError: libfeature_extractor.so: libfeature_extractor.so: cannot allocate memory in static TLS block`]()
jni家在so时报错,不能分配静态TLS块(线程局部存储)。要求so中不可以存在静态变量、全局变量等。这里的解决方案:编译so时去除`-ltcmalloc_minimal`,重新编译so,再System.load()。即编译选项:
```sh
env_config['CPPPATH'] += path
env_config['LIBPATH'] += lib_path
env_config['LIBS'] = libs
env_config['CXXFLAGS'] += ' -Wno-unused-local-typedefs -Wno-unused-variable'
env_config['LINKFLAGS'] += ' -Wl,--eh-frame-hdr'
env = Environment(**env_config)
#env['_LIBFLAGS'] = '-Wl,-Bstatic -lmlxproto -lmlxcommon -lprotobuf -lglog -lgflags -lboost_system -lboost_timer -lboost_chrono -ltcmalloc_minimal -lcityhash -lrt -Wl,-Bdynamic -lpthread'
env['_LIBFLAGS'] = '-Wl,-Bstatic -lmlxproto -lmlxcommon -lprotobuf -lglog -lgflags -lboost_system -lboost_timer -lboost_chrono -lcityhash -lrt -Wl,-Bdynamic -lpthread'
#sources = [Glob('../../src/*.cc')] + [Glob('../../proto/*.cc')] + [Glob('../../src/features/*.cc')] + [Glob("./*.cc")]
env.SharedLibrary('feature_extractor', sources)
```
---
### 15. [malloc: *** error for object 0x7f86a3c08620: pointer being freed was not allocated]()
错误原因:指针未初始化,指向了“别的”地址(不为null),在释放时提示“释放了未分配地址的指针”。
解决方案:任何指针必须初始化,或者给null值。如下:
```c++
class Tensor {
// ....
private:
TensorShape tensor_shape_;
TensorBuffer<T>* buf_ = nullptr; // 必须初始化
}; // class Tensor
```
Case2:
```c++
core_framework_executor_test(69510,0x7fff7b20f000) malloc: *** error for object 0x3b03126f3b03126f: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
^CAbort trap: 6 (core dumped)
```
原因:程序由1~13行改为15~23行报上述错误,原因待查。
```c++
if (is_lbcast && is_rbcast) {
Y.device(d) = X0.reshape(lreshape_dims).broadcast(lbcast_dims).binaryExpr(
X1.reshape(rreshape_dims).broadcast(rbcast_dims), typename FUNCTOR::func());
} else if (is_lbcast && !is_rbcast) {
Y.device(d) = X0.reshape(lreshape_dims).broadcast(lbcast_dims).binaryExpr(
X1.reshape(rreshape_dims), typename FUNCTOR::func());
} else if (!is_lbcast && is_rbcast) {
Y.device(d) = X0.reshape(lreshape_dims).binaryExpr(
X1.reshape(rreshape_dims).broadcast(rbcast_dims), typename FUNCTOR::func());
} else {
Y.device(d) = X0.reshape(lreshape_dims).binaryExpr(
X1.reshape(rreshape_dims), typename FUNCTOR::func());
}
/*
auto X00 = X0.reshape(lreshape_dims);
if (is_lbcast) {
X00 = X00.broadcast(lbcast_dims);
}
auto X11 = X1.reshape(rreshape_dims);
if (is_rbcast) {
X11 = X11.broadcast(rbcast_dims);
}
Y.device(d) = X00.binaryExpr(X11, typename FUNCTOR::func());
*/
LOG(INFO) << "Y:\n" << Y;
```
Case2: 指针与局部变量赋值的问题
```c++
void GradientRegistry::Lookup(const std::string& op, GradCreator* creator) {
auto it = grad_creator_mapper_.find(op);
CHECK(it != grad_creator_mapper_.end())
<< op << " not in gradient registry.";
LOG(DEBUG) << op << " has exists.";
//creator = &(it->second); // Error. 报core dump. 由于GradCreator是函数指针,难以排查
*creator = it->second;
}
```
关于操作符重载:
> 这里有一点要注意:**返回值不能是引用**。因为是引用,其引用的是v0(局部变量)的对象,而v0在函数结束时会被销毁,所以引用将指向一个不存在的对象。而使用MyVector则是在v0被销毁时构造它的拷贝,**调用函数将得到该拷贝**。所以拷贝构造函数必要时需要重写。
---
### 16. [error: no matching constructor for initialization of 'std::thread']()
错误提示:
```c++
thread_local_test.cc:44:15: error: no matching constructor for initialization of 'std::thread'
std::thread t3(foo, 22);
^ ~~~~
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/thread:379:9: note: candidate constructor template not viable: requires single argument ‘f’, but 2 arguments were
provided
thread::thread(_Fp f)
^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/thread:268:5: note: candidate constructor not viable: requires 1 argument, but 2 were provided
thread(const thread&);
^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/thread:275:5: note: candidate constructor not viable: requires 0 arguments, but 2 were provided
thread() _NOEXCEPT : __t_(0) {}
^1
2
3
4
5
6
7
8
9
10
11
12
13
14
代码如下:
```c++
#include <thread> // c++11
void foo(int value) {
openmi::ThreadLocal<int> g_i;
g_i.Value() = value;
std::cout << "foo tid=" << std::this_thread::get_id() << ", n=" << g_i.Value() << std::endl;
}
std::thread t3(foo, 22);
t3.join();
编译条件:g++ -g -pthread thread_local_test.cc -o xx
。
解决方案:编译条件需要添加-std=c++11 即可解决
17. duplicate symbol __ZN6openmi4zeroE in:
编译时具体错误:
1 | duplicate symbol __ZN6openmi4zeroE in: |
编译条件:g++ log_stream_test.cc log_stream.cc -o xx
错误分析:上述错误意思是出现了重复的系统符号,即openmi命名空间下的 zero变量。在log_stream.h
文件中确实发现该变量是根据digits得到的:
1 | const char digits[] = "9876543210123456789"; |
如果程序是自己写的,大概率可能是头文件(*.h
)中_定义_了变量或函数,并且被源文件(.cc/cpp)多次引用 造成的。
解决方案:zero
变量计算放在.cc
文件中。为避免此种情况再次发生,最好将程序的声明和定义分别放在不同的文件中。不要在头文件中有定义;除非全局变量或inline函数。
18. tools/logging2.cc:40:23: error: member function 'Length' not viable: 'this' argument has type 'const openmi::LogStream::Buffer'
具体错误:
1 | openmit-common/tools/logging2.cc:40:23: error: member function 'Length' not viable: 'this' argument has type 'const openmi::LogStream::Buffer' |
问题:非const参数 传到了const参数上。
同样错误
1 | tensor_shape.cc:59:23: error: member function 'Shape' not viable: 'this' argument has type 'const openmi::TensorShape', |
19. note: candidate template ignored: invalid explicitly-specified argument for template parameter 'NDIMS' typename TTypes<T, NDIMS>::Tensor TensorType();
具体错误:
1 | /Users/zhouyong03/myhome/openmit/openmix/openmit-mix/unittest/core_framework_tensor_test.cc:13:21: error: no matching member function for call to 'TensorType' |
问题:这里参数模版NDIMS原型是template
20. Bus error: 10 (core dumped)
问题:提示总线错误Bus error。
原因:提前使用了未new(即未分配空间)的对象。
1 | data_->fullname_ = file; |
解决:将1,2,3行放在第6行之后即可。
21. error: allocation of incomplete type 'Eigen::ThreadPoolDevice'
编译时遇到错误:
1 | openmi/core/common_runtime/local_device.cc:13:11: error: allocation of incomplete type 'Eigen::ThreadPoolDevice' |
问题原因:出现这个问题,表明编译器不知道所用的struct 或者是class的具体实现
,这里可以看到device.h第7行的前向声明没有找到对应实现。 这里提示Eigen::ThreadPoolDevice不完整的原因是使用Eigen线程池必须要添加编译选项 -DEIGEN_USE_THREADS
解决方案:编译时添加编译选项-DEIGEN_USE_THREADS
22. error: C++ requires a type specifier for all declarations
出错原因: 代码片段没有写在函数中。
解决方法: 将代码片段写进函数中。
23. [malloc: *** error for object 0x7ff62a6010e8: incorrect checksum for freed object - object was probably modified after being freed.
]
具体错误信息:
1 | core_framework_executor_test(87702,0x7fff7b20f000) malloc: *** error for object 0x7ff62a6010e8: incorrect checksum for freed object - object was probably modified after being freed. |
错误原因:Eigen矩阵乘法运算,rows和cols不匹配。例如: w*x = y,前向的转制问题,会影响到后置;
解决办法:代码中check W*X与Y的shape是否相等,提前曝出问题;
24. ['operator()' cannot be the name of a variable or data member
]
错误日志与代码
1 | openmi/core/softmax_op.cc:10:17: error: expected ')' |
代码
1 | template <typename Device, typename T> |
原因:仿函数实现形式错误,应改为operator()(....)
,不能缺少operator与参数列表之间的()
25. [Assertion failed: (dimensions_match(m_leftImpl.dimensions(), m_rightImpl.dimensions())), function evalSubExprsIfNeeded
]
错误日志与代码:
1 | Assertion failed: (dimensions_match(m_leftImpl.dimensions(), m_rightImpl.dimensions())), function evalSubExprsIfNeeded, file /Users/zhouyong03/myhome/openmit/tech-stacks/ml_eigen/third_party/deps/eigen/include/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorAssign.h, line 122. |
代码:
1 | auto d = context->template eigen_device<Device>(); |
错误提示的意思是:等号左右结果的rank不一致。X.sum()
计算后rank=0, X.sum(depth_dim)
计算后rank=NDIM-1. NDIM为X的rank。所以上述错误有两种解决方案:
- 重新确定Y的rank,如:
TensorMap<T, NDIM-1> Y(${Y_dims})
- 计算时改为:
Y.device(d) = X.sum(depth_dim).eval().reshape(${Y_dims})
26. [strncpy core dump
]
代码
1 | #include <string> |
1 | zhouyong03deMacBook-Pro:unittest zhouyong03$ g++ -std=c++11 strncpy_test.cc -o xx |
错误原因是strsep会不断的移位buf指针,最后buf指向null,所以strsep buf
报错.
27. basic_string::_S_construct NULL not valid
具体错误:
1 | terminate called after throwing an instance of 'std::logic_error' |
非常明显的错误,就是用null初始化了std::string字符串。那么,关键是定位初始化字符串是可能是null值的代码。这里使用valgrind工具定位,发现下述错误:
1 | ==29007== 144 bytes in 1 blocks are possibly lost in loss record 5 of 7 |
定位到获取HostName失败,代码返回NULL所致。错误代码:
1 | std::string SystemInfo::Hostname() { |
解决:gethostname获取失败时不应返回NULL, 可以给默认值;2. 获取失败的原因是char hostname[32]
数组太小,可以改大一些。修改后的代码:
1 | std::string SystemInfo::Hostname() { |
28. Conditional jump or move depends on uninitialised value(s)
1 | ==6081== Conditional jump or move depends on uninitialised value(s) |
原因:
报Uninitialised value was created by a heap allocation
/Conditional jump or move depends on uninitialised value(s)
这类错误,表示有些变量未初始化。
解决方案:1. 找到未初始化的变量(包括基本类型变量),然后初始化;2. 涉及到分配内存的变量,尝试使用calloc替换malloc。
该case的原因是:1. LogFile::thread_safe_
和LogFile::flush_interval_
等变量未初始化;
29: Cannot access memory at address 0x7f0ca4de1d08 …
1 | Failed to read a valid object file image from memory. |
featurex框架上线初始化feature_extractor
时报的错误,场景:同时初始化到综和美食的配置,每次初始化会初始化num_thread
个;
网上有评论上述错误可能是:
1 | a valid object file image 这可不是图片呀!应该是多线程引起的内存冲突。 不好查。 |
在这里,使用gdb java8 ${corefile}
打开,bt命令观察发现如下错误:
1 | Loaded symbols for /usr/local/jdk1.8.0_45/jre/lib/amd64/libawt.so |
可以看到是feature_extractor.cc:575
ClearProtoObject报的错误,代码实现为:
1 | void FeatureExtractor::ClearProtoObject() { |
是其中cond_.wait
报错,暂时先去除后台线程清理proto object的功能。
之所以使用gdb java8
打开core文件,是因为直接使用gdb打开,出现以下提示Core was generated by /usr/local/java8/bin/java -server -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-'.
因此需要添加java8(注意,不能是java)
30 unsupported/Eigen/CXX11/src/Tensor/TensorStorage.h:39:7: error: class template partial
具体错误:
1 | ~/.openmi_deps/include/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorStorage.h:39:7: error: class template partial |
原因:mac升级至10.15,更高版本的clang对模版检查更智能。
解决:
1 | 将unsupported/Eigen/CXX11/src/Tensor/TensorStorage.h中的 |
31. dyld: lazy symbol binding failed: Symbol not found: ___emutls_get_address
1 | ./xgboost demo/binary_classification/mushroom.conf |
编译正常,运行时提示Symbol找不到,主要原因是运行时加载了旧的gcc动态库,而编译时用的是新gcc库(macos编译时用的gcc-9, 默认是4.2.1)。
解决方案1:在~/.bash_profile中DYLD_FALLBACK_LIBRARY_PATH
指定为新gcc的lib路径,即:
export DYLD_FALLBACK_LIBRARY_PATH=/usr/local/Cellar/gcc/9.2.0_2/lib/gcc/9
解决方案2:macos系统中的LD_LIBRARY_PATH
更换为DYLD_LIBRARY_PATH
最主要的原因是macos动态库不应该配置在LD_LIBRARY_PATH
而是DYLD_LIBRARY_PATH
变量
macos寻找动态库的顺序依次是:
DYLD_LIBRARY_PATH
->BACK_FRAMEWORK_PATH
|DYLD_FALLBACK_LIBRARY_PATH
,见下面对DYLD_LIBRARY_PATH
的解读
1 This is a colon separated list of directories that contain libraries. The dynamic linker searches these directories before it searches the default locations for libraries. It allows you to test new versions of existing libraries. For each library that a program uses, the dynamic linker looks for it in each directory in DYLD_LIBRARY_PATH in turn. If it still can't find the library, it then searches BACK_FRAMEWORK_PATH and DYLD_FALLBACK_LIBRARY_PATH in turn.
32. libtool: error: unrecognised option: ‘-static’
问题:mac上安装boost时出现下述错误,而libtool是使用brew安装的。
1 | libtool: unrecognized option `-static' |
解决:需要使用mac系统自带的libtool版本才可以(路径在:/Library/Developer/CommandLineTools/usr/bin
)
33. /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20’ not found
问题:/libcaml_featurex.so: /usr/lib64/libstdc++.so.6: version GLIBCXX_3.4.20' not found
原因:编译时用的gcc4.9.2目录下的libstdc++.so.6;离线集群运行时默认走的是/usr/lib64/libstdc++.so.6,前者版本高;
解决方案:项目CMakeList.txt编译时使用gcc4.9.2版本的libstdc++静态库,示例如下:
1 | add_library(stdc++ STATIC IMPORTED) |
编程技巧
回调函数
1 | class OpKernelBase { |