问题背景
最近工作中遇到了一个非常奇怪的crash问题,反反复复分析了好久。由于保密原因,这里只记录一下分析思路,不会涉及到代码。
初步log分析
tombstone显示这是一个abort:
Cmdline: /vendor/bin/hw/vendor.qti.camera.provider@2.7-service_64
pid: 1189, tid: 20934, name: HwBinder:1189_5 >>> /vendor/bin/hw/vendor.qti.camera.provider@2.7-service_64 <<<
uid: 1047
tagged_addr_ctrl: 0000000000000001
signal 6 (SIGABRT), code -1 (SI_QUEUE), fault addr --------
Abort message: 'terminating with uncaught exception of type NSt6__ndk112system_errorE'
x0 0000000000000000 x1 00000000000051c6 x2 0000000000000006 x3 0000006fefc1b180
x4 736f646277641f73 x5 736f646277641f73 x6 736f646277641f73 x7 7f7f7f7f7f7f7f7f
x8 00000000000000f0 x9 883ddec157d9b7a4 x10 0000000000000000 x11 ffffff80ffffffdf
x12 0000000000000001 x13 000000007fffffff x14 0000000000097c44 x15 000000082449409b
x16 0000007197d09d30 x17 0000007197ce3650 x18 0000006fd46cc000 x19 00000000000004a5
x20 00000000000051c6 x21 00000000ffffffff x22 0000007197d0d210 x23 0000006fefc1b2f0
x24 0000006fefc1b2b0 x25 0000000000000048 x26 b40000708493eb00 x27 b40000708636d880
x28 b40000708636d880 x29 0000006fefc1b200
lr 0000007197c939fc sp 0000006fefc1b160 pc 0000007197c93a28 pst 0000000000001000
backtrace:
#00 pc 000000000008aa28 /apex/com.android.runtime/lib64/bionic/libc.so (abort+168) (BuildId: 94065bf91428f6ae9fb310c478171302)
#01 pc 000000000004911c /apex/com.android.vndk.v31/lib64/libc++.so (abort_message+252) (BuildId: 3800ebf760648abf274c4f36afc38008)
#02 pc 0000000000049310 /apex/com.android.vndk.v31/lib64/libc++.so (demangling_terminate_handler()+240) (BuildId: 3800ebf760648abf274c4f36afc38008)
#03 pc 0000000000049ecc /apex/com.android.vndk.v31/lib64/libc++.so (std::__terminate(void (*)())+16) (BuildId: 3800ebf760648abf274c4f36afc38008)
#04 pc 0000000000049e64 /apex/com.android.vndk.v31/lib64/libc++.so (std::terminate()+56) (BuildId: 3800ebf760648abf274c4f36afc38008)
#05 pc 000000000058694c /vendor/lib64/hw/camera.qcom.so (__clang_call_terminate+12) (BuildId: 37fa0aa7aeb902db4b7d78f719568eb3)
#06 pc 0000000000eb2190 /vendor/lib64/hw/camera.qcom.so (CamX::MiTrackerNode::~MiTrackerNode()+680) (BuildId: 37fa0aa7aeb902db4b7d78f719568eb3)
#07 pc 0000000000eb21ac /vendor/lib64/hw/camera.qcom.so (CamX::MiTrackerNode::~MiTrackerNode()+20) (BuildId: 37fa0aa7aeb902db4b7d78f719568eb3)
#08 pc 0000000000f9a448 /vendor/lib64/hw/camera.qcom.so (CamX::Node::Destroy()+5392) (BuildId: 37fa0aa7aeb902db4b7d78f719568eb3)
#09 pc 0000000000fd72e4 /vendor/lib64/hw/camera.qcom.so (CamX::Pipeline::DestroyNodes()+108) (BuildId: 37fa0aa7aeb902db4b7d78f719568eb3)
第一反应是通过addr2line找到异常发生的点:
addr2line -Cife out/target/product/zeus/symbols/vendor/lib64/hw/camera.qcom.so 0000000000eb2190 ~MiTrackerNode vendor/qcom/proprietary/camx-common/../camx/src/swl/mitrackaf/camxmitrackafnode.cpp:229 (discriminator 8)
addr2line -Cife out/target/product/zeus/symbols/vendor/lib64/hw/camera.qcom.so 0000000000eb21ac ~MiTrackerNode vendor/qcom/proprietary/camx-common/../camx/src/swl/mitrackaf/camxmitrackafnode.cpp:175
解析地址,发现addr2line的结果指向了函数的两个括号处{}:
/// Node::~Node
MiTrackerNode::~MiTrackerNode()
{ //175行
CamxResult result = CamxResultSuccess;
CAMX_LOG_VERBOSE(CamxLogGroupTracker,"m_algoInitFlag %d m_pMiTracker %p cameraId %d",
m_algoInitFlag,m_pMiTracker,m_cameraInfo.cameraId);
CamxTime exitstartTime;
CamxTime exitendTime;
OsUtils::GetTime(&exitstartTime);
if (pthread_join(m_hInitThread, NULL) != 0)
{
CAMX_LOG_INFO(CamxLogGroupTracker,"m_hInitThread dead already\n");
}
//
.................................
//
OsUtils::GetTime(&exitendTime);
CAMX_LOG_VERBOSE(CamxLogGroupTracker, " ~MiTrackerNode time = %d ms",
(OsUtils::CamxTimeToMillis(&exitendTime) - OsUtils::CamxTimeToMillis(&exitstartTime)));
} //229行
意思是第一行代码的执行就出了问题?听起来不太合理,看来addr2line不能帮忙定位到问题。
使用objdump分析
换到objdump看下这段代码到底有什么特别的地方,使得addr2line不能工作。以下是通过aarch64-linux-gnu-objdump.exe -D E:\log\camera.qcom.so | less定位到的出问题附近的汇编代码块:
根据上面的分析,分别对这几个地址做解码:
E:\log\camlog-1112150636>D:\tools\gcc-linaro-7.5.0-2019.12-i686-mingw32_aarch64-linux-gnu\bin\aarch64-linux-gnu-addr2line.exe -e E:\log\camera.qcom.so 0xeb2108
vendor/qcom/proprietary/camx-common/../camx/src/swl/mitrackaf/camxmitrackafnode.cpp:227 (discriminator 14)
E:\log\camlog-1112150636>D:\tools\gcc-linaro-7.5.0-2019.12-i686-mingw32_aarch64-linux-gnu\bin\aarch64-linux-gnu-addr2line.exe -e E:\log\camera.qcom.so 0xeb20c4
vendor/qcom/proprietary/camx-api/camx/utils/camxosutils.h:1518 (discriminator 4)
E:\log\camlog-1112150636>D:\tools\gcc-linaro-7.5.0-2019.12-i686-mingw32_aarch64-linux-gnu\bin\aarch64-linux-gnu-addr2line.exe -e E:\log\camera.qcom.so 0xeb20d8
vendor/qcom/proprietary/camx-api/camx/utils/camxosutils.h:1518 (discriminator 4)
E:\log\camlog-1112150636>D:\tools\gcc-linaro-7.5.0-2019.12-i686-mingw32_aarch64-linux-gnu\bin\aarch64-linux-gnu-addr2line.exe -e E:\log\camera.qcom.so 0xeb2150
vendor/qcom/proprietary/camx-common/../camx/src/swl/mitrackaf/camxmitrackafnode.cpp:?
指向的代码分别是这些:
static UINT32 CamxTimeToMillis(
CamxTime* pTime)
{
return (pTime == NULL) ? 0 : ((1000 * pTime->seconds) + ((pTime->nanoSeconds) / 1000000L));//这行代码就是vendor/qcom/proprietary/camx-api/camx/utils/camxosutils.h:1518
}
/// MiTrackerNode::~MiTrackerNode
MiTrackerNode::~MiTrackerNode()
{ //175行
OsUtils::GetTime(&exitendTime);
CAMX_LOG_VERBOSE(CamxLogGroupTracker, " ~MiTrackerNode time = %d ms", //这行代码就是vendor/qcom/proprietary/camx-common/../camx/src/swl/mitrackaf/camxmitrackafnode.cpp:227
(OsUtils::CamxTimeToMillis(&exitendTime) - OsUtils::CamxTimeToMillis(&exitstartTime)));
} //229行
由此可以推测,这个terminate的源头很可能就在这个CamxTimeToMillis的调用上。具体产生的原因暂不明了,但删除这行代码后,这个错误就消失了。
复现后继续分析
不久,问题再次复现,并且这一次现象变得更加奇怪了。
这一次的backtrace跟上一次一模一样:
11-09 12:15:57.206 29142 29142 F DEBUG : pid: 12760, tid: 28582, name: HwBinder:12760_ >>> /vendor/bin/hw/vendor.qti.camera.provider@2.7-service_64 <<<
11-09 12:15:57.207 29142 29142 F DEBUG : uid: 1047
11-09 12:15:57.207 29142 29142 F DEBUG : tagged_addr_ctrl: 0000000000000001
11-09 12:15:57.207 29142 29142 F DEBUG : signal 6 (SIGABRT), code -1 (SI_QUEUE), fault addr --------
11-09 12:15:57.207 29142 29142 F DEBUG : Abort message: 'terminating with uncaught exception of type NSt6__ndk112system_errorE'
11-09 12:15:57.207 29142 29142 F DEBUG : x0 0000000000000000 x1 0000000000006fa6 x2 0000000000000006 x3 0000007559da01a0
11-09 12:15:57.207 29142 29142 F DEBUG : x4 736f646277641f73 x5 736f646277641f73 x6 736f646277641f73 x7 7f7f7f7f7f7f7f7f
11-09 12:15:57.208 29142 29142 F DEBUG : x8 00000000000000f0 x9 54725e8097ae4b13 x10 0000000000000000 x11 ffffff80ffffffdf
11-09 12:15:57.208 29142 29142 F DEBUG : x12 0000000000000001 x13 000000007fffffff x14 000000000037366a x15 000000258cff67cb
11-09 12:15:57.208 29142 29142 F DEBUG : x16 00000079b89f9060 x17 00000079b89d54e0 x18 000000752fcf4000 x19 00000000000031d8
11-09 12:15:57.208 29142 29142 F DEBUG : x20 0000000000006fa6 x21 00000000ffffffff x22 00000079b89fb350 x23 0000007559da0310
11-09 12:15:57.208 29142 29142 F DEBUG : x24 0000007559da02d0 x25 0000000000000048 x26 b4000078083bda60 x27 b4000078287262c0
11-09 12:15:57.208 29142 29142 F DEBUG : x28 b4000078287262c0 x29 0000007559da0220
11-09 12:15:57.208 29142 29142 F DEBUG : lr 00000079b898583c sp 0000007559da0180 pc 00000079b8985868 pst 0000000000001000
11-09 12:15:57.209 29142 29142 F DEBUG : backtrace:
11-09 12:15:57.209 29142 29142 F DEBUG : #00 pc 0000000000051868 /apex/com.android.runtime/lib64/bionic/libc.so (abort+168) (BuildId: b15ac47006c5c2933d0526c330c56105)
11-09 12:15:57.209 29142 29142 F DEBUG : #01 pc 000000000004911c /apex/com.android.vndk.v31/lib64/libc++.so (abort_message+252) (BuildId: 3800ebf760648abf274c4f36afc38008)
11-09 12:15:57.209 29142 29142 F DEBUG : #02 pc 0000000000049310 /apex/com.android.vndk.v31/lib64/libc++.so (demangling_terminate_handler()+240) (BuildId: 3800ebf760648abf274c4f36afc38008)
11-09 12:15:57.209 29142 29142 F DEBUG : #03 pc 0000000000049ecc /apex/com.android.vndk.v31/lib64/libc++.so (std::__terminate(void (*)())+16) (BuildId: 3800ebf760648abf274c4f36afc38008)
11-09 12:15:57.209 29142 29142 F DEBUG : #04 pc 0000000000049e64 /apex/com.android.vndk.v31/lib64/libc++.so (std::terminate()+56) (BuildId: 3800ebf760648abf274c4f36afc38008)
11-09 12:15:57.209 29142 29142 F DEBUG : #05 pc 00000000005afcf4 /vendor/lib64/hw/camera.qcom.so (__clang_call_terminate+12) (BuildId: 882d57000dbf93c8161629a99404f1c4)
11-09 12:15:57.209 29142 29142 F DEBUG : #06 pc 0000000000f97fec /vendor/lib64/hw/camera.qcom.so (CamX::MiTrackerNode::~MiTrackerNode()+444) (BuildId: 882d57000dbf93c8161629a99404f1c4)
11-09 12:15:57.209 29142 29142 F DEBUG : #07 pc 0000000000f98004 /vendor/lib64/hw/camera.qcom.so (CamX::MiTrackerNode::~MiTrackerNode()+20) (BuildId: 882d57000dbf93c8161629a99404f1c4)
11-09 12:15:57.209 29142 29142 F DEBUG : #08 pc 0000000001082d10 /vendor/lib64/hw/camera.qcom.so (CamX::Node::Destroy()+5440) (BuildId: 882d57000dbf93c8161629a99404f1c4)
同样addr2line也是指向CamX::MiTrackerNode::~MiTrackerNode()的开头。所以addr2line对于这个问题也同样不好使。更糟糕的是,这一次连objdump也不好使了:
死磕问题
山穷水尽的情况下只能反过来思考,__clang_call_terminate这个函数在什么情况下会被调用到。opengrok一搜,调用这个函数的地方太多了。只好再看下这个abort发生前的log信息,下面几行log让我像突然触电一样:
11-09 12:15:51.979 12760 28582 W libc : invalid pthread_t (0) passed to pthread_join
11-09 12:15:52.067 12760 28582 E libc++abi: terminating with uncaught exception of type NSt6__ndk112system_errorE
11-09 12:15:52.206 12760 28582 F libc : Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 28582 (HwBinder:12760_), pid 12760 (vendor.qti.came)
第一行提示了pthread处出现了一个警告,紧接着出现一个关键字:uncaught exception,这不就是C++中的exception机制吗?当一个exception没有通过try catch捕获的时候,就会出现这种情况。
C++的exception机制包括这几部分:
1. 对于每个 catch
语句的存在,编译器会在其所在的函数末尾加上一些特殊信息,包括当前函数可以捕获的异常表,以及清理表(cleanup table)。在进行栈展开时,会调用 libstdc++
提供的特殊函数(称为 personality routine),会检查栈上的所有函数哪个异常可以被捕获。如果其调用栈上没有任何一个函数安装了try-catch,那么就会转入默认的处理handler。
2. 类的析构函数如果没有try-catch语句,编译器会自动在这个析构函数最后面放一个默认的处理handler,也就是terminate。在出现异常的时候,直接跳转到这个默认处理handler里面执行。这也是为什么在这一次的汇编代码里面看不到有地方跳转到terminate,偏偏又执行到这里的原因。
详细介绍:C++ 异常是如何实现的 - 知乎
到这问题就比较清晰了,MiTrackerNode::~MiTrackerNode函数内部某个地方很可能发生了exception,而这个函数又没有安装try-catch,这导致exception的默认处理方法terminate被执行了。所以要找到问题,就需要在这个函数里面安装一个try-catch预语句,涵盖整个析构函数。这样就可以轻易抓到exception发生的源头,从而解决问题。目前正在等待问题复现并做最终的解决方案。我的预感是离问题解决也就是一步之遥了。