问题简介
问题产生的原因是因为PDU电源插座坏了,导致服务器异常关闭,重启Gerrit
服务之后,其中有一个项目push
时报错。
- 生产环境
系统:Centos6
Gerrit:2.13.11
数据库:H2,嵌入式模式
这台服务器的历史比较旧远了,而且Gerrit
的版本也很低。
- 测试环境
系统:Ubuntu 18.04
IP:172.16.1.111
Gerrit:2.13.11
数据库:MySQL
下面是git客户端报错(已经是测试环境重现错误)git push origin HEAD:refs/for/master
输出:
Counting objects: 2, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 258 bytes | 0 bytes/s, done.
Total 2 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1)
error: unpack failed: error Short read of block.
fatal: Unpack error, check server log
To ssh://172.16.1.111:29418/TEST_1
! [remote rejected] HEAD -> refs/for/master (n/a (unpacker error))
error: failed to push some refs to 'ssh://admin@172.16.1.111:29418/TEST_1'
分析问题
开始搜索到相关的问题
git unpack error on push to gerrit
网上大部分搜索到的原因还是unpack failed: error Missing unknown 613fd2557fba30aff2dbd51c3807cc57561bab08
不是我们的错误原因error: unpack failed: error Short read of block
下面是Gerrit
更详细的错误日志gerrit/logs/error_log
[2023-11-23 10:02:52,635] [SSH git-receive-pack '/TEST_1' (admin)] ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user admin acco
unt 1) during git-receive-pack '/TEST_1'
com.google.gerrit.sshd.BaseCommand$Failure: fatal: Unpack error, check server log
at com.google.gerrit.sshd.commands.Receive.runImpl(Receive.java:159)
at com.google.gerrit.sshd.AbstractGitCommand.service(AbstractGitCommand.java:101)
at com.google.gerrit.sshd.AbstractGitCommand.access$000(AbstractGitCommand.java:32)
at com.google.gerrit.sshd.AbstractGitCommand$1.run(AbstractGitCommand.java:70)
at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:442)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:417)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Unpack error on project "TEST_1":
AdvertiseRefsHook: org.eclipse.jgit.transport.AdvertiseRefsHookChain@144ab7f3class org.eclipse.jgit.transport.AdvertiseRefsHookChain
at com.google.gerrit.sshd.commands.Receive.runImpl(Receive.java:158)
... 12 more
Caused by: org.eclipse.jgit.errors.UnpackException: Exception while parsing pack stream
at org.eclipse.jgit.transport.ReceivePack.service(ReceivePack.java:307)
at org.eclipse.jgit.transport.ReceivePack.receive(ReceivePack.java:206)
at com.google.gerrit.sshd.commands.Receive.runImpl(Receive.java:97)
... 12 more
Caused by: java.io.EOFException: Short read of block.
at org.eclipse.jgit.util.IO.readFully(IO.java:249)
at org.eclipse.jgit.internal.storage.file.UnpackedObject.open(UnpackedObject.java:105)
at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openLooseObject(ObjectDirectory.java:444)
at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openLooseFromSelfOrAlternate(ObjectDirectory.java:403)
at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openObject(ObjectDirectory.java:385)
at org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:154)
at org.eclipse.jgit.lib.ObjectReader.open(ObjectReader.java:227)
at org.eclipse.jgit.revwalk.RevWalk.parseAny(RevWalk.java:859)
at org.eclipse.jgit.transport.BaseReceivePack.checkConnectivity(BaseReceivePack.java:1354)
at org.eclipse.jgit.transport.BaseReceivePack.receivePackAndCheckConnectivity(BaseReceivePack.java:1047)
at org.eclipse.jgit.transport.ReceivePack.service(ReceivePack.java:250)
... 14 more
按照网上的一些方法,尝试使用加--no-thin
参数来提交(不合并优化提交),还是一样的报错。
复现问题
开始怀疑是Gerrit
评审的问题,可能是数据库某个评审没有关闭,网上没有找到类似的错误,于是准备在测试环境搭建一套Gerrit
来复现
- 在测试服务器重新搭建一套和正式环境版本一样的
Gerrit
,数据库选用的是MySQL
测试环境系统是 Ubuntu 18.04
- 把生产环境的项目copy到测试环境
- 客户端重新
push
,问题复现,能复现,说明不是数据库的问题,而是.git
出了问题,那下面的方向也就明确了,如何修复.git
在服务端执行命令git log
会输出错误
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
fatal: loose object cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 (stored in ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9) is corrupt
确认 1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
这个文件是丢失或者损坏了
执行ll objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
-r--r--r-- 1 root root 0 Nov 24 09:29 objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
确认是个空的文件
执行git cat-file -p cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
fatal: Not a valid object name cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
修复.git
参考修复文档
如何修复Git错误”object file … is empty”
执行git fsck --full --no-dangling
命令来检查是否有损坏的对象文件
error: object file ./objects/2a/cdc20d19d8cae08ed8adb741511139bb316b86 is empty
error: unable to mmap ./objects/2a/cdc20d19d8cae08ed8adb741511139bb316b86: No such file or directory
error: 2acdc20d19d8cae08ed8adb741511139bb316b86: object corrupt or missing: ./objects/2a/cdc20d19d8cae08ed8adb741511139bb316b86
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
error: unable to mmap ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9: No such file or directory
error: cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9: object corrupt or missing: ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
error: object file ./objects/de/08318ad845bd960aedf0ab1ce85fc6e26f608f is empty
error: unable to mmap ./objects/de/08318ad845bd960aedf0ab1ce85fc6e26f608f: No such file or directory
error: de08318ad845bd960aedf0ab1ce85fc6e26f608f: object corrupt or missing: ./objects/de/08318ad845bd960aedf0ab1ce85fc6e26f608f
error: object file ./objects/df/8aded019cb1b23af4a4f3c5171472e76461a56 is empty
error: unable to mmap ./objects/df/8aded019cb1b23af4a4f3c5171472e76461a56: No such file or directory
error: df8aded019cb1b23af4a4f3c5171472e76461a56: object corrupt or missing: ./objects/df/8aded019cb1b23af4a4f3c5171472e76461a56
Checking object directories: 100% (256/256), done.
Checking objects: 100% (389140/389140), done.
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
fatal: loose object cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 (stored in ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9) is corrupt
还有其它3个空的文件,但是后面提示错误的还是cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
损坏
尝试运行git prune
命令,目的是将从仓库中删除无效的对象文件
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
fatal: loose object cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 (stored in ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9) is corrupt
执行不成功
尝试运行git gc
命令,清理不再使用的对象文件,并重新链接存在的文件
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
fatal: loose object cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 (stored in ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9) is corrupt
error: failed to run repack
还是有报错
只能继续查询相关资料,后面找到:
How can I fix a corrupted Git repository?
看到有个回答使用工具git-repair
sudo apt install git-repair
git-repair # Fix a broken Git repository
or
git-repair --force # Force repair, even if data is lost
git fsck # To verify it was fixed
安装apt install git-repair
执行修复命令git-repair
,因为这个项目的文件很多(大概5G,小文件很多),执行时间很长,最后输出:
Initialized empty Git repository in /tmp/tmprepoB5dr2w/.git/
1 missing objects could not be recovered!
If you have a clone of this bare repository, you should add it as a remote of this repository, and retry.
If there are no clones of this repository, you can instead retry with the --force parameter to force recovery to a possibly usable state.
说是有个缺失的objects不能恢复,再次执行命令git fsck --full --no-dangling
来检查,输出:
Checking object directories: 100% (256/256), done.
error: refs/notes/review: invalid sha1 pointer cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
无效的指针,看一下 cat refs/notes/review
cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
看一下这个文件是什么
参考以前转载的博客Git原理
这个文件可能是Gerrit
的
参考网上的文章gerrit权限控制
refs/notes/review
保存Gerrit
代码审查信息的分支,可能是这个分支丢了,也就是说停电时没完全保存这个分支
客户端再执行push
git push origin HEAD:refs/for/master
输出:
Counting objects: 2, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 1.42 KiB | 0 bytes/s, done.
Total 2 (delta 0), reused 0 (delta 0)
error: unpack failed: error Missing unknown cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
fatal: Unpack error, check server log
To ssh://172.16.1.111:29418/TEST_1
! [remote rejected] HEAD -> refs/for/master (n/a (unpacker error))
error: failed to push some refs to 'ssh://admin@172.16.1.111:29418/TEST_1'
这下报错回到了原来我们查询资料的错误了,再看看详细的服务端日志:
[2023-11-23 14:11:04,445] [SSH git-receive-pack '/TEST_1' (admin)] ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user admin acco
unt 1) during git-receive-pack '/TEST_1'
com.google.gerrit.sshd.BaseCommand$Failure: fatal: Unpack error, check server log
at com.google.gerrit.sshd.commands.Receive.runImpl(Receive.java:159)
at com.google.gerrit.sshd.AbstractGitCommand.service(AbstractGitCommand.java:101)
at com.google.gerrit.sshd.AbstractGitCommand.access$000(AbstractGitCommand.java:32)
at com.google.gerrit.sshd.AbstractGitCommand$1.run(AbstractGitCommand.java:70)
at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:442)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:417)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Unpack error on project "TEST_1":
AdvertiseRefsHook: org.eclipse.jgit.transport.AdvertiseRefsHookChain@6bca808eclass org.eclipse.jgit.transport.AdvertiseRefsHookChain
at com.google.gerrit.sshd.commands.Receive.runImpl(Receive.java:158)
... 12 more
Caused by: org.eclipse.jgit.errors.UnpackException: Exception while parsing pack stream
at org.eclipse.jgit.transport.ReceivePack.service(ReceivePack.java:307)
at org.eclipse.jgit.transport.ReceivePack.receive(ReceivePack.java:206)
at com.google.gerrit.sshd.commands.Receive.runImpl(Receive.java:97)
... 12 more
Caused by: org.eclipse.jgit.errors.MissingObjectException: Missing unknown cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
at org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:158)
at org.eclipse.jgit.lib.ObjectReader.open(ObjectReader.java:227)
at org.eclipse.jgit.revwalk.RevWalk.parseAny(RevWalk.java:859)
at org.eclipse.jgit.transport.BaseReceivePack.checkConnectivity(BaseReceivePack.java:1354)
at org.eclipse.jgit.transport.BaseReceivePack.receivePackAndCheckConnectivity(BaseReceivePack.java:1047)
at org.eclipse.jgit.transport.ReceivePack.service(ReceivePack.java:250)
... 14 more
考虑到停电,这个丢失的分支可能找不回来了(就算能找回代价和精力肯定比较大),那就直接删除cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
执行rm refs/notes/review
,再执行一次检查git fsck --full --no-dangling
输出:
Checking object directories: 100% (256/256), done.
Checking objects: 100% (2/2), done.
输出已经没有什么问题了,客户端再次执行push
git push origin HEAD:refs/for/master
输出:
Counting objects: 2, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 258 bytes | 0 bytes/s, done.
Total 2 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1)
remote: Counting objects: 2, done
remote: Processing changes: refs: 1, done
remote: ERROR: [d3ba5be] missing Change-Id in commit message footer
remote:
remote: Hint: To automatically insert Change-Id, install the hook:
remote: gitdir=$(git rev-parse --git-dir); scp -p -P 29418 admin@172.16.1.111:hooks/commit-msg ${gitdir}/hooks/
remote: And then amend the commit:
remote: git commit --amend
remote:
To ssh://172.16.1.111:29418/TEST_1
! [remote rejected] HEAD -> refs/for/master ([d3ba5be] missing Change-Id in commit message footer)
error: failed to push some refs to 'ssh://admin@172.16.1.111:29418/TEST_1'
现在返回是正常的了,只是缺少Change-Id
,我们按照提示执行操作gitdir=$(git rev-parse --git-dir); scp -p -P 29418 admin@172.16.1.111:hooks/commit-msg ${gitdir}/hooks/
git commit --amend
再次提交
Counting objects: 2, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 302 bytes | 0 bytes/s, done.
Total 2 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1)
remote: Counting objects: 2, done
remote: Processing changes: new: 1, refs: 1, done
remote:
remote: New Changes:
remote: http://172.16.1.111/1 test
remote:
To ssh://172.16.1.111:29418/TEST_1
* [new branch] HEAD -> refs/for/master
到此,测试环境已经修复好了,没有发现问题。
- 这里说一下
git-repair
工作原理
参考官网https://git-repair.branchable.com/
how it works
git-repair
starts by deleting all corrupt objects, and retrieving all missing objects that it can from the remotes of the repository.
If that is not sufficient to fully recover the repository, it can also reset branches back to commits before the corruption happened, delete branches that are no longer available due to the lost data, and remove any missing files from the index. It will only do this if run with the --force
option, since that rewrites history and throws out missing data.
After running this command, you will probably want to run git fsck
to verify it fixed the repository.
Note that fsck may still complain about objects referenced by the reflog, or the stash, if they were unable to be recovered. This command does not try to clean up either the reflog or the stash.
Also note that the --force
option never touches tags, even if they are no longer usable due to missing data, so fsck may also find problems with tags.
Since this command unpacks all packs in the repository, you may want to run git gc afterwards.
确定修复方案
因为生产环境是Centos6
,版本比较老,git-repair
已经不支持了,自己编译安装依赖又比较多,不想改变生产环境;
最后决定在测试环境修复完成后,复制git
仓库到生产环境,操作前备份备份
复制修复好的git
仓库到生产环境,测试push
,一切正常,最后让开发人员确认git
仓库有没有丢失文件。
写在最后
git
虽然是分布式,不怕文件损坏,大不了从新建仓库提交,但最好还是做异地全量备份,避免出现我们种情况,还有就是我们目前的服务器比较老旧,考虑重新购买新的服务器,做RAID
目前使用的H2
嵌入式模式,只支持表锁级别,这也是一个问题,经常会超时500,看数据库日志有锁超时
org.h2.jdbc.JdbcSQLException: Timeout trying to lock table "PATCH_SETS"; SQL statement:
SELECT T.revision,T.uploader_account_id,T.created_on,T.draft,T.groups,T.push_certificate,T.change_id,T.patch_set_id FROM patch_sets T WHERE T.change_id=? ORDER BY T.patch_set_id [50200-176]
考虑H2
走混合模式,开启MVCC
,或者直接更换数据库为MySQL
,这当然是后续了,还需要继续学习相关的~。
以上引以为戒。