记录一次服务器断电Gerrit git文件损坏修复


问题简介

问题产生的原因是因为PDU电源插座坏了,导致服务器异常关闭,重启Gerrit服务之后,其中有一个项目push时报错。

  • 生产环境
    系统:Centos6
    Gerrit:2.13.11
    数据库:H2,嵌入式模式

这台服务器的历史比较旧远了,而且Gerrit的版本也很低。

  • 测试环境
    系统:Ubuntu 18.04
    IP:172.16.1.111
    Gerrit:2.13.11
    数据库:MySQL

下面是git客户端报错(已经是测试环境重现错误)
git push origin HEAD:refs/for/master 输出:

Counting objects: 2, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 258 bytes | 0 bytes/s, done.
Total 2 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1)
error: unpack failed: error Short read of block.
fatal: Unpack error, check server log
To ssh://172.16.1.111:29418/TEST_1
 ! [remote rejected] HEAD -> refs/for/master (n/a (unpacker error))
error: failed to push some refs to 'ssh://admin@172.16.1.111:29418/TEST_1'

分析问题

开始搜索到相关的问题
git unpack error on push to gerrit

解决Gerrit的git unpack error问题

网上大部分搜索到的原因还是unpack failed: error Missing unknown 613fd2557fba30aff2dbd51c3807cc57561bab08

不是我们的错误原因error: unpack failed: error Short read of block

下面是Gerrit更详细的错误日志gerrit/logs/error_log

[2023-11-23 10:02:52,635] [SSH git-receive-pack '/TEST_1' (admin)] ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user admin acco
unt 1) during git-receive-pack '/TEST_1'
com.google.gerrit.sshd.BaseCommand$Failure: fatal: Unpack error, check server log
        at com.google.gerrit.sshd.commands.Receive.runImpl(Receive.java:159)
        at com.google.gerrit.sshd.AbstractGitCommand.service(AbstractGitCommand.java:101)
        at com.google.gerrit.sshd.AbstractGitCommand.access$000(AbstractGitCommand.java:32)
        at com.google.gerrit.sshd.AbstractGitCommand$1.run(AbstractGitCommand.java:70)
        at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:442)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:417)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Unpack error on project "TEST_1":
  AdvertiseRefsHook: org.eclipse.jgit.transport.AdvertiseRefsHookChain@144ab7f3class org.eclipse.jgit.transport.AdvertiseRefsHookChain

        at com.google.gerrit.sshd.commands.Receive.runImpl(Receive.java:158)
        ... 12 more
Caused by: org.eclipse.jgit.errors.UnpackException: Exception while parsing pack stream
        at org.eclipse.jgit.transport.ReceivePack.service(ReceivePack.java:307)
        at org.eclipse.jgit.transport.ReceivePack.receive(ReceivePack.java:206)
        at com.google.gerrit.sshd.commands.Receive.runImpl(Receive.java:97)
        ... 12 more
Caused by: java.io.EOFException: Short read of block.
        at org.eclipse.jgit.util.IO.readFully(IO.java:249)
        at org.eclipse.jgit.internal.storage.file.UnpackedObject.open(UnpackedObject.java:105)
        at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openLooseObject(ObjectDirectory.java:444)
        at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openLooseFromSelfOrAlternate(ObjectDirectory.java:403)
        at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openObject(ObjectDirectory.java:385)
        at org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:154)
        at org.eclipse.jgit.lib.ObjectReader.open(ObjectReader.java:227)
        at org.eclipse.jgit.revwalk.RevWalk.parseAny(RevWalk.java:859)
        at org.eclipse.jgit.transport.BaseReceivePack.checkConnectivity(BaseReceivePack.java:1354)
        at org.eclipse.jgit.transport.BaseReceivePack.receivePackAndCheckConnectivity(BaseReceivePack.java:1047)
        at org.eclipse.jgit.transport.ReceivePack.service(ReceivePack.java:250)
        ... 14 more

按照网上的一些方法,尝试使用加--no-thin参数来提交(不合并优化提交),还是一样的报错。

复现问题

开始怀疑是Gerrit评审的问题,可能是数据库某个评审没有关闭,网上没有找到类似的错误,于是准备在测试环境搭建一套Gerrit来复现

  • 在测试服务器重新搭建一套和正式环境版本一样的Gerrit,数据库选用的是MySQL

测试环境系统是 Ubuntu 18.04

  • 把生产环境的项目copy到测试环境
  • 客户端重新push,问题复现,能复现,说明不是数据库的问题,而是.git出了问题,那下面的方向也就明确了,如何修复.git

在服务端执行命令git log会输出错误

error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
fatal: loose object cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 (stored in ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9) is corrupt

确认 1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 这个文件是丢失或者损坏了

执行ll objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9

-r--r--r-- 1 root root 0 Nov 24 09:29 objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9

确认是个空的文件

执行git cat-file -p cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9

error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
fatal: Not a valid object name cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9

修复.git

参考修复文档
如何修复Git错误”object file … is empty”

执行git fsck --full --no-dangling命令来检查是否有损坏的对象文件

error: object file ./objects/2a/cdc20d19d8cae08ed8adb741511139bb316b86 is empty
error: unable to mmap ./objects/2a/cdc20d19d8cae08ed8adb741511139bb316b86: No such file or directory
error: 2acdc20d19d8cae08ed8adb741511139bb316b86: object corrupt or missing: ./objects/2a/cdc20d19d8cae08ed8adb741511139bb316b86
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
error: unable to mmap ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9: No such file or directory
error: cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9: object corrupt or missing: ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
error: object file ./objects/de/08318ad845bd960aedf0ab1ce85fc6e26f608f is empty
error: unable to mmap ./objects/de/08318ad845bd960aedf0ab1ce85fc6e26f608f: No such file or directory
error: de08318ad845bd960aedf0ab1ce85fc6e26f608f: object corrupt or missing: ./objects/de/08318ad845bd960aedf0ab1ce85fc6e26f608f
error: object file ./objects/df/8aded019cb1b23af4a4f3c5171472e76461a56 is empty
error: unable to mmap ./objects/df/8aded019cb1b23af4a4f3c5171472e76461a56: No such file or directory
error: df8aded019cb1b23af4a4f3c5171472e76461a56: object corrupt or missing: ./objects/df/8aded019cb1b23af4a4f3c5171472e76461a56
Checking object directories: 100% (256/256), done.
Checking objects: 100% (389140/389140), done.
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
fatal: loose object cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 (stored in ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9) is corrupt

还有其它3个空的文件,但是后面提示错误的还是cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9损坏

尝试运行git prune命令,目的是将从仓库中删除无效的对象文件

error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
fatal: loose object cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 (stored in ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9) is corrupt

执行不成功

尝试运行git gc命令,清理不再使用的对象文件,并重新链接存在的文件

error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
error: object file ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 is empty
fatal: loose object cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9 (stored in ./objects/cf/1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9) is corrupt
error: failed to run repack

还是有报错
只能继续查询相关资料,后面找到:
How can I fix a corrupted Git repository?

看到有个回答使用工具git-repair

sudo apt install git-repair
git-repair  # Fix a broken Git repository
or
git-repair --force  # Force repair, even if data is lost
git fsck  # To verify it was fixed

安装apt install git-repair

执行修复命令git-repair,因为这个项目的文件很多(大概5G,小文件很多),执行时间很长,最后输出:

Initialized empty Git repository in /tmp/tmprepoB5dr2w/.git/
1 missing objects could not be recovered!
If you have a clone of this bare repository, you should add it as a remote of this repository, and retry.
If there are no clones of this repository, you can instead retry with the --force parameter to force recovery to a possibly usable state.

说是有个缺失的objects不能恢复,再次执行命令git fsck --full --no-dangling来检查,输出:

Checking object directories: 100% (256/256), done.
error: refs/notes/review: invalid sha1 pointer cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9

无效的指针,看一下 cat refs/notes/review

cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9

看一下这个文件是什么
参考以前转载的博客Git原理
这个文件可能是Gerrit
参考网上的文章gerrit权限控制

refs/notes/review保存Gerrit代码审查信息的分支,可能是这个分支丢了,也就是说停电时没完全保存这个分支

客户端再执行push
git push origin HEAD:refs/for/master 输出:

Counting objects: 2, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 1.42 KiB | 0 bytes/s, done.
Total 2 (delta 0), reused 0 (delta 0)
error: unpack failed: error Missing unknown cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
fatal: Unpack error, check server log
To ssh://172.16.1.111:29418/TEST_1
 ! [remote rejected] HEAD -> refs/for/master (n/a (unpacker error))
error: failed to push some refs to 'ssh://admin@172.16.1.111:29418/TEST_1'

这下报错回到了原来我们查询资料的错误了,再看看详细的服务端日志:

[2023-11-23 14:11:04,445] [SSH git-receive-pack '/TEST_1' (admin)] ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user admin acco
unt 1) during git-receive-pack '/TEST_1'
com.google.gerrit.sshd.BaseCommand$Failure: fatal: Unpack error, check server log
        at com.google.gerrit.sshd.commands.Receive.runImpl(Receive.java:159)
        at com.google.gerrit.sshd.AbstractGitCommand.service(AbstractGitCommand.java:101)
        at com.google.gerrit.sshd.AbstractGitCommand.access$000(AbstractGitCommand.java:32)
        at com.google.gerrit.sshd.AbstractGitCommand$1.run(AbstractGitCommand.java:70)
        at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:442)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:417)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Unpack error on project "TEST_1":
  AdvertiseRefsHook: org.eclipse.jgit.transport.AdvertiseRefsHookChain@6bca808eclass org.eclipse.jgit.transport.AdvertiseRefsHookChain

        at com.google.gerrit.sshd.commands.Receive.runImpl(Receive.java:158)
        ... 12 more
Caused by: org.eclipse.jgit.errors.UnpackException: Exception while parsing pack stream
        at org.eclipse.jgit.transport.ReceivePack.service(ReceivePack.java:307)
        at org.eclipse.jgit.transport.ReceivePack.receive(ReceivePack.java:206)
        at com.google.gerrit.sshd.commands.Receive.runImpl(Receive.java:97)
        ... 12 more
Caused by: org.eclipse.jgit.errors.MissingObjectException: Missing unknown cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9
        at org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:158)
        at org.eclipse.jgit.lib.ObjectReader.open(ObjectReader.java:227)
        at org.eclipse.jgit.revwalk.RevWalk.parseAny(RevWalk.java:859)
        at org.eclipse.jgit.transport.BaseReceivePack.checkConnectivity(BaseReceivePack.java:1354)
        at org.eclipse.jgit.transport.BaseReceivePack.receivePackAndCheckConnectivity(BaseReceivePack.java:1047)
        at org.eclipse.jgit.transport.ReceivePack.service(ReceivePack.java:250)
        ... 14 more

考虑到停电,这个丢失的分支可能找不回来了(就算能找回代价和精力肯定比较大),那就直接删除cf1ce10a08b7c5fb3e0cc24561f51292bcb9d1f9

执行rm refs/notes/review,再执行一次检查git fsck --full --no-dangling 输出:

Checking object directories: 100% (256/256), done.
Checking objects: 100% (2/2), done.

输出已经没有什么问题了,客户端再次执行push
git push origin HEAD:refs/for/master 输出:

Counting objects: 2, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 258 bytes | 0 bytes/s, done.
Total 2 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1)
remote: Counting objects: 2, done
remote: Processing changes: refs: 1, done    
remote: ERROR: [d3ba5be] missing Change-Id in commit message footer
remote: 
remote: Hint: To automatically insert Change-Id, install the hook:
remote:   gitdir=$(git rev-parse --git-dir); scp -p -P 29418 admin@172.16.1.111:hooks/commit-msg ${gitdir}/hooks/
remote: And then amend the commit:
remote:   git commit --amend
remote: 
To ssh://172.16.1.111:29418/TEST_1
 ! [remote rejected] HEAD -> refs/for/master ([d3ba5be] missing Change-Id in commit message footer)
error: failed to push some refs to 'ssh://admin@172.16.1.111:29418/TEST_1'

现在返回是正常的了,只是缺少Change-Id,我们按照提示执行操作
gitdir=$(git rev-parse --git-dir); scp -p -P 29418 admin@172.16.1.111:hooks/commit-msg ${gitdir}/hooks/
git commit --amend

再次提交

Counting objects: 2, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 302 bytes | 0 bytes/s, done.
Total 2 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1)
remote: Counting objects: 2, done
remote: Processing changes: new: 1, refs: 1, done    
remote: 
remote: New Changes:
remote:   http://172.16.1.111/1 test
remote: 
To ssh://172.16.1.111:29418/TEST_1
 * [new branch]      HEAD -> refs/for/master

到此,测试环境已经修复好了,没有发现问题。

  • 这里说一下 git-repair 工作原理

参考官网https://git-repair.branchable.com/

how it works

git-repair starts by deleting all corrupt objects, and retrieving all missing objects that it can from the remotes of the repository.

If that is not sufficient to fully recover the repository, it can also reset branches back to commits before the corruption happened, delete branches that are no longer available due to the lost data, and remove any missing files from the index. It will only do this if run with the --force option, since that rewrites history and throws out missing data.

After running this command, you will probably want to run git fsck to verify it fixed the repository.

Note that fsck may still complain about objects referenced by the reflog, or the stash, if they were unable to be recovered. This command does not try to clean up either the reflog or the stash.

Also note that the --force option never touches tags, even if they are no longer usable due to missing data, so fsck may also find problems with tags.

Since this command unpacks all packs in the repository, you may want to run git gc afterwards.

确定修复方案

因为生产环境是Centos6,版本比较老,git-repair已经不支持了,自己编译安装依赖又比较多,不想改变生产环境;
最后决定在测试环境修复完成后,复制git仓库到生产环境,操作前备份备份
复制修复好的git仓库到生产环境,测试push,一切正常,最后让开发人员确认git仓库有没有丢失文件。

写在最后

git虽然是分布式,不怕文件损坏,大不了从新建仓库提交,但最好还是做异地全量备份,避免出现我们种情况,还有就是我们目前的服务器比较老旧,考虑重新购买新的服务器,做RAID

目前使用的H2 嵌入式模式,只支持表锁级别,这也是一个问题,经常会超时500,看数据库日志有锁超时

org.h2.jdbc.JdbcSQLException: Timeout trying to lock table "PATCH_SETS"; SQL statement:
SELECT T.revision,T.uploader_account_id,T.created_on,T.draft,T.groups,T.push_certificate,T.change_id,T.patch_set_id FROM patch_sets T WHERE T.change_id=? ORDER BY T.patch_set_id [50200-176]

考虑H2走混合模式,开启MVCC,或者直接更换数据库为MySQL,这当然是后续了,还需要继续学习相关的~。

以上引以为戒


文章作者: 江湖义气
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 江湖义气 !
  目录